Commit Graph

34759 Commits

Author SHA1 Message Date
Paul E. McKenney
4ab00bdd99 rcutorture: Suppress boottime bad-sequence warnings
In normal production, an excessively long wait on a grace period
(synchronize_rcu(), for example) at boottime is often just as bad
as at any other time.  In fact, given the desire for fast boot, any
sort of long wait at boot is a bad idea.  However, heavy rcutorture
testing on large hyperthreaded systems can generate such long waits
during boot as a matter of course.  This commit therefore causes
the rcupdate.rcu_cpu_stall_suppress_at_boot kernel boot parameter to
suppress reporting of bootime bad-sequence warning due to excessively
long grace-period waits.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20 16:03:30 -08:00
Paul E. McKenney
58c53360b3 rcutorture: Allow boottime stall warnings to be suppressed
In normal production, an RCU CPU stall warning at boottime is often
just as bad as at any other time.  In fact, given the desire for fast
boot, any sort of long-term stall at boot is a bad idea.  However,
heavy rcutorture testing on large hyperthreaded systems can generate
boottime RCU CPU stalls as a matter of course.  This commit therefore
provides a kernel boot parameter that suppresses reporting of boottime
RCU CPU stall warnings and similarly of rcutorture writer stalls.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20 16:03:30 -08:00
Paul E. McKenney
a59ee765a6 torture: Forgive -EBUSY from boottime CPU-hotplug operations
During boot, CPU hotplug is often disabled, for example by PCI probing.
On large systems that take substantial time to boot, this can result
in spurious RCU_HOTPLUG errors.  This commit therefore forgives any
boottime -EBUSY CPU-hotplug failures by adjusting counters to pretend
that the corresponding attempt never happened.  A non-splat record
of the failed attempt is emitted to the console with the added string
"(-EBUSY forgiven during boot)".

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20 16:03:30 -08:00
Paul E. McKenney
435508095a rcutorture: Refrain from callback flooding during boot
Additional rcutorture aggression can result in, believe it or not,
boot times in excess of three minutes on large hyperthreaded systems.
This is long enough for rcutorture to decide to do some callback flooding,
which seems a bit excessive given that userspace cannot have started
until long after boot, and it is userspace that does the real-world
callback flooding.  Worse yet, because Tiny RCU lacks forward-progress
functionality, the looping-in-the-kernel tests can also be problematic
during early boot.

This commit therefore causes rcutorture to hold off on callback
flooding until about the time that init is spawned, and the same
for looping-in-the-kernel tests for Tiny RCU.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20 16:03:30 -08:00
Paul E. McKenney
59ee0326cc rcutorture: Suppress forward-progress complaints during early boot
Some larger systems can take in excess of 50 seconds to complete their
early boot initcalls prior to spawing init.  This does not in any way
help the forward-progress judgments of built-in rcutorture (when
rcutorture is built as a module, the insmod or modprobe command normally
cannot happen until some time after boot completes).  This commit
therefore suppresses such complaints until about the time that init
is spawned.

This also includes a fix to a stupid error located by kbuild test robot.

[ paulmck: Apply kbuild test robot feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
[ paulmck: Fix to nohz_full slow-expediting recovery logic, per bpetkov. ]
[ paulmck: Restrict splat to CONFIG_PREEMPT_RT=y kernels and simplify. ]
Tested-by: Borislav Petkov <bp@alien8.de>
2020-02-20 16:03:30 -08:00
Paul E. McKenney
710426068d srcu: Hold srcu_struct ->lock when updating ->srcu_gp_seq
A read of the srcu_struct structure's ->srcu_gp_seq field should not
need READ_ONCE() when that structure's ->lock is held.  Except that this
lock is not always held when updating this field.  This commit therefore
acquires the lock around updates and removes a now-unneeded READ_ONCE().

This data race was reported by KCSAN.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
[ paulmck: Switch from READ_ONCE() to lock per Peter Zilstra question. ]
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2020-02-20 16:01:11 -08:00
Paul E. McKenney
39f91504a0 srcu: Fix process_srcu()/srcu_batches_completed() datarace
The srcu_struct structure's ->srcu_idx field is accessed locklessly,
so reads must use READ_ONCE().  This commit therefore adds the needed
READ_ONCE() invocation where it was missed.

This data race was reported by KCSAN.  Not appropriate for backporting
due to failure being unlikely.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20 16:01:11 -08:00
Paul E. McKenney
8c9e0cb323 srcu: Fix __call_srcu()/srcu_get_delay() datarace
The srcu_struct structure's ->srcu_gp_seq_needed_exp field is accessed
locklessly, so updates must use WRITE_ONCE().  This commit therefore
adds the needed WRITE_ONCE() invocations.

This data race was reported by KCSAN.  Not appropriate for backporting
due to failure being unlikely.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20 16:01:11 -08:00
Paul E. McKenney
7ff8b4502b srcu: Fix __call_srcu()/process_srcu() datarace
The srcu_node structure's ->srcu_gp_seq_needed_exp field is accessed
locklessly, so updates must use WRITE_ONCE().  This commit therefore
adds the needed WRITE_ONCE() invocations.

This data race was reported by KCSAN.  Not appropriate for backporting
due to failure being unlikely.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20 16:01:11 -08:00
Jules Irenge
90ba11ba99 rcu: Add missing annotation for exit_tasks_rcu_finish()
Sparse reports a warning at exit_tasks_rcu_finish(void)

|warning: context imbalance in exit_tasks_rcu_finish()
|- wrong count at exit

To fix this, this commit adds a __releases(&tasks_rcu_exit_srcu).
Given that exit_tasks_rcu_finish() does actually call __srcu_read_lock(),
this not only fixes the warning but also improves on the readability of
the code.

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2020-02-20 16:00:45 -08:00
Jules Irenge
e1e9bdc00a rcu: Add missing annotation for exit_tasks_rcu_start()
Sparse reports a warning at exit_tasks_rcu_start(void)

|warning: context imbalance in exit_tasks_rcu_start() - wrong count at exit

To fix this, this commit adds an __acquires(&tasks_rcu_exit_srcu).
Given that exit_tasks_rcu_start() does actually call __srcu_read_lock(),
this not only fixes the warning but also improves on the readability of
the code.

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2020-02-20 16:00:45 -08:00
Paul E. McKenney
fcb7381265 rcu-tasks: *_ONCE() for rcu_tasks_cbs_head
The RCU tasks list of callbacks, rcu_tasks_cbs_head, is sampled locklessly
by rcu_tasks_kthread() when waiting for work to do.  This commit therefore
applies READ_ONCE() to that lockless sampling and WRITE_ONCE() to the
single potential store outside of rcu_tasks_kthread.

This data race was reported by KCSAN.  Not appropriate for backporting
due to failure being unlikely.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20 16:00:45 -08:00
Paul E. McKenney
b692dc4adf rcu: Update __call_rcu() comments
The __call_rcu() function's header comment refers to a cpu argument
that no longer exists, and the comment of the return path from
rcu_nocb_try_bypass() ignores the non-no-CBs CPU case.  This commit
therefore update both comments.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20 16:00:20 -08:00
Colin Ian King
aa96a93ba2 rcu: Fix spelling mistake "leval" -> "level"
This commit fixes a spelling mistake in a pr_info() message.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20 16:00:20 -08:00
Paul E. McKenney
8c14263d35 rcu: React to callback overload by boosting RCU readers
RCU priority boosting currently is not applied until the grace period
is at least 250 milliseconds old (or the number of milliseconds specified
by the CONFIG_RCU_BOOST_DELAY Kconfig option).  Although this has worked
well, it can result in OOM under conditions of RCU callback flooding.
One can argue that the real-time systems using RCU priority boosting
should carefully avoid RCU callback flooding, but one can just as well
argue that an OOM is a rather obnoxious error message.

This commit therefore disables the RCU priority boosting delay when
there are excessive numbers of callbacks queued.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20 16:00:20 -08:00
Paul E. McKenney
b2b00ddf19 rcu: React to callback overload by aggressively seeking quiescent states
In default configutions, RCU currently waits at least 100 milliseconds
before asking cond_resched() and/or resched_rcu() for help seeking
quiescent states to end a grace period.  But 100 milliseconds can be
one good long time during an RCU callback flood, for example, as can
happen when user processes repeatedly open and close files in a tight
loop.  These 100-millisecond gaps in successive grace periods during a
callback flood can result in excessive numbers of callbacks piling up,
unnecessarily increasing memory footprint.

This commit therefore asks cond_resched() and/or resched_rcu() for help
as early as the first FQS scan when at least one of the CPUs has more
than 20,000 callbacks queued, a number that can be changed using the new
rcutree.qovld kernel boot parameter.  An auxiliary qovld_calc variable
is used to avoid acquisition of locks that have not yet been initialized.
Early tests indicate that this reduces the RCU-callback memory footprint
during rcutorture floods by from 50% to 4x, depending on configuration.

Reported-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reported-by: Tejun Heo <tj@kernel.org>
[ paulmck: Fix bug located by Qian Cai. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Dexuan Cui <decui@microsoft.com>
Tested-by: Qian Cai <cai@lca.pw>
2020-02-20 16:00:20 -08:00
Paul E. McKenney
b5ea03709d rcu: Clear ->core_needs_qs at GP end or self-reported QS
The rcu_data structure's ->core_needs_qs field does not necessarily get
cleared in a timely fashion after the corresponding CPUs' quiescent state
has been reported.  From a functional viewpoint, no harm done, but this
can result in excessive invocation of RCU core processing, as witnessed
by the kernel test robot, which saw greatly increased softirq overhead.

This commit therefore restores the rcu_report_qs_rdp() function's
clearing of this field, but only when running on the corresponding CPU.
Cases where some other CPU reports the quiescent state (for example, on
behalf of an idle CPU) are handled by setting this field appropriately
within the __note_gp_changes() function's end-of-grace-period checks.
This handling is carried out regardless of whether the end of a grace
period actually happened, thus handling the case where a CPU goes non-idle
after a quiescent state is reported on its behalf, but before the grace
period ends.  This fix also avoids cross-CPU updates to ->core_needs_qs,

While in the area, this commit changes the __note_gp_changes() need_gp
variable's name to need_qs because it is a quiescent state that is needed
from the CPU in question.

Fixes: ed93dfc6bc ("rcu: Confine ->core_needs_qs accesses to the corresponding CPU")
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20 16:00:20 -08:00
Paul E. McKenney
28e09a2e48 locktorture: Forgive apparent unfairness if CPU hotplug
If CPU hotplug testing is enabled, a lock might appear to be maximally
unfair just because one of the CPUs was offline almost all the time.
This commit therefore forgives unfairness if CPU hotplug testing was
enabled.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20 15:59:59 -08:00
Paul E. McKenney
c0e1472d80 locktorture: Use private random-number generators
Both lock_torture_writer() and lock_torture_reader() use the "static"
keyword on their DEFINE_TORTURE_RANDOM(rand) declarations, which means
that a single instance of a random-number generator are shared among all
the writers and another is shared among all the readers.  Unfortunately,
this random-number generator was not designed for concurrent access.
This commit therefore removes both "static" keywords so that each reader
and each writer gets its own random-number generator.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20 15:59:59 -08:00
Paul E. McKenney
80c503e0e6 locktorture: Print ratio of acquisitions, not failures
The __torture_print_stats() function in locktorture.c carefully
initializes local variable "min" to statp[0].n_lock_acquired, but
then compares it to statp[i].n_lock_fail.  Given that the .n_lock_fail
field should normally be zero, and given the initialization, it seems
reasonable to display the maximum and minimum number acquisitions
instead of miscomputing the maximum and minimum number of failures.
This commit therefore switches from failures to acquisitions.

And this turns out to be not only a day-zero bug, but entirely my
own fault.  I hate it when that happens!

Fixes: 0af3fe1efa ("locktorture: Add a lock-torture kernel module")
Reported-by: Will Deacon <will@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Will Deacon <will@kernel.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Peter Zijlstra <peterz@infradead.org>
2020-02-20 15:59:59 -08:00
Uladzislau Rezki (Sony)
613707929b rcu: Add a trace event for kfree_rcu() use of kfree_bulk()
The event is given three parameters, first one is the name
of RCU flavour, second one is the number of elements in array
for free and last one is an address of the array holding
pointers to be freed by the kfree_bulk() function.

To enable the trace event your kernel has to be build with
CONFIG_RCU_TRACE=y, after that it is possible to track the
events using ftrace subsystem.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20 15:58:51 -08:00
Uladzislau Rezki (Sony)
34c8817455 rcu: Support kfree_bulk() interface in kfree_rcu()
The kfree_rcu() logic can be improved further by using kfree_bulk()
interface along with "basic batching support" introduced earlier.

The are at least two advantages of using "bulk" interface:
- in case of large number of kfree_rcu() requests kfree_bulk()
  reduces the per-object overhead caused by calling kfree()
  per-object.

- reduces the number of cache-misses due to "pointer chasing"
  between objects which can be far spread between each other.

This approach defines a new kfree_rcu_bulk_data structure that
stores pointers in an array with a specific size. Number of entries
in that array depends on PAGE_SIZE making kfree_rcu_bulk_data
structure to be exactly one page.

Since it deals with "block-chain" technique there is an extra
need in dynamic allocation when a new block is required. Memory
is allocated with GFP_NOWAIT | __GFP_NOWARN flags, i.e. that
allows to skip direct reclaim under low memory condition to
prevent stalling and fails silently under high memory pressure.

The "emergency path" gets maintained when a system is run out of
memory. In that case objects are linked into regular list.

The "rcuperf" was run to analyze this change in terms of memory
consumption and kfree_bulk() throughput.

1) Testing on the Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz, 12xCPUs
with following parameters:

kfree_loops=200000 kfree_alloc_num=1000 kfree_rcu_test=1 kfree_vary_obj_size=1
dev.2020.01.10a branch

Default / CONFIG_SLAB
53607352517 ns, loops: 200000, batches: 1885, memory footprint: 1248MB
53529637912 ns, loops: 200000, batches: 1921, memory footprint: 1193MB
53570175705 ns, loops: 200000, batches: 1929, memory footprint: 1250MB

Patch / CONFIG_SLAB
23981587315 ns, loops: 200000, batches: 810, memory footprint: 1219MB
23879375281 ns, loops: 200000, batches: 822, memory footprint: 1190MB
24086841707 ns, loops: 200000, batches: 794, memory footprint: 1380MB

Default / CONFIG_SLUB
51291025022 ns, loops: 200000, batches: 1713, memory footprint: 741MB
51278911477 ns, loops: 200000, batches: 1671, memory footprint: 719MB
51256183045 ns, loops: 200000, batches: 1719, memory footprint: 647MB

Patch / CONFIG_SLUB
50709919132 ns, loops: 200000, batches: 1618, memory footprint: 456MB
50736297452 ns, loops: 200000, batches: 1633, memory footprint: 507MB
50660403893 ns, loops: 200000, batches: 1628, memory footprint: 429MB

in case of CONFIG_SLAB there is double increase in performance and
slightly higher memory usage. As for CONFIG_SLUB, the performance
figures are better together with lower memory usage.

2) Testing on the HiKey-960, arm64, 8xCPUs with below parameters:

CONFIG_SLAB=y
kfree_loops=200000 kfree_alloc_num=1000 kfree_rcu_test=1

102898760401 ns, loops: 200000, batches: 5822, memory footprint: 158MB
89947009882  ns, loops: 200000, batches: 6715, memory footprint: 115MB

rcuperf shows approximately ~12% better throughput in case of
using "bulk" interface. The "drain logic" or its RCU callback
does the work faster that leads to better throughput.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20 15:58:51 -08:00
Paul E. McKenney
3d05031ae6 rcu: Make nocb_gp_wait() double-check unexpected-callback warning
Currently, nocb_gp_wait() unconditionally complains if there is a
callback not already associated with a grace period.  This assumes that
either there was no such callback initially on the one hand, or that
the rcu_advance_cbs() function assigned all such callbacks to a grace
period on the other.  However, in theory there are some situations that
would prevent rcu_advance_cbs() from assigning all of the callbacks.

This commit therefore checks for unassociated callbacks immediately after
rcu_advance_cbs() returns, while the corresponding rcu_node structure's
->lock is still held.  If there are unassociated callbacks at that point,
the subsequent WARN_ON_ONCE() is disabled.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20 15:58:23 -08:00
Paul E. McKenney
13817dd589 rcu: Tighten rcu_lockdep_assert_cblist_protected() check
The ->nocb_lock lockdep assertion is currently guarded by cpu_online(),
which is incorrect for no-CBs CPUs, whose callback lists must be
protected by ->nocb_lock regardless of whether or not the corresponding
CPU is online.  This situation could result in failure to detect bugs
resulting from failing to hold ->nocb_lock for offline CPUs.

This commit therefore removes the cpu_online() guard.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20 15:58:23 -08:00
Paul E. McKenney
faa059c397 rcu: Optimize and protect atomic_cmpxchg() loop
This commit reworks the atomic_cmpxchg() loop in rcu_eqs_special_set()
to do only the initial read from the current CPU's rcu_data structure's
->dynticks field explicitly.  On subsequent passes, this value is instead
retained from the failing atomic_cmpxchg() operation.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20 15:58:23 -08:00
Jules Irenge
92c0b889f2 rcu/nocb: Add missing annotation for rcu_nocb_bypass_unlock()
Sparse reports warning at rcu_nocb_bypass_unlock()

warning: context imbalance in rcu_nocb_bypass_unlock() - unexpected unlock

The root cause is a missing annotation of rcu_nocb_bypass_unlock()
which causes the warning.

This commit therefore adds the missing __releases(&rdp->nocb_bypass_lock)
annotation.

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Boqun Feng <boqun.feng@gmail.com>
2020-02-20 15:58:23 -08:00
Jules Irenge
9ced454807 rcu: Add missing annotation for rcu_nocb_bypass_lock()
Sparse reports warning at rcu_nocb_bypass_lock()

|warning: context imbalance in rcu_nocb_bypass_lock() - wrong count at exit

To fix this, this commit adds an __acquires(&rdp->nocb_bypass_lock).
Given that rcu_nocb_bypass_lock() does actually call raw_spin_lock()
when raw_spin_trylock() fails, this not only fixes the warning but also
improves on the readability of the code.

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20 15:58:23 -08:00
Paul E. McKenney
5648d65912 rcu: Don't flag non-starting GPs before GP kthread is running
Currently rcu_check_gp_start_stall() complains if a grace period takes
too long to start, where "too long" is roughly one RCU CPU stall-warning
interval.  This has worked well, but there are some debugging Kconfig
options (such as CONFIG_EFI_PGT_DUMP=y) that can make booting take a
very long time, so much so that the stall-warning interval has expired
before RCU's grace-period kthread has even been spawned.

This commit therefore resets the rcu_state.gp_req_activity and
rcu_state.gp_activity timestamps just before the grace-period kthread
is spawned, and modifies the checks and adds ordering to ensure that
if rcu_check_gp_start_stall() sees that the grace-period kthread
has been spawned, that it will also see the resets applied to the
rcu_state.gp_req_activity and rcu_state.gp_activity timestamps.

Reported-by: Qian Cai <cai@lca.pw>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
[ paulmck: Fix whitespace issues reported by Qian Cai. ]
Tested-by: Qian Cai <cai@lca.pw>
[ paulmck: Simplify grace-period wakeup check per Steve Rostedt feedback. ]
2020-02-20 15:58:23 -08:00
Paul E. McKenney
aa24f93753 rcu: Fix rcu_barrier_callback() race condition
The rcu_barrier_callback() function does an atomic_dec_and_test(), and
if it is the last CPU to check in, does the required wakeup.  Either way,
it does an event trace.  Unfortunately, this is susceptible to the
following sequence of events:

o	CPU 0 invokes rcu_barrier_callback(), but atomic_dec_and_test()
	says that it is not last.  But at this point, CPU 0 is delayed,
	perhaps due to an NMI, SMI, or vCPU preemption.

o	CPU 1 invokes rcu_barrier_callback(), and atomic_dec_and_test()
	says that it is last.  So CPU 1 traces completion and does
	the needed wakeup.

o	The awakened rcu_barrier() function does cleanup and releases
	rcu_state.barrier_mutex.

o	Another CPU now acquires rcu_state.barrier_mutex and starts
	another round of rcu_barrier() processing, including updating
	rcu_state.barrier_sequence.

o	CPU 0 gets its act back together and does its tracing.  Except
	that rcu_state.barrier_sequence has already been updated, so
	its tracing is incorrect and probably quite confusing.
	(Wait!  Why did this CPU check in twice for one rcu_barrier()
	invocation???)

This commit therefore causes rcu_barrier_callback() to take a
snapshot of the value of rcu_state.barrier_sequence before invoking
atomic_dec_and_test(), thus guaranteeing that the event-trace output
is sensible, even if the timing of the event-trace output might still
be confusing.  (Wait!  Why did the old rcu_barrier() complete before
all of its CPUs checked in???)  But being that this is RCU, only so much
confusion can reasonably be eliminated.

This data race was reported by KCSAN.  Not appropriate for backporting
due to failure being unlikely and due to the mild consequences of the
failure, namely a confusing event trace.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20 15:58:23 -08:00
Paul E. McKenney
59881bcd85 rcu: Add WRITE_ONCE() to rcu_state ->gp_start
The rcu_state structure's ->gp_start field is read locklessly, so this
commit adds the WRITE_ONCE() to an update in order to provide proper
documentation and READ_ONCE()/WRITE_ONCE() pairing.

This data race was reported by KCSAN.  Not appropriate for backporting
due to failure being unlikely.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20 15:58:23 -08:00
Paul E. McKenney
57721fd15a rcu: Remove dead code from rcu_segcblist_insert_pend_cbs()
The rcu_segcblist_insert_pend_cbs() function currently (partially)
initializes the rcu_cblist that it pulls callbacks from.  However, all
the resulting stores are dead because all callers pass in the address of
an on-stack cblist that is not used afterwards.  This commit therefore
removes this pointless initialization.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20 15:58:23 -08:00
Eric Dumazet
90c018942c timer: Use hlist_unhashed_lockless() in timer_pending()
The timer_pending() function is mostly used in lockless contexts, so
Without proper annotations, KCSAN might detect a data-race [1].

Using hlist_unhashed_lockless() instead of hand-coding it seems
appropriate (as suggested by Paul E. McKenney).

[1]

BUG: KCSAN: data-race in del_timer / detach_if_pending

write to 0xffff88808697d870 of 8 bytes by task 10 on cpu 0:
 __hlist_del include/linux/list.h:764 [inline]
 detach_timer kernel/time/timer.c:815 [inline]
 detach_if_pending+0xcd/0x2d0 kernel/time/timer.c:832
 try_to_del_timer_sync+0x60/0xb0 kernel/time/timer.c:1226
 del_timer_sync+0x6b/0xa0 kernel/time/timer.c:1365
 schedule_timeout+0x2d2/0x6e0 kernel/time/timer.c:1896
 rcu_gp_fqs_loop+0x37c/0x580 kernel/rcu/tree.c:1639
 rcu_gp_kthread+0x143/0x230 kernel/rcu/tree.c:1799
 kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352

read to 0xffff88808697d870 of 8 bytes by task 12060 on cpu 1:
 del_timer+0x3b/0xb0 kernel/time/timer.c:1198
 sk_stop_timer+0x25/0x60 net/core/sock.c:2845
 inet_csk_clear_xmit_timers+0x69/0xa0 net/ipv4/inet_connection_sock.c:523
 tcp_clear_xmit_timers include/net/tcp.h:606 [inline]
 tcp_v4_destroy_sock+0xa3/0x3f0 net/ipv4/tcp_ipv4.c:2096
 inet_csk_destroy_sock+0xf4/0x250 net/ipv4/inet_connection_sock.c:836
 tcp_close+0x6f3/0x970 net/ipv4/tcp.c:2497
 inet_release+0x86/0x100 net/ipv4/af_inet.c:427
 __sock_release+0x85/0x160 net/socket.c:590
 sock_close+0x24/0x30 net/socket.c:1268
 __fput+0x1e1/0x520 fs/file_table.c:280
 ____fput+0x1f/0x30 fs/file_table.c:313
 task_work_run+0xf6/0x130 kernel/task_work.c:113
 tracehook_notify_resume include/linux/tracehook.h:188 [inline]
 exit_to_usermode_loop+0x2b4/0x2c0 arch/x86/entry/common.c:163

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 12060 Comm: syz-executor.5 Not tainted 5.4.0-rc3+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine,

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
[ paulmck: Pulled in Eric's later amendments. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20 15:58:22 -08:00
Paul E. McKenney
3ca3b0e2cb rcu: Add *_ONCE() to rcu_node ->boost_kthread_status
The rcu_node structure's ->boost_kthread_status field is accessed
locklessly, so this commit causes all updates to use WRITE_ONCE() and
all reads to use READ_ONCE().

This data race was reported by KCSAN.  Not appropriate for backporting
due to failure being unlikely.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20 15:58:22 -08:00
Paul E. McKenney
2a2ae872ef rcu: Add *_ONCE() to rcu_data ->rcu_forced_tick
The rcu_data structure's ->rcu_forced_tick field is read locklessly, so
this commit adds WRITE_ONCE() to all updates and READ_ONCE() to all
lockless reads.

This data race was reported by KCSAN.  Not appropriate for backporting
due to failure being unlikely.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20 15:58:22 -08:00
Paul E. McKenney
a5b8950180 rcu: Add READ_ONCE() to rcu_data ->gpwrap
The rcu_data structure's ->gpwrap field is read locklessly, and so
this commit adds the required READ_ONCE() to a pair of laods in order
to avoid destructive compiler optimizations.

This data race was reported by KCSAN.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20 15:58:22 -08:00
SeongJae Park
65bb0dc437 rcu: Fix typos in file-header comments
Convert to plural and add a note that this is for Tree RCU.

Signed-off-by: SeongJae Park <sjpark@amazon.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20 15:58:22 -08:00
Paul E. McKenney
8ff37290d6 rcu: Add *_ONCE() for grace-period progress indicators
The various RCU structures' ->gp_seq, ->gp_seq_needed, ->gp_req_activity,
and ->gp_activity fields are read locklessly, so they must be updated with
WRITE_ONCE() and, when read locklessly, with READ_ONCE().  This commit makes
these changes.

This data race was reported by KCSAN.  Not appropriate for backporting
due to failure being unlikely.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20 15:58:22 -08:00
Paul E. McKenney
bfeebe2421 rcu: Add READ_ONCE() to rcu_segcblist ->tails[]
The rcu_segcblist structure's ->tails[] array entries are read
locklessly, so this commit adds the READ_ONCE() to a load in order to
avoid destructive compiler optimizations.

This data race was reported by KCSAN.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20 15:58:22 -08:00
Paul E. McKenney
0050c7b2d2 locking/rtmutex: rcu: Add WRITE_ONCE() to rt_mutex ->owner
The rt_mutex structure's ->owner field is read locklessly, so this
commit adds the WRITE_ONCE() to an update in order to provide proper
documentation and READ_ONCE()/WRITE_ONCE() pairing.

This data race was reported by KCSAN.  Not appropriate for backporting
due to failure being unlikely.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Will Deacon <will@kernel.org>
2020-02-20 15:58:22 -08:00
Paul E. McKenney
105abf82b0 rcu: Add WRITE_ONCE() to rcu_node ->qsmaskinitnext
The rcu_state structure's ->qsmaskinitnext field is read locklessly,
so this commit adds the WRITE_ONCE() to an update in order to provide
proper documentation and READ_ONCE()/WRITE_ONCE() pairing.

This data race was reported by KCSAN.  Not appropriate for backporting
due to failure being unlikely for systems not doing incessant CPU-hotplug
operations.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20 15:58:22 -08:00
Paul E. McKenney
2906d2154c rcu: Add WRITE_ONCE() to rcu_state ->gp_req_activity
The rcu_state structure's ->gp_req_activity field is read locklessly,
so this commit adds the WRITE_ONCE() to an update in order to provide
proper documentation and READ_ONCE()/WRITE_ONCE() pairing.

This data race was reported by KCSAN.  Not appropriate for backporting
due to failure being unlikely.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20 15:58:22 -08:00
Paul E. McKenney
0937d04573 rcu: Add READ_ONCE() to rcu_node ->gp_seq
The rcu_node structure's ->gp_seq field is read locklessly, so this
commit adds the READ_ONCE() to several loads in order to avoid
destructive compiler optimizations.

This data race was reported by KCSAN.  Not appropriate for backporting
because this affects only tracing and warnings.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20 15:58:22 -08:00
Paul E. McKenney
b0c18c8773 rcu: Add WRITE_ONCE to rcu_node ->exp_seq_rq store
The rcu_node structure's ->exp_seq_rq field is read locklessly, so
this commit adds the WRITE_ONCE() to a load in order to provide proper
documentation and READ_ONCE()/WRITE_ONCE() pairing.

This data race was reported by KCSAN.  Not appropriate for backporting
due to failure being unlikely.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20 15:58:22 -08:00
Paul E. McKenney
7672d647dd rcu: Add WRITE_ONCE() to rcu_node ->qsmask update
The rcu_node structure's ->qsmask field is read locklessly, so this
commit adds the WRITE_ONCE() to an update in order to provide proper
documentation and READ_ONCE()/WRITE_ONCE() pairing.

This data race was reported by KCSAN.  Not appropriate for backporting
due to failure being unlikely.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20 15:58:22 -08:00
Paul E. McKenney
8a7e8f5171 rcu: Provide debug symbols and line numbers in KCSAN runs
This commit adds "-g -fno-omit-frame-pointer" to ease interpretation
of KCSAN output, but only for CONFIG_KCSAN=y kerrnels.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20 15:58:21 -08:00
Paul E. McKenney
24bb9eccf7 rcu: Fix exp_funnel_lock()/rcu_exp_wait_wake() datarace
The rcu_node structure's ->exp_seq_rq field is accessed locklessly, so
updates must use WRITE_ONCE().  This commit therefore adds the needed
WRITE_ONCE() invocation where it was missed.

This data race was reported by KCSAN.  Not appropriate for backporting
due to failure being unlikely.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20 15:58:21 -08:00
Paul E. McKenney
82dd8419e2 rcu: Warn on for_each_leaf_node_cpu_mask() from non-leaf
The for_each_leaf_node_cpu_mask() and for_each_leaf_node_possible_cpu()
macros must be invoked only on leaf rcu_node structures.  Failing to
abide by this restriction can result in infinite loops on systems with
more than 64 CPUs (or for more than 32 CPUs on 32-bit systems).  This
commit therefore adds WARN_ON_ONCE() calls to make misuse of these two
macros easier to debug.

Reported-by: Qian Cai <cai@lca.pw>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-02-20 15:58:21 -08:00
Masami Hiramatsu
d8a953ddde bootconfig: Set CONFIG_BOOT_CONFIG=n by default
Set CONFIG_BOOT_CONFIG=n by default. This also warns
user if CONFIG_BOOT_CONFIG=n but "bootconfig" is given
in the kernel command line.

Link: http://lkml.kernel.org/r/158220111291.26565.9036889083940367969.stgit@devnote2

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-20 17:52:12 -05:00
Masami Hiramatsu
7ab215f22d tracing: Clear trace_state when starting trace
Clear trace_state data structure when starting trace
in __synth_event_trace_start() internal function.

Currently trace_state is initialized only in the
synth_event_trace_start() API, but the trace_state
in synth_event_trace() and synth_event_trace_array()
are on the stack without initialization.
This means those APIs will see wrong parameters and
wil skip closing process in __synth_event_trace_end()
because trace_state->disabled may be !0.

Link: http://lkml.kernel.org/r/158193315899.8868.1781259176894639952.stgit@devnote2

Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-20 17:48:59 -05:00
Steven Rostedt (VMware)
78041c0c9e tracing: Disable trace_printk() on post poned tests
The tracing seftests checks various aspects of the tracing infrastructure,
and one is filtering. If trace_printk() is active during a self test, it can
cause the filtering to fail, which will disable that part of the trace.

To keep the selftests from failing because of trace_printk() calls,
trace_printk() checks the variable tracing_selftest_running, and if set, it
does not write to the tracing buffer.

As some tracers were registered earlier in boot, the selftest they triggered
would fail because not all the infrastructure was set up for the full
selftest. Thus, some of the tests were post poned to when their
infrastructure was ready (namely file system code). The postpone code did
not set the tracing_seftest_running variable, and could fail if a
trace_printk() was added and executed during their run.

Cc: stable@vger.kernel.org
Fixes: 9afecfbb95 ("tracing: Postpone tracer start-up tests till the system is more robust")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-20 17:43:58 -05:00
Steven Rostedt (VMware)
3c18a9be7c tracing: Have synthetic event test use raw_smp_processor_id()
The test code that tests synthetic event creation pushes in as one of its
test fields the current CPU using "smp_processor_id()". As this is just
something to see if the value is correctly passed in, and the actual CPU
used does not matter, use raw_smp_processor_id(), otherwise with debug
preemption enabled, a warning happens as the smp_processor_id() is called
without preemption enabled.

Link: http://lkml.kernel.org/r/20200220162950.35162579@gandalf.local.home

Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-20 17:43:41 -05:00
Tom Zanussi
784bd0847e tracing: Fix number printing bug in print_synth_event()
Fix a varargs-related bug in print_synth_event() which resulted in
strange output and oopses on 32-bit x86 systems. The problem is that
trace_seq_printf() expects the varargs to match the format string, but
print_synth_event() was always passing u64 values regardless.  This
results in unspecified behavior when unpacking with va_arg() in
trace_seq_printf().

Add a function that takes the size into account when calling
trace_seq_printf().

Before:

  modprobe-1731  [003] ....   919.039758: gen_synth_test: next_pid_field=777(null)next_comm_field=hula hoops ts_ns=1000000 ts_ms=1000 cpu=3(null)my_string_field=thneed my_int_field=598(null)

After:

 insmod-1136  [001] ....    36.634590: gen_synth_test: next_pid_field=777 next_comm_field=hula hoops ts_ns=1000000 ts_ms=1000 cpu=1 my_string_field=thneed my_int_field=598

Link: http://lkml.kernel.org/r/a9b59eb515dbbd7d4abe53b347dccf7a8e285657.1581720155.git.zanussi@kernel.org

Reported-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-20 16:24:17 -05:00
Tom Zanussi
3843083772 tracing: Check that number of vals matches number of synth event fields
Commit 7276531d4036('tracing: Consolidate trace() functions')
inadvertently dropped the synth_event_trace() and
synth_event_trace_array() checks that verify the number of values
passed in matches the number of fields in the synthetic event being
traced, so add them back.

Link: http://lkml.kernel.org/r/32819cac708714693669e0dfe10fe9d935e94a16.1581720155.git.zanussi@kernel.org

Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-20 16:24:17 -05:00
Tom Zanussi
1d9d4c9019 tracing: Make synth_event trace functions endian-correct
synth_event_trace(), synth_event_trace_array() and
__synth_event_add_val() write directly into the trace buffer and need
to take endianness into account, like trace_event_raw_event_synth()
does.

Link: http://lkml.kernel.org/r/2011354355e405af9c9d28abba430d1f5ff7771a.1581720155.git.zanussi@kernel.org

Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-20 16:24:17 -05:00
Tom Zanussi
279eef0531 tracing: Make sure synth_event_trace() example always uses u64
synth_event_trace() is the varargs version of synth_event_trace_array(),
which takes an array of u64, as do synth_event_add_val() et al.

To not only be consistent with those, but also to address the fact
that synth_event_trace() expects every arg to be of the same type
since it doesn't also pass in e.g. a format string, the caller needs
to make sure all args are of the same type, u64.  u64 is used because
it needs to accomodate the largest type available in synthetic events,
which is u64.

This fixes the bug reported by the kernel test robot/Rong Chen.

Link: https://lore.kernel.org/lkml/20200212113444.GS12867@shao2-debian/
Link: http://lkml.kernel.org/r/894c4e955558b521210ee0642ba194a9e603354c.1581720155.git.zanussi@kernel.org

Fixes: 9fe41efaca ("tracing: Add synth event generation test module")
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-20 16:24:17 -05:00
Morten Rasmussen
000619680c sched/fair: Remove wake_cap()
Capacity-awareness in the wake-up path previously involved disabling
wake_affine in certain scenarios. We have just made select_idle_sibling()
capacity-aware, so this isn't needed anymore.

Remove wake_cap() entirely.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
[Changelog tweaks]
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[Changelog tweaks]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200206191957.12325-5-valentin.schneider@arm.com
2020-02-20 21:03:15 +01:00
Valentin Schneider
f8459197e7 sched/core: Remove for_each_lower_domain()
The last remaining user of this macro has just been removed, get rid of it.

Suggested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Quentin Perret <qperret@google.com>
Link: https://lkml.kernel.org/r/20200206191957.12325-4-valentin.schneider@arm.com
2020-02-20 21:03:15 +01:00
Morten Rasmussen
a526d46679 sched/topology: Remove SD_BALANCE_WAKE on asymmetric capacity systems
SD_BALANCE_WAKE was previously added to lower sched_domain levels on
asymmetric CPU capacity systems by commit:

  9ee1cda5ee ("sched/core: Enable SD_BALANCE_WAKE for asymmetric capacity systems")

to enable the use of find_idlest_cpu() and friends to find an appropriate
CPU for tasks.

That responsibility has now been shifted to select_idle_sibling() and
friends, and hence the flag can be removed. Note that this causes
asymmetric CPU capacity systems to no longer enter the slow wakeup path
(find_idlest_cpu()) on wakeups - only on execs and forks (which is aligned
with all other mainline topologies).

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
[Changelog tweaks]
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Quentin Perret <qperret@google.com>
Link: https://lkml.kernel.org/r/20200206191957.12325-3-valentin.schneider@arm.com
2020-02-20 21:03:14 +01:00
Morten Rasmussen
b7a331615d sched/fair: Add asymmetric CPU capacity wakeup scan
Issue
=====

On asymmetric CPU capacity topologies, we currently rely on wake_cap() to
drive select_task_rq_fair() towards either:

- its slow-path (find_idlest_cpu()) if either the previous or
  current (waking) CPU has too little capacity for the waking task
- its fast-path (select_idle_sibling()) otherwise

Commit:

  3273163c67 ("sched/fair: Let asymmetric CPU configurations balance at wake-up")

points out that this relies on the assumption that "[...]the CPU capacities
within an SD_SHARE_PKG_RESOURCES domain (sd_llc) are homogeneous".

This assumption no longer holds on newer generations of big.LITTLE
systems (DynamIQ), which can accommodate CPUs of different compute capacity
within a single LLC domain. To hopefully paint a better picture, a regular
big.LITTLE topology would look like this:

  +---------+ +---------+
  |   L2    | |   L2    |
  +----+----+ +----+----+
  |CPU0|CPU1| |CPU2|CPU3|
  +----+----+ +----+----+
      ^^^         ^^^
    LITTLEs      bigs

which would result in the following scheduler topology:

  DIE [         ] <- sd_asym_cpucapacity
  MC  [   ] [   ] <- sd_llc
       0 1   2 3

Conversely, a DynamIQ topology could look like:

  +-------------------+
  |        L3         |
  +----+----+----+----+
  | L2 | L2 | L2 | L2 |
  +----+----+----+----+
  |CPU0|CPU1|CPU2|CPU3|
  +----+----+----+----+
     ^^^^^     ^^^^^
    LITTLEs    bigs

which would result in the following scheduler topology:

  MC [       ] <- sd_llc, sd_asym_cpucapacity
      0 1 2 3

What this means is that, on DynamIQ systems, we could pass the wake_cap()
test (IOW presume the waking task fits on the CPU capacities of some LLC
domain), thus go through select_idle_sibling().
This function operates on an LLC domain, which here spans both bigs and
LITTLEs, so it could very well pick a CPU of too small capacity for the
task, despite there being fitting idle CPUs - it very much depends on the
CPU iteration order, on which we have absolutely no guarantees
capacity-wise.

Implementation
==============

Introduce yet another select_idle_sibling() helper function that takes CPU
capacity into account. The policy is to pick the first idle CPU which is
big enough for the task (task_util * margin < cpu_capacity). If no
idle CPU is big enough, we pick the idle one with the highest capacity.

Unlike other select_idle_sibling() helpers, this one operates on the
sd_asym_cpucapacity sched_domain pointer, which is guaranteed to span all
known CPU capacities in the system. As such, this will work for both
"legacy" big.LITTLE (LITTLEs & bigs split at MC, joined at DIE) and for
newer DynamIQ systems (e.g. LITTLEs and bigs in the same MC domain).

Note that this limits the scope of select_idle_sibling() to
select_idle_capacity() for asymmetric CPU capacity systems - the LLC domain
will not be scanned, and no further heuristic will be applied.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Quentin Perret <qperret@google.com>
Link: https://lkml.kernel.org/r/20200206191957.12325-2-valentin.schneider@arm.com
2020-02-20 21:03:14 +01:00
Scott Wood
82e0516ce3 sched/core: Remove duplicate assignment in sched_tick_remote()
A redundant "curr = rq->curr" was added; remove it.

Fixes: ebc0f83c78 ("timers/nohz: Update NOHZ load in remote tick")
Signed-off-by: Scott Wood <swood@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/1580776558-12882-1-git-send-email-swood@redhat.com
2020-02-20 21:03:13 +01:00
Alexandre Belloni
b0c609ab20 PM / hibernate: fix typo "reserverd_size" -> "reserved_size"
Fix a mistake in a variable name in a comment.

Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-02-20 11:58:01 +01:00
David S. Miller
41f57cfde1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Alexei Starovoitov says:

====================
pull-request: bpf 2020-02-19

The following pull-request contains BPF updates for your *net* tree.

We've added 10 non-merge commits during the last 10 day(s) which contain
a total of 10 files changed, 93 insertions(+), 31 deletions(-).

The main changes are:

1) batched bpf hashtab fixes from Brian and Yonghong.

2) various selftests and libbpf fixes.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-19 16:42:35 -08:00
Yonghong Song
b9aff38de2 bpf: Fix a potential deadlock with bpf_map_do_batch
Commit 057996380a ("bpf: Add batch ops to all htab bpf map")
added lookup_and_delete batch operation for hash table.
The current implementation has bpf_lru_push_free() inside
the bucket lock, which may cause a deadlock.

syzbot reports:
   -> #2 (&htab->buckets[i].lock#2){....}:
       __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
       _raw_spin_lock_irqsave+0x95/0xcd kernel/locking/spinlock.c:159
       htab_lru_map_delete_node+0xce/0x2f0 kernel/bpf/hashtab.c:593
       __bpf_lru_list_shrink_inactive kernel/bpf/bpf_lru_list.c:220 [inline]
       __bpf_lru_list_shrink+0xf9/0x470 kernel/bpf/bpf_lru_list.c:266
       bpf_lru_list_pop_free_to_local kernel/bpf/bpf_lru_list.c:340 [inline]
       bpf_common_lru_pop_free kernel/bpf/bpf_lru_list.c:447 [inline]
       bpf_lru_pop_free+0x87c/0x1670 kernel/bpf/bpf_lru_list.c:499
       prealloc_lru_pop+0x2c/0xa0 kernel/bpf/hashtab.c:132
       __htab_lru_percpu_map_update_elem+0x67e/0xa90 kernel/bpf/hashtab.c:1069
       bpf_percpu_hash_update+0x16e/0x210 kernel/bpf/hashtab.c:1585
       bpf_map_update_value.isra.0+0x2d7/0x8e0 kernel/bpf/syscall.c:181
       generic_map_update_batch+0x41f/0x610 kernel/bpf/syscall.c:1319
       bpf_map_do_batch+0x3f5/0x510 kernel/bpf/syscall.c:3348
       __do_sys_bpf+0x9b7/0x41e0 kernel/bpf/syscall.c:3460
       __se_sys_bpf kernel/bpf/syscall.c:3355 [inline]
       __x64_sys_bpf+0x73/0xb0 kernel/bpf/syscall.c:3355
       do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
       entry_SYSCALL_64_after_hwframe+0x49/0xbe

   -> #0 (&loc_l->lock){....}:
       check_prev_add kernel/locking/lockdep.c:2475 [inline]
       check_prevs_add kernel/locking/lockdep.c:2580 [inline]
       validate_chain kernel/locking/lockdep.c:2970 [inline]
       __lock_acquire+0x2596/0x4a00 kernel/locking/lockdep.c:3954
       lock_acquire+0x190/0x410 kernel/locking/lockdep.c:4484
       __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
       _raw_spin_lock_irqsave+0x95/0xcd kernel/locking/spinlock.c:159
       bpf_common_lru_push_free kernel/bpf/bpf_lru_list.c:516 [inline]
       bpf_lru_push_free+0x250/0x5b0 kernel/bpf/bpf_lru_list.c:555
       __htab_map_lookup_and_delete_batch+0x8d4/0x1540 kernel/bpf/hashtab.c:1374
       htab_lru_map_lookup_and_delete_batch+0x34/0x40 kernel/bpf/hashtab.c:1491
       bpf_map_do_batch+0x3f5/0x510 kernel/bpf/syscall.c:3348
       __do_sys_bpf+0x1f7d/0x41e0 kernel/bpf/syscall.c:3456
       __se_sys_bpf kernel/bpf/syscall.c:3355 [inline]
       __x64_sys_bpf+0x73/0xb0 kernel/bpf/syscall.c:3355
       do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
       entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Possible unsafe locking scenario:

          CPU0                    CPU2
          ----                    ----
     lock(&htab->buckets[i].lock#2);
                                  lock(&l->lock);
                                  lock(&htab->buckets[i].lock#2);
     lock(&loc_l->lock);

    *** DEADLOCK ***

To fix the issue, for htab_lru_map_lookup_and_delete_batch() in CPU0,
let us do bpf_lru_push_free() out of the htab bucket lock. This can
avoid the above deadlock scenario.

Fixes: 057996380a ("bpf: Add batch ops to all htab bpf map")
Reported-by: syzbot+a38ff3d9356388f2fb83@syzkaller.appspotmail.com
Reported-by: syzbot+122b5421d14e68f29cd1@syzkaller.appspotmail.com
Suggested-by: Hillf Danton <hdanton@sina.com>
Suggested-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: Brian Vazquez <brianvv@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200219234757.3544014-1-yhs@fb.com
2020-02-19 16:01:25 -08:00
Brian Vazquez
492e0d0d6f bpf: Do not grab the bucket spinlock by default on htab batch ops
Grabbing the spinlock for every bucket even if it's empty, was causing
significant perfomance cost when traversing htab maps that have only a
few entries. This patch addresses the issue by checking first the
bucket_cnt, if the bucket has some entries then we go and grab the
spinlock and proceed with the batching.

Tested with a htab of size 50K and different value of populated entries.

Before:
  Benchmark             Time(ns)        CPU(ns)
  ---------------------------------------------
  BM_DumpHashMap/1       2759655        2752033
  BM_DumpHashMap/10      2933722        2930825
  BM_DumpHashMap/200     3171680        3170265
  BM_DumpHashMap/500     3639607        3635511
  BM_DumpHashMap/1000    4369008        4364981
  BM_DumpHashMap/5k     11171919       11134028
  BM_DumpHashMap/20k    69150080       69033496
  BM_DumpHashMap/39k   190501036      190226162

After:
  Benchmark             Time(ns)        CPU(ns)
  ---------------------------------------------
  BM_DumpHashMap/1        202707         200109
  BM_DumpHashMap/10       213441         210569
  BM_DumpHashMap/200      478641         472350
  BM_DumpHashMap/500      980061         967102
  BM_DumpHashMap/1000    1863835        1839575
  BM_DumpHashMap/5k      8961836        8902540
  BM_DumpHashMap/20k    69761497       69322756
  BM_DumpHashMap/39k   187437830      186551111

Fixes: 057996380a ("bpf: Add batch ops to all htab bpf map")
Signed-off-by: Brian Vazquez <brianvv@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200218172552.215077-1-brianvv@google.com
2020-02-19 15:59:30 -08:00
Daniel Xu
fff7b64355 bpf: Add bpf_read_branch_records() helper
Branch records are a CPU feature that can be configured to record
certain branches that are taken during code execution. This data is
particularly interesting for profile guided optimizations. perf has had
branch record support for a while but the data collection can be a bit
coarse grained.

We (Facebook) have seen in experiments that associating metadata with
branch records can improve results (after postprocessing). We generally
use bpf_probe_read_*() to get metadata out of userspace. That's why bpf
support for branch records is useful.

Aside from this particular use case, having branch data available to bpf
progs can be useful to get stack traces out of userspace applications
that omit frame pointers.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200218030432.4600-2-dxu@dxuuu.xyz
2020-02-19 14:37:36 -08:00
Stephen Kitt
140588bfed s390: remove obsolete ieee_emulation_warnings
s390 math emulation was removed with commit 5a79859ae0 ("s390:
remove 31 bit support"), rendering ieee_emulation_warnings useless.
The code still built because it was protected by CONFIG_MATHEMU, which
was no longer selectable.

This patch removes the sysctl_ieee_emulation_warnings declaration and
the sysctl entry declaration.

Link: https://lkml.kernel.org/r/20200214172628.3598516-1-steve@sk2.org
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Stephen Kitt <steve@sk2.org>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2020-02-19 13:51:46 +01:00
Linus Torvalds
0a44cac810 dma-mapping fixes for 5.6
- give command line cma= precedence over the CONFIG_ option
    (Nicolas Saenz Julienne)
  - always allow 32-bit DMA, even for weirdly placed ZONE_DMA
  - improve the debug printks when memory is not addressable, to help
    find problems with swiotlb initialization
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAl5MU2MLHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYMITg/6A7iEBPKTR5ugKKHoYiKfg7xrslOHZHHdWh2qXKMf
 C+vjeZzIeIWWFWVvCYAfbueSTAPeyfE1qMTwHjBakH1DOIpl2TgtLmBqZT+hb4WJ
 Me4a48dNyX5Ngk5e6l4iNfkaa0P0SsbjRyZQyWpXCRDOAdudqmUOrgjA1MJPZrdS
 dkzB6RdcKUdUJZ/METQ0+i2knbznMcfHn0WsqPjD8A1OMvgslUg3ky4t1dpfCjC9
 hfmwkaWQmfPbW1SKmkuuOXc4im2WI8TeVNJ5N1+qAr91AtEVELTSBToZkEP4FoFj
 iTCfwimt77qF3qeD7kIU6WGbFrnEcV9VTWW3YilMFgVUF6f0bW0HZBq4fQdI9HvS
 O/aZteZTz5ukp9tJ8gILN1CNuuayJtaRFMeZv+A2+lsI31ITC2GKYsoNAepdwtHq
 w1d4OrJciUy92h4VOrGtWFDIyr+tFrwqUjUR1WElz3tvPE4fkaLaK4lzEof7Z5B9
 trzUHBmiiFnjAVpEIADEjRHXcNJQPBuWi07UT7ZuJ3bMa+neKUcsSwm3RZf46Kx+
 yh45HsmjjHcBmHbbpTS3BGkog44cXB/+hRxtntBOwsvAaP4Ip8SHeyDrMH1Ay6xr
 4xZmt9c3kQySgrywjIP//WmQ601HMjPmddkrJj8kZNNE7HxQYQlIyyf9IpEeSxVK
 Seg=
 =NfNc
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-5.6' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping fixes from Christoph Hellwig:

 - give command line cma= precedence over the CONFIG_ option (Nicolas
   Saenz Julienne)

 - always allow 32-bit DMA, even for weirdly placed ZONE_DMA

 - improve the debug printks when memory is not addressable, to help
   find problems with swiotlb initialization

* tag 'dma-mapping-5.6' of git://git.infradead.org/users/hch/dma-mapping:
  dma-direct: improve DMA mask overflow reporting
  dma-direct: improve swiotlb error reporting
  dma-direct: relax addressability checks in dma_direct_supported
  dma-contiguous: CMA: give precedence to cmdline
2020-02-18 15:06:38 -08:00
Song Liu
b80b033bed bpf: Allow bpf_perf_event_read_value in all BPF programs
bpf_perf_event_read_value() is NMI safe. Enable it for all BPF programs.
This can be used in fentry/fexit to profile BPF program and individual
kernel function with hardware counters.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200214234146.2910011-1-songliubraving@fb.com
2020-02-18 16:08:27 +01:00
Benjamin Herrenschmidt
33225d7b0a printk: Correctly set CON_CONSDEV even when preferred console was not registered
CON_CONSDEV flag was historically used to put/keep the preferred console
first in console_drivers list. Where the preferred console is the last
on the command line.

The ordering is important only when opening /dev/console:

  + tty_kopen()
    + tty_lookup_driver()
      + console_device()

The flag was originally an implementation detail. But it was later
made accessible from userspace via /proc/consoles. It was used,
for example, by the tool "showconsole" to show the real tty
accessible via /dev/console, see
https://github.com/bitstreamout/showconsole

Now, the current code sets CON_CONSDEV only for the preferred
console or when a fallback console is added. The flag is not
set when the preferred console is defined on the command line
but it is not registered from some reasons.

Simple solution is to set CON_CONSDEV flag for the first
registered console. It will work most of the time because:

  + Most real consoles have console->device defined.

  + Boot consoles are removed in printk_late_init().

  + unregister_console() moves CON_CONSDEV flag to the next
    console.

Clean solution would require checking con->device when the
preferred console is registered and in unregister_console().

Conclusion:

Use the simple solution for now. It is better than the current
state and good enough.

The clean solution is not worth it. It would complicate the already
complicated code without too much gain. Instead the code would deserve
a complete rewrite.

Link: https://lore.kernel.org/r/20200213095133.23176-4-pmladek@suse.com
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[pmladek@suse.com: Correct reasoning in the commit message, comment update.]
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2020-02-18 09:35:24 +01:00
Benjamin Herrenschmidt
e369d8227f printk: Fix preferred console selection with multiple matches
In the following circumstances, the rule of selecting the console
corresponding to the last "console=" entry on the command line as
the preferred console (CON_CONSDEV, ie, /dev/console) fails. This
is a specific example, but it could happen with different consoles
that have a similar name aliasing mechanism.

  - The kernel command line has both console=tty0 and console=ttyS0
    in that order (the latter with speed etc... arguments).
    This is common with some cloud setups such as Amazon Linux.

  - add_preferred_console is called early to register "uart0". In
    our case that happens from acpi_parse_spcr() on arm64 since the
    "enable_console" argument is true on that architecture. This causes
    "uart0" to become entry 0 of the console_cmdline array.

Now, because of the above, what happens is:

  - add_preferred_console is called by the cmdline parsing for tty0
    and ttyS0 respectively, thus occupying entries 1 and 2 of the
    console_cmdline array (since this happens after ACPI SPCR parsing).
    At that point preferred_console is set to 2 as expected.

  - When the tty layer kicks in, it will call register_console for tty0.
    This will match entry 1 in console_cmdline array. It isn't our
    preferred console but because it's our only console at this point,
    it will end up "first" in the consoles list.

  - When 8250 probes the actual serial port later on, it calls
    register_console for ttyS0. At that point the loop in register_console
    tries to match it with the entries in the console_cmdline array.
    Ideally this should match ttyS0 in entry 2, which is preferred, causing
    it to be inserted first and to replace tty0 as CONSDEV. However, 8250
    provides a "match" hook in its struct console, and that hook will match
    "uart" as an alias to "ttyS". So we match uart0 at entry 0 in the array
    which is not the preferred console and will not match entry 2 which is
    since we break out of the loop on the first match. As a result,
    we don't set CONSDEV and don't insert it first, but second in
    the console list.

    As a result, we end up with tty0 remaining first in the array, and thus
    /dev/console going there instead of the last user specified one which
    is ttyS0.

This tentative fix register_console() to scan first for consoles
specified on the command line, and only if none is found, to then
scan for consoles specified by the architecture.

Link: https://lore.kernel.org/r/20200213095133.23176-3-pmladek@suse.com
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2020-02-18 09:34:42 +01:00
Benjamin Herrenschmidt
ad8cd1db80 printk: Move console matching logic into a separate function
This moves the loop that tries to match a newly registered console
with the command line or add_preferred_console list into a separate
helper, in order to be able to call it multiple times in subsequent
patches.

Link: https://lore.kernel.org/r/20200213095133.23176-2-pmladek@suse.com
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2020-02-18 09:33:48 +01:00
Gustavo A. R. Silva
0f74226649 kernel: module: Replace zero-length array with flexible-array member
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2020-02-17 21:40:55 +01:00
Amol Grover
5fb1c2a5bb posix-timers: Pass lockdep expression to RCU lists
head is traversed using hlist_for_each_entry_rcu outside an RCU read-side
critical section but under the protection of hash_lock.

Hence, add corresponding lockdep expression to silence false-positive
lockdep warnings, and harden RCU lists.

[ tglx: Removed the macro and put the condition right where it's used ]

Signed-off-by: Amol Grover <frextrite@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200216074330.GA14025@workstation-portable
2020-02-17 20:12:19 +01:00
Alexander Popov
6e317c32fd timer: Improve the comment describing schedule_timeout()
When working commit 6dcd5d7a7a, a mistake was noticed by Linus:
schedule_timeout() was called without setting the task state to anything
particular.

It calls the scheduler, but doesn't delay anything, because the task stays
runnable. That happens because sched_submit_work() does nothing for tasks
in TASK_RUNNING state.

That turned out to be the intended behavior. Adding a WARN() is not useful
as the task could be woken up right after setting the state and before
reaching schedule_timeout().

Improve the comment about schedule_timeout() and describe that more
explicitly.

Signed-off-by: Alexander Popov <alex.popov@linux.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200117225900.16340-1-alex.popov@linux.com
2020-02-17 20:12:19 +01:00
Thomas Gleixner
2d6b01bd88 lib/vdso: Move VCLOCK_TIMENS to vdso_clock_modes
Move the time namespace indicator clock mode to the other ones for
consistency sake.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Link: https://lkml.kernel.org/r/20200207124403.656097274@linutronix.de
2020-02-17 20:12:17 +01:00
Thomas Gleixner
c7a18100bd lib/vdso: Avoid highres update if clocksource is not VDSO capable
If the current clocksource is not VDSO capable there is no point in
updating the high resolution parts of the VDSO data.

Replace the architecture specific check with a check for a VDSO capable
clocksource and skip the update if there is none.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Link: https://lkml.kernel.org/r/20200207124403.563379423@linutronix.de
2020-02-17 20:12:17 +01:00
Thomas Gleixner
f86fd32db7 lib/vdso: Cleanup clock mode storage leftovers
Now that all architectures are converted to use the generic storage the
helpers and conditionals can be removed.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Link: https://lkml.kernel.org/r/20200207124403.470699892@linutronix.de
2020-02-17 20:12:17 +01:00
Johannes Krude
e20d3a055a bpf, offload: Replace bitwise AND by logical AND in bpf_prog_offload_info_fill
This if guards whether user-space wants a copy of the offload-jited
bytecode and whether this bytecode exists. By erroneously doing a bitwise
AND instead of a logical AND on user- and kernel-space buffer-size can lead
to no data being copied to user-space especially when user-space size is a
power of two and bigger then the kernel-space buffer.

Fixes: fcfb126def ("bpf: add new jited info fields in bpf_dev_offload and bpf_prog_info")
Signed-off-by: Johannes Krude <johannes@krude.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/bpf/20200212193227.GA3769@phlox.h.transitiv.net
2020-02-17 16:53:49 +01:00
Thomas Gleixner
5d51bee725 clocksource: Add common vdso clock mode storage
All architectures which use the generic VDSO code have their own storage
for the VDSO clock mode. That's pointless and just requires duplicate code.

Provide generic storage for it. The new Kconfig symbol is intermediate and
will be removed once all architectures are converted over.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Link: https://lkml.kernel.org/r/20200207124403.028046322@linutronix.de
2020-02-17 14:40:23 +01:00
Linus Torvalds
ef78e5b7de Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "Misc fixes all over the place:

   - Fix NUMA over-balancing between lightly loaded nodes. This is
     fallout of the big load-balancer rewrite.

   - Fix the NOHZ remote loadavg update logic, which fixes anomalies
     like reported 150 loadavg on mostly idle CPUs.

   - Fix XFS performance/scalability

   - Fix throttled groups unbound task-execution bug

   - Fix PSI procfs boundary condition

   - Fix the cpu.uclamp.{min,max} cgroup configuration write checks

   - Fix DocBook annotations

   - Fix RCU annotations

   - Fix overly CPU-intensive housekeeper CPU logic loop on large CPU
     counts"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Fix kernel-doc warning in attach_entity_load_avg()
  sched/core: Annotate curr pointer in rq with __rcu
  sched/psi: Fix OOB write when writing 0 bytes to PSI files
  sched/fair: Allow a per-CPU kthread waking a task to stack on the same CPU, to fix XFS performance regression
  sched/fair: Prevent unlimited runtime on throttled group
  sched/nohz: Optimize get_nohz_timer_target()
  sched/uclamp: Reject negative values in cpu_uclamp_write()
  sched/fair: Allow a small load imbalance between low utilisation SD_NUMA domains
  timers/nohz: Update NOHZ load in remote tick
  sched/core: Don't skip remote tick for idle CPUs
2020-02-15 12:51:22 -08:00
Linus Torvalds
4e03e4e6d2 Power management fixes for 5.6-rc2
Fix three issues related to the handling of wakeup events signaled
 through the ACPI SCI while suspended to idle (Rafael Wysocki) and
 unexport an internal cpufreq variable (Yangtao Li).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl5GbaoSHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxwHYP/iINPkW3IErX3TulA/2U+JjAK8TBsjl6
 TY602el2noTddiUTvwf6AZ0N9B8QyPDLd66iC2/hGLIrM+Tg1Jd3/cZ5pwI79CYg
 NV7+kzW/39mscPpH/aPd5KRDXssQ2HRjBDv+GgwPsJxoycKDdppuOXDHzu87/57f
 oteVtP8VGwvMayZIi9d4tYO1x2vYtqd7QJioj7Ob3xMMWABjeMsputRkOzGaQfOB
 jOfgpWmPIePSjg6F4gGPRKgVzQmoeZvibtRLyaBe+s9go4vhKW+/X8XCoBM8ZNAM
 edKp8Z0FSv005GtbCcq2bukZTO4eE1KN/vFTOSzqmjuojsjN7hpu9jyUk3gga/si
 J7xmgo6w3L8VWGVY2Yiqo+iM/qVXvoNN06dJ8KRm0ZRq7m2iNcRvL3YksXgbACx4
 a2JGzfEoQ4DTO/Vy8nHR0GCXXByddQap1uCHmLeery4BiIBDMxll76hsSZyxsaPh
 fqYwoHCy4NXn+Hp4zDB5y7NpUNra5xqqUVVtnPlc1RbwL9tuurBPG4QKnxB51lkR
 iUBV340abeqqeJEE7/txsFSB/AEzA+Ff/f/JER/WBfsYL3bwzRxadPqiQA5+QZnt
 TQD7vnTk5ILKBxw9dBV+HjwDXf5meGOJDuPKM1/tspT572U3K5rrmlI1ZKpnU5fB
 /2ZwjQviBNhq
 =uE2l
 -----END PGP SIGNATURE-----

Merge tag 'pm-5.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
 "Fix three issues related to the handling of wakeup events signaled
  through the ACPI SCI while suspended to idle (Rafael Wysocki) and
  unexport an internal cpufreq variable (Yangtao Li)"

* tag 'pm-5.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI: PM: s2idle: Prevent spurious SCIs from waking up the system
  ACPICA: Introduce acpi_any_gpe_status_set()
  ACPI: PM: s2idle: Avoid possible race related to the EC GPE
  ACPI: EC: Fix flushing of pending work
  cpufreq: Make cpufreq_global_kobject static
2020-02-14 12:34:30 -08:00
Frederic Weisbecker
490f561b78 context-tracking: Introduce CONFIG_HAVE_TIF_NOHZ
A few archs (x86, arm, arm64) don't rely anymore on TIF_NOHZ to call
into context tracking on user entry/exit but instead use static keys
(or not) to optimize those calls. Ideally every arch should migrate to
that behaviour in the long run.

Settle a config option to let those archs remove their TIF_NOHZ
definitions.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Burton <paulburton@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: David S. Miller <davem@davemloft.net>
2020-02-14 16:05:04 +01:00
Rafael J. Wysocki
814d51f888 PM: QoS: Make CPU latency QoS depend on CONFIG_CPU_IDLE
Because cpuidle is the only user of the effective constraint coming
from the CPU latency QoS, add #ifdef CONFIG_CPU_IDLE around that code
to avoid building it unnecessarily.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Amit Kucheria <amit.kucheria@linaro.org>
Tested-by: Amit Kucheria <amit.kucheria@linaro.org>
2020-02-14 10:37:27 +01:00
Rafael J. Wysocki
fe52de36dc PM: QoS: Update file information comments
Update the file information comments in include/linux/pm_qos.h
and kernel/power/qos.c by adding titles along with copyright and
authors information to them and changing the qos.c description to
better reflect its contents (outdated information is dropped from
it in particular).

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Amit Kucheria <amit.kucheria@linaro.org>
Tested-by: Amit Kucheria <amit.kucheria@linaro.org>
2020-02-14 10:37:26 +01:00
Rafael J. Wysocki
67b06ba018 PM: QoS: Drop PM_QOS_CPU_DMA_LATENCY and rename related functions
Drop the PM QoS classes enum including PM_QOS_CPU_DMA_LATENCY,
drop the wrappers around pm_qos_request(), pm_qos_request_active(),
and pm_qos_add/update/remove_request() introduced previously, rename
these functions, respectively, to cpu_latency_qos_limit(),
cpu_latency_qos_request_active(), and
cpu_latency_qos_add/update/remove_request(), and update their
kerneldoc comments.  [While at it, drop some useless comments from
these functions.]

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Amit Kucheria <amit.kucheria@linaro.org>
Tested-by: Amit Kucheria <amit.kucheria@linaro.org>
2020-02-14 10:37:26 +01:00
Thomas Gleixner
cba6437a18 genirq/proc: Reject invalid affinity masks (again)
Qian Cai reported that the WARN_ON() in the x86/msi affinity setting code,
which catches cases where the affinity setting is not done on the CPU which
is the current target of the interrupt, triggers during CPU hotplug stress
testing.

It turns out that the warning which was added with the commit addressing
the MSI affinity race unearthed yet another long standing bug.

If user space writes a bogus affinity mask, i.e. it contains no online CPUs,
then it calls irq_select_affinity_usr(). This was introduced for ALPHA in

  eee45269b0 ("[PATCH] Alpha: convert to generic irq framework (generic part)")

and subsequently made available for all architectures in

  1840475676 ("genirq: Expose default irq affinity mask (take 3)")

which introduced the circumvention of the affinity setting restrictions for
interrupt which cannot be moved in process context.

The whole exercise is bogus in various aspects:

  1) If the interrupt is already started up then there is absolutely
     no point to honour a bogus interrupt affinity setting from user
     space. The interrupt is already assigned to an online CPU and it
     does not make any sense to reassign it to some other randomly
     chosen online CPU.

  2) If the interupt is not yet started up then there is no point
     either. A subsequent startup of the interrupt will invoke
     irq_setup_affinity() anyway which will chose a valid target CPU.

So the only correct solution is to just return -EINVAL in case user space
wrote an affinity mask which does not contain any online CPUs, except for
ALPHA which has it's own magic sauce for this.

Fixes: 1840475676 ("genirq: Expose default irq affinity mask (take 3)")
Reported-by: Qian Cai <cai@lca.pw>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Qian Cai <cai@lca.pw>
Link: https://lkml.kernel.org/r/878sl8xdbm.fsf@nanos.tec.linutronix.de
2020-02-14 09:43:17 +01:00
Rafael J. Wysocki
e033b6c175 PM: QoS: Adjust pm_qos_request() signature and reorder pm_qos.h
Change the return type of pm_qos_request() to be the same as the
one of pm_qos_read_value() called by it internally and stop exporting
it to modules (because its only caller, cpuidle, is not modular).

Also move the pm_qos_read_value() header away from the CPU latency
QoS API function headers in pm_qos.h (because it technically does
not belong to that API).

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Amit Kucheria <amit.kucheria@linaro.org>
Tested-by: Amit Kucheria <amit.kucheria@linaro.org>
2020-02-13 11:26:46 +01:00
Rafael J. Wysocki
333eed7d20 PM: QoS: Simplify definitions of CPU latency QoS trace events
Modify the definitions of the CPU latency QoS trace events to take
one argument (since PM_QOS_CPU_DMA_LATENCY is always passed as the
pm_qos_class argument to them) and update the documentation of them
accordingly (while at it, make it explicitly mention CPU latency QoS
and relocate it after the device PM QoS trace events documentation).

The names and output format of the trace events do not change to
preserve user space compatibility.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Amit Kucheria <amit.kucheria@linaro.org>
Tested-by: Amit Kucheria <amit.kucheria@linaro.org>
2020-02-13 11:26:39 +01:00
Rafael J. Wysocki
2552d35201 PM: QoS: Rename things related to the CPU latency QoS
First, rename PM_QOS_CPU_DMA_LAT_DEFAULT_VALUE to
PM_QOS_CPU_LATENCY_DEFAULT_VALUE and update all of the code
referring to it accordingly.

Next, rename cpu_dma_constraints to cpu_latency_constraints, move
the definition of it closer to the functions referring to it and
update all of them accordingly.  [While at it, add a comment to mark
the start of the code related to the CPU latency QoS.]

Finally, rename the pm_qos_power_*() family of functions and
pm_qos_power_fops to cpu_latency_qos_*() and cpu_latency_qos_fops,
respectively, and update the definition of cpu_latency_qos_miscdev.
[While at it, update the miscdev interface code start comment.]

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Amit Kucheria <amit.kucheria@linaro.org>
Tested-by: Amit Kucheria <amit.kucheria@linaro.org>
2020-02-13 11:26:33 +01:00
Rafael J. Wysocki
3a4a004222 PM: QoS: Drop PM_QOS_CPU_DMA_LATENCY notifier chain
Notice that pm_qos_remove_notifier() is not used at all and the only
caller of pm_qos_add_notifier() is the cpuidle core, which only needs
the PM_QOS_CPU_DMA_LATENCY notifier to invoke wake_up_all_idle_cpus()
upon changes of the PM_QOS_CPU_DMA_LATENCY target value.

First, to ensure that wake_up_all_idle_cpus() will be called
whenever the PM_QOS_CPU_DMA_LATENCY target value changes, modify the
pm_qos_add/update/remove_request() family of functions to check if
the effective constraint for the PM_QOS_CPU_DMA_LATENCY has changed
and call wake_up_all_idle_cpus() directly in that case.

Next, drop the PM_QOS_CPU_DMA_LATENCY notifier from cpuidle as it is
not necessary any more.

Finally, drop both pm_qos_add_notifier() and pm_qos_remove_notifier(),
as they have no callers now, along with cpu_dma_lat_notifier which is
only used by them.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Amit Kucheria <amit.kucheria@linaro.org>
Tested-by: Amit Kucheria <amit.kucheria@linaro.org>
2020-02-13 11:26:27 +01:00
Rafael J. Wysocki
02c92a3789 PM: QoS: Redefine struct pm_qos_request and drop struct pm_qos_object
First, change the definition of struct pm_qos_request so that it
contains a struct pm_qos_constraints pointer (called "qos") instead
of a PM QoS class number (in preparation for dropping the PM QoS
classes concept altogether going forward) and move its definition
(along with the definition of struct pm_qos_flags_request that does
not change) after the definition of struct pm_qos_constraints.

Next, drop the definition of struct pm_qos_object and the null_pm_qos
and cpu_dma_pm_qos variables of that type along with pm_qos_array[]
holding pointers to them and change the code to refer to the
pm_qos_constraints structure directly or to use the new qos pointer
in struct pm_qos_request for that instead of going through
pm_qos_array[] to access it.  Also update kerneldoc comments that
mention pm_qos_class to refer to PM_QOS_CPU_DMA_LATENCY directly
instead.

Finally, drop register_pm_qos_misc(), introduce cpu_latency_qos_miscdev
(with the name field set to "cpu_dma_latency") to implement the
CPU latency QoS interface in /dev/ and register it directly from
pm_qos_power_init().

After these changes the notion of PM QoS classes remains only in the
API (in the form of redundant function parameters that are ignored)
and in the definitions of PM QoS trace events.

While at it, some redundant local variables are dropped etc.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Amit Kucheria <amit.kucheria@linaro.org>
Tested-by: Amit Kucheria <amit.kucheria@linaro.org>
2020-02-13 11:26:19 +01:00
Rafael J. Wysocki
299a229830 PM: QoS: Clean up misc device file operations
Reorder the code to avoid using extra function header declarations
for the pm_qos_power_*() family of functions and drop those
declarations.

Also clean up the internals of those functions to consolidate checks,
avoid using redundant local variables and similar.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Amit Kucheria <amit.kucheria@linaro.org>
Tested-by: Amit Kucheria <amit.kucheria@linaro.org>
2020-02-13 11:26:13 +01:00
Rafael J. Wysocki
63cffc0534 PM: QoS: Drop iterations over global QoS classes
After commit c3082a674f ("PM: QoS: Get rid of unused flags") the
only global PM QoS class in use is PM_QOS_CPU_DMA_LATENCY, so it
does not really make sense to iterate over global QoS classes
anywhere, since there is only one.

Remove iterations over global QoS classes from the code and use
PM_QOS_CPU_DMA_LATENCY as the target class directly where needed.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Amit Kucheria <amit.kucheria@linaro.org>
Tested-by: Amit Kucheria <amit.kucheria@linaro.org>
2020-02-13 11:26:07 +01:00
Rafael J. Wysocki
dcd70ca1a3 PM: QoS: Clean up pm_qos_read_value() and pm_qos_get/set_value()
Move the definition of pm_qos_read_value() before the one of
pm_qos_get_value() and add a kerneldoc comment to it (as it is
not static).

Also replace the BUG() in pm_qos_get_value() with WARN() (to
prevent the kernel from crashing if an unknown PM QoS type is
used by mistake) and drop the comment next to it that is not
necessary any more.

Additionally, drop the unnecessary inline modifier from the header
of pm_qos_set_value().

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Amit Kucheria <amit.kucheria@linaro.org>
Tested-by: Amit Kucheria <amit.kucheria@linaro.org>
2020-02-13 11:26:02 +01:00
Rafael J. Wysocki
7b35370b2e PM: QoS: Clean up pm_qos_update_target() and pm_qos_update_flags()
Clean up the pm_qos_update_target() function:
 * Update its kerneldoc comment.
 * Drop the redundant ret local variable from it.
 * Reorder definitions of local variables in it.
 * Update a comment in it.

Also update the kerneldoc comment of pm_qos_update_flags() (e.g.
notifiers are not called by it any more) and add one emtpy line
to its body (for more visual clarity).

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Amit Kucheria <amit.kucheria@linaro.org>
Tested-by: Amit Kucheria <amit.kucheria@linaro.org>
2020-02-13 11:25:55 +01:00
Rafael J. Wysocki
87ad735679 PM: QoS: Drop the PM_QOS_SUM QoS type
The PM_QOS_SUM QoS type is not used, so drop it along with the
code referring to it in pm_qos_get_value() and the related local
variables in there.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Amit Kucheria <amit.kucheria@linaro.org>
Tested-by: Amit Kucheria <amit.kucheria@linaro.org>
2020-02-13 11:25:48 +01:00
Rafael J. Wysocki
5a7ea52b6f PM: QoS: Drop pm_qos_update_request_timeout()
The pm_qos_update_request_timeout() function is not called from
anywhere, so drop it along with the work member in struct
pm_qos_request needed by it.

Also drop the useless pm_qos_update_request_timeout trace event
that is only triggered by that function (so it never triggers at
all) and update the trace events documentation accordingly.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Amit Kucheria <amit.kucheria@linaro.org>
Tested-by: Amit Kucheria <amit.kucheria@linaro.org>
2020-02-13 11:25:40 +01:00
Christian Brauner
ef2c41cf38 clone3: allow spawning processes into cgroups
This adds support for creating a process in a different cgroup than its
parent. Callers can limit and account processes and threads right from
the moment they are spawned:
- A service manager can directly spawn new services into dedicated
  cgroups.
- A process can be directly created in a frozen cgroup and will be
  frozen as well.
- The initial accounting jitter experienced by process supervisors and
  daemons is eliminated with this.
- Threaded applications or even thread implementations can choose to
  create a specific cgroup layout where each thread is spawned
  directly into a dedicated cgroup.

This feature is limited to the unified hierarchy. Callers need to pass
a directory file descriptor for the target cgroup. The caller can
choose to pass an O_PATH file descriptor. All usual migration
restrictions apply, i.e. there can be no processes in inner nodes. In
general, creating a process directly in a target cgroup adheres to all
migration restrictions.

One of the biggest advantages of this feature is that CLONE_INTO_GROUP does
not need to grab the write side of the cgroup cgroup_threadgroup_rwsem.
This global lock makes moving tasks/threads around super expensive. With
clone3() this lock is avoided.

Cc: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: cgroups@vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2020-02-12 17:57:51 -05:00
Christian Brauner
f3553220d4 cgroup: add cgroup_may_write() helper
Add a cgroup_may_write() helper which we can use in the
CLONE_INTO_CGROUP patch series to verify that we can write to the
destination cgroup.

Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: cgroups@vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2020-02-12 17:57:51 -05:00
Christian Brauner
5a5cf5cb30 cgroup: refactor fork helpers
This refactors the fork helpers so they can be easily modified in the
next patches. The patch just moves the cgroup threadgroup rwsem grab and
release into the helpers. They don't need to be directly exposed in fork.c.

Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: cgroups@vger.kernel.org
Acked-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2020-02-12 17:57:51 -05:00
Christian Brauner
17703097f3 cgroup: add cgroup_get_from_file() helper
Add a helper cgroup_get_from_file(). The helper will be used in
subsequent patches to retrieve a cgroup while holding a reference to the
struct file it was taken from.

Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: cgroups@vger.kernel.org
Acked-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2020-02-12 17:57:51 -05:00
Christian Brauner
6df970e4f5 cgroup: unify attach permission checking
The core codepaths to check whether a process can be attached to a
cgroup are the same for threads and thread-group leaders. Only a small
piece of code verifying that source and destination cgroup are in the
same domain differentiates the thread permission checking from
thread-group leader permission checking.
Since cgroup_migrate_vet_dst() only matters cgroup2 - it is a noop on
cgroup1 - we can move it out of cgroup_attach_task().
All checks can now be consolidated into a new helper
cgroup_attach_permissions() callable from both cgroup_procs_write() and
cgroup_threads_write().

Cc: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: cgroups@vger.kernel.org
Acked-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2020-02-12 17:57:51 -05:00
Prateek Sood
a49e4629b5 cpuset: Make cpuset hotplug synchronous
Convert cpuset_hotplug_workfn() into synchronous call for cpu hotplug
path. For memory hotplug path it still gets queued as a work item.

Since cpuset_hotplug_workfn() can be made synchronous for cpu hotplug
path, it is not required to wait for cpuset hotplug while thawing
processes.

Signed-off-by: Prateek Sood <prsood@codeaurora.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2020-02-12 17:13:47 -05:00
Madhuparna Bhowmik
3010c5b9f5 cgroup.c: Use built-in RCU list checking
list_for_each_entry_rcu has built-in RCU and lock checking.
Pass cond argument to list_for_each_entry_rcu() to silence
false lockdep warning when  CONFIG_PROVE_RCU_LIST is enabled
by default.

Even though the function css_next_child() already checks if
cgroup_mutex or rcu_read_lock() is held using
cgroup_assert_mutex_or_rcu_locked(), there is a need to pass
cond to list_for_each_entry_rcu() to avoid false positive
lockdep warning.

Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com>
Acked-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2020-02-12 17:12:22 -05:00
Michal Koutný
f43caa2adc cgroup: Clean up css_set task traversal
css_task_iter stores pointer to head of each iterable list, this dates
back to commit 0f0a2b4fa6 ("cgroup: reorganize css_task_iter") when we
did not store cur_cset. Let us utilize list heads directly in cur_cset
and streamline css_task_iter_advance_css_set a bit. This is no
intentional function change.

Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2020-02-12 17:11:52 -05:00
Michal Koutný
9c974c7724 cgroup: Iterate tasks that did not finish do_exit()
PF_EXITING is set earlier than actual removal from css_set when a task
is exitting. This can confuse cgroup.procs readers who see no PF_EXITING
tasks, however, rmdir is checking against css_set membership so it can
transitionally fail with EBUSY.

Fix this by listing tasks that weren't unlinked from css_set active
lists.
It may happen that other users of the task iterator (without
CSS_TASK_ITER_PROCS) spot a PF_EXITING task before cgroup_exit(). This
is equal to the state before commit c03cd7738a ("cgroup: Include dying
leaders with live threads in PROCS iterations") but it may be reviewed
later.

Reported-by: Suren Baghdasaryan <surenb@google.com>
Fixes: c03cd7738a ("cgroup: Include dying leaders with live threads in PROCS iterations")
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2020-02-12 17:02:53 -05:00
Vasily Averin
2d4ecb030d cgroup: cgroup_procs_next should increase position index
If seq_file .next fuction does not change position index,
read after some lseek can generate unexpected output:

1) dd bs=1 skip output of each 2nd elements
$ dd if=/sys/fs/cgroup/cgroup.procs bs=8 count=1
2
3
4
5
1+0 records in
1+0 records out
8 bytes copied, 0,000267297 s, 29,9 kB/s
[test@localhost ~]$ dd if=/sys/fs/cgroup/cgroup.procs bs=1 count=8
2
4 <<< NB! 3 was skipped
6 <<<    ... and 5 too
8 <<<    ... and 7
8+0 records in
8+0 records out
8 bytes copied, 5,2123e-05 s, 153 kB/s

 This happen because __cgroup_procs_start() makes an extra
 extra cgroup_procs_next() call

2) read after lseek beyond end of file generates whole last line.
3) read after lseek into middle of last line generates
expected rest of last line and unexpected whole line once again.

Additionally patch removes an extra position index changes in
__cgroup_procs_start()

Cc: stable@vger.kernel.org
https://bugzilla.kernel.org/show_bug.cgi?id=206283
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2020-02-12 17:00:04 -05:00
Vasily Averin
db8dd96972 cgroup-v1: cgroup_pidlist_next should update position index
if seq_file .next fuction does not change position index,
read after some lseek can generate unexpected output.

 # mount | grep cgroup
 # dd if=/mnt/cgroup.procs bs=1  # normal output
...
1294
1295
1296
1304
1382
584+0 records in
584+0 records out
584 bytes copied

dd: /mnt/cgroup.procs: cannot skip to specified offset
83  <<< generates end of last line
1383  <<< ... and whole last line once again
0+1 records in
0+1 records out
8 bytes copied

dd: /mnt/cgroup.procs: cannot skip to specified offset
1386  <<< generates last line anyway
0+1 records in
0+1 records out
5 bytes copied

https://bugzilla.kernel.org/show_bug.cgi?id=206283
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2020-02-12 16:53:35 -05:00
Rafael J. Wysocki
3489662042 PM: QoS: Drop debugfs interface
After commit c3082a674f ("PM: QoS: Get rid of unused flags") the
only global PM QoS class in use is PM_QOS_CPU_DMA_LATENCY and the
existing PM QoS debugfs interface has become overly complicated (as
it takes other potentially possible PM QoS classes that are not there
any more into account).  It is also not particularly useful (the
"type" of the PM_QOS_CPU_DMA_LATENCY is known, its aggregate value
can be read from /dev/cpu_dma_latency and the number of requests in
the queue does not really matter) and there are no known users
depending on it.  Moreover, there are dedicated trace events that
can be used for tracking PM QoS usage with much higher precision.

For these reasons, drop the PM QoS debugfs interface altogether.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
2020-02-12 10:28:42 +01:00
Linus Torvalds
61a7595403 Various fixes:
- Fix an uninitialized variable
 
  - Fix compile bug to bootconfig userspace tool (in tools directory)
 
  - Suppress some error messages of bootconfig userspace tool
 
  - Remove unneded CONFIG_LIBXBC from bootconfig
 
  - Allocate bootconfig xbc_nodes dynamically.
    To ease complaints about taking up static memory at boot up
 
  - Use of parse_args() to parse bootconfig instead of strstr() usage
    Prevents issues of double quotes containing the interested string
 
  - Fix missing ring_buffer_nest_end() on synthetic event error path
 
  - Return zero not -EINVAL on soft disabled synthetic event
    (soft disabling must be the same as hard disabling, which returns zero)
 
  - Consolidate synthetic event code (remove duplicate code)
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXkMTwRQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qnJQAQD5aKiD6jx+zGtLsFTuZMcEGvhhUuJ6
 oUaSvhKO3UqezwD/V5avKuuC0wWt//gOWDY+0+4QNjmvn1GLnaNYohs6fg0=
 =NLT9
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
 "Various fixes:

   - Fix an uninitialized variable

   - Fix compile bug to bootconfig userspace tool (in tools directory)

   - Suppress some error messages of bootconfig userspace tool

   - Remove unneded CONFIG_LIBXBC from bootconfig

   - Allocate bootconfig xbc_nodes dynamically. To ease complaints about
     taking up static memory at boot up

   - Use of parse_args() to parse bootconfig instead of strstr() usage
     Prevents issues of double quotes containing the interested string

   - Fix missing ring_buffer_nest_end() on synthetic event error path

   - Return zero not -EINVAL on soft disabled synthetic event (soft
     disabling must be the same as hard disabling, which returns zero)

   - Consolidate synthetic event code (remove duplicate code)"

* tag 'trace-v5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Consolidate trace() functions
  tracing: Don't return -EINVAL when tracing soft disabled synth events
  tracing: Add missing nest end to synth_event_trace_start() error case
  tools/bootconfig: Suppress non-error messages
  bootconfig: Allocate xbc_nodes array dynamically
  bootconfig: Use parse_args() to find bootconfig and '--'
  tracing/kprobe: Fix uninitialized variable bug
  bootconfig: Remove unneeded CONFIG_LIBXBC
  tools/bootconfig: Fix wrong __VA_ARGS__ usage
2020-02-11 16:39:18 -08:00
Kan Liang
bbfd5e4fab perf/core: Add new branch sample type for HW index of raw branch records
The low level index is the index in the underlying hardware buffer of
the most recently captured taken branch which is always saved in
branch_entries[0]. It is very useful for reconstructing the call stack.
For example, in Intel LBR call stack mode, the depth of reconstructed
LBR call stack limits to the number of LBR registers. With the low level
index information, perf tool may stitch the stacks of two samples. The
reconstructed LBR call stack can break the HW limitation.

Add a new branch sample type to retrieve low level index of raw branch
records. The low level index is between -1 (unknown) and max depth which
can be retrieved in /sys/devices/cpu/caps/branches.

Only when the new branch sample type is set, the low level index
information is dumped into the PERF_SAMPLE_BRANCH_STACK output.
Perf tool should check the attr.branch_sample_type, and apply the
corresponding format for PERF_SAMPLE_BRANCH_STACK samples.
Otherwise, some user case may be broken. For example, users may parse a
perf.data, which include the new branch sample type, with an old version
perf tool (without the check). Users probably get incorrect information
without any warning.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200127165355.27495-2-kan.liang@linux.intel.com
2020-02-11 13:23:49 +01:00
Davidlohr Bueso
41f0e29190 locking/percpu-rwsem: Add might_sleep() for writer locking
We are missing this annotation in percpu_down_write(). Correct
this.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Waiman Long <longman@redhat.com>
Link: https://lkml.kernel.org/r/20200108013305.7732-1-dave@stgolabs.net
2020-02-11 13:11:02 +01:00
Davidlohr Bueso
ac8dec4209 locking/percpu-rwsem: Fold __percpu_up_read()
Now that __percpu_up_read() is only ever used from percpu_up_read()
merge them, it's a small function.

Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Waiman Long <longman@redhat.com>
Link: https://lkml.kernel.org/r/20200131151540.212415454@infradead.org
2020-02-11 13:10:58 +01:00
Peter Zijlstra
bcba67cd80 locking/rwsem: Remove RWSEM_OWNER_UNKNOWN
Remove the now unused RWSEM_OWNER_UNKNOWN hack. This hack breaks
PREEMPT_RT and getting rid of it was the entire motivation for
re-writing the percpu rwsem.

The biggest problem is that it is fundamentally incompatible with any
form of Priority Inheritance, any exclusively held lock must have a
distinct owner.

Requested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Waiman Long <longman@redhat.com>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lkml.kernel.org/r/20200204092228.GP14946@hirez.programming.kicks-ass.net
2020-02-11 13:10:57 +01:00
Peter Zijlstra
7f26482a87 locking/percpu-rwsem: Remove the embedded rwsem
The filesystem freezer uses percpu-rwsem in a way that is effectively
write_non_owner() and achieves this with a few horrible hacks that
rely on the rwsem (!percpu) implementation.

When PREEMPT_RT replaces the rwsem implementation with a PI aware
variant this comes apart.

Remove the embedded rwsem and implement it using a waitqueue and an
atomic_t.

 - make readers_block an atomic, and use it, with the waitqueue
   for a blocking test-and-set write-side.

 - have the read-side wait for the 'lock' state to clear.

Have the waiters use FIFO queueing and mark them (reader/writer) with
a new WQ_FLAG. Use a custom wake_function to wake either a single
writer or all readers until a writer.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Waiman Long <longman@redhat.com>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lkml.kernel.org/r/20200204092403.GB14879@hirez.programming.kicks-ass.net
2020-02-11 13:10:56 +01:00
Peter Zijlstra
75ff64572e locking/percpu-rwsem: Extract __percpu_down_read_trylock()
In preparation for removing the embedded rwsem and building a custom
lock, extract the read-trylock primitive.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Waiman Long <longman@redhat.com>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lkml.kernel.org/r/20200131151540.098485539@infradead.org
2020-02-11 13:10:55 +01:00
Peter Zijlstra
71365d4023 locking/percpu-rwsem: Move __this_cpu_inc() into the slowpath
As preparation to rework __percpu_down_read() move the
__this_cpu_inc() into it.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Waiman Long <longman@redhat.com>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lkml.kernel.org/r/20200131151540.041600199@infradead.org
2020-02-11 13:10:54 +01:00
Peter Zijlstra
206c98ffbe locking/percpu-rwsem: Convert to bool
Use bool where possible.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Waiman Long <longman@redhat.com>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lkml.kernel.org/r/20200131151539.984626569@infradead.org
2020-02-11 13:10:54 +01:00
Peter Zijlstra
1751060e25 locking/percpu-rwsem, lockdep: Make percpu-rwsem use its own lockdep_map
As preparation for replacing the embedded rwsem, give percpu-rwsem its
own lockdep_map.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Waiman Long <longman@redhat.com>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lkml.kernel.org/r/20200131151539.927625541@infradead.org
2020-02-11 13:10:53 +01:00
Waiman Long
810507fe6f locking/lockdep: Reuse freed chain_hlocks entries
Once a lock class is zapped, all the lock chains that include the zapped
class are essentially useless. The lock_chain structure itself can be
reused, but not the corresponding chain_hlocks[] entries. Over time,
we will run out of chain_hlocks entries while there are still plenty
of other lockdep array entries available.

To fix this imbalance, we have to make chain_hlocks entries reusable
just like the others. As the freed chain_hlocks entries are in blocks of
various lengths. A simple bitmap like the one used in the other reusable
lockdep arrays isn't applicable. Instead the chain_hlocks entries are
put into bucketed lists (MAX_CHAIN_BUCKETS) of chain blocks.  Bucket 0
is the variable size bucket which houses chain blocks of size larger than
MAX_CHAIN_BUCKETS sorted in decreasing size order.  Initially, the whole
array is in one chain block (the primordial chain block) in bucket 0.

The minimum size of a chain block is 2 chain_hlocks entries. That will
be the minimum allocation size. In other word, allocation requests
for one chain_hlocks entry will cause 2-entry block to be returned and
hence 1 entry will be wasted.

Allocation requests for the chain_hlocks are fulfilled first by looking
for chain block of matching size. If not found, the first chain block
from bucket[0] (the largest one) is split. That can cause hlock entries
fragmentation and reduce allocation efficiency if a chain block of size >
MAX_CHAIN_BUCKETS is ever zapped and put back to after the primordial
chain block. So the MAX_CHAIN_BUCKETS must be large enough that this
should seldom happen.

By reusing the chain_hlocks entries, we are able to handle workloads
that add and zap a lot of lock classes without the risk of running out
of chain_hlocks entries as long as the total number of outstanding lock
classes at any time remain within a reasonable limit.

Two new tracking counters, nr_free_chain_hlocks & nr_large_chain_blocks,
are added to track the total number of chain_hlocks entries in the
free bucketed lists and the number of large chain blocks in buckets[0]
respectively. The nr_free_chain_hlocks replaces nr_chain_hlocks.

The nr_large_chain_blocks counter enables to see if we should increase
the number of buckets (MAX_CHAIN_BUCKETS) available so as to avoid to
avoid the fragmentation problem in bucket[0].

An internal nfsd test that ran for more than an hour and kept on
loading and unloading kernel modules could cause the following message
to be displayed.

  [ 4318.443670] BUG: MAX_LOCKDEP_CHAIN_HLOCKS too low!

The patched kernel was able to complete the test with a lot of free
chain_hlocks entries to spare:

  # cat /proc/lockdep_stats
     :
   dependency chains:                   18867 [max: 65536]
   dependency chain hlocks:             74926 [max: 327680]
   dependency chain hlocks lost:            0
     :
   zapped classes:                       1541
   zapped lock chains:                  56765
   large chain blocks:                      1

By changing MAX_CHAIN_BUCKETS to 3 and add a counter for the size of the
largest chain block. The system still worked and We got the following
lockdep_stats data:

   dependency chains:                   18601 [max: 65536]
   dependency chain hlocks used:        73133 [max: 327680]
   dependency chain hlocks lost:            0
     :
   zapped classes:                       1541
   zapped lock chains:                  56702
   large chain blocks:                  45165
   large chain block size:              20165

By running the test again, I was indeed able to cause chain_hlocks
entries to get lost:

   dependency chain hlocks used:        74806 [max: 327680]
   dependency chain hlocks lost:          575
     :
   large chain blocks:                  48737
   large chain block size:                  7

Due to the fragmentation, it is possible that the
"MAX_LOCKDEP_CHAIN_HLOCKS too low!" error can happen even if a lot of
of chain_hlocks entries appear to be free.

Fortunately, a MAX_CHAIN_BUCKETS value of 16 should be big enough that
few variable sized chain blocks, other than the initial one, should
ever be present in bucket 0.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200206152408.24165-7-longman@redhat.com
2020-02-11 13:10:52 +01:00
Waiman Long
797b82eb90 locking/lockdep: Track number of zapped lock chains
Add a new counter nr_zapped_lock_chains to track the number lock chains
that have been removed.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200206152408.24165-6-longman@redhat.com
2020-02-11 13:10:51 +01:00
Waiman Long
836bd74b59 locking/lockdep: Throw away all lock chains with zapped class
If a lock chain contains a class that is zapped, the whole lock chain is
likely to be invalid. If the zapped class is at the end of the chain,
the partial chain without the zapped class should have been stored
already as the current code will store all its predecessor chains. If
the zapped class is somewhere in the middle, there is no guarantee that
the partial chain will actually happen. It may just clutter up the hash
and make searching slower. I would rather prefer storing the chain only
when it actually happens.

So just dump the corresponding chain_hlocks entries for now. A latter
patch will try to reuse the freed chain_hlocks entries.

This patch also changes the type of nr_chain_hlocks to unsigned integer
to be consistent with the other counters.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200206152408.24165-5-longman@redhat.com
2020-02-11 13:10:50 +01:00
Waiman Long
1d44bcb4fd locking/lockdep: Track number of zapped classes
The whole point of the lockdep dynamic key patch is to allow unused
locks to be removed from the lockdep data buffers so that existing
buffer space can be reused. However, there is no way to find out how
many unused locks are zapped and so we don't know if the zapping process
is working properly.

Add a new nr_zapped_classes counter to track that and show it in
/proc/lockdep_stats.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200206152408.24165-4-longman@redhat.com
2020-02-11 13:10:49 +01:00
Waiman Long
b9875e9882 locking/lockdep: Display irq_context names in /proc/lockdep_chains
Currently, the irq_context field of a lock chains displayed in
/proc/lockdep_chains is just a number. It is likely that many people
may not know what a non-zero number means. To make the information more
useful, print the actual irq names ("softirq" and "hardirq") instead.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200206152408.24165-3-longman@redhat.com
2020-02-11 13:10:48 +01:00
Waiman Long
b3b9c187dc locking/lockdep: Decrement IRQ context counters when removing lock chain
There are currently three counters to track the IRQ context of a lock
chain - nr_hardirq_chains, nr_softirq_chains and nr_process_chains.
They are incremented when a new lock chain is added, but they are
not decremented when a lock chain is removed. That causes some of the
statistic counts reported by /proc/lockdep_stats to be incorrect.
IRQ
Fix that by decrementing the right counter when a lock chain is removed.

Since inc_chains() no longer accesses hardirq_context and softirq_context
directly, it is moved out from the CONFIG_TRACE_IRQFLAGS conditional
compilation block.

Fixes: a0b0fd53e1 ("locking/lockdep: Free lock classes that are no longer in use")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200206152408.24165-2-longman@redhat.com
2020-02-11 13:10:48 +01:00
Randy Dunlap
e9f5490c35 sched/fair: Fix kernel-doc warning in attach_entity_load_avg()
Fix kernel-doc warning in kernel/sched/fair.c, caused by a recent
function parameter removal:

  ../kernel/sched/fair.c:3526: warning: Excess function parameter 'flags' description in 'attach_entity_load_avg'

Fixes: a4f9a0e51b ("sched/fair: Remove redundant call to cpufreq_update_util()")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/cbe964e4-6879-fd08-41c9-ef1917414af4@infradead.org
2020-02-11 13:05:10 +01:00
Nathan Chancellor
5661dd95a2 printk: Convert a use of sprintf to snprintf in console_unlock
When CONFIG_PRINTK is disabled (e.g. when building allnoconfig), clang
warns:

../kernel/printk/printk.c:2416:10: warning: 'sprintf' will always
overflow; destination buffer has size 0, but format string expands to at
least 33 [-Wfortify-source]
			len = sprintf(text,
			      ^
1 warning generated.

It is not wrong; text has a zero size when CONFIG_PRINTK is disabled
because LOG_LINE_MAX and PREFIX_MAX are both zero. Change to snprintf so
that this case is explicitly handled without any risk of overflow.

Link: https://github.com/ClangBuiltLinux/linux/issues/846
Link: 6d485ff455
Link: http://lkml.kernel.org/r/20200130221644.2273-1-natechancellor@gmail.com
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: linux-kernel@vger.kernel.org
Cc: clang-built-linux@googlegroups.com
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2020-02-11 13:01:00 +01:00
Madhuparna Bhowmik
4104a562e0 sched/core: Annotate curr pointer in rq with __rcu
This patch fixes the following sparse warnings in sched/core.c
and sched/membarrier.c:

  kernel/sched/core.c:2372:27: error: incompatible types in comparison expression
  kernel/sched/core.c:4061:17: error: incompatible types in comparison expression
  kernel/sched/core.c:6067:9: error: incompatible types in comparison expression
  kernel/sched/membarrier.c:108:21: error: incompatible types in comparison expression
  kernel/sched/membarrier.c:177:21: error: incompatible types in comparison expression
  kernel/sched/membarrier.c:243:21: error: incompatible types in comparison expression

Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200201125803.20245-1-madhuparnabhowmik10@gmail.com
2020-02-11 13:00:37 +01:00
Suren Baghdasaryan
6fcca0fa48 sched/psi: Fix OOB write when writing 0 bytes to PSI files
Issuing write() with count parameter set to 0 on any file under
/proc/pressure/ will cause an OOB write because of the access to
buf[buf_size-1] when NUL-termination is performed. Fix this by checking
for buf_size to be non-zero.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/20200203212216.7076-1-surenb@google.com
2020-02-11 13:00:02 +01:00
Andy Shevchenko
ed31685c96 console: Introduce ->exit() callback
Some consoles might require special operations on unregistering.
For instance, serial console, when registered in the kernel,
keeps power on for entire time, until it gets unregistered.
Example of use:

	->setup(console):
		pm_runtime_get(...);

	->exit(console):
		pm_runtime_put(...);

For such cases to have a balance we would provide ->exit() callback.

Link: http://lkml.kernel.org/r/20200203133130.11591-7-andriy.shevchenko@linux.intel.com
To: linux-kernel@vger.kernel.org
To: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2020-02-11 10:44:22 +01:00
Andy Shevchenko
e78bedbd42 console: Don't notify user space when unregister non-listed console
If console is not on the list then there is nothing for us to do
and sysfs notify is pointless.

Note, that nr_ext_console_drivers is being changed only for listed
consoles.

Suggested-by: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Link: http://lkml.kernel.org/r/20200203133130.11591-6-andriy.shevchenko@linux.intel.com
To: linux-kernel@vger.kernel.org
To: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2020-02-11 10:44:17 +01:00
Andy Shevchenko
bb72e3981d console: Avoid positive return code from unregister_console()
There are only two callers that use the returned code from
unregister_console():

  - unregister_early_console() in arch/m68k/kernel/early_printk.c
  - kgdb_unregister_nmi_console() in drivers/tty/serial/kgdb_nmi.c

They both expect to get "0" on success and a non-zero value on error.
But the current behavior is confusing and buggy:

  - _braille_unregister_console() returns "1" on success
  - unregister_console() returns "1" on error

Fix and clean up the behavior:

  - Return success when _braille_unregister_console() succeeded
  - Return a meaningful error code when the console was
    not registered before

Link: http://lkml.kernel.org/r/20200203133130.11591-5-andriy.shevchenko@linux.intel.com
To: linux-kernel@vger.kernel.org
To: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2020-02-11 10:44:12 +01:00
Andy Shevchenko
d58ad10122 console: Drop misleading comment
/* find the last or real console */

This comment is misleading. The purpose of the loop is to check
if we are trying to register boot console after a real one has
already been registered. This is already mentioned in a comment
above.

Link: http://lkml.kernel.org/r/20200203133130.11591-4-andriy.shevchenko@linux.intel.com
To: linux-kernel@vger.kernel.org
To: Steven Rostedt <rostedt@goodmis.org>
Suggested-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
[pmladek@suse.com: Updated commit message.]
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2020-02-11 10:44:02 +01:00
Andy Shevchenko
12825e6ba8 console: Use for_each_console() helper in unregister_console()
We have rather open coded single linked list manipulations where we may
simple use for_each_console() helper with properly set exit conditions.

Replace open coded single-linked list handling with for_each_console()
helper in use.

Link: http://lkml.kernel.org/r/20200203133130.11591-3-andriy.shevchenko@linux.intel.com
To: linux-kernel@vger.kernel.org
To: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2020-02-11 10:43:56 +01:00
Andy Shevchenko
caa72c3bc5 console: Drop double check for console_drivers being non-NULL
There is no need to explicitly check for console_drivers to be non-NULL
since for_each_console() does this.

Link: http://lkml.kernel.org/r/20200203133130.11591-2-andriy.shevchenko@linux.intel.com
To: linux-kernel@vger.kernel.org
To: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2020-02-11 10:43:42 +01:00
Rafael J. Wysocki
e3728b50cd ACPI: PM: s2idle: Avoid possible race related to the EC GPE
It is theoretically possible for the ACPI EC GPE to be set after the
s2idle_ops->wake() called from s2idle_loop() has returned and before
the subsequent pm_wakeup_pending() check is carried out.  If that
happens, the resulting wakeup event will cause the system to resume
even though it may be a spurious one.

To avoid that race, first make the ->wake() callback in struct
platform_s2idle_ops return a bool value indicating whether or not
to let the system resume and rearrange s2idle_loop() to use that
value instad of the direct pm_wakeup_pending() call if ->wake() is
present.

Next, rework acpi_s2idle_wake() to process EC events and check
pm_wakeup_pending() before re-arming the SCI for system wakeup
to prevent it from triggering prematurely and add comments to
that function to explain the rationale for the new code flow.

Fixes: 56b9918490 ("PM: sleep: Simplify suspend-to-idle control flow")
Cc: 5.4+ <stable@vger.kernel.org> # 5.4+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-02-11 10:11:02 +01:00
Tom Zanussi
7276531d40 tracing: Consolidate trace() functions
Move the checking, buffer reserve and buffer commit code in
synth_event_trace_start/end() into inline functions
__synth_event_trace_start/end() so they can also be used by
synth_event_trace() and synth_event_trace_array(), and then have all
those functions use them.

Also, change synth_event_trace_state.enabled to disabled so it only
needs to be set if the event is disabled, which is not normally the
case.

Link: http://lkml.kernel.org/r/b1f3108d0f450e58192955a300e31d0405ab4149.1581374549.git.zanussi@kernel.org

Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-10 22:00:21 -05:00
Tom Zanussi
0c62f6cd9e tracing: Don't return -EINVAL when tracing soft disabled synth events
There's no reason to return -EINVAL when tracing a synthetic event if
it's soft disabled - treat it the same as if it were hard disabled and
return normally.

Have synth_event_trace() and synth_event_trace_array() just return
normally, and have synth_event_trace_start set the trace state to
disabled and return.

Link: http://lkml.kernel.org/r/df5d02a1625aff97c9866506c5bada6a069982ba.1581374549.git.zanussi@kernel.org

Fixes: 8dcc53ad95 ("tracing: Add synth_event_trace() and related functions")
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-10 22:00:13 -05:00
Tom Zanussi
d090409abb tracing: Add missing nest end to synth_event_trace_start() error case
If the ring_buffer reserve in synth_event_trace_start() fails, the
matching ring_buffer_nest_end() should be called in the error code,
since nothing else will ever call it in this case.

Link: http://lkml.kernel.org/r/20abc444b3eeff76425f895815380abe7aa53ff8.1581374549.git.zanussi@kernel.org

Fixes: 8dcc53ad95 ("tracing: Add synth_event_trace() and related functions")
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-10 21:58:19 -05:00
Linus Torvalds
0a679e13ea Merge branch 'for-5.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fix from Tejun Heo:
 "I made a mistake while removing cgroup task list lazy init
  optimization making the root cgroup.procs show entries for the
  init_tasks. The zero entries doesn't cause critical failures but does
  make systemd print out warning messages during boot.

  Fix it by omitting init_tasks as they should be"

* 'for-5.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: init_tasks shouldn't be linked to the root cgroup
2020-02-10 17:07:05 -08:00
Hongbo Yao
2bf0eb9b3b bpf: Make btf_check_func_type_match() static
Fix the following sparse warning:

kernel/bpf/btf.c:4131:5: warning: symbol 'btf_check_func_type_match' was
not declared. Should it be static?

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Hongbo Yao <yaohongbo@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200210011441.147102-1-yaohongbo@huawei.com
2020-02-11 00:22:47 +01:00
Gustavo A. R. Silva
10f129cb59 tracing/kprobe: Fix uninitialized variable bug
There is a potential execution path in which variable *ret* is returned
without being properly initialized, previously.

Fix this by initializing variable *ret* to 0.

Link: http://lkml.kernel.org/r/20200205223404.GA3379@embeddedor

Addresses-Coverity-ID: 1491142 ("Uninitialized scalar variable")
Fixes: 2a588dd1d5 ("tracing: Add kprobe event command generation functions")
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-10 12:07:42 -05:00
Steve Grubb
70b3eeed49 audit: CONFIG_CHANGE don't log internal bookkeeping as an event
Common Criteria calls out for any action that modifies the audit trail to
be recorded. That usually is interpreted to mean insertion or removal of
rules. It is not required to log modification of the inode information
since the watch is still in effect. Additionally, if the rule is a never
rule and the underlying file is one they do not want events for, they
get an event for this bookkeeping update against their wishes.

Since no device/inode info is logged at insertion and no device/inode
information is logged on update, there is nothing meaningful being
communicated to the admin by the CONFIG_CHANGE updated_rules event. One
can assume that the rule was not "modified" because it is still watching
the intended target. If the device or inode cannot be resolved, then
audit_panic is called which is sufficient.

The correct resolution is to drop logging config_update events since
the watch is still in effect but just on another unknown inode.

Signed-off-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2020-02-10 10:46:35 -05:00
Mel Gorman
52262ee567 sched/fair: Allow a per-CPU kthread waking a task to stack on the same CPU, to fix XFS performance regression
The following XFS commit:

  8ab39f11d9 ("xfs: prevent CIL push holdoff in log recovery")

changed the logic from using bound workqueues to using unbound
workqueues. Functionally this makes sense but it was observed at the
time that the dbench performance dropped quite a lot and CPU migrations
were increased.

The current pattern of the task migration is straight-forward. With XFS,
an IO issuer delegates work to xlog_cil_push_work ()on an unbound kworker.
This runs on a nearby CPU and on completion, dbench wakes up on its old CPU
as it is still idle and no migration occurs. dbench then queues the real
IO on the blk_mq_requeue_work() work item which runs on a bound kworker
which is forced to run on the same CPU as dbench. When IO completes,
the bound kworker wakes dbench but as the kworker is a bound but,
real task, the CPU is not considered idle and dbench gets migrated by
select_idle_sibling() to a new CPU. dbench may ping-pong between two CPUs
for a while but ultimately it starts a round-robin of all CPUs sharing
the same LLC. High-frequency migration on each IO completion has poor
performance overall. It has negative implications both in commication
costs and power management. mpstat confirmed that at low thread counts
that all CPUs sharing an LLC has low level of activity.

Note that even if the CIL patch was reverted, there still would
be migrations but the impact is less noticeable. It turns out that
individually the scheduler, XFS, blk-mq and workqueues all made sensible
decisions but in combination, the overall effect was sub-optimal.

This patch special cases the IO issue/completion pattern and allows
a bound kworker waker and a task wakee to stack on the same CPU if
there is a strong chance they are directly related. The expectation
is that the kworker is likely going back to sleep shortly. This is not
guaranteed as the IO could be queued asynchronously but there is a very
strong relationship between the task and kworker in this case that would
justify stacking on the same CPU instead of migrating. There should be
few concerns about kworker starvation given that the special casing is
only when the kworker is the waker.

DBench on XFS
MMTests config: io-dbench4-async modified to run on a fresh XFS filesystem

UMA machine with 8 cores sharing LLC
                          5.5.0-rc7              5.5.0-rc7
                  tipsched-20200124           kworkerstack
Amean     1        22.63 (   0.00%)       20.54 *   9.23%*
Amean     2        25.56 (   0.00%)       23.40 *   8.44%*
Amean     4        28.63 (   0.00%)       27.85 *   2.70%*
Amean     8        37.66 (   0.00%)       37.68 (  -0.05%)
Amean     64      469.47 (   0.00%)      468.26 (   0.26%)
Stddev    1         1.00 (   0.00%)        0.72 (  28.12%)
Stddev    2         1.62 (   0.00%)        1.97 ( -21.54%)
Stddev    4         2.53 (   0.00%)        3.58 ( -41.19%)
Stddev    8         5.30 (   0.00%)        5.20 (   1.92%)
Stddev    64       86.36 (   0.00%)       94.53 (  -9.46%)

NUMA machine, 48 CPUs total, 24 CPUs share cache
                           5.5.0-rc7              5.5.0-rc7
                   tipsched-20200124      kworkerstack-v1r2
Amean     1         58.69 (   0.00%)       30.21 *  48.53%*
Amean     2         60.90 (   0.00%)       35.29 *  42.05%*
Amean     4         66.77 (   0.00%)       46.55 *  30.28%*
Amean     8         81.41 (   0.00%)       68.46 *  15.91%*
Amean     16       113.29 (   0.00%)      107.79 *   4.85%*
Amean     32       199.10 (   0.00%)      198.22 *   0.44%*
Amean     64       478.99 (   0.00%)      477.06 *   0.40%*
Amean     128     1345.26 (   0.00%)     1372.64 *  -2.04%*
Stddev    1          2.64 (   0.00%)        4.17 ( -58.08%)
Stddev    2          4.35 (   0.00%)        5.38 ( -23.73%)
Stddev    4          6.77 (   0.00%)        6.56 (   3.00%)
Stddev    8         11.61 (   0.00%)       10.91 (   6.04%)
Stddev    16        18.63 (   0.00%)       19.19 (  -3.01%)
Stddev    32        38.71 (   0.00%)       38.30 (   1.06%)
Stddev    64       100.28 (   0.00%)       91.24 (   9.02%)
Stddev    128      186.87 (   0.00%)      160.34 (  14.20%)

Dbench has been modified to report the time to complete a single "load
file". This is a more meaningful metric for dbench that a throughput
metric as the benchmark makes many different system calls that are not
throughput-related

Patch shows a 9.23% and 48.53% reduction in the time to process a load
file with the difference partially explained by the number of CPUs sharing
a LLC. In a separate run, task migrations were almost eliminated by the
patch for low client counts. In case people have issue with the metric
used for the benchmark, this is a comparison of the throughputs as
reported by dbench on the NUMA machine.

dbench4 Throughput (misleading but traditional)
                           5.5.0-rc7              5.5.0-rc7
                   tipsched-20200124      kworkerstack-v1r2
Hmean     1        321.41 (   0.00%)      617.82 *  92.22%*
Hmean     2        622.87 (   0.00%)     1066.80 *  71.27%*
Hmean     4       1134.56 (   0.00%)     1623.74 *  43.12%*
Hmean     8       1869.96 (   0.00%)     2212.67 *  18.33%*
Hmean     16      2673.11 (   0.00%)     2806.13 *   4.98%*
Hmean     32      3032.74 (   0.00%)     3039.54 (   0.22%)
Hmean     64      2514.25 (   0.00%)     2498.96 *  -0.61%*
Hmean     128     1778.49 (   0.00%)     1746.05 *  -1.82%*

Note that this is somewhat specific to XFS and ext4 shows no performance
difference as it does not rely on kworkers in the same way. No major
problem was observed running other workloads on different machines although
not all tests have completed yet.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200128154006.GD3466@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-02-10 11:24:37 +01:00
Linus Torvalds
89a47dd1af Kbuild updates for v5.6 (2nd)
- fix randconfig to generate a sane .config
 
  - rename hostprogs-y / always to hostprogs / always-y, which are
    more natual syntax.
 
  - optimize scripts/kallsyms
 
  - fix yes2modconfig and mod2yesconfig
 
  - make multiple directory targets ('make foo/ bar/') work
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAl47NfMVHG1hc2FoaXJv
 eUBrZXJuZWwub3JnAAoJED2LAQed4NsGRGwP/3AHO8P0wGEeFKs3ziSMjs2W7/Pj
 lN08Kuxm0u3LnyEEcHVUveoi+xBYqvrw0RsGgYf5S8q0Mpep7MPqbfkDUxV/0Zkj
 QP2CsvOTbjdBjH7q3ojkwLcDl0Pxu9mg3eZMRXZ2WQeNXuMRw6Bicoh7ElvB1Bv/
 HC+j30i2Me3cf/riQGSAsstvlXyIR8RaerR8PfRGESTysiiN76+JcHTatJHhOJL9
 O6XKkzo8/CXMYKKVF4Ae4NP+WFg6E96/pAPx0Rf47RbPX9UG35L9rkzTDnk70Ms6
 OhKiu3hXsRX7mkqApuoTqjge4+iiQcKZxYmMXU1vGlIRzjwg19/4YFP6pDSCcnIu
 kKb8KN4o4N41N7MFS3OLZWwISA8Vw6RbtwDZ3AghDWb7EHb9oNW42mGfcAPr1+wZ
 /KH6RHTzaz+5q2MgyMY1NhADFrhIT9CvDM+UJECgbokblnw7PHAnPmbsuVak9ZOH
 u9ojO1HpTTuIYO6N6v4K5zQBZF1N+RvkmBnhHd8j6SksppsCoC/G62QxgXhF2YK3
 FQMpATCpuyengLxWAmPEjsyyPOlrrdu9UxqNsXVy5ol40+7zpxuHwKcQKCa9urJR
 rcpbIwLaBcLhHU4BmvBxUk5aZxxGV2F0O0gXTOAbT2xhd6BipZSMhUmN49SErhQm
 NC/coUmQX7McxMXh
 =sv4U
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-v5.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull more Kbuild updates from Masahiro Yamada:

 - fix randconfig to generate a sane .config

 - rename hostprogs-y / always to hostprogs / always-y, which are more
   natual syntax.

 - optimize scripts/kallsyms

 - fix yes2modconfig and mod2yesconfig

 - make multiple directory targets ('make foo/ bar/') work

* tag 'kbuild-v5.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  kbuild: make multiple directory targets work
  kconfig: Invalidate all symbols after changing to y or m.
  kallsyms: fix type of kallsyms_token_table[]
  scripts/kallsyms: change table to store (strcut sym_entry *)
  scripts/kallsyms: rename local variables in read_symbol()
  kbuild: rename hostprogs-y/always to hostprogs/always-y
  kbuild: fix the document to use extra-y for vmlinux.lds
  kconfig: fix broken dependency in randconfig-generated .config
2020-02-09 16:05:50 -08:00
Linus Torvalds
1a2a76c268 A set of fixes for X86:
- Ensure that the PIT is set up when the local APIC is disable or
    configured in legacy mode. This is caused by an ordering issue
    introduced in the recent changes which skip PIT initialization when the
    TSC and APIC frequencies are already known.
 
  - Handle malformed SRAT tables during early ACPI parsing which caused an
    infinite loop anda boot hang.
 
  - Fix a long standing race in the affinity setting code which affects PCI
    devices with non-maskable MSI interrupts. The problem is caused by the
    non-atomic writes of the MSI address (destination APIC id) and data
    (vector) fields which the device uses to construct the MSI message. The
    non-atomic writes are mandated by PCI.
 
    If both fields change and the device raises an interrupt after writing
    address and before writing data, then the MSI block constructs a
    inconsistent message which causes interrupts to be lost and subsequent
    malfunction of the device.
 
    The fix is to redirect the interrupt to the new vector on the current
    CPU first and then switch it over to the new target CPU. This allows to
    observe an eventually raised interrupt in the transitional stage (old
    CPU, new vector) to be observed in the APIC IRR and retriggered on the
    new target CPU and the new vector. The potential spurious interrupts
    caused by this are harmless and can in the worst case expose a buggy
    driver (all handlers have to be able to deal with spurious interrupts as
    they can and do happen for various reasons).
 
  - Add the missing suspend/resume mechanism for the HYPERV hypercall page
    which prevents resume hibernation on HYPERV guests. This change got
    lost before the merge window.
 
  - Mask the IOAPIC before disabling the local APIC to prevent potentially
    stale IOAPIC remote IRR bits which cause stale interrupt lines after
    resume.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl5AEJwTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoWY2D/47ur9gsVQGryKzneVAr0SCsq4Un11e
 uifX4ldu4gCEBRTYhpgcpiFKeLvY/QJ6uOD+gQUHyy/s+lCf6yzE6UhXEqSCtcT7
 LkSxD8jAFf6KhMA6iqYBfyxUsPMXBetLjjHWsyc/kf15O/vbYm7qf05timmNZkDS
 S7C+yr3KRqRjLR7G7t4twlgC9aLcNUQihUdsH2qyTvjnlkYHJLDa0/Js7bFYYKVx
 9GdUDLvPFB1mZ76g012De4R3kJsWitiyLlQ38DP5VysKulnszUCdiXlgCEFrgxvQ
 OQhLafQzOAzvxQmP+1alODR0dmJZA8k0zsDeeTB/vTpRvv6+Pe2qUswLSpauBzuq
 TpDsrv8/5pwZh28+91f/Unk+tH8NaVNtGe/Uf+ePxIkn1nbqL84o4NHGplM6R97d
 HAWdZQZ1cGRLf6YRRJ+57oM/5xE3vBbF1Wn0+QDTFwdsk2vcxuQ4eB3M/8E1V7Zk
 upp8ty50bZ5+rxQ8XTq/eb8epSRnfLoBYpi4ux6MIOWRdmKDl40cDeZCzA2kNP7m
 qY1haaRN3ksqvhzc0Yf6cL+CgvC4ur8gRHezfOqmBzVoaLyVEFIVjgjR/ojf0bq8
 /v+L9D5+IdIv4jEZruRRs0gOXNDzoBbvf0qKGaO0tUTWiDsv7c5AGixp8aozniHS
 HXsv1lIpRuC7WQ==
 =WxKD
 -----END PGP SIGNATURE-----

Merge tag 'x86-urgent-2020-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Thomas Gleixner:
 "A set of fixes for X86:

   - Ensure that the PIT is set up when the local APIC is disable or
     configured in legacy mode. This is caused by an ordering issue
     introduced in the recent changes which skip PIT initialization when
     the TSC and APIC frequencies are already known.

   - Handle malformed SRAT tables during early ACPI parsing which caused
     an infinite loop anda boot hang.

   - Fix a long standing race in the affinity setting code which affects
     PCI devices with non-maskable MSI interrupts. The problem is caused
     by the non-atomic writes of the MSI address (destination APIC id)
     and data (vector) fields which the device uses to construct the MSI
     message. The non-atomic writes are mandated by PCI.

     If both fields change and the device raises an interrupt after
     writing address and before writing data, then the MSI block
     constructs a inconsistent message which causes interrupts to be
     lost and subsequent malfunction of the device.

     The fix is to redirect the interrupt to the new vector on the
     current CPU first and then switch it over to the new target CPU.
     This allows to observe an eventually raised interrupt in the
     transitional stage (old CPU, new vector) to be observed in the APIC
     IRR and retriggered on the new target CPU and the new vector.

     The potential spurious interrupts caused by this are harmless and
     can in the worst case expose a buggy driver (all handlers have to
     be able to deal with spurious interrupts as they can and do happen
     for various reasons).

   - Add the missing suspend/resume mechanism for the HYPERV hypercall
     page which prevents resume hibernation on HYPERV guests. This
     change got lost before the merge window.

   - Mask the IOAPIC before disabling the local APIC to prevent
     potentially stale IOAPIC remote IRR bits which cause stale
     interrupt lines after resume"

* tag 'x86-urgent-2020-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/apic: Mask IOAPIC entries when disabling the local APIC
  x86/hyperv: Suspend/resume the hypercall page for hibernation
  x86/apic/msi: Plug non-maskable MSI affinity race
  x86/boot: Handle malformed SRAT tables during early ACPI parsing
  x86/timer: Don't skip PIT setup when APIC is disabled or in legacy mode
2020-02-09 12:11:12 -08:00
Linus Torvalds
f41377609a Two fixes for the SMP related functionality:
- Make the UP version of smp_call_function_single() match SMP semantics
    when called for a not available CPU.  Instead of emitting a warning and
    assuming that the function call target is CPU0, return a proper error
    code like the SMP version does.
 
  - Remove a superfluous check in smp_call_function_many_cond()
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl5ADFYTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoZdoD/9I+zp3QWj0/xxpNEOZj5yKhaxDLMqK
 OjgIvKGOfd5kWk+y57iuvPZKiacnHPlixni9H0KlAz6FW8V40jQz5iOCpnw6OTSO
 mqET7dn7ei7+9fpKmivfAybf8Jz9dy4ouZUwxEo0L+AizI6JE1eR6tpCbt09I9Dk
 uOCIcolTFX32JF4p7IHSlk1ViU8jC/L1zO7In7aqizpwQt9uVxLRUZeIb3nSftBY
 iJs8Kubfb+Fuc7+k5CXYmXLC6toqhWvsN1546ngC+sYz4nbgGUcnYc+U9slstGkr
 fHiP2uuTyYdIwKYrF7KYRv0NdjrEt4w+ZpK49AHoc0ZNs8MnVhcGq4riWWjmfuGk
 ZTbmEIQu9cxEaecQcIsIVpi7xpap0LfFTTze0YUshYlHtONQ4xAsFo2vjbBTMDqU
 P31aVv95bgHDaDv12pQu3DV/ztW4Xi5/6KynDkeCBo9VdaUwGbpV9Ro0SCWSt5qH
 OymyN1x+JIozd2LGNA8Vat7FxpktgqTCe2TMLLwqL4fX4GtTHbwC9zFyfvcey2Kn
 KKgP1c0rcHYmdyYvFd6mumjhzusBGCUVL/h9SSKboNQWZ0/fL1KyFZseM5Sqwexy
 76qVA2zY0ZJ0QN77vZqZgWWf2UUlxF7++Vi0cfuCtt2+V4SiYN7RzeO8SpAKSD0M
 4ycr8w8uXQk45Q==
 =XfHp
 -----END PGP SIGNATURE-----

Merge tag 'smp-urgent-2020-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull SMP fixes from Thomas Gleixner:
 "Two fixes for the SMP related functionality:

   - Make the UP version of smp_call_function_single() match SMP
     semantics when called for a not available CPU. Instead of emitting
     a warning and assuming that the function call target is CPU0,
     return a proper error code like the SMP version does.

   - Remove a superfluous check in smp_call_function_many_cond()"

* tag 'smp-urgent-2020-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  smp/up: Make smp_call_function_single() match SMP semantics
  smp: Remove superfluous cond_func check in smp_call_function_many_cond()
2020-02-09 12:09:43 -08:00
Linus Torvalds
ca21b9b370 A set of fixes and improvements for the perf subsystem:
- Kernel fixes:
 
    - Install cgroup events to the correct CPU context to prevent a
      potential list double add
 
    - Prevent am intgeer underflow in the perf mlock acounting
 
    - Add a missing prototyp for arch_perf_update_userpage()
 
  - Tooling:
 
    - Add a missing unlock in the error path of maps__insert() in perf maps.
 
    - Fix the build with the latest libbfd
 
    - Fix the perf parser so it does not delete parse event terms, which
      caused a regression for using perf with the ARM CoreSight as the sink
      confuguration was missing due to the deletion.
 
    - Fix the double free in the perf CPU map merging test case
 
    - Add the missing ustring support for the perf probe command
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl5AC0ITHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoaJtD/4jEdN6KNGVJIQ5jOYdchXK/zb68plS
 3By6CegbaNq1SU5UPIdMX4BkznVGaVtJU/0hWuvD/ycpBTAMgKjwalYJtAC+anVi
 JhG7NiPRV1Nhm+7eZ/78mUpW4CUimTlvZVzU/yneYdFm2klvcxUHblJYSqEGp0AS
 r2aZRsqQnWSoI/+z+0THO8tI+HLSpkmKy2slLxaZphI0VjSrjWPDHfF6eAOyl/dq
 lTCz+tjd6EytELL+lhWFsGXYAi6HPKP3T4yPRH+eDYKQmByYaEYbK3E8wg/0XB/J
 2AHgSBf9pSPDBIkLOWOidmkmWgZD9ykCTyOPu4N0S70+NeaCm2nXLTOQ7dnyLE7t
 WCx8mvnIS2hshNUoXMkarG5LYexPupDMMEfHyUT5+T2rKxacKWLaRoIV+JCsUpQb
 m6eU3+n/YsN1C05V75Fuztt4irGhltlQxcG8F3gH/vqSy6VDdZb8lMU6+iyE2VKG
 ezsI7AMQkT6LrTGa2hXHHnnluaxHHSA32GPe4W1QTwMCMWMtRTwQHBBLoJ4mC0wk
 iujB9DVuh7ljmr7QSG9ZYV91eplpzJDUC54P6Qs/p7ouG4YzkIO6glt6BOgBmbp7
 YkrJtGpV6npjJmLckktcSd9rtnCzot6yGxeaIVfLPhhtf2KECSCckCyddwkakt0A
 wwVVBe8RNxXf2A==
 =xu7D
 -----END PGP SIGNATURE-----

Merge tag 'perf-urgent-2020-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fixes from Thomas Gleixner:
 "A set of fixes and improvements for the perf subsystem:

  Kernel fixes:

   - Install cgroup events to the correct CPU context to prevent a
     potential list double add

   - Prevent an integer underflow in the perf mlock accounting

   - Add a missing prototype for arch_perf_update_userpage()

  Tooling:

   - Add a missing unlock in the error path of maps__insert() in perf
     maps.

   - Fix the build with the latest libbfd

   - Fix the perf parser so it does not delete parse event terms, which
     caused a regression for using perf with the ARM CoreSight as the
     sink configuration was missing due to the deletion.

   - Fix the double free in the perf CPU map merging test case

   - Add the missing ustring support for the perf probe command"

* tag 'perf-urgent-2020-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf maps: Add missing unlock to maps__insert() error case
  perf probe: Add ustring support for perf probe command
  perf: Make perf able to build with latest libbfd
  perf test: Fix test case Merge cpu map
  perf parse: Copy string to perf_evsel_config_term
  perf parse: Refactor 'struct perf_evsel_config_term'
  kernel/events: Add a missing prototype for arch_perf_update_userpage()
  perf/cgroups: Install cgroup events to correct cpuctx
  perf/core: Fix mlock accounting in perf_mmap()
2020-02-09 12:04:09 -08:00
Linus Torvalds
2fbc23c738 Two small fixes for the time(r) subsystem:
- Handle a subtle race between the clocksource watchdog and a concurrent
     clocksource watchdog stop/start sequence correctly to prevent a timer
     double add bug.
 
   - Fix the file path for the core time namespace file.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl5ADSUTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoXajD/9BiKKMQu11ExpG/VddjCM6M9eHqCAp
 6tFWtjN1u49mw2rqH88WlwcOqQpMHvASPEQ5SekYhD1vLX4OSk1E58No9UNKQANE
 xQjQals4MmuvPtBZe6Lp5ORSKKvFRfZCt/4TZ5NcrUXLGyWaRRhHbuSjKtJZ8tko
 NRYcNSYnDOABL6LhKnLwAVsI9faeymKsrwwxW+FQerclCj1QaJLbFC4uenpCwKjF
 rz5qdg9wk7NTQ6KfX2qQrQgnNGUywBTvL0pGtGV+l3VPZMMYyaqSWpPaqZ+McogS
 FP60sDOFy8XlyVkqD/FdKnZwss1akXmkhnh2t/41mDrFE6kpsOBR0q5ZpAExI6N2
 uUN692kb2mVGpC+VLEED/R3I4cixC0Ux1UE+x/4qnG+CkQDoFU5QVgTzOTCSUfE3
 yiDTVOniAz998uoKJID8F7JjQH5g8NJoNODYZ8mT/ctntOl7Q7EXEL5nBOLH36KA
 sl1gTX0hPoyHFmV5VJRmyAnzF3NkVmQ3FI9Sya93NJluOnhSwma01wcan9Dlnq6I
 5HUn71+TCSR18pr7adIWqIB9gJuVu6ssZtZD8nxUH1pG1gv/Odp6WFEVnmhtaNVG
 cOmugi0DALndqLiTACTCQqnwb3wIeQ5QRd81HdMmjV1DgqE21U76s6JAR1tXO9eq
 eNDQ00Cb7dcBYQ==
 =0zFj
 -----END PGP SIGNATURE-----

Merge tag 'timers-urgent-2020-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fixes from Thomas Gleixner:
 "Two small fixes for the time(r) subsystem:

   - Handle a subtle race between the clocksource watchdog and a
     concurrent clocksource watchdog stop/start sequence correctly to
     prevent a timer double add bug.

   - Fix the file path for the core time namespace file"

* tag 'timers-urgent-2020-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  clocksource: Prevent double add_timer_on() for watchdog_timer
  MAINTAINERS: Correct path to time namespace source file
2020-02-09 12:00:12 -08:00
Linus Torvalds
f06bed87d7 A set of fixes for the interrupt subsystem:
- Provision only ACPI enabled redistributors on GICv3
 
  - Use the proper command colums when building the INVALL command for the
    GICv3-ITS
 
  - Ensure the allocation of the L2 vPE table for GICv4.1
 
  - Correct the GICv4.1 VPROBASER programming so it uses the proper size
 
  - A set of small GICv4.1 tidy up patches
 
  - Configuration cleanup for C-SKY interrupt chip
 
  - Clarify the function documentation for irq_set_wake() to document that
    the wakeup functionality is orthogonal to the irq disable/enable
    mechanism.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl5ACB4THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYodrNEAC22Nu3gGHKE/AUPZP8q53wl5axVZ4M
 reM3Wnw7LcUqmXHApbG/kJMbkGkN8sQhieyuTY2UBea+K06nox6aulBjLZ2U6UGE
 /5vFD+qB8a8AvSjyVGi0BU04h4RXJEZ9MxM34VDBiabQ74yiEIQvEYhyGVrMVRoM
 HC2UP2Y3SgYbBzRPL/sXUjNtPB6QAxABm41PK/2b7y36eULHv3LszqrEcNyuJ7qm
 2wEppOmB8+4j6d12zxOJh2hE4RLvNwKgWpcbEofVsI0FdCTcJ/0wVhdTPJmzLz2m
 kNFhLQ6qEhCj3ca0tF3sPwl+g0lHKVBtWMkIjKbC4N8g7pBvzj46Ys0/umuTnY9T
 pQvJ+N7Jcnbm2IkxYL707X8GewJjcGdYqVklXOJDyfCKm9G1h2lrCQmEjJaVHGVi
 f5eQVg401ndqu3L4sSctQM9Qwd3RnVZwanwbPBSD4sbTRdQseRTezIM61bvzvppF
 mIwflkfHB/CsrszfFrXHDy22GnsrpR+TTJWgPFahczZCAIxvdv8s+lsMpkZ1oXfg
 21cT0Bpj9JT6MIU9K7nalWmAO2Ylb0qDofLNlD1tb9pLWQDSHdR/hEm9o+4Msa/6
 /cvrVLVwwM1P0hU1lI7VRKlbsZ0sYWLY1uro05lvckt4QO9WFAZsafnmAVOzN/g5
 l7voNi/F8sww2Q==
 =a9t/
 -----END PGP SIGNATURE-----

Merge tag 'irq-urgent-2020-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull interrupt fixes from Thomas Gleixner:
 "A set of fixes for the interrupt subsystem:

   - Provision only ACPI enabled redistributors on GICv3

   - Use the proper command colums when building the INVALL command for
     the GICv3-ITS

   - Ensure the allocation of the L2 vPE table for GICv4.1

   - Correct the GICv4.1 VPROBASER programming so it uses the proper
     size

   - A set of small GICv4.1 tidy up patches

   - Configuration cleanup for C-SKY interrupt chip

   - Clarify the function documentation for irq_set_wake() to document
     that the wakeup functionality is orthogonal to the irq
     disable/enable mechanism"

* tag 'irq-urgent-2020-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/gic-v3-its: Rename VPENDBASER/VPROPBASER accessors
  irqchip/gic-v3-its: Remove superfluous WARN_ON
  irqchip/gic-v4.1: Drop 'tmp' in inherit_vpe_l1_table_from_rd()
  irqchip/gic-v4.1: Ensure L2 vPE table is allocated at RD level
  irqchip/gic-v4.1: Set vpe_l1_base for all redistributors
  irqchip/gic-v4.1: Fix programming of GICR_VPROPBASER_4_1_SIZE
  genirq: Clarify that irq wake state is orthogonal to enable/disable
  irqchip/gic-v3-its: Reference to its_invall_cmd descriptor when building INVALL
  irqchip: Some Kconfig cleanup for C-SKY
  irqchip/gic-v3: Only provision redistributors that are enabled in ACPI
2020-02-09 11:56:41 -08:00
Linus Torvalds
291abfea47 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Unbalanced locking in mwifiex_process_country_ie, from Brian Norris.

 2) Fix thermal zone registration in iwlwifi, from Andrei
    Otcheretianski.

 3) Fix double free_irq in sgi ioc3 eth, from Thomas Bogendoerfer.

 4) Use after free in mptcp, from Florian Westphal.

 5) Use after free in wireguard's root_remove_peer_lists, from Eric
    Dumazet.

 6) Properly access packets heads in bonding alb code, from Eric
    Dumazet.

 7) Fix data race in skb_queue_len(), from Qian Cai.

 8) Fix regression in r8169 on some chips, from Heiner Kallweit.

 9) Fix XDP program ref counting in hv_netvsc, from Haiyang Zhang.

10) Certain kinds of set link netlink operations can cause a NULL deref
    in the ipv6 addrconf code. Fix from Eric Dumazet.

11) Don't cancel uninitialized work queue in drop monitor, from Ido
    Schimmel.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (84 commits)
  net: thunderx: use proper interface type for RGMII
  mt76: mt7615: fix max_nss in mt7615_eeprom_parse_hw_cap
  bpf: Improve bucket_log calculation logic
  selftests/bpf: Test freeing sockmap/sockhash with a socket in it
  bpf, sockhash: Synchronize_rcu before free'ing map
  bpf, sockmap: Don't sleep while holding RCU lock on tear-down
  bpftool: Don't crash on missing xlated program instructions
  bpf, sockmap: Check update requirements after locking
  drop_monitor: Do not cancel uninitialized work item
  mlxsw: spectrum_dpipe: Add missing error path
  mlxsw: core: Add validation of hardware device types for MGPIR register
  mlxsw: spectrum_router: Clear offload indication from IPv6 nexthops on abort
  selftests: mlxsw: Add test cases for local table route replacement
  mlxsw: spectrum_router: Prevent incorrect replacement of local table routes
  net: dsa: microchip: enable module autoprobe
  ipv6/addrconf: fix potential NULL deref in inet6_set_link_af()
  dpaa_eth: support all modes with rate adapting PHYs
  net: stmmac: update pci platform data to use phy_interface
  net: stmmac: xgmac: fix missing IFF_MULTICAST checki in dwxgmac2_set_filter
  net: stmmac: fix missing IFF_MULTICAST check in dwmac4_set_filter
  ...
2020-02-08 17:15:08 -08:00
Linus Torvalds
c9d35ee049 Merge branch 'merge.nfs-fs_parse.1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs file system parameter updates from Al Viro:
 "Saner fs_parser.c guts and data structures. The system-wide registry
  of syntax types (string/enum/int32/oct32/.../etc.) is gone and so is
  the horror switch() in fs_parse() that would have to grow another case
  every time something got added to that system-wide registry.

  New syntax types can be added by filesystems easily now, and their
  namespace is that of functions - not of system-wide enum members. IOW,
  they can be shared or kept private and if some turn out to be widely
  useful, we can make them common library helpers, etc., without having
  to do anything whatsoever to fs_parse() itself.

  And we already get that kind of requests - the thing that finally
  pushed me into doing that was "oh, and let's add one for timeouts -
  things like 15s or 2h". If some filesystem really wants that, let them
  do it. Without somebody having to play gatekeeper for the variants
  blessed by direct support in fs_parse(), TYVM.

  Quite a bit of boilerplate is gone. And IMO the data structures make a
  lot more sense now. -200LoC, while we are at it"

* 'merge.nfs-fs_parse.1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (25 commits)
  tmpfs: switch to use of invalfc()
  cgroup1: switch to use of errorfc() et.al.
  procfs: switch to use of invalfc()
  hugetlbfs: switch to use of invalfc()
  cramfs: switch to use of errofc() et.al.
  gfs2: switch to use of errorfc() et.al.
  fuse: switch to use errorfc() et.al.
  ceph: use errorfc() and friends instead of spelling the prefix out
  prefix-handling analogues of errorf() and friends
  turn fs_param_is_... into functions
  fs_parse: handle optional arguments sanely
  fs_parse: fold fs_parameter_desc/fs_parameter_spec
  fs_parser: remove fs_parameter_description name field
  add prefix to fs_context->log
  ceph_parse_param(), ceph_parse_mon_ips(): switch to passing fc_log
  new primitive: __fs_parse()
  switch rbd and libceph to p_log-based primitives
  struct p_log, variants of warnf() et.al. taking that one instead
  teach logfc() to handle prefices, give it saner calling conventions
  get rid of cg_invalf()
  ...
2020-02-08 13:26:41 -08:00
David S. Miller
2696e1146d Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2020-02-07

The following pull-request contains BPF updates for your *net* tree.

We've added 15 non-merge commits during the last 10 day(s) which contain
a total of 12 files changed, 114 insertions(+), 31 deletions(-).

The main changes are:

1) Various BPF sockmap fixes related to RCU handling in the map's tear-
   down code, from Jakub Sitnicki.

2) Fix macro state explosion in BPF sk_storage map when calculating its
   bucket_log on allocation, from Martin KaFai Lau.

3) Fix potential BPF sockmap update race by rechecking socket's established
   state under lock, from Lorenz Bauer.

4) Fix crash in bpftool on missing xlated instructions when kptr_restrict
   sysctl is set, from Toke Høiland-Jørgensen.

5) Fix i40e's XSK wakeup code to return proper error in busy state and
   various misc fixes in xdpsock BPF sample code, from Maciej Fijalkowski.

6) Fix the way modifiers are skipped in BTF in the verifier while walking
   pointers to avoid program rejection, from Alexei Starovoitov.

7) Fix Makefile for runqslower BPF tool to i) rebuild on libbpf changes and
   ii) to fix undefined reference linker errors for older gcc version due to
   order of passed gcc parameters, from Yulia Kartseva and Song Liu.

8) Fix a trampoline_count BPF kselftest warning about missing braces around
   initializer, from Andrii Nakryiko.

9) Fix up redundant "HAVE" prefix from large INSN limit kernel probe in
   bpftool, from Michal Rostecki.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-08 15:01:03 +01:00
Stephen Boyd
f9f21cea31 genirq: Clarify that irq wake state is orthogonal to enable/disable
There's some confusion around if an irq that's disabled with disable_irq()
can still wake the system from sleep states such as "suspend to RAM".

Clarify this in the kernel documentation for irq_set_irq_wake() so that
it's clear that an irq can be disabled and still wake the system if it has
been marked for wakeup.

Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Link: https://lkml.kernel.org/r/20200206191521.94559-1-swboyd@chromium.org
2020-02-07 21:37:08 +01:00
Al Viro
58c025f0e8 cgroup1: switch to use of errorfc() et.al.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07 14:48:43 -05:00
Al Viro
d7167b1499 fs_parse: fold fs_parameter_desc/fs_parameter_spec
The former contains nothing but a pointer to an array of the latter...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07 14:48:37 -05:00
Eric Sandeen
96cafb9ccb fs_parser: remove fs_parameter_description name field
Unused now.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07 14:48:36 -05:00
Al Viro
fbc2d1686d get rid of cg_invalf()
pointless alias for invalf()...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07 14:48:31 -05:00
Paul E. McKenney
1e474b28e7 smp/up: Make smp_call_function_single() match SMP semantics
In CONFIG_SMP=y kernels, smp_call_function_single() returns -ENXIO when
invoked for a non-existent CPU.  In contrast, in CONFIG_SMP=n kernels,
a splat is emitted and smp_call_function_single() otherwise silently
ignores its "cpu" argument, instead pretending that the caller intended
to have something happen on CPU 0.  Given that there is now code that
expects smp_call_function_single() to return an error if a bad CPU was
specified, this difference in semantics needs to be addressed.

Bring the semantics of the CONFIG_SMP=n version of
smp_call_function_single() into alignment with its CONFIG_SMP=y
counterpart.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200205143409.GA7021@paulmck-ThinkPad-P72
2020-02-07 15:34:12 +01:00
Linus Torvalds
d854b2d639 kgdb fixes for 5.6-rc1
One of the simplifications added for 5.6-rc1 has caused build
 regressions on some platforms (it was reported for sparc64).
 This pull request fixes it with a direct revert.
 
 Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEELzVBU1D3lWq6cKzwfOMlXTn3iKEFAl48JNUACgkQfOMlXTn3
 iKF7cA/9G3hl4IB1D2OfbbLb6lXPu+TEWDsQk7Y5CmKgbclMv+JL684TJCU7rnFB
 Xxjho97Cj9bqR3ZGDInxLcJrAnbaD6756UC7k5jVUIcdFFhprjdgWImyJVs6VXoe
 ESahpSna+Ap3ZsMBgQJUfnBbDzY/B3jmPQ8/f7qOIjBb17AdarisClk2CdbY90Gi
 0dfJti65SLqE10on7Clx+9HmNkPp9Rs+1+4O9r2LW5c9b6aNop8iQPdLdl7/B2Me
 V9Vs5uAsLqNKBpyFuq3PG1AXzK53gK2N3wZqmOPAuKAObDM/wUqvVHaAJHqiIWN9
 U+Dn9aZCIW3Mbeat4RijVT8CCoz1WEvCQM1MUk5sbgBnQ90vwSyDmJrA88GMoHtI
 wmP5OJL/EdghGWG9+PcFH2+axD87iEq48hxVXlB9UTtvoyjLkkiXm3WvOx2UC8rX
 8tjEZvivMO3GeJBxYQUXvSFzGUeJAMZhMQm9+0qaw//SJK5GyEG0CfDDp5Ts1q2w
 +F689/TaygpOZGP5m1iFqhNAoR03g+NTX33/CYoa+Yplnn+j+LU4ZTlNq4vl1VY6
 lATR3Eu/vLYhCqdeXaZpRx4cthEFQ3kl8tT53RU2Ip20cNQkNooHPpCpkk0vSuGx
 X0PV9CX/nAFWGrfbJLCW3AQ2ZTxqdl8i2Q7HzOJUJPrddppqN/U=
 =Pb+W
 -----END PGP SIGNATURE-----

Merge tag 'kgdb-fixes-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux

Pull kgdb fix from Daniel Thompson:
 "One of the simplifications added for 5.6-rc1 has caused build
  regressions on some platforms (it was reported for sparc64).

  This fixes it with a revert"

* tag 'kgdb-fixes-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux:
  Revert "kdb: Get rid of confusing diag msg from "rd" if current task has no regs"
2020-02-06 09:05:42 -08:00
Daniel Thompson
fcf2736c82 Revert "kdb: Get rid of confusing diag msg from "rd" if current task has no regs"
This reverts commit bbfceba15f.

When DBG_MAX_REG_NUM is zero then a number of symbols are conditionally
defined. It is therefore not possible to check it using C expressions.

Reported-by: Anatoly Pugachev <matorola@gmail.com>
Acked-by: Doug Anderson <dianders@chromium.org>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2020-02-06 11:40:09 +00:00
Linus Torvalds
e310396bb8 Tracing updates:
- Added new "bootconfig".
    Looks for a file appended to initrd to add boot config options.
    This has been discussed thoroughly at Linux Plumbers.
    Very useful for adding kprobes at bootup.
    Only enabled if "bootconfig" is on the real kernel command line.
 
  - Created dynamic event creation.
    Merges common code between creating synthetic events and
      kprobe events.
 
  - Rename perf "ring_buffer" structure to "perf_buffer"
 
  - Rename ftrace "ring_buffer" structure to "trace_buffer"
    Had to rename existing "trace_buffer" to "array_buffer"
 
  - Allow trace_printk() to work withing (some) tracing code.
 
  - Sort of tracing configs to be a little better organized
 
  - Fixed bug where ftrace_graph hash was not being protected properly
 
  - Various other small fixes and clean ups
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXjtAURQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qshOAQDzopQmvAVrrI6oogghr8JQA30Z2yqT
 i+Ld7vPWL2MV9wEA1S+zLGDSYrj8f/vsCq6BxRYT1ApO+YtmY6LTXiUejwg=
 =WNds
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing updates from Steven Rostedt:

 - Added new "bootconfig".

   This looks for a file appended to initrd to add boot config options,
   and has been discussed thoroughly at Linux Plumbers.

   Very useful for adding kprobes at bootup.

   Only enabled if "bootconfig" is on the real kernel command line.

 - Created dynamic event creation.

   Merges common code between creating synthetic events and kprobe
   events.

 - Rename perf "ring_buffer" structure to "perf_buffer"

 - Rename ftrace "ring_buffer" structure to "trace_buffer"

   Had to rename existing "trace_buffer" to "array_buffer"

 - Allow trace_printk() to work withing (some) tracing code.

 - Sort of tracing configs to be a little better organized

 - Fixed bug where ftrace_graph hash was not being protected properly

 - Various other small fixes and clean ups

* tag 'trace-v5.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (88 commits)
  bootconfig: Show the number of nodes on boot message
  tools/bootconfig: Show the number of bootconfig nodes
  bootconfig: Add more parse error messages
  bootconfig: Use bootconfig instead of boot config
  ftrace: Protect ftrace_graph_hash with ftrace_sync
  ftrace: Add comment to why rcu_dereference_sched() is open coded
  tracing: Annotate ftrace_graph_notrace_hash pointer with __rcu
  tracing: Annotate ftrace_graph_hash pointer with __rcu
  bootconfig: Only load bootconfig if "bootconfig" is on the kernel cmdline
  tracing: Use seq_buf for building dynevent_cmd string
  tracing: Remove useless code in dynevent_arg_pair_add()
  tracing: Remove check_arg() callbacks from dynevent args
  tracing: Consolidate some synth_event_trace code
  tracing: Fix now invalid var_ref_vals assumption in trace action
  tracing: Change trace_boot to use synth_event interface
  tracing: Move tracing selftests to bottom of menu
  tracing: Move mmio tracer config up with the other tracers
  tracing: Move tracing test module configs together
  tracing: Move all function tracing configs together
  tracing: Documentation for in-kernel synthetic event API
  ...
2020-02-06 07:12:11 +00:00
Steven Rostedt (VMware)
54a16ff6f2 ftrace: Protect ftrace_graph_hash with ftrace_sync
As function_graph tracer can run when RCU is not "watching", it can not be
protected by synchronize_rcu() it requires running a task on each CPU before
it can be freed. Calling schedule_on_each_cpu(ftrace_sync) needs to be used.

Link: https://lore.kernel.org/r/20200205131110.GT2935@paulmck-ThinkPad-P72

Cc: stable@vger.kernel.org
Fixes: b9b0c831be ("ftrace: Convert graph filter to use hash tables")
Reported-by: "Paul E. McKenney" <paulmck@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-05 17:16:42 -05:00
Steven Rostedt (VMware)
16052dd5bd ftrace: Add comment to why rcu_dereference_sched() is open coded
Because the function graph tracer can execute in sections where RCU is not
"watching", the rcu_dereference_sched() for the has needs to be open coded.
This is fine because the RCU "flavor" of the ftrace hash is protected by
its own RCU handling (it does its own little synchronization on every CPU
and does not rely on RCU sched).

Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-05 17:15:57 -05:00
Amol Grover
fd0e6852c4 tracing: Annotate ftrace_graph_notrace_hash pointer with __rcu
Fix following instances of sparse error
kernel/trace/ftrace.c:5667:29: error: incompatible types in comparison
kernel/trace/ftrace.c:5813:21: error: incompatible types in comparison
kernel/trace/ftrace.c:5868:36: error: incompatible types in comparison
kernel/trace/ftrace.c:5870:25: error: incompatible types in comparison

Use rcu_dereference_protected to dereference the newly annotated pointer.

Link: http://lkml.kernel.org/r/20200205055701.30195-1-frextrite@gmail.com

Signed-off-by: Amol Grover <frextrite@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-05 17:14:37 -05:00
Amol Grover
24a9729f83 tracing: Annotate ftrace_graph_hash pointer with __rcu
Fix following instances of sparse error
kernel/trace/ftrace.c:5664:29: error: incompatible types in comparison
kernel/trace/ftrace.c:5785:21: error: incompatible types in comparison
kernel/trace/ftrace.c:5864:36: error: incompatible types in comparison
kernel/trace/ftrace.c:5866:25: error: incompatible types in comparison

Use rcu_dereference_protected to access the __rcu annotated pointer.

Link: http://lkml.kernel.org/r/20200201072703.17330-1-frextrite@gmail.com

Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Amol Grover <frextrite@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-05 17:14:26 -05:00
Christoph Hellwig
75467ee48a dma-direct: improve DMA mask overflow reporting
Remove the unset dma_mask case as that won't get into mapping calls
anymore, and also report the other errors unconditonally and with a
slightly improved message.  Remove the now pointless report_addr helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad@darnok.org>
2020-02-05 18:53:41 +01:00
Christoph Hellwig
4a47cbae04 dma-direct: improve swiotlb error reporting
Untangle the way how dma_direct_map_page calls into swiotlb to be able
to properly report errors where the swiotlb DMA address overflows the
mask separately from overflows in the !swiotlb case.  This means that
siotlb_map now has to do a little more work that duplicates
dma_direct_map_page, but doing so greatly simplifies the calling
convention.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2020-02-05 18:53:05 +01:00
Christoph Hellwig
91ef26f914 dma-direct: relax addressability checks in dma_direct_supported
dma_direct_supported tries to find the minimum addressable bitmask
based on the end pfn and optional magic that architectures can use
to communicate the size of the magic ZONE_DMA that can be used
for bounce buffering.  But between the DMA offsets that can change
per device (or sometimes even region), the fact the ZONE_DMA isn't
even guaranteed to be the lowest addresses and failure of having
proper interfaces to the MM code this fails at least for one
arm subarchitecture.

As all the legacy DMA implementations have supported 32-bit DMA
masks, and 32-bit masks are guranteed to always work by the API
contract (using bounce buffers if needed), we can short cut the
complicated check and always return true without breaking existing
assumptions.  Hopefully we can properly clean up the interaction
with the arch defined zones and the bootmem allocator eventually.

Fixes: ad3c7b18c5 ("arm: use swiotlb for bounce buffering on LPAE configs")
Reported-by: Peter Ujfalusi <peter.ujfalusi@ti.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Peter Ujfalusi <peter.ujfalusi@ti.com>
2020-02-05 18:50:55 +01:00
Ingo Molnar
fdff7c21ea Merge branch 'linus' into perf/urgent, to synchronize with upstream
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-02-05 08:44:22 +01:00
Linus Torvalds
72f582ff85 Merge branch 'work.recursive_removal' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs recursive removal updates from Al Viro:
 "We have quite a few places where synthetic filesystems do an
  equivalent of 'rm -rf', with varying amounts of code duplication,
  wrong locking, etc. That really ought to be a library helper.

  Only debugfs (and very similar tracefs) are converted here - I have
  more conversions, but they'd never been in -next, so they'll have to
  wait"

* 'work.recursive_removal' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  simple_recursive_removal(): kernel-side rm -rf for ramfs-style filesystems
2020-02-05 05:09:46 +00:00
Masahiro Yamada
cde26a6e17 kallsyms: fix type of kallsyms_token_table[]
kallsyms_token_table[] only contains ASCII characters. It should be
char instead of u8.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
2020-02-05 13:45:37 +09:00
Alexey Dobriyan
97a32539b9 proc: convert everything to "struct proc_ops"
The most notable change is DEFINE_SHOW_ATTRIBUTE macro split in
seq_file.h.

Conversion rule is:

	llseek		=> proc_lseek
	unlocked_ioctl	=> proc_ioctl

	xxx		=> proc_xxx

	delete ".owner = THIS_MODULE" line

[akpm@linux-foundation.org: fix drivers/isdn/capi/kcapi_proc.c]
[sfr@canb.auug.org.au: fix kernel/sched/psi.c]
  Link: http://lkml.kernel.org/r/20200122180545.36222f50@canb.auug.org.au
Link: http://lkml.kernel.org/r/20191225172546.GB13378@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04 03:05:26 +00:00
Alexei Starovoitov
257af63d7f bpf: Fix modifier skipping logic
Fix the way modifiers are skipped while walking pointers. Otherwise second
level dereferences of 'const struct foo *' will be rejected by the verifier.

Fixes: 9e15db6613 ("bpf: Implement accurate raw_tp context access via BTF")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200201000314.261392-1-ast@kernel.org
2020-02-04 00:06:07 +01:00
Linus Torvalds
e17ac02b18 kgdb patches for 5.6-rc1
Everything for kgdb this time around is either simplifications or clean
 ups.
 
 In particular Douglas Anderson's modifications to the backtrace machine
 in the *last* dev cycle have enabled Doug to tidy up some MIPS specific
 backtrace code and stop sharing certain data structures across the
 kernel.  Note that The MIPS folks were on Cc: for the MIPS patch and
 reacted positively (but without an explicit Acked-by).
 
 Doug also got rid of the implicit switching between tasks and register
 sets during some but not of kdb's backtrace actions (because the
 implicit switching was either confusing for users, pointless or both).
 
 Finally there is a coverity fix and patch to replace open coded console
 traversal with the proper helper function.
 
 Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEELzVBU1D3lWq6cKzwfOMlXTn3iKEFAl44NQ0ACgkQfOMlXTn3
 iKHiXw//d6w5bIuA/HAQ24u/piEDlvYG7TYJ3GJLE1qaQMti9e2Ob48ahgUqQDbH
 K2slFvlhZbrXMHO8BZ1pQt2xaUx9rhmJEBh3GvEudFp4RgwRkebNF2YDuT5yq/Di
 gi3eeB4ZKBvCTsKGI+bNXYQCdTYEJ55gH+vj7jL1Kb2bmrNisnCKhzQhM2RvrkNB
 hRfpuFet3i9WsW9OILyt8aDTHCTKrPkghWiGQZ+9Z3TROI80CbO0Vwmg0xrrYEvh
 //X1Hu+IjoOSfQHNblBm9AMsqeo73HYJ9i5mtDhPL/BVensicY19Q7/bNSdw2yHL
 it3pPpyVGEhMXr/Qdbe2B7oqLUOzawpngdSzzcaa/lUT4zjh0F1tNrIyXjTZ4iCH
 kk2posDN+C/IfcOmZpSGBZQ8Ef57qtSAzvdGpyQPSTChyf8z1ufvCHfIzESpkaPU
 aa5jNwbAZCWmGDR3tGweUAUvgrKNaulbjygTvarNnv5Rt8gNXV7sKCilFF/nFLb4
 Pe9+NUWPSH81cwKyq/r4oG2TGPRUKMg5lo2k/ELHevTtXS5c2P/jtBp7NCstulk2
 RBp4oQhZ+lZNt8kz4l0yRXbaA5kqk3JRd8K76Bkm6E4ceXeX07d7rySkJPmzAGeA
 ZyLPUNGgn9k4XDMlkTUbFVocFtm+gxfelHcR1raDRg3MfYYzVAM=
 =igIA
 -----END PGP SIGNATURE-----

Merge tag 'kgdb-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux

Pull kgdb updates from Daniel Thompson:
 "Everything for kgdb this time around is either simplifications or
  clean ups.

  In particular Douglas Anderson's modifications to the backtrace
  machine in the *last* dev cycle have enabled Doug to tidy up some MIPS
  specific backtrace code and stop sharing certain data structures
  across the kernel. Note that The MIPS folks were on Cc: for the MIPS
  patch and reacted positively (but without an explicit Acked-by).

  Doug also got rid of the implicit switching between tasks and register
  sets during some but not of kdb's backtrace actions (because the
  implicit switching was either confusing for users, pointless or both).

  Finally there is a coverity fix and patch to replace open coded
  console traversal with the proper helper function"

* tag 'kgdb-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux:
  kdb: Use for_each_console() helper
  kdb: remove redundant assignment to pointer bp
  kdb: Get rid of confusing diag msg from "rd" if current task has no regs
  kdb: Gid rid of implicit setting of the current task / regs
  kdb: kdb_current_task shouldn't be exported
  kdb: kdb_current_regs should be private
  MIPS: kdb: Remove old workaround for backtracing on other CPUs
2020-02-03 16:59:51 +00:00
Tom Zanussi
2b90927c77 tracing: Use seq_buf for building dynevent_cmd string
The dynevent_cmd commands that build up the command string don't need
to do that themselves - there's a seq_buf facility that does pretty
much the same thing those command are doing manually, so use it
instead.

Link: http://lkml.kernel.org/r/eb8a6e835c964d0ab8a38cbf5ffa60746b54a465.1580506712.git.zanussi@kernel.org

Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-01 13:10:15 -05:00
Tom Zanussi
e9260f6257 tracing: Remove useless code in dynevent_arg_pair_add()
The final addition to q is unnecessary, since q isn't ever used
afterwards.

Link: http://lkml.kernel.org/r/7880a1268217886cdba7035526650195668da856.1580506712.git.zanussi@kernel.org

Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-01 13:09:42 -05:00
Tom Zanussi
74403b6c50 tracing: Remove check_arg() callbacks from dynevent args
It's kind of strange to have check_arg() callbacks as part of the arg
objects themselves; it makes more sense to just pass these in when the
args are added instead.

Remove the check_arg() callbacks from those objects which also means
removing the check_arg() args from the init functions, adding them to
the add functions and fixing up existing callers.

Link: http://lkml.kernel.org/r/c7708d6f177fcbe1a36b6e4e8e150907df0fa5d2.1580506712.git.zanussi@kernel.org

Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-02-01 13:09:23 -05:00
Konstantin Khlebnikov
febac332a8 clocksource: Prevent double add_timer_on() for watchdog_timer
Kernel crashes inside QEMU/KVM are observed:

  kernel BUG at kernel/time/timer.c:1154!
  BUG_ON(timer_pending(timer) || !timer->function) in add_timer_on().

At the same time another cpu got:

  general protection fault: 0000 [#1] SMP PTI of poinson pointer 0xdead000000000200 in:

  __hlist_del at include/linux/list.h:681
  (inlined by) detach_timer at kernel/time/timer.c:818
  (inlined by) expire_timers at kernel/time/timer.c:1355
  (inlined by) __run_timers at kernel/time/timer.c:1686
  (inlined by) run_timer_softirq at kernel/time/timer.c:1699

Unfortunately kernel logs are badly scrambled, stacktraces are lost.

Printing the timer->function before the BUG_ON() pointed to
clocksource_watchdog().

The execution of clocksource_watchdog() can race with a sequence of
clocksource_stop_watchdog() .. clocksource_start_watchdog():

expire_timers()
 detach_timer(timer, true);
  timer->entry.pprev = NULL;
 raw_spin_unlock_irq(&base->lock);
 call_timer_fn
  clocksource_watchdog()

					clocksource_watchdog_kthread() or
					clocksource_unbind()

					spin_lock_irqsave(&watchdog_lock, flags);
					clocksource_stop_watchdog();
					 del_timer(&watchdog_timer);
					 watchdog_running = 0;
					spin_unlock_irqrestore(&watchdog_lock, flags);

					spin_lock_irqsave(&watchdog_lock, flags);
					clocksource_start_watchdog();
					 add_timer_on(&watchdog_timer, ...);
					 watchdog_running = 1;
					spin_unlock_irqrestore(&watchdog_lock, flags);

  spin_lock(&watchdog_lock);
  add_timer_on(&watchdog_timer, ...);
   BUG_ON(timer_pending(timer) || !timer->function);
    timer_pending() -> true
    BUG()

I.e. inside clocksource_watchdog() watchdog_timer could be already armed.

Check timer_pending() before calling add_timer_on(). This is sufficient as
all operations are synchronized by watchdog_lock.

Fixes: 75c5158f70 ("timekeeping: Update clocksource with stop_machine")
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/158048693917.4378.13823603769948933793.stgit@buzz
2020-02-01 11:07:56 +01:00
Thomas Gleixner
6f1a4891a5 x86/apic/msi: Plug non-maskable MSI affinity race
Evan tracked down a subtle race between the update of the MSI message and
the device raising an interrupt internally on PCI devices which do not
support MSI masking. The update of the MSI message is non-atomic and
consists of either 2 or 3 sequential 32bit wide writes to the PCI config
space.

   - Write address low 32bits
   - Write address high 32bits (If supported by device)
   - Write data

When an interrupt is migrated then both address and data might change, so
the kernel attempts to mask the MSI interrupt first. But for MSI masking is
optional, so there exist devices which do not provide it. That means that
if the device raises an interrupt internally between the writes then a MSI
message is sent built from half updated state.

On x86 this can lead to spurious interrupts on the wrong interrupt
vector when the affinity setting changes both address and data. As a
consequence the device interrupt can be lost causing the device to
become stuck or malfunctioning.

Evan tried to handle that by disabling MSI accross an MSI message
update. That's not feasible because disabling MSI has issues on its own:

 If MSI is disabled the PCI device is routing an interrupt to the legacy
 INTx mechanism. The INTx delivery can be disabled, but the disablement is
 not working on all devices.

 Some devices lose interrupts when both MSI and INTx delivery are disabled.

Another way to solve this would be to enforce the allocation of the same
vector on all CPUs in the system for this kind of screwed devices. That
could be done, but it would bring back the vector space exhaustion problems
which got solved a few years ago.

Fortunately the high address (if supported by the device) is only relevant
when X2APIC is enabled which implies interrupt remapping. In the interrupt
remapping case the affinity setting is happening at the interrupt remapping
unit and the PCI MSI message is programmed only once when the PCI device is
initialized.

That makes it possible to solve it with a two step update:

  1) Target the MSI msg to the new vector on the current target CPU

  2) Target the MSI msg to the new vector on the new target CPU

In both cases writing the MSI message is only changing a single 32bit word
which prevents the issue of inconsistency.

After writing the final destination it is necessary to check whether the
device issued an interrupt while the intermediate state #1 (new vector,
current CPU) was in effect.

This is possible because the affinity change is always happening on the
current target CPU. The code runs with interrupts disabled, so the
interrupt can be detected by checking the IRR of the local APIC. If the
vector is pending in the IRR then the interrupt is retriggered on the new
target CPU by sending an IPI for the associated vector on the target CPU.

This can cause spurious interrupts on both the local and the new target
CPU.

 1) If the new vector is not in use on the local CPU and the device
    affected by the affinity change raised an interrupt during the
    transitional state (step #1 above) then interrupt entry code will
    ignore that spurious interrupt. The vector is marked so that the
    'No irq handler for vector' warning is supressed once.

 2) If the new vector is in use already on the local CPU then the IRR check
    might see an pending interrupt from the device which is using this
    vector. The IPI to the new target CPU will then invoke the handler of
    the device, which got the affinity change, even if that device did not
    issue an interrupt

 3) If the new vector is in use already on the local CPU and the device
    affected by the affinity change raised an interrupt during the
    transitional state (step #1 above) then the handler of the device which
    uses that vector on the local CPU will be invoked.

expose issues in device driver interrupt handlers which are not prepared to
handle a spurious interrupt correctly. This not a regression, it's just
exposing something which was already broken as spurious interrupts can
happen for a lot of reasons and all driver handlers need to be able to deal
with them.

Reported-by: Evan Green <evgreen@chromium.org>
Debugged-by: Evan Green <evgreen@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Evan Green <evgreen@chromium.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/87imkr4s7n.fsf@nanos.tec.linutronix.de
2020-02-01 09:31:47 +01:00
Tom Zanussi
249d7b2ef6 tracing: Consolidate some synth_event_trace code
The synth_event trace code contains some almost identical functions
and some small functions that are called only once - consolidate the
common code into single functions and fold in the small functions to
simplify the code overall.

Link: http://lkml.kernel.org/r/d1c8d8ad124a653b7543afe801d38c199ca5c20e.1580506712.git.zanussi@kernel.org

Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-31 18:35:17 -05:00
Linus Torvalds
7eec11d3a7 Merge branch 'akpm' (patches from Andrew)
Pull updates from Andrew Morton:
 "Most of -mm and quite a number of other subsystems: hotfixes, scripts,
  ocfs2, misc, lib, binfmt, init, reiserfs, exec, dma-mapping, kcov.

  MM is fairly quiet this time.  Holidays, I assume"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (118 commits)
  kcov: ignore fault-inject and stacktrace
  include/linux/io-mapping.h-mapping: use PHYS_PFN() macro in io_mapping_map_atomic_wc()
  execve: warn if process starts with executable stack
  reiserfs: prevent NULL pointer dereference in reiserfs_insert_item()
  init/main.c: fix misleading "This architecture does not have kernel memory protection" message
  init/main.c: fix quoted value handling in unknown_bootoption
  init/main.c: remove unnecessary repair_env_string in do_initcall_level
  init/main.c: log arguments and environment passed to init
  fs/binfmt_elf.c: coredump: allow process with empty address space to coredump
  fs/binfmt_elf.c: coredump: delete duplicated overflow check
  fs/binfmt_elf.c: coredump: allocate core ELF header on stack
  fs/binfmt_elf.c: make BAD_ADDR() unlikely
  fs/binfmt_elf.c: better codegen around current->mm
  fs/binfmt_elf.c: don't copy ELF header around
  fs/binfmt_elf.c: fix ->start_code calculation
  fs/binfmt_elf.c: smaller code generation around auxv vector fill
  lib/find_bit.c: uninline helper _find_next_bit()
  lib/find_bit.c: join _find_next_bit{_le}
  uapi: rename ext2_swab() to swab() and share globally in swab.h
  lib/scatterlist.c: adjust indentation in __sg_alloc_table
  ...
2020-01-31 12:16:36 -08:00
Linus Torvalds
ddaefe8947 Modules updates for v5.6
Summary of modules changes for the 5.6 merge window:
 
 - Add "MS" (SHF_MERGE|SHF_STRINGS) section flags to __ksymtab_strings to
   indicate to the linker that it can perform string deduplication (i.e.,
   duplicate strings are reduced to a single copy in the string table).
   This means any repeated namespace string would be merged to just one
   entry in __ksymtab_strings.
 
 - Various code cleanups and small fixes (fix small memleak in error path,
   improve moduleparam docs, silence rcu warnings, improve error logging)
 
 Signed-off-by: Jessica Yu <jeyu@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEVrp26glSWYuDNrCUwEV+OM47wXIFAl40TvwQHGpleXVAa2Vy
 bmVsLm9yZwAKCRDARX44zjvBcigxD/4/ksGeXvf3tcsRc5M5S33Tws25vcHeByz/
 WEX1f7ZnXukCApFdnpUbVkjiH7EM0+T6lGumv4NPJht+ggP8JoY9hMkBqMmd0js/
 +R9U6o0vB4LW8zU68RwE0TS4qphpmpJz16HlhTPtIk4Vo0GBxnEYMMMcVWIeqq1W
 m3KcEUudv9/Y7IFawDNRJcUWI1jD2vcfaavbU6XbTw82ARiiScZFrWYzf1PGYJ6L
 XvJNwCVh8TDbS4C5kaNWp2LiGXegjKClosdisCIjkQr/3e+Rg1jOGHpa6B2+Vow2
 ttq6lmcikNpcCkCV1tFz+ex2LLsLBMAO939c2C0LIhnnIxVgSkDU0pWn3psAxiOl
 lRqHtQN42dRlOtBwZ9JoKTT9Wi3H/Lx0FCxg5OdblrSlOqH+GxQjBLkgtvmn/ZAh
 /dReehUoqbL55GieZuPPyostg3upCDE27IQZdFrZLWbE0VGiIyU9p6GYo7Tssuo2
 Tr8kmhYUF9o1AnlzVQgGgZF73PpM6vhmEnn/dipZrgFI//2A3xkAfi5JdhGLKsFi
 UsaeTX3q/AmnC8dqaNayiftSgaK/4hdSboW1hgWLLD98H608s7Bl1reTmXPxSyWj
 RvBVP0vp5+u9EItfkAG6jbEpM5ZtyFDUc+5KNfJhym6vaplp5H+krIrT2Li+oLUu
 d/eifJ/1vA==
 =boqg
 -----END PGP SIGNATURE-----

Merge tag 'modules-for-v5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux

Pull module updates from Jessica Yu:
 "Summary of modules changes for the 5.6 merge window:

   - Add "MS" (SHF_MERGE|SHF_STRINGS) section flags to __ksymtab_strings
     to indicate to the linker that it can perform string deduplication
     (i.e., duplicate strings are reduced to a single copy in the string
     table). This means any repeated namespace string would be merged to
     just one entry in __ksymtab_strings.

   - Various code cleanups and small fixes (fix small memleak in error
     path, improve moduleparam docs, silence rcu warnings, improve error
     logging)"

* tag 'modules-for-v5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
  module.h: Annotate mod_kallsyms with __rcu
  module: avoid setting info->name early in case we can fall back to info->mod->name
  modsign: print module name along with error message
  kernel/module: Fix memleak in module_add_modinfo_attrs()
  export.h: reduce __ksymtab_strings string duplication by using "MS" section flags
  moduleparam: fix kerneldoc
  modules: lockdep: Suppress suspicious RCU usage warning
2020-01-31 11:42:13 -08:00
Dmitry Vyukov
43e76af85f kcov: ignore fault-inject and stacktrace
Don't instrument 3 more files that contain debugging facilities and
produce large amounts of uninteresting coverage for every syscall.

The following snippets are sprinkled all over the place in kcov traces
in a debugging kernel.  We already try to disable instrumentation of
stack unwinding code and of most debug facilities.  I guess we did not
use fault-inject.c at the time, and stacktrace.c was somehow missed (or
something has changed in kernel/configs).  This change both speeds up
kcov (kernel doesn't need to store these PCs, user-space doesn't need to
process them) and frees trace buffer capacity for more useful coverage.

  should_fail
  lib/fault-inject.c:149
  fail_dump
  lib/fault-inject.c:45

  stack_trace_save
  kernel/stacktrace.c:124
  stack_trace_consume_entry
  kernel/stacktrace.c:86
  stack_trace_consume_entry
  kernel/stacktrace.c:89
  ... a hundred frames skipped ...
  stack_trace_consume_entry
  kernel/stacktrace.c:93
  stack_trace_consume_entry
  kernel/stacktrace.c:86

Link: http://lkml.kernel.org/r/20200116111449.217744-1-dvyukov@gmail.com
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31 10:30:41 -08:00
Tom Zanussi
d380dcde9a tracing: Fix now invalid var_ref_vals assumption in trace action
The patch 'tracing: Fix histogram code when expression has same var as
value' added code to return an existing variable reference when
creating a new variable reference, which resulted in var_ref_vals
slots being reused instead of being duplicated.

The implementation of the trace action assumes that the end of the
var_ref_vals array starting at action_data.var_ref_idx corresponds to
the values that will be assigned to the trace params. The patch
mentioned above invalidates that assumption, which means that each
param needs to explicitly specify its index into var_ref_vals.

This fix changes action_data.var_ref_idx to an array of var ref
indexes to account for that.

Link: https://lore.kernel.org/r/1580335695.6220.8.camel@kernel.org

Fixes: 8bcebc77e8 ("tracing: Fix histogram code when expression has same var as value")
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-31 12:59:26 -05:00
Tom Zanussi
fdeb1aca28 tracing: Change trace_boot to use synth_event interface
Have trace_boot_add_synth_event() use the synth_event interface.

Also, rename synth_event_run_cmd() to synth_event_run_command() now
that trace_boot's version is gone.

Link: http://lkml.kernel.org/r/94f1fa0e31846d0bddca916b8663404b20559e34.1580323897.git.zanussi@kernel.org

Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-31 12:59:26 -05:00
Andy Shevchenko
dc2c733e65 kdb: Use for_each_console() helper
Replace open coded single-linked list iteration loop with for_each_console()
helper in use.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2020-01-31 17:34:54 +00:00
Colin Ian King
a4f8a7fb19 kdb: remove redundant assignment to pointer bp
The point bp is assigned a value that is never read, it is being
re-assigned later to bp = &kdb_breakpoints[lowbp] in a for-loop.
Remove the redundant assignment.

Addresses-Coverity ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Link: https://lore.kernel.org/r/20191128130753.181246-1-colin.king@canonical.com
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2020-01-31 17:34:06 +00:00
Douglas Anderson
bbfceba15f kdb: Get rid of confusing diag msg from "rd" if current task has no regs
If you switch to a sleeping task with the "pid" command and then type
"rd", kdb tells you this:

  No current kdb registers.  You may need to select another task
  diag: -17: Invalid register name

The first message makes sense, but not the second.  Fix it by just
returning 0 after commands accessing the current registers finish if
we've already printed the "No current kdb registers" error.

While fixing kdb_rd(), change the function to use "if" rather than
"ifdef".  It cleans the function up a bit and any modern compiler will
have no trouble handling still producing good code.

Signed-off-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20191109111624.5.I121f4c6f0c19266200bf6ef003de78841e5bfc3d@changeid
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2020-01-31 17:34:03 +00:00
Douglas Anderson
9441d5f6b7 kdb: Gid rid of implicit setting of the current task / regs
Some (but not all?) of the kdb backtrace paths would cause the
kdb_current_task and kdb_current_regs to remain changed.  As discussed
in a review of a previous patch [1], this doesn't seem intuitive, so
let's fix that.

...but, it turns out that there's actually no longer any reason to set
the current task / current regs while backtracing anymore anyway.  As
of commit 2277b49258 ("kdb: Fix stack crawling on 'running' CPUs
that aren't the master") if we're backtracing on a task running on a
CPU we ask that CPU to do the backtrace itself.  Linux can do that
without anything fancy.  If we're doing backtrace on a sleeping task
we can also do that fine without updating globals.  So this patch
mostly just turns into deleting a bunch of code.

[1] https://lore.kernel.org/r/20191010150735.dhrj3pbjgmjrdpwr@holly.lan

Signed-off-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20191109111624.4.Ibc3d982bbeb9e46872d43973ba808cd4c79537c7@changeid
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2020-01-31 17:34:00 +00:00
Douglas Anderson
a8649fb0a8 kdb: kdb_current_task shouldn't be exported
The kdb_current_task variable has been declared in
"kernel/debug/kdb/kdb_private.h" since 2010 when kdb was added to the
mainline kernel.  This is not a public header.  There should be no
reason that kdb_current_task should be exported and there are no
in-kernel users that need it.  Remove the export.

Signed-off-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20191109111623.3.I14b22b5eb15ca8f3812ab33e96621231304dc1f7@changeid
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2020-01-31 17:33:57 +00:00
Douglas Anderson
c67c10a67f kdb: kdb_current_regs should be private
As of the patch ("MIPS: kdb: Remove old workaround for backtracing on
other CPUs") there is no reason for kdb_current_regs to be in the
public "kdb.h".  Let's move it next to kdb_current_task.

Signed-off-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20191109111623.2.Iadbfb484e90b557cc4b5ac9890bfca732cd99d77@changeid
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2020-01-31 17:33:54 +00:00
Tejun Heo
0cd9d33ace cgroup: init_tasks shouldn't be linked to the root cgroup
5153faac18 ("cgroup: remove cgroup_enable_task_cg_lists()
optimization") removed lazy initialization of css_sets so that new
tasks are always lniked to its css_set. In the process, it incorrectly
ended up adding init_tasks to root css_set. They show up as PID 0's in
root's cgroup.procs triggering warnings in systemd and generally
confusing people.

Fix it by skip css_set linking for init_tasks.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: https://github.com/joanbm
Link: https://github.com/systemd/systemd/issues/14682
Fixes: 5153faac18 ("cgroup: remove cgroup_enable_task_cg_lists() optimization")
Cc: stable@vger.kernel.org # v5.5+
2020-01-30 11:37:33 -05:00
Steven Rostedt (VMware)
1e837945a8 tracing: Move tracing selftests to bottom of menu
Move all the tracing selftest configs to the bottom of the tracing menu.
There's no reason for them to be interspersed throughout.

Also, move the bootconfig menu to the top.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-30 09:46:29 -05:00
Steven Rostedt (VMware)
21b3ce3063 tracing: Move mmio tracer config up with the other tracers
Move the config that enables the mmiotracer with the other tracers such that
all the tracers are together.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-30 09:46:29 -05:00
Steven Rostedt (VMware)
a48fc4f5f1 tracing: Move tracing test module configs together
The MMIO test module was by itself, move it to the other test modules. Also,
add the text "Test module" to PREEMPTIRQ_DELAY_TEST as that create a test
module as well.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-30 09:46:29 -05:00
Steven Rostedt (VMware)
61778cd70c tracing: Move all function tracing configs together
The features that depend on the function tracer were spread out through the
tracing menu, pull them together as it is easier to manage.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-30 09:46:29 -05:00
Tom Zanussi
64836248dd tracing: Add kprobe event command generation test module
Add a test module that checks the basic functionality of the in-kernel
kprobe event command generation API by creating kprobe events from a
module.

Link: http://lkml.kernel.org/r/97e502b204f9dba948e3fa3a4315448298218787.1580323897.git.zanussi@kernel.org

Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-30 09:46:28 -05:00
Tom Zanussi
29a1548105 tracing: Change trace_boot to use kprobe_event interface
Have trace_boot_add_kprobe_event() use the kprobe_event interface.

Also, rename kprobe_event_run_cmd() to kprobe_event_run_command() now
that trace_boot's version is gone.

Link: http://lkml.kernel.org/r/af5429d11291ab1e9a85a0ff944af3b2bcf193c7.1580323897.git.zanussi@kernel.org

Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-30 09:46:28 -05:00
Tom Zanussi
2a588dd1d5 tracing: Add kprobe event command generation functions
Add functions used to generate kprobe event commands, built on top of
the dynevent_cmd interface.

kprobe_event_gen_cmd_start() is used to create a kprobe event command
using a variable arg list, and kretprobe_event_gen_cmd_start() does
the same for kretprobe event commands.  kprobe_event_add_fields() can
be used to add single fields one by one or as a group.  Once all
desired fields are added, kprobe_event_gen_cmd_end() or
kretprobe_event_gen_cmd_end() respectively are used to actually
execute the command and create the event.

Link: http://lkml.kernel.org/r/95cc4696502bb6017f9126f306a45ad19b4cc14f.1580323897.git.zanussi@kernel.org

Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-30 09:46:28 -05:00
Tom Zanussi
9fe41efaca tracing: Add synth event generation test module
Add a test module that checks the basic functionality of the in-kernel
synthetic event generation API by generating and tracing synthetic
events from a module.

Link: http://lkml.kernel.org/r/fcb4dd9eb9eefb70ab20538d3529d51642389664.1580323897.git.zanussi@kernel.org

Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-30 09:46:28 -05:00
Tom Zanussi
8dcc53ad95 tracing: Add synth_event_trace() and related functions
Add an exported function named synth_event_trace(), allowing modules
or other kernel code to trace synthetic events.

Also added are several functions that allow the same functionality to
be broken out in a piecewise fashion, which are useful in situations
where tracing an event from a full array of values would be
cumbersome.  Those functions are synth_event_trace_start/end() and
synth_event_add_(next)_val().

Link: http://lkml.kernel.org/r/7a84de5f1854acf4144b57efe835ca645afa764f.1580323897.git.zanussi@kernel.org

Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-30 09:46:28 -05:00
Tom Zanussi
35ca5207c2 tracing: Add synthetic event command generation functions
Add functions used to generate synthetic event commands, built on top
of the dynevent_cmd interface.

synth_event_gen_cmd_start() is used to create a synthetic event
command using a variable arg list and
synth_event_gen_cmd_array_start() does the same thing but using an
array of field descriptors.  synth_event_add_field(),
synth_event_add_field_str() and synth_event_add_fields() can be used
to add single fields one by one or as a group.  Once all desired
fields are added, synth_event_gen_cmd_end() is used to actually
execute the command and create the event.

synth_event_create() does everything, including creating the event, in
a single call.

Link: http://lkml.kernel.org/r/38fef702fad5ef208009f459552f34a94befd860.1580323897.git.zanussi@kernel.org

Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-30 09:46:28 -05:00
Tom Zanussi
86c5426bad tracing: Add dynamic event command creation interface
Add an interface used to build up dynamic event creation commands,
such as synthetic and kprobe events.  Interfaces specific to those
particular types of events and others can be built on top of this
interface.

Command creation is started by first using the dynevent_cmd_init()
function to initialize the dynevent_cmd object.  Following that, args
are appended and optionally checked by the dynevent_arg_add() and
dynevent_arg_pair_add() functions, which use objects representing
arguments and pairs of arguments, initialized respectively by
dynevent_arg_init() and dynevent_arg_pair_init().  Finally, once all
args have been successfully added, the command is finalized and
actually created using dynevent_create().

The code here for actually printing into the dyn_event->cmd buffer
using snprintf() etc was adapted from v4 of Masami's 'tracing/boot:
Add synthetic event support' patch.

Link: http://lkml.kernel.org/r/1f65fa44390b6f238f6036777c3784ced1dcc6a0.1580323897.git.zanussi@kernel.org

Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-30 09:46:28 -05:00
Tom Zanussi
f5f6b255a2 tracing: Add synth_event_delete()
create_or_delete_synth_event() contains code to delete a synthetic
event, which would be useful on its own - specifically, it would be
useful to allow event-creating modules to call it separately.

Separate out the delete code from that function and create an exported
function named synth_event_delete().

Link: http://lkml.kernel.org/r/050db3b06df7f0a4b8a2922da602d1d879c7c1c2.1580323897.git.zanussi@kernel.org

Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-30 09:46:28 -05:00
Tom Zanussi
e3e2a2cc9c tracing: Add trace_get/put_event_file()
Add a function to get an event file and prevent it from going away on
module or instance removal.

trace_get_event_file() will find an event file in a given instance (if
instance is NULL, it assumes the top trace array) and return it,
pinning the instance's trace array as well as the event's module, if
applicable, so they won't go away while in use.

trace_put_event_file() does the matching release.

Link: http://lkml.kernel.org/r/bb31ac4bdda168d5ed3c4b5f5a4c8f633e8d9118.1580323897.git.zanussi@kernel.org

Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
[ Moved trace_array_put() to end of trace_put_event_file() ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-30 09:46:28 -05:00
Tom Zanussi
89c95fcef1 tracing: Add trace_array_find/_get() to find instance trace arrays
Add a new trace_array_find() function that can be used to find a trace
array given the instance name, and replace existing code that does the
same thing with it.  Also add trace_array_find_get() which does the
same but returns the trace array after upping its refcount.

Also make both available for use outside of trace.c.

Link: http://lkml.kernel.org/r/cb68528c975eba95bee4561ac67dd1499423b2e5.1580323897.git.zanussi@kernel.org

Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-30 09:46:27 -05:00
Vasily Averin
6722b23e7a trigger_next should increase position index
if seq_file .next fuction does not change position index,
read after some lseek can generate unexpected output.

Without patch:
 # dd bs=30 skip=1 if=/sys/kernel/tracing/events/sched/sched_switch/trigger
 dd: /sys/kernel/tracing/events/sched/sched_switch/trigger: cannot skip to specified offset
 n traceoff snapshot stacktrace enable_event disable_event enable_hist disable_hist hist
 # Available triggers:
 # traceon traceoff snapshot stacktrace enable_event disable_event enable_hist disable_hist hist
 6+1 records in
 6+1 records out
 206 bytes copied, 0.00027916 s, 738 kB/s

Notice the printing of "# Available triggers:..." after the line.

With the patch:
 # dd bs=30 skip=1 if=/sys/kernel/tracing/events/sched/sched_switch/trigger
 dd: /sys/kernel/tracing/events/sched/sched_switch/trigger: cannot skip to specified offset
 n traceoff snapshot stacktrace enable_event disable_event enable_hist disable_hist hist
 2+1 records in
 2+1 records out
 88 bytes copied, 0.000526867 s, 167 kB/s

It only prints the end of the file, and does not restart.

Link: http://lkml.kernel.org/r/3c35ee24-dd3a-8119-9c19-552ed253388a@virtuozzo.com

https://bugzilla.kernel.org/show_bug.cgi?id=206283
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-30 09:46:27 -05:00
Vasily Averin
039958a5f7 tracing: eval_map_next() should always increase position index
if seq_file .next fuction does not change position index,
read after some lseek can generate unexpected output.

Link: http://lkml.kernel.org/r/7ad85b22-1866-977c-db17-88ac438bc764@virtuozzo.com

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
[ This is not a bug fix, it just makes it "technically correct"
  which is why I applied it. NULL is only returned on an anomaly
  which triggers a WARN_ON ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-30 09:46:27 -05:00
Vasily Averin
e4075e8bdf ftrace: fpid_next() should increase position index
if seq_file .next fuction does not change position index,
read after some lseek can generate unexpected output.

Without patch:
 # dd bs=4 skip=1 if=/sys/kernel/tracing/set_ftrace_pid
 dd: /sys/kernel/tracing/set_ftrace_pid: cannot skip to specified offset
 id
 no pid
 2+1 records in
 2+1 records out
 10 bytes copied, 0.000213285 s, 46.9 kB/s

Notice the "id" followed by "no pid".

With the patch:
 # dd bs=4 skip=1 if=/sys/kernel/tracing/set_ftrace_pid
 dd: /sys/kernel/tracing/set_ftrace_pid: cannot skip to specified offset
 id
 0+1 records in
 0+1 records out
 3 bytes copied, 0.000202112 s, 14.8 kB/s

Notice that it only prints "id" and not the "no pid" afterward.

Link: http://lkml.kernel.org/r/4f87c6ad-f114-30bb-8506-c32274ce2992@virtuozzo.com

https://bugzilla.kernel.org/show_bug.cgi?id=206283
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-30 09:46:27 -05:00
Mathieu Desnoyers
64ae572bc7 tracing: Fix sched switch start/stop refcount racy updates
Reading the sched_cmdline_ref and sched_tgid_ref initial state within
tracing_start_sched_switch without holding the sched_register_mutex is
racy against concurrent updates, which can lead to tracepoint probes
being registered more than once (and thus trigger warnings within
tracepoint.c).

[ May be the fix for this bug ]
Link: https://lore.kernel.org/r/000000000000ab6f84056c786b93@google.com

Link: http://lkml.kernel.org/r/20190817141208.15226-1-mathieu.desnoyers@efficios.com

Cc: stable@vger.kernel.org
CC: Steven Rostedt (VMware) <rostedt@goodmis.org>
CC: Joel Fernandes (Google) <joel@joelfernandes.org>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Paul E. McKenney <paulmck@linux.ibm.com>
Reported-by: syzbot+774fddf07b7ab29a1e55@syzkaller.appspotmail.com
Fixes: d914ba37d7 ("tracing: Add support for recording tgid of tasks")
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-30 09:46:10 -05:00
Nicolas Saenz Julienne
8c8c5a4994 dma-contiguous: CMA: give precedence to cmdline
Although the device tree might contain a reserved-memory DT node
dedicated as the default CMA pool, users might want to change CMA's
parameters using the kernel command line for debugging purposes and
whatnot. Honor this by bypassing the reserved memory CMA setup, which
will ultimately end up freeing the memblock and allow the command line
CMA configuration routine to run.

Signed-off-by: Nicolas Saenz Julienne <nsaenzjulienne@suse.de>
Reviewed-by: Phil Elwell <phil@raspberrypi.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-01-30 14:41:42 +01:00
Linus Torvalds
39bed42de2 hmm related patches for 5.6
This small series revises the names in mmu_notifier to make the code
 clearer and more readable.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAl4wf2EACgkQOG33FX4g
 mxqrdw//XIexbXQqP4dUKFCFeI7Um6ZqYE6iVCQi6JEetpKxCR8BSrJsq6EP60Mg
 cVCKolISuudzOccz/liotg9SrwRlcO3mzucd8LJZG0v2FZMzQr0EKjst0RC4/xvK
 U2RxGvwLQ+XVR/3/l6hXyWyw7u28+F1RsfQMMX3kqR3qlcQachQ3k7oUINDIq2XH
 JkQcBV+XK0doXEp6VCCVKwuwEN7O5xSm8lAIHDNFZEEPre0iKxwatgWxdXFIWQek
 tRywwB7bRzFROBlDcoOQ0GDTqScr3bghz6vWU4GGv3avYkystKwy44ha6BzO2xQc
 ZNIo8AN9UFFhcmF531wklsXTCbxbxJAJAwdyIuQnKq5glw64EFnrjo2sxuL6s56h
 C1GHADtxDccv+nr2sKP/rFFeq9K3VqHDtjEdBOhReuB0Vp1YfVr17A4R8yAn8A+1
 vm3IusoOq+g8qMYxRHEb+76/S//joaxAlFQkU5Gjn/0xsykP99YQSQFBjXmkzWlS
 IiHLf0HJiCCL8SHe4Wnyhyl1DUIIl38HQULqbFWZ8hK4ELhTd2KEuDxzT8q+v+v7
 2M9nBVdRaw1kskGiFv+F7mb6c990CTEZO9B5fHpAjPRxeVkLYc06QfJY+hXbbu4c
 6yzIvERRRlAviCmgb7G+3pLyBCKdvlIlCVsVOdxHXSRsl904BnA=
 =hhT0
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull mmu_notifier updates from Jason Gunthorpe:
 "This small series revises the names in mmu_notifier to make the code
  clearer and more readable"

* tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
  mm/mmu_notifiers: Use 'interval_sub' as the variable for mmu_interval_notifier
  mm/mmu_notifiers: Use 'subscription' as the variable name for mmu_notifier
  mm/mmu_notifier: Rename struct mmu_notifier_mm to mmu_notifier_subscriptions
2020-01-29 19:56:50 -08:00
Linus Torvalds
83fa805bcb threads-v5.6
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXjFo8wAKCRCRxhvAZXjc
 omaGAQDVwCHQekqxp2eC8EJH4Pkt+Bn1BLrA25stlTo93YBPHgEAsPVUCRNcrZAl
 VncYmxCfpt3Yu0S/MTVXu5xrRiIXPQk=
 =uqTN
 -----END PGP SIGNATURE-----

Merge tag 'threads-v5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull thread management updates from Christian Brauner:
 "Sargun Dhillon over the last cycle has worked on the pidfd_getfd()
  syscall.

  This syscall allows for the retrieval of file descriptors of a process
  based on its pidfd. A task needs to have ptrace_may_access()
  permissions with PTRACE_MODE_ATTACH_REALCREDS (suggested by Oleg and
  Andy) on the target.

  One of the main use-cases is in combination with seccomp's user
  notification feature. As a reminder, seccomp's user notification
  feature was made available in v5.0. It allows a task to retrieve a
  file descriptor for its seccomp filter. The file descriptor is usually
  handed of to a more privileged supervising process. The supervisor can
  then listen for syscall events caught by the seccomp filter of the
  supervisee and perform actions in lieu of the supervisee, usually
  emulating syscalls. pidfd_getfd() is needed to expand its uses.

  There are currently two major users that wait on pidfd_getfd() and one
  future user:

   - Netflix, Sargun said, is working on a service mesh where users
     should be able to connect to a dns-based VIP. When a user connects
     to e.g. 1.2.3.4:80 that runs e.g. service "foo" they will be
     redirected to an envoy process. This service mesh uses seccomp user
     notifications and pidfd to intercept all connect calls and instead
     of connecting them to 1.2.3.4:80 connects them to e.g.
     127.0.0.1:8080.

   - LXD uses the seccomp notifier heavily to intercept and emulate
     mknod() and mount() syscalls for unprivileged containers/processes.
     With pidfd_getfd() more uses-cases e.g. bridging socket connections
     will be possible.

   - The patchset has also seen some interest from the browser corner.
     Right now, Firefox is using a SECCOMP_RET_TRAP sandbox managed by a
     broker process. In the future glibc will start blocking all signals
     during dlopen() rendering this type of sandbox impossible. Hence,
     in the future Firefox will switch to a seccomp-user-nofication
     based sandbox which also makes use of file descriptor retrieval.
     The thread for this can be found at
     https://sourceware.org/ml/libc-alpha/2019-12/msg00079.html

  With pidfd_getfd() it is e.g. possible to bridge socket connections
  for the supervisee (binding to a privileged port) and taking actions
  on file descriptors on behalf of the supervisee in general.

  Sargun's first version was using an ioctl on pidfds but various people
  pushed for it to be a proper syscall which he duely implemented as
  well over various review cycles. Selftests are of course included.
  I've also added instructions how to deal with merge conflicts below.

  There's also a small fix coming from the kernel mentee project to
  correctly annotate struct sighand_struct with __rcu to fix various
  sparse warnings. We've received a few more such fixes and even though
  they are mostly trivial I've decided to postpone them until after -rc1
  since they came in rather late and I don't want to risk introducing
  build warnings.

  Finally, there's a new prctl() command PR_{G,S}ET_IO_FLUSHER which is
  needed to avoid allocation recursions triggerable by storage drivers
  that have userspace parts that run in the IO path (e.g. dm-multipath,
  iscsi, etc). These allocation recursions deadlock the device.

  The new prctl() allows such privileged userspace components to avoid
  allocation recursions by setting the PF_MEMALLOC_NOIO and
  PF_LESS_THROTTLE flags. The patch carries the necessary acks from the
  relevant maintainers and is routed here as part of prctl()
  thread-management."

* tag 'threads-v5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  prctl: PR_{G,S}ET_IO_FLUSHER to support controlling memory reclaim
  sched.h: Annotate sighand_struct with __rcu
  test: Add test for pidfd getfd
  arch: wire up pidfd_getfd syscall
  pid: Implement pidfd_getfd syscall
  vfs, fdtable: Add fget_task helper
2020-01-29 19:38:34 -08:00
Linus Torvalds
08a3ef8f6b linux-kselftest-5.6-rc1-kunit
This kunit update for Linux 5.6-rc1 consists of:
 
 -- Support for building kunit as a module from Alan Maguire
 -- AppArmor KUnit tests for policy unpack from Mike Salvatore
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEPZKym/RZuOCGeA/kCwJExA0NQxwFAl4xz/wACgkQCwJExA0N
 Qxyg2A//X0bnhN82oCchkTRW3GyGi5wTR2wGhoNzMZD0XUtCvn+4BlCSP20ttYdT
 beiLCiewcuEdvXRyEV9Kikvet/67ovbjA/ce6ZrR7TlIHo8esKcy19/nu1OTvtI1
 8eji1q7NSEV9iswz1ZoBAw+MTDHZfOI9qYY2UPcwjy7xWN84z2X1w+8UQ3EamOKd
 6BfbohsYuuTTHhA2k1aUzvQcHqNz0YdH4yvNQpdunJXLUI04TeGZA6Ug66u6kWEd
 1f5SSAu6r1vnU7DADrb1QwEDuIwL4KBuaMg2Rj5GLxTNp3wxmW9M2Dit+iN7+vNH
 TS31kZW6KgxC5XuGVPENJaWlDX5Hm+5W8uiRZLNXsxDy927u53RzwrSZw/FbdbB1
 HuPZZCzE1soWHdPIQz44HCCAg9XddypYlC1o4IYL1JkJknqG12ky4xgM8GRNCZAB
 oUW3Ax3Lcr0EJALO/kFd/uEbl79PdmDk8uPMU1jtLyx5cs70yC3fsT2GB+DbP802
 i/FxTtrOMGjU2OWcYfQcXapvZdgImf9nPsSZe3FJXjHfytNRbVZOZ2rHAMh03Keu
 EBthDs6ejm6OUSGUXjngE9NaQKXsNSQ1Qor+6FrGnT4IxUMzWenudqHH7/dgF7Fr
 fHlZGBilKMc/EYKb/6hj4kvEChrSIXj6TFknmI28I/epPiOr2gU=
 =AFO4
 -----END PGP SIGNATURE-----

Merge tag 'linux-kselftest-5.6-rc1-kunit' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest

Pull Kselftest kunit updates from Shuah Khan:
 "This kunit update consists of:

   - Support for building kunit as a module from Alan Maguire

   - AppArmor KUnit tests for policy unpack from Mike Salvatore"

* tag 'linux-kselftest-5.6-rc1-kunit' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
  kunit: building kunit as a module breaks allmodconfig
  kunit: update documentation to describe module-based build
  kunit: allow kunit to be loaded as a module
  kunit: remove timeout dependence on sysctl_hung_task_timeout_seconds
  kunit: allow kunit tests to be loaded as a module
  kunit: hide unexported try-catch interface in try-catch-impl.h
  kunit: move string-stream.h to lib/kunit
  apparmor: add AppArmor KUnit tests for policy unpack
2020-01-29 15:25:34 -08:00
Linus Torvalds
22b17db4ea y2038: core, driver and file system changes
These are updates to device drivers and file systems that for some reason
 or another were not included in the kernel in the previous y2038 series.
 
 I've gone through all users of time_t again to make sure the kernel is
 in a long-term maintainable state, replacing all remaining references
 to time_t with safe alternatives.
 
 Some related parts of the series were picked up into the nfsd, xfs,
 alsa and v4l2 trees. A final set of patches in linux-mm removes the now
 unused time_t/timeval/timespec types and helper functions after all five
 branches are merged for linux-5.6, ensuring that no new users get merged.
 
 As a result, linux-5.6, or my backport of the patches to 5.4 [1], should
 be the first release that can serve as a base for a 32-bit system designed
 to run beyond year 2038, with a few remaining caveats:
 
 - All user space must be compiled with a 64-bit time_t, which will be
   supported in the coming musl-1.2 and glibc-2.32 releases, along with
   installed kernel headers from linux-5.6 or higher.
 
 - Applications that use the system call interfaces directly need to be
   ported to use the time64 syscalls added in linux-5.1 in place of the
   existing system calls. This impacts most users of futex() and seccomp()
   as well as programming languages that have their own runtime environment
   not based on libc.
 
 - Applications that use a private copy of kernel uapi header files or
   their contents may need to update to the linux-5.6 version, in
   particular for sound/asound.h, xfs/xfs_fs.h, linux/input.h,
   linux/elfcore.h, linux/sockios.h, linux/timex.h and linux/can/bcm.h.
 
 - A few remaining interfaces cannot be changed to pass a 64-bit time_t
   in a compatible way, so they must be configured to use CLOCK_MONOTONIC
   times or (with a y2106 problem) unsigned 32-bit timestamps. Most
   importantly this impacts all users of 'struct input_event'.
 
 - All y2038 problems that are present on 64-bit machines also apply to
   32-bit machines. In particular this affects file systems with on-disk
   timestamps using signed 32-bit seconds: ext4 with ext3-style small
   inodes, ext2, xfs (to be fixed soon) and ufs.
 
 Changes since v1 [2]:
 
 - Add Acks I received
 - Rebase to v5.5-rc1, dropping patches that got merged already
 - Add NFS, XFS and the final three patches from another series
 - Rewrite etnaviv patches
 - Add one late revert to avoid an etnaviv regression
 
 [1] https://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground.git/log/?h=y2038-endgame
 [2] https://lore.kernel.org/lkml/20191108213257.3097633-1-arnd@arndb.de/
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJeMYy3AAoJEGCrR//JCVInEGwP/0R+S+ok7vw9OdLVT0lFl07D
 IcVabgOWf24imN7m7L7Mlt3nDfxIT4tMpiAXq7eMO3spcyViG18O2LXdSQ4/7QBp
 +BlhoMjOP9w34Jyd7mnkFr4vqQALvfIqkS8rFObDtDub2Rfj9PC36MRMIu8BPXlv
 RK8bigwJeH/DV38yc5/JeUcD+WuewYLsK9XPWN+4yB4vgGsNU3ZQQ6nnzbR3hMsN
 DN8WZ68Y7IBs0Kyxkf+s2zmRXtCa2RiFg/2TUsk5olVAJVaenvte69hq5RSbg1vW
 vLi6K8cBoPWL59nqCzcNE+TUhSUg3LOj/a/KWyl76yovz7AlJaNjssOf8ZjHw6sL
 MhQqz3hXTxiJDS2Jvbf1yojiYGlzrq/gqcRFGe9jPcZdieMc4/yZCx60G/Exa5Pu
 YdMcqMyDWPFyUAFQNWEF59HPheOdj6tb1KpJ6bwgCo3P7QqhLrU4z9w3Py4/ZfBO
 4sWcWteSsD6MN/ADJ2WQ56nNxzM2AvkeVJKcF6FCkdngXX9T0GExmZz7SqB5Du99
 9lNjIiD5E+LBa/Swo/7n49aYa8x06V1pmHYTZVh9Wkl+CZiO21umezQFrWsfaMTp
 xt3c6pFdMG5xNMGpreTAXOmf2R+T6O8IO2qQq/TYjzqOLH7QC830P7avkmml+cK1
 LjOBE2TfSeO8Ru1dXV4t
 =wx0A
 -----END PGP SIGNATURE-----

Merge tag 'y2038-drivers-for-v5.6-signed' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground

Pull y2038 updates from Arnd Bergmann:
 "Core, driver and file system changes

  These are updates to device drivers and file systems that for some
  reason or another were not included in the kernel in the previous
  y2038 series.

  I've gone through all users of time_t again to make sure the kernel is
  in a long-term maintainable state, replacing all remaining references
  to time_t with safe alternatives.

  Some related parts of the series were picked up into the nfsd, xfs,
  alsa and v4l2 trees. A final set of patches in linux-mm removes the
  now unused time_t/timeval/timespec types and helper functions after
  all five branches are merged for linux-5.6, ensuring that no new users
  get merged.

  As a result, linux-5.6, or my backport of the patches to 5.4 [1],
  should be the first release that can serve as a base for a 32-bit
  system designed to run beyond year 2038, with a few remaining caveats:

   - All user space must be compiled with a 64-bit time_t, which will be
     supported in the coming musl-1.2 and glibc-2.32 releases, along
     with installed kernel headers from linux-5.6 or higher.

   - Applications that use the system call interfaces directly need to
     be ported to use the time64 syscalls added in linux-5.1 in place of
     the existing system calls. This impacts most users of futex() and
     seccomp() as well as programming languages that have their own
     runtime environment not based on libc.

   - Applications that use a private copy of kernel uapi header files or
     their contents may need to update to the linux-5.6 version, in
     particular for sound/asound.h, xfs/xfs_fs.h, linux/input.h,
     linux/elfcore.h, linux/sockios.h, linux/timex.h and
     linux/can/bcm.h.

   - A few remaining interfaces cannot be changed to pass a 64-bit
     time_t in a compatible way, so they must be configured to use
     CLOCK_MONOTONIC times or (with a y2106 problem) unsigned 32-bit
     timestamps. Most importantly this impacts all users of 'struct
     input_event'.

   - All y2038 problems that are present on 64-bit machines also apply
     to 32-bit machines. In particular this affects file systems with
     on-disk timestamps using signed 32-bit seconds: ext4 with
     ext3-style small inodes, ext2, xfs (to be fixed soon) and ufs"

[1] https://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground.git/log/?h=y2038-endgame

* tag 'y2038-drivers-for-v5.6-signed' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground: (21 commits)
  Revert "drm/etnaviv: reject timeouts with tv_nsec >= NSEC_PER_SEC"
  y2038: sh: remove timeval/timespec usage from headers
  y2038: sparc: remove use of struct timex
  y2038: rename itimerval to __kernel_old_itimerval
  y2038: remove obsolete jiffies conversion functions
  nfs: fscache: use timespec64 in inode auxdata
  nfs: fix timstamp debug prints
  nfs: use time64_t internally
  sunrpc: convert to time64_t for expiry
  drm/etnaviv: avoid deprecated timespec
  drm/etnaviv: reject timeouts with tv_nsec >= NSEC_PER_SEC
  drm/msm: avoid using 'timespec'
  hfs/hfsplus: use 64-bit inode timestamps
  hostfs: pass 64-bit timestamps to/from user space
  packet: clarify timestamp overflow
  tsacct: add 64-bit btime field
  acct: stop using get_seconds()
  um: ubd: use 64-bit time_t where possible
  xtensa: ISS: avoid struct timeval
  dlm: use SO_SNDTIMEO_NEW instead of SO_SNDTIMEO_OLD
  ...
2020-01-29 14:55:47 -08:00
Linus Torvalds
a4fe2b4d87 Printk changes for 5.6
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAl4xgVoACgkQUqAMR0iA
 lPInOw//XnGCL9WggQQV/Kq8JSlXz96quZcPMoIOQkXQQp56FfGz3Y8NtNFtAOpG
 BiA1VeOkmfdGP08mtUvEjrvZM35JBQxtn3FWbuNMqBmlnVrffFaYTizcCnGG0w6Y
 rLaVSOqml1FqUKq8unxZvBpjactqVLC85L8dmEJD9/SpZwQJZky/fSpDeuMHTgx2
 KZ0tilIc+hJNawgXHJWfl6+EIMa6ZVl9IMFO+i87I4kdOpXzyC2vdqD8r7irYzB6
 j4KakPSTgpm3GdIOMijENEeGWvqxD/1jm41ujbDGeE6+WnKW/UXxhgbYZhGlKzSS
 HLU49Pmk9TtyeSRewue6pZtG2nPj+UwT3qNMRyNK8u53EoN/eFBys2h7tEildRKY
 jHquIYY849YpC1/Db38shHOD0Phx+VpxzMIM0ZjLZmKVJyaAzdg2srcHcXWS8EmU
 ij9Ybe9T+7JKvS/l4rMaw44yoZJ7ePs62fMnCcJF38RojwqJGvwRRcLr8U4X09ap
 PlAPXykcZkIpYge/6dzWSCQfHUeJvoHN5YBoBOH5sx3xlimXaHnmEZA4OVbRknFo
 Ye8xjkUKejFsONWLu8Jh5P78ifcZw99hOpX4Cv+opc4q3nVJuQ4RgWR5PfD9F+U7
 dvEkboTHme0mFbeQCz1WJtKr7xB4NO8O62suqYY0dDvWOyCdcVc=
 =TQ5g
 -----END PGP SIGNATURE-----

Merge tag 'printk-for-5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk

Pull printk update from Petr Mladek:
 "Prevent replaying log on all consoles"

* tag 'printk-for-5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk:
  printk: fix exclusive_console replaying
2020-01-29 14:53:23 -08:00
Linus Torvalds
6aee4badd8 Merge branch 'work.openat2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull openat2 support from Al Viro:
 "This is the openat2() series from Aleksa Sarai.

  I'm afraid that the rest of namei stuff will have to wait - it got
  zero review the last time I'd posted #work.namei, and there had been a
  leak in the posted series I'd caught only last weekend. I was going to
  repost it on Monday, but the window opened and the odds of getting any
  review during that... Oh, well.

  Anyway, openat2 part should be ready; that _did_ get sane amount of
  review and public testing, so here it comes"

From Aleksa's description of the series:
 "For a very long time, extending openat(2) with new features has been
  incredibly frustrating. This stems from the fact that openat(2) is
  possibly the most famous counter-example to the mantra "don't silently
  accept garbage from userspace" -- it doesn't check whether unknown
  flags are present[1].

  This means that (generally) the addition of new flags to openat(2) has
  been fraught with backwards-compatibility issues (O_TMPFILE has to be
  defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
  kernels gave errors, since it's insecure to silently ignore the
  flag[2]). All new security-related flags therefore have a tough road
  to being added to openat(2).

  Furthermore, the need for some sort of control over VFS's path
  resolution (to avoid malicious paths resulting in inadvertent
  breakouts) has been a very long-standing desire of many userspace
  applications.

  This patchset is a revival of Al Viro's old AT_NO_JUMPS[3] patchset
  (which was a variant of David Drysdale's O_BENEATH patchset[4] which
  was a spin-off of the Capsicum project[5]) with a few additions and
  changes made based on the previous discussion within [6] as well as
  others I felt were useful.

  In line with the conclusions of the original discussion of
  AT_NO_JUMPS, the flag has been split up into separate flags. However,
  instead of being an openat(2) flag it is provided through a new
  syscall openat2(2) which provides several other improvements to the
  openat(2) interface (see the patch description for more details). The
  following new LOOKUP_* flags are added:

  LOOKUP_NO_XDEV:

     Blocks all mountpoint crossings (upwards, downwards, or through
     absolute links). Absolute pathnames alone in openat(2) do not
     trigger this. Magic-link traversal which implies a vfsmount jump is
     also blocked (though magic-link jumps on the same vfsmount are
     permitted).

  LOOKUP_NO_MAGICLINKS:

     Blocks resolution through /proc/$pid/fd-style links. This is done
     by blocking the usage of nd_jump_link() during resolution in a
     filesystem. The term "magic-links" is used to match with the only
     reference to these links in Documentation/, but I'm happy to change
     the name.

     It should be noted that this is different to the scope of
     ~LOOKUP_FOLLOW in that it applies to all path components. However,
     you can do openat2(NO_FOLLOW|NO_MAGICLINKS) on a magic-link and it
     will *not* fail (assuming that no parent component was a
     magic-link), and you will have an fd for the magic-link.

     In order to correctly detect magic-links, the introduction of a new
     LOOKUP_MAGICLINK_JUMPED state flag was required.

  LOOKUP_BENEATH:

     Disallows escapes to outside the starting dirfd's
     tree, using techniques such as ".." or absolute links. Absolute
     paths in openat(2) are also disallowed.

     Conceptually this flag is to ensure you "stay below" a certain
     point in the filesystem tree -- but this requires some additional
     to protect against various races that would allow escape using
     "..".

     Currently LOOKUP_BENEATH implies LOOKUP_NO_MAGICLINKS, because it
     can trivially beam you around the filesystem (breaking the
     protection). In future, there might be similar safety checks done
     as in LOOKUP_IN_ROOT, but that requires more discussion.

  In addition, two new flags are added that expand on the above ideas:

  LOOKUP_NO_SYMLINKS:

     Does what it says on the tin. No symlink resolution is allowed at
     all, including magic-links. Just as with LOOKUP_NO_MAGICLINKS this
     can still be used with NOFOLLOW to open an fd for the symlink as
     long as no parent path had a symlink component.

  LOOKUP_IN_ROOT:

     This is an extension of LOOKUP_BENEATH that, rather than blocking
     attempts to move past the root, forces all such movements to be
     scoped to the starting point. This provides chroot(2)-like
     protection but without the cost of a chroot(2) for each filesystem
     operation, as well as being safe against race attacks that
     chroot(2) is not.

     If a race is detected (as with LOOKUP_BENEATH) then an error is
     generated, and similar to LOOKUP_BENEATH it is not permitted to
     cross magic-links with LOOKUP_IN_ROOT.

     The primary need for this is from container runtimes, which
     currently need to do symlink scoping in userspace[7] when opening
     paths in a potentially malicious container.

     There is a long list of CVEs that could have bene mitigated by
     having RESOLVE_THIS_ROOT (such as CVE-2017-1002101,
     CVE-2017-1002102, CVE-2018-15664, and CVE-2019-5736, just to name a
     few).

  In order to make all of the above more usable, I'm working on
  libpathrs[8] which is a C-friendly library for safe path resolution.
  It features a userspace-emulated backend if the kernel doesn't support
  openat2(2). Hopefully we can get userspace to switch to using it, and
  thus get openat2(2) support for free once it's ready.

  Future work would include implementing things like
  RESOLVE_NO_AUTOMOUNT and possibly a RESOLVE_NO_REMOTE (to allow
  programs to be sure they don't hit DoSes though stale NFS handles)"

* 'work.openat2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  Documentation: path-lookup: include new LOOKUP flags
  selftests: add openat2(2) selftests
  open: introduce openat2(2) syscall
  namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution
  namei: LOOKUP_IN_ROOT: chroot-like scoped resolution
  namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution
  namei: LOOKUP_NO_XDEV: block mountpoint crossing
  namei: LOOKUP_NO_MAGICLINKS: block magic-link resolution
  namei: LOOKUP_NO_SYMLINKS: block symlink resolution
  namei: allow set_root() to produce errors
  namei: allow nd_jump_link() to produce errors
  nsfs: clean-up ns_get_path() signature to return int
  namei: only return -ECHILD from follow_dotdot_rcu()
2020-01-29 11:20:24 -08:00
Linus Torvalds
15d6632496 Merge branch 'urgent-for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull RCU warning removal from Paul McKenney:
 "A single commit that fixes an embarrassing bug discussed here:

      https://lore.kernel.org/lkml/20200125131425.GB16136@zn.tnic/

  which apparently also affects smaller systems"

[ This was sent to Ingo, but since I see the issue on the laptop I use for
  testing during the merge window, I'm doing the pull directly     - Linus ]

* 'urgent-for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
  rcu: Forgive slow expedited grace periods at boot time
2020-01-29 11:04:49 -08:00
Martin KaFai Lau
d3e42bb0a3 bpf: Reuse log from btf_prase_vmlinux() in btf_struct_ops_init()
Instead of using a locally defined "struct bpf_verifier_log log = {}",
btf_struct_ops_init() should reuse the "log" from its calling
function "btf_parse_vmlinux()".  It should also resolve the
frame-size too large compiler warning in some ARCH.

Fixes: 27ae7997a6 ("bpf: Introduce BPF_PROG_TYPE_STRUCT_OPS")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200127175145.1154438-1-kafai@fb.com
2020-01-29 16:40:54 +01:00
Masami Hiramatsu
5c3469cb89 tracing/boot: Move external function declarations to kernel/trace/trace.h
Move external function declarations into kernel/trace/trace.h
from trace_boot.c for tracing subsystem internal use.

Link: http://lkml.kernel.org/r/158029060405.12381.11944554430359702545.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-29 08:49:04 -05:00
Masami Hiramatsu
76a598ec8c tracing/boot: Include required headers and sort it alphabetically
Include some required (but currently indirectly included)
headers and sort it alphabetically.

Link: http://lkml.kernel.org/r/158029059514.12381.6597832266860248781.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-29 08:48:44 -05:00
Tom Zanussi
d0a497066f tracing: Add 'hist:' to hist trigger error log error string
The 'hist:' prefix gets stripped from the command text during command
processing, but should be added back when displaying the command
during error processing.

Not only because it's what should be displayed but also because not
having it means the test cases fail because the caret is miscalculated
by the length of the prefix string.

Link: http://lkml.kernel.org/r/449df721f560042e22382f67574bcc5b4d830d3d.1561743018.git.zanussi@kernel.org

Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-28 23:17:10 -05:00
Tom Zanussi
4de26c8c96 tracing: Add hist trigger error messages for sort specification
Add error codes and messages for all the error paths leading to sort
specification parsing errors.

Link: http://lkml.kernel.org/r/237830dc05e583fbb53664d817a784297bf961be.1561743018.git.zanussi@kernel.org

Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-28 23:16:44 -05:00
Tom Zanussi
b527b638fd tracing: Simplify assignment parsing for hist triggers
In the process of adding better error messages for sorting, I realized
that strsep was being used incorrectly and some of the error paths I
was expecting to be hit weren't and just fell through to the common
invalid key error case.

It also became obvious that for keyword assignments, it wasn't
necessary to save the full assignment and reparse it later, and having
a common empty-assignment check would also make more sense in terms of
error processing.

Change the code to fix these problems and simplify it for new error
message changes in a subsequent patch.

Link: http://lkml.kernel.org/r/1c3ef0b6655deaf345f6faee2584a0298ac2d743.1561743018.git.zanussi@kernel.org

Fixes: e62347d245 ("tracing: Add hist trigger support for user-defined sorting ('sort=' param)")
Fixes: 7ef224d1d0 ("tracing: Add 'hist' event trigger command")
Fixes: a4072fe85b ("tracing: Add a clock attribute for hist triggers")
Reported-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-28 23:16:27 -05:00
Linus Torvalds
fad7bdc9b0 This pull request contains the following changes for UML:
- Fix for time travel mode
 - Disable CONFIG_CONSTRUCTORS again
 - A new command line option to have an non-raw serial line
 - Preparations to remove obsolete UML network drivers
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAl4k2EYWHHJpY2hhcmRA
 c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wTe2EACDEsoWZvvKnocFH/umFfZdxciU
 Ys5noEPElnILVIwV+Gm9SHq/RQWzG8BqSOirfOn1iGhEqWjDTPzwqPuqFGxKtRVp
 VoaYDA506oDH903i4vj1OuGDHxgModEmR/GFqU9uEtXUws2qbeZQcG0COkquJU8X
 URMz4XB+KLqDI2TvOTnbWevjJnslwLIqRuDdZ2q0d685J1XhRhuq/srgZGMiUpGn
 4H/E4k0UxlC082oh9QWRFYYyc6vhyvlguupphzBgICZQmP4P4ck3pe23OT+vOWBl
 +e2ti9MlB9/Tv3dGhzmq2180U0D74RvtHIi7RjUdaTcEoOkgDwXqKsZ1CY4kCV78
 mxrXHCE6YUMvsQcTBxobXYD/zUXeqXtlSHyGQ4MUATCvI6ag8vWKWjGXV/kDVWdf
 FEeL0O6AHjruTrPxi1aSJ3TFG+JerXCGZpSt2DG67sCcWJ/RqYnrs45DF4U6ywf4
 BQ/nA0bpdZouLrhtCS6yBRvPiA5TVXHmrQMpK/LsOpBD4sKCV+MXghbYoWAwcSoM
 H+RSpf1em3zQrlRcuNPW8XGVkqOmUKn9pFzT9ybWv0h2hVhrDiutjJEPgbpJooIr
 yB0G/MVTtk3Xrok2lq8TT+Hp13TWCTFynsmKYvgv4s37p5jA5fvKL0vhdhIlAxHE
 FCyGsZIkAcMLfjvC3Q==
 =yi/o
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml

Pull UML updates from Anton Ivanov:
 "I am sending this on behalf of Richard who is traveling.

  This contains the following changes for UML:

   - Fix for time travel mode

   - Disable CONFIG_CONSTRUCTORS again

   - A new command line option to have an non-raw serial line

   - Preparations to remove obsolete UML network drivers"

* tag 'for-linus-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml:
  um: Fix time-travel=inf-cpu with xor/raid6
  Revert "um: Enable CONFIG_CONSTRUCTORS"
  um: Mark non-vector net transports as obsolete
  um: Add an option to make serial driver non-raw
2020-01-28 18:29:25 -08:00
Linus Torvalds
a78416d974 Kprobe events added "ustring" to distinguish reading strings from kernel space
or user space. But the creating of the event format file only checks for
 "string" to display string formats. "ustring" must also be handled.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXi8JRxQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qvaiAP943Srl0C1NHuKtJGHpYkgHJRt4mPFO
 569Wx82a2ODH4AEA/D8uda0+p0wJB/uDnd/VyhTeb1nAjqzhx4pfGPNjaw8=
 =lu3h
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fix from Steven Rostedt:
 "Kprobe events added 'ustring' to distinguish reading strings from
  kernel space or user space.

  But the creating of the event format file only checks for 'string' to
  display string formats. 'ustring' must also be handled"

* tag 'trace-v5.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing/kprobes: Have uname use __get_str() in print_fmt
2020-01-28 18:26:09 -08:00
Linus Torvalds
bd2463ac7d Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from David Miller:

 1) Add WireGuard

 2) Add HE and TWT support to ath11k driver, from John Crispin.

 3) Add ESP in TCP encapsulation support, from Sabrina Dubroca.

 4) Add variable window congestion control to TIPC, from Jon Maloy.

 5) Add BCM84881 PHY driver, from Russell King.

 6) Start adding netlink support for ethtool operations, from Michal
    Kubecek.

 7) Add XDP drop and TX action support to ena driver, from Sameeh
    Jubran.

 8) Add new ipv4 route notifications so that mlxsw driver does not have
    to handle identical routes itself. From Ido Schimmel.

 9) Add BPF dynamic program extensions, from Alexei Starovoitov.

10) Support RX and TX timestamping in igc, from Vinicius Costa Gomes.

11) Add support for macsec HW offloading, from Antoine Tenart.

12) Add initial support for MPTCP protocol, from Christoph Paasch,
    Matthieu Baerts, Florian Westphal, Peter Krystad, and many others.

13) Add Octeontx2 PF support, from Sunil Goutham, Geetha sowjanya, Linu
    Cherian, and others.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1469 commits)
  net: phy: add default ARCH_BCM_IPROC for MDIO_BCM_IPROC
  udp: segment looped gso packets correctly
  netem: change mailing list
  qed: FW 8.42.2.0 debug features
  qed: rt init valid initialization changed
  qed: Debug feature: ilt and mdump
  qed: FW 8.42.2.0 Add fw overlay feature
  qed: FW 8.42.2.0 HSI changes
  qed: FW 8.42.2.0 iscsi/fcoe changes
  qed: Add abstraction for different hsi values per chip
  qed: FW 8.42.2.0 Additional ll2 type
  qed: Use dmae to write to widebus registers in fw_funcs
  qed: FW 8.42.2.0 Parser offsets modified
  qed: FW 8.42.2.0 Queue Manager changes
  qed: FW 8.42.2.0 Expose new registers and change windows
  qed: FW 8.42.2.0 Internal ram offsets modifications
  MAINTAINERS: Add entry for Marvell OcteonTX2 Physical Function driver
  Documentation: net: octeontx2: Add RVU HW and drivers overview
  octeontx2-pf: ethtool RSS config support
  octeontx2-pf: Add basic ethtool support
  ...
2020-01-28 16:02:33 -08:00
Linus Torvalds
a78208e243 Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
 "API:
   - Removed CRYPTO_TFM_RES flags
   - Extended spawn grabbing to all algorithm types
   - Moved hash descsize verification into API code

  Algorithms:
   - Fixed recursive pcrypt dead-lock
   - Added new 32 and 64-bit generic versions of poly1305
   - Added cryptogams implementation of x86/poly1305

  Drivers:
   - Added support for i.MX8M Mini in caam
   - Added support for i.MX8M Nano in caam
   - Added support for i.MX8M Plus in caam
   - Added support for A33 variant of SS in sun4i-ss
   - Added TEE support for Raven Ridge in ccp
   - Added in-kernel API to submit TEE commands in ccp
   - Added AMD-TEE driver
   - Added support for BCM2711 in iproc-rng200
   - Added support for AES256-GCM based ciphers for chtls
   - Added aead support on SEC2 in hisilicon"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (244 commits)
  crypto: arm/chacha - fix build failured when kernel mode NEON is disabled
  crypto: caam - add support for i.MX8M Plus
  crypto: x86/poly1305 - emit does base conversion itself
  crypto: hisilicon - fix spelling mistake "disgest" -> "digest"
  crypto: chacha20poly1305 - add back missing test vectors and test chunking
  crypto: x86/poly1305 - fix .gitignore typo
  tee: fix memory allocation failure checks on drv_data and amdtee
  crypto: ccree - erase unneeded inline funcs
  crypto: ccree - make cc_pm_put_suspend() void
  crypto: ccree - split overloaded usage of irq field
  crypto: ccree - fix PM race condition
  crypto: ccree - fix FDE descriptor sequence
  crypto: ccree - cc_do_send_request() is void func
  crypto: ccree - fix pm wrongful error reporting
  crypto: ccree - turn errors to debug msgs
  crypto: ccree - fix AEAD decrypt auth fail
  crypto: ccree - fix typo in comment
  crypto: ccree - fix typos in error msgs
  crypto: atmel-{aes,sha,tdes} - Retire crypto_platform_data
  crypto: x86/sha - Eliminate casts on asm implementations
  ...
2020-01-28 15:38:56 -08:00
Konstantin Khlebnikov
b4fb015eef sched/rt: Optimize checking group RT scheduler constraints
Group RT scheduler contains protection against setting zero runtime for
cgroup with RT tasks. Right now function tg_set_rt_bandwidth() iterates
over all CPU cgroups and calls tg_has_rt_tasks() for any cgroup which
runtime is zero (not only for changed one). Default RT runtime is zero,
thus tg_has_rt_tasks() will is called for almost at CPU cgroups.

This protection already is slightly racy: runtime limit could be changed
between cpu_cgroup_can_attach() and cpu_cgroup_attach() because changing
cgroup attribute does not lock cgroup_mutex while attach does not lock
rt_constraints_mutex. Changing task scheduler class also races with
changing rt runtime: check in __sched_setscheduler() isn't protected.

Function tg_has_rt_tasks() iterates over all threads in the system.
This gives NR_CGROUPS * NR_TASKS operations under single tasklist_lock
locked for read tg_set_rt_bandwidth(). Any concurrent attempt of locking
tasklist_lock for write (for example fork) will stuck with disabled irqs.

This patch makes two optimizations:
1) Remove locking tasklist_lock and iterate only tasks in cgroup
2) Call tg_has_rt_tasks() iff rt runtime changes from non-zero to zero

All changed code is under CONFIG_RT_GROUP_SCHED.

Testcase:

 # mkdir /sys/fs/cgroup/cpu/test{1..10000}
 # echo 0 | tee /sys/fs/cgroup/cpu/test*/cpu.rt_runtime_us

At the same time without patch fork time will be >100ms:

 # perf trace -e clone --duration 100 stress-ng --fork 1

Also remote ping will show timings >100ms caused by irq latency.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/157996383820.4651.11292439232549211693.stgit@buzz
2020-01-28 21:37:09 +01:00
Srikar Dronamraju
bec2860a2b sched/fair: Optimize select_idle_core()
Currently we loop through all threads of a core to evaluate if the core is
idle or not. This is unnecessary. If a thread of a core is not idle, skip
evaluating other threads of a core. Also while clearing the cpumask, bits
of all CPUs of a core can be cleared in one-shot.

Collecting ticks on a Power 9 SMT 8 system around select_idle_core
while running schbench shows us

(units are in ticks, hence lesser is better)
Without patch
    N        Min     Max     Median         Avg      Stddev
x 130        151    1083        284   322.72308   144.41494

With patch
    N        Min     Max     Median         Avg      Stddev   Improvement
x 164         88     610        201   225.79268   106.78943        30.03%

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Link: https://lkml.kernel.org/r/20191206172422.6578-1-srikar@linux.vnet.ibm.com
2020-01-28 21:37:08 +01:00
Giovanni Gherdovich
1567c3e346 x86, sched: Add support for frequency invariance
Implement arch_scale_freq_capacity() for 'modern' x86. This function
is used by the scheduler to correctly account usage in the face of
DVFS.

The present patch addresses Intel processors specifically and has positive
performance and performance-per-watt implications for the schedutil cpufreq
governor, bringing it closer to, if not on-par with, the powersave governor
from the intel_pstate driver/framework.

Large performance gains are obtained when the machine is lightly loaded and
no regression are observed at saturation. The benchmarks with the largest
gains are kernel compilation, tbench (the networking version of dbench) and
shell-intensive workloads.

1. FREQUENCY INVARIANCE: MOTIVATION
   * Without it, a task looks larger if the CPU runs slower

2. PECULIARITIES OF X86
   * freq invariance accounting requires knowing the ratio freq_curr/freq_max
   2.1 CURRENT FREQUENCY
       * Use delta_APERF / delta_MPERF * freq_base (a.k.a "BusyMHz")
   2.2 MAX FREQUENCY
       * It varies with time (turbo). As an approximation, we set it to a
         constant, i.e. 4-cores turbo frequency.

3. EFFECTS ON THE SCHEDUTIL FREQUENCY GOVERNOR
   * The invariant schedutil's formula has no feedback loop and reacts faster
     to utilization changes

4. KNOWN LIMITATIONS
   * In some cases tasks can't reach max util despite how hard they try

5. PERFORMANCE TESTING
   5.1 MACHINES
       * Skylake, Broadwell, Haswell
   5.2 SETUP
       * baseline Linux v5.2 w/ non-invariant schedutil. Tested freq_max = 1-2-3-4-8-12
         active cores turbo w/ invariant schedutil, and intel_pstate/powersave
   5.3 BENCHMARK RESULTS
       5.3.1 NEUTRAL BENCHMARKS
             * NAS Parallel Benchmark (HPC), hackbench
       5.3.2 NON-NEUTRAL BENCHMARKS
             * tbench (10-30% better), kernbench (10-15% better),
               shell-intensive-scripts (30-50% better)
             * no regressions
       5.3.3 SELECTION OF DETAILED RESULTS
       5.3.4 POWER CONSUMPTION, PERFORMANCE-PER-WATT
             * dbench (5% worse on one machine), kernbench (3% worse),
               tbench (5-10% better), shell-intensive-scripts (10-40% better)

6. MICROARCH'ES ADDRESSED HERE
   * Xeon Core before Scalable Performance processors line (Xeon Gold/Platinum
     etc have different MSRs semantic for querying turbo levels)

7. REFERENCES
   * MMTests performance testing framework, github.com/gormanm/mmtests

 +-------------------------------------------------------------------------+
 | 1. FREQUENCY INVARIANCE: MOTIVATION
 +-------------------------------------------------------------------------+

For example; suppose a CPU has two frequencies: 500 and 1000 Mhz. When
running a task that would consume 1/3rd of a CPU at 1000 MHz, it would
appear to consume 2/3rd (or 66.6%) when running at 500 MHz, giving the
false impression this CPU is almost at capacity, even though it can go
faster [*]. In a nutshell, without frequency scale-invariance tasks look
larger just because the CPU is running slower.

[*] (footnote: this assumes a linear frequency/performance relation; which
everybody knows to be false, but given realities its the best approximation
we can make.)

 +-------------------------------------------------------------------------+
 | 2. PECULIARITIES OF X86
 +-------------------------------------------------------------------------+

Accounting for frequency changes in PELT signals requires the computation of
the ratio freq_curr / freq_max. On x86 neither of those terms is readily
available.

2.1 CURRENT FREQUENCY
====================

Since modern x86 has hardware control over the actual frequency we run
at (because amongst other things, Turbo-Mode), we cannot simply use
the frequency as requested through cpufreq.

Instead we use the APERF/MPERF MSRs to compute the effective frequency
over the recent past. Also, because reading MSRs is expensive, don't
do so every time we need the value, but amortize the cost by doing it
every tick.

2.2 MAX FREQUENCY
=================

Obtaining freq_max is also non-trivial because at any time the hardware can
provide a frequency boost to a selected subset of cores if the package has
enough power to spare (eg: Turbo Boost). This means that the maximum frequency
available to a given core changes with time.

The approach taken in this change is to arbitrarily set freq_max to a constant
value at boot. The value chosen is the "4-cores (4C) turbo frequency" on most
microarchitectures, after evaluating the following candidates:

    * 1-core (1C) turbo frequency (the fastest turbo state available)
    * around base frequency (a.k.a. max P-state)
    * something in between, such as 4C turbo

To interpret these options, consider that this is the denominator in
freq_curr/freq_max, and that ratio will be used to scale PELT signals such as
util_avg and load_avg. A large denominator will undershoot (util_avg looks a
bit smaller than it really is), viceversa with a smaller denominator PELT
signals will tend to overshoot. Given that PELT drives frequency selection
in the schedutil governor, we will have:

    freq_max set to     | effect on DVFS
    --------------------+------------------
    1C turbo            | power efficiency (lower freq choices)
    base freq           | performance (higher util_avg, higher freq requests)
    4C turbo            | a bit of both

4C turbo proves to be a good compromise in a number of benchmarks (see below).

 +-------------------------------------------------------------------------+
 | 3. EFFECTS ON THE SCHEDUTIL FREQUENCY GOVERNOR
 +-------------------------------------------------------------------------+

Once an architecture implements a frequency scale-invariant utilization (the
PELT signal util_avg), schedutil switches its frequency selection formula from

    freq_next = 1.25 * freq_curr * util            [non-invariant util signal]

to

    freq_next = 1.25 * freq_max * util             [invariant util signal]

where, in the second formula, freq_max is set to the 1C turbo frequency (max
turbo). The advantage of the second formula, whose usage we unlock with this
patch, is that freq_next doesn't depend on the current frequency in an
iterative fashion, but can jump to any frequency in a single update. This
absence of feedback in the formula makes it quicker to react to utilization
changes and more robust against pathological instabilities.

Compare it to the update formula of intel_pstate/powersave:

    freq_next = 1.25 * freq_max * Busy%

where again freq_max is 1C turbo and Busy% is the percentage of time not spent
idling (calculated with delta_MPERF / delta_TSC); essentially the same as
invariant schedutil, and largely responsible for intel_pstate/powersave good
reputation. The non-invariant schedutil formula is derived from the invariant
one by approximating util_inv with util_raw * freq_curr / freq_max, but this
has limitations.

Testing shows improved performances due to better frequency selections when
the machine is lightly loaded, and essentially no change in behaviour at
saturation / overutilization.

 +-------------------------------------------------------------------------+
 | 4. KNOWN LIMITATIONS
 +-------------------------------------------------------------------------+

It's been shown that it is possible to create pathological scenarios where a
CPU-bound task cannot reach max utilization, if the normalizing factor
freq_max is fixed to a constant value (see [Lelli-2018]).

If freq_max is set to 4C turbo as we do here, one needs to peg at least 5
cores in a package doing some busywork, and observe that none of those task
will ever reach max util (1024) because they're all running at less than the
4C turbo frequency.

While this concern still applies, we believe the performance benefit of
frequency scale-invariant PELT signals outweights the cost of this limitation.

 [Lelli-2018]
 https://lore.kernel.org/lkml/20180517150418.GF22493@localhost.localdomain/

 +-------------------------------------------------------------------------+
 | 5. PERFORMANCE TESTING
 +-------------------------------------------------------------------------+

5.1 MACHINES
============

We tested the patch on three machines, with Skylake, Broadwell and Haswell
CPUs. The details are below, together with the available turbo ratios as
reported by the appropriate MSRs.

* 8x-SKYLAKE-UMA:
  Single socket E3-1240 v5, Skylake 4 cores/8 threads
  Max EFFiciency, BASE frequency and available turbo levels (MHz):

    EFFIC    800 |********
    BASE    3500 |***********************************
    4C      3700 |*************************************
    3C      3800 |**************************************
    2C      3900 |***************************************
    1C      3900 |***************************************

* 80x-BROADWELL-NUMA:
  Two sockets E5-2698 v4, 2x Broadwell 20 cores/40 threads
  Max EFFiciency, BASE frequency and available turbo levels (MHz):

    EFFIC   1200 |************
    BASE    2200 |**********************
    8C      2900 |*****************************
    7C      3000 |******************************
    6C      3100 |*******************************
    5C      3200 |********************************
    4C      3300 |*********************************
    3C      3400 |**********************************
    2C      3600 |************************************
    1C      3600 |************************************

* 48x-HASWELL-NUMA
  Two sockets E5-2670 v3, 2x Haswell 12 cores/24 threads
  Max EFFiciency, BASE frequency and available turbo levels (MHz):

    EFFIC   1200 |************
    BASE    2300 |***********************
    12C     2600 |**************************
    11C     2600 |**************************
    10C     2600 |**************************
    9C      2600 |**************************
    8C      2600 |**************************
    7C      2600 |**************************
    6C      2600 |**************************
    5C      2700 |***************************
    4C      2800 |****************************
    3C      2900 |*****************************
    2C      3100 |*******************************
    1C      3100 |*******************************

5.2 SETUP
=========

* The baseline is Linux v5.2 with schedutil (non-invariant) and the intel_pstate
  driver in passive mode.
* The rationale for choosing the various freq_max values to test have been to
  try all the 1-2-3-4C turbo levels (note that 1C and 2C turbo are identical
  on all machines), plus one more value closer to base_freq but still in the
  turbo range (8C turbo for both 80x-BROADWELL-NUMA and 48x-HASWELL-NUMA).
* In addition we've run all tests with intel_pstate/powersave for comparison.
* The filesystem is always XFS, the userspace is openSUSE Leap 15.1.
* 8x-SKYLAKE-UMA is capable of HWP (Hardware-Managed P-States), so the runs
  with active intel_pstate on this machine use that.

This gives, in terms of combinations tested on each machine:

* 8x-SKYLAKE-UMA
  * Baseline: Linux v5.2, non-invariant schedutil, intel_pstate passive
  * intel_pstate active + powersave + HWP
  * invariant schedutil, freq_max = 1C turbo
  * invariant schedutil, freq_max = 3C turbo
  * invariant schedutil, freq_max = 4C turbo

* both 80x-BROADWELL-NUMA and 48x-HASWELL-NUMA
  * [same as 8x-SKYLAKE-UMA, but no HWP capable]
  * invariant schedutil, freq_max = 8C turbo
    (which on 48x-HASWELL-NUMA is the same as 12C turbo, or "all cores turbo")

5.3 BENCHMARK RESULTS
=====================

5.3.1 NEUTRAL BENCHMARKS
------------------------

Tests that didn't show any measurable difference in performance on any of the
test machines between non-invariant schedutil and our patch are:

* NAS Parallel Benchmarks (NPB) using either MPI or openMP for IPC, any
  computational kernel
* flexible I/O (FIO)
* hackbench (using threads or processes, and using pipes or sockets)

5.3.2 NON-NEUTRAL BENCHMARKS
----------------------------

What follow are summary tables where each benchmark result is given a score.

* A tilde (~) means a neutral result, i.e. no difference from baseline.
* Scores are computed with the ratio result_new / result_baseline, so a tilde
  means a score of 1.00.
* The results in the score ratio are the geometric means of results running
  the benchmark with different parameters (eg: for kernbench: using 1, 2, 4,
  ... number of processes; for pgbench: varying the number of clients, and so
  on).
* The first three tables show higher-is-better kind of tests (i.e. measured in
  operations/second), the subsequent three show lower-is-better kind of tests
  (i.e. the workload is fixed and we measure elapsed time, think kernbench).
* "gitsource" is a name we made up for the test consisting in running the
  entire unit tests suite of the Git SCM and measuring how long it takes. We
  take it as a typical example of shell-intensive serialized workload.
* In the "I_PSTATE" column we have the results for intel_pstate/powersave. Other
  columns show invariant schedutil for different values of freq_max. 4C turbo
  is circled as it's the value we've chosen for the final implementation.

80x-BROADWELL-NUMA (comparison ratio; higher is better)
                                         +------+
                 I_PSTATE   1C     3C    | 4C   |  8C
pgbench-ro           1.14   ~      ~     | 1.11 |  1.14
pgbench-rw           ~      ~      ~     | ~    |  ~
netperf-udp          1.06   ~      1.06  | 1.05 |  1.07
netperf-tcp          ~      1.03   ~     | 1.01 |  1.02
tbench4              1.57   1.18   1.22  | 1.30 |  1.56
                                         +------+

8x-SKYLAKE-UMA (comparison ratio; higher is better)
                                         +------+
             I_PSTATE/HWP   1C     3C    | 4C   |
pgbench-ro           ~      ~      ~     | ~    |
pgbench-rw           ~      ~      ~     | ~    |
netperf-udp          ~      ~      ~     | ~    |
netperf-tcp          ~      ~      ~     | ~    |
tbench4              1.30   1.14   1.14  | 1.16 |
                                         +------+

48x-HASWELL-NUMA (comparison ratio; higher is better)
                                         +------+
                 I_PSTATE   1C     3C    | 4C   |  12C
pgbench-ro           1.15   ~      ~     | 1.06 |  1.16
pgbench-rw           ~      ~      ~     | ~    |  ~
netperf-udp          1.05   0.97   1.04  | 1.04 |  1.02
netperf-tcp          0.96   1.01   1.01  | 1.01 |  1.01
tbench4              1.50   1.05   1.13  | 1.13 |  1.25
                                         +------+

In the table above we see that active intel_pstate is slightly better than our
4C-turbo patch (both in reference to the baseline non-invariant schedutil) on
read-only pgbench and much better on tbench. Both cases are notable in which
it shows that lowering our freq_max (to 8C-turbo and 12C-turbo on
80x-BROADWELL-NUMA and 48x-HASWELL-NUMA respectively) helps invariant
schedutil to get closer.

If we ignore active intel_pstate and focus on the comparison with baseline
alone, there are several instances of double-digit performance improvement.

80x-BROADWELL-NUMA (comparison ratio; lower is better)
                                         +------+
                 I_PSTATE   1C     3C    | 4C   |  8C
dbench4              1.23   0.95   0.95  | 0.95 |  0.95
kernbench            0.93   0.83   0.83  | 0.83 |  0.82
gitsource            0.98   0.49   0.49  | 0.49 |  0.48
                                         +------+

8x-SKYLAKE-UMA (comparison ratio; lower is better)
                                         +------+
             I_PSTATE/HWP   1C     3C    | 4C   |
dbench4              ~      ~      ~     | ~    |
kernbench            ~      ~      ~     | ~    |
gitsource            0.92   0.55   0.55  | 0.55 |
                                         +------+

48x-HASWELL-NUMA (comparison ratio; lower is better)
                                         +------+
                 I_PSTATE   1C     3C    | 4C   |  8C
dbench4              ~      ~      ~     | ~    |  ~
kernbench            0.94   0.90   0.89  | 0.90 |  0.90
gitsource            0.97   0.69   0.69  | 0.69 |  0.69
                                         +------+

dbench is not very remarkable here, unless we notice how poorly active
intel_pstate is performing on 80x-BROADWELL-NUMA: 23% regression versus
non-invariant schedutil. We repeated that run getting consistent results. Out
of scope for the patch at hand, but deserving future investigation. Other than
that, we previously ran this campaign with Linux v5.0 and saw the patch doing
better on dbench a the time. We haven't checked closely and can only speculate
at this point.

On the NUMA boxes kernbench gets 10-15% improvements on average; we'll see in
the detailed tables that the gains concentrate on low process counts (lightly
loaded machines).

The test we call "gitsource" (running the git unit test suite, a long-running
single-threaded shell script) appears rather spectacular in this table (gains
of 30-50% depending on the machine). It is to be noted, however, that
gitsource has no adjustable parameters (such as the number of jobs in
kernbench, which we average over in order to get a single-number summary
score) and is exactly the kind of low-parallelism workload that benefits the
most from this patch. When looking at the detailed tables of kernbench or
tbench4, at low process or client counts one can see similar numbers.

5.3.3 SELECTION OF DETAILED RESULTS
-----------------------------------

Machine            : 48x-HASWELL-NUMA
Benchmark          : tbench4 (i.e. dbench4 over the network, actually loopback)
Varying parameter  : number of clients
Unit               : MB/sec (higher is better)

                   5.2.0 vanilla (BASELINE)               5.2.0 intel_pstate                   5.2.0 1C-turbo
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Hmean  1        126.73  +- 0.31% (        )      315.91  +- 0.66% ( 149.28%)      125.03  +- 0.76% (  -1.34%)
Hmean  2        258.04  +- 0.62% (        )      614.16  +- 0.51% ( 138.01%)      269.58  +- 1.45% (   4.47%)
Hmean  4        514.30  +- 0.67% (        )     1146.58  +- 0.54% ( 122.94%)      533.84  +- 1.99% (   3.80%)
Hmean  8       1111.38  +- 2.52% (        )     2159.78  +- 0.38% (  94.33%)     1359.92  +- 1.56% (  22.36%)
Hmean  16      2286.47  +- 1.36% (        )     3338.29  +- 0.21% (  46.00%)     2720.20  +- 0.52% (  18.97%)
Hmean  32      4704.84  +- 0.35% (        )     4759.03  +- 0.43% (   1.15%)     4774.48  +- 0.30% (   1.48%)
Hmean  64      7578.04  +- 0.27% (        )     7533.70  +- 0.43% (  -0.59%)     7462.17  +- 0.65% (  -1.53%)
Hmean  128     6998.52  +- 0.16% (        )     6987.59  +- 0.12% (  -0.16%)     6909.17  +- 0.14% (  -1.28%)
Hmean  192     6901.35  +- 0.25% (        )     6913.16  +- 0.10% (   0.17%)     6855.47  +- 0.21% (  -0.66%)

                             5.2.0 3C-turbo                   5.2.0 4C-turbo                  5.2.0 12C-turbo
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Hmean  1        128.43  +- 0.28% (   1.34%)      130.64  +- 3.81% (   3.09%)      153.71  +- 5.89% (  21.30%)
Hmean  2        311.70  +- 6.15% (  20.79%)      281.66  +- 3.40% (   9.15%)      305.08  +- 5.70% (  18.23%)
Hmean  4        641.98  +- 2.32% (  24.83%)      623.88  +- 5.28% (  21.31%)      906.84  +- 4.65% (  76.32%)
Hmean  8       1633.31  +- 1.56% (  46.96%)     1714.16  +- 0.93% (  54.24%)     2095.74  +- 0.47% (  88.57%)
Hmean  16      3047.24  +- 0.42% (  33.27%)     3155.02  +- 0.30% (  37.99%)     3634.58  +- 0.15% (  58.96%)
Hmean  32      4734.31  +- 0.60% (   0.63%)     4804.38  +- 0.23% (   2.12%)     4674.62  +- 0.27% (  -0.64%)
Hmean  64      7699.74  +- 0.35% (   1.61%)     7499.72  +- 0.34% (  -1.03%)     7659.03  +- 0.25% (   1.07%)
Hmean  128     6935.18  +- 0.15% (  -0.91%)     6942.54  +- 0.10% (  -0.80%)     7004.85  +- 0.12% (   0.09%)
Hmean  192     6901.62  +- 0.12% (   0.00%)     6856.93  +- 0.10% (  -0.64%)     6978.74  +- 0.10% (   1.12%)

This is one of the cases where the patch still can't surpass active
intel_pstate, not even when freq_max is as low as 12C-turbo. Otherwise, gains are
visible up to 16 clients and the saturated scenario is the same as baseline.

The scores in the summary table from the previous sections are ratios of
geometric means of the results over different clients, as seen in this table.

Machine            : 80x-BROADWELL-NUMA
Benchmark          : kernbench (kernel compilation)
Varying parameter  : number of jobs
Unit               : seconds (lower is better)

                   5.2.0 vanilla (BASELINE)               5.2.0 intel_pstate                   5.2.0 1C-turbo
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean  2        379.68  +- 0.06% (        )      330.20  +- 0.43% (  13.03%)      285.93  +- 0.07% (  24.69%)
Amean  4        200.15  +- 0.24% (        )      175.89  +- 0.22% (  12.12%)      153.78  +- 0.25% (  23.17%)
Amean  8        106.20  +- 0.31% (        )       95.54  +- 0.23% (  10.03%)       86.74  +- 0.10% (  18.32%)
Amean  16        56.96  +- 1.31% (        )       53.25  +- 1.22% (   6.50%)       48.34  +- 1.73% (  15.13%)
Amean  32        34.80  +- 2.46% (        )       33.81  +- 0.77% (   2.83%)       30.28  +- 1.59% (  12.99%)
Amean  64        26.11  +- 1.63% (        )       25.04  +- 1.07% (   4.10%)       22.41  +- 2.37% (  14.16%)
Amean  128       24.80  +- 1.36% (        )       23.57  +- 1.23% (   4.93%)       21.44  +- 1.37% (  13.55%)
Amean  160       24.85  +- 0.56% (        )       23.85  +- 1.17% (   4.06%)       21.25  +- 1.12% (  14.49%)

                             5.2.0 3C-turbo                   5.2.0 4C-turbo                   5.2.0 8C-turbo
- - - - - - - -  - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean  2        284.08  +- 0.13% (  25.18%)      283.96  +- 0.51% (  25.21%)      285.05  +- 0.21% (  24.92%)
Amean  4        153.18  +- 0.22% (  23.47%)      154.70  +- 1.64% (  22.71%)      153.64  +- 0.30% (  23.24%)
Amean  8         87.06  +- 0.28% (  18.02%)       86.77  +- 0.46% (  18.29%)       86.78  +- 0.22% (  18.28%)
Amean  16        48.03  +- 0.93% (  15.68%)       47.75  +- 1.99% (  16.17%)       47.52  +- 1.61% (  16.57%)
Amean  32        30.23  +- 1.20% (  13.14%)       30.08  +- 1.67% (  13.57%)       30.07  +- 1.67% (  13.60%)
Amean  64        22.59  +- 2.02% (  13.50%)       22.63  +- 0.81% (  13.32%)       22.42  +- 0.76% (  14.12%)
Amean  128       21.37  +- 0.67% (  13.82%)       21.31  +- 1.15% (  14.07%)       21.17  +- 1.93% (  14.63%)
Amean  160       21.68  +- 0.57% (  12.76%)       21.18  +- 1.74% (  14.77%)       21.22  +- 1.00% (  14.61%)

The patch outperform active intel_pstate (and baseline) by a considerable
margin; the summary table from the previous section says 4C turbo and active
intel_pstate are 0.83 and 0.93 against baseline respectively, so 4C turbo is
0.83/0.93=0.89 against intel_pstate (~10% better on average). There is no
noticeable difference with regard to the value of freq_max.

Machine            : 8x-SKYLAKE-UMA
Benchmark          : gitsource (time to run the git unit test suite)
Varying parameter  : none
Unit               : seconds (lower is better)

                            5.2.0 vanilla           5.2.0 intel_pstate/hwp         5.2.0 1C-turbo
- - - - - - - -  - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean         858.85  +- 1.16% (        )      791.94  +- 0.21% (   7.79%)      474.95 (  44.70%)

                           5.2.0 3C-turbo                   5.2.0 4C-turbo
- - - - - - - -  - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean         475.26  +- 0.20% (  44.66%)      474.34  +- 0.13% (  44.77%)

In this test, which is of interest as representing shell-intensive
(i.e. fork-intensive) serialized workloads, invariant schedutil outperforms
intel_pstate/powersave by a whopping 40% margin.

5.3.4 POWER CONSUMPTION, PERFORMANCE-PER-WATT
---------------------------------------------

The following table shows average power consumption in watt for each
benchmark. Data comes from turbostat (package average), which in turn is read
from the RAPL interface on CPUs. We know the patch affects CPU frequencies so
it's reasonable to ignore other power consumers (such as memory or I/O). Also,
we don't have a power meter available in the lab so RAPL is the best we have.

turbostat sampled average power every 10 seconds for the entire duration of
each benchmark. We took all those values and averaged them (i.e. with don't
have detail on a per-parameter granularity, only on whole benchmarks).

80x-BROADWELL-NUMA (power consumption, watts)
                                                    +--------+
               BASELINE I_PSTATE       1C       3C  |     4C |      8C
pgbench-ro       130.01   142.77   131.11   132.45  | 134.65 |  136.84
pgbench-rw        68.30    60.83    71.45    71.70  |  71.65 |   72.54
dbench4           90.25    59.06   101.43    99.89  | 101.10 |  102.94
netperf-udp       65.70    69.81    66.02    68.03  |  68.27 |   68.95
netperf-tcp       88.08    87.96    88.97    88.89  |  88.85 |   88.20
tbench4          142.32   176.73   153.02   163.91  | 165.58 |  176.07
kernbench         92.94   101.95   114.91   115.47  | 115.52 |  115.10
gitsource         40.92    41.87    75.14    75.20  |  75.40 |   75.70
                                                    +--------+
8x-SKYLAKE-UMA (power consumption, watts)
                                                    +--------+
              BASELINE I_PSTATE/HWP    1C       3C  |     4C |
pgbench-ro        46.49    46.68    46.56    46.59  |  46.52 |
pgbench-rw        29.34    31.38    30.98    31.00  |  31.00 |
dbench4           27.28    27.37    27.49    27.41  |  27.38 |
netperf-udp       22.33    22.41    22.36    22.35  |  22.36 |
netperf-tcp       27.29    27.29    27.30    27.31  |  27.33 |
tbench4           41.13    45.61    43.10    43.33  |  43.56 |
kernbench         42.56    42.63    43.01    43.01  |  43.01 |
gitsource         13.32    13.69    17.33    17.30  |  17.35 |
                                                    +--------+
48x-HASWELL-NUMA (power consumption, watts)
                                                    +--------+
               BASELINE I_PSTATE       1C       3C  |     4C |     12C
pgbench-ro       128.84   136.04   129.87   132.43  | 132.30 |  134.86
pgbench-rw        37.68    37.92    37.17    37.74  |  37.73 |   37.31
dbench4           28.56    28.73    28.60    28.73  |  28.70 |   28.79
netperf-udp       56.70    60.44    56.79    57.42  |  57.54 |   57.52
netperf-tcp       75.49    75.27    75.87    76.02  |  76.01 |   75.95
tbench4          115.44   139.51   119.53   123.07  | 123.97 |  130.22
kernbench         83.23    91.55    95.58    95.69  |  95.72 |   96.04
gitsource         36.79    36.99    39.99    40.34  |  40.35 |   40.23
                                                    +--------+

A lower power consumption isn't necessarily better, it depends on what is done
with that energy. Here are tables with the ratio of performance-per-watt on
each machine and benchmark. Higher is always better; a tilde (~) means a
neutral ratio (i.e. 1.00).

80x-BROADWELL-NUMA (performance-per-watt ratios; higher is better)
                                     +------+
             I_PSTATE     1C     3C  |   4C |    8C
pgbench-ro       1.04   1.06   0.94  | 1.07 |  1.08
pgbench-rw       1.10   0.97   0.96  | 0.96 |  0.97
dbench4          1.24   0.94   0.95  | 0.94 |  0.92
netperf-udp      ~      1.02   1.02  | ~    |  1.02
netperf-tcp      ~      1.02   ~     | ~    |  1.02
tbench4          1.26   1.10   1.06  | 1.12 |  1.26
kernbench        0.98   0.97   0.97  | 0.97 |  0.98
gitsource        ~      1.11   1.11  | 1.11 |  1.13
                                     +------+

8x-SKYLAKE-UMA (performance-per-watt ratios; higher is better)
                                     +------+
         I_PSTATE/HWP     1C     3C  |   4C |
pgbench-ro       ~      ~      ~     | ~    |
pgbench-rw       0.95   0.97   0.96  | 0.96 |
dbench4          ~      ~      ~     | ~    |
netperf-udp      ~      ~      ~     | ~    |
netperf-tcp      ~      ~      ~     | ~    |
tbench4          1.17   1.09   1.08  | 1.10 |
kernbench        ~      ~      ~     | ~    |
gitsource        1.06   1.40   1.40  | 1.40 |
                                     +------+

48x-HASWELL-NUMA  (performance-per-watt ratios; higher is better)
                                     +------+
             I_PSTATE     1C     3C  |   4C |   12C
pgbench-ro       1.09   ~      1.09  | 1.03 |  1.11
pgbench-rw       ~      0.86   ~     | ~    |  0.86
dbench4          ~      1.02   1.02  | 1.02 |  ~
netperf-udp      ~      0.97   1.03  | 1.02 |  ~
netperf-tcp      0.96   ~      ~     | ~    |  ~
tbench4          1.24   ~      1.06  | 1.05 |  1.11
kernbench        0.97   0.97   0.98  | 0.97 |  0.96
gitsource        1.03   1.33   1.32  | 1.32 |  1.33
                                     +------+

These results are overall pleasing: in plenty of cases we observe
performance-per-watt improvements. The few regressions (read/write pgbench and
dbench on the Broadwell machine) are of small magnitude. kernbench loses a few
percentage points (it has a 10-15% performance improvement, but apparently the
increase in power consumption is larger than that). tbench4 and gitsource, which
benefit the most from the patch, keep a positive score in this table which is
a welcome surprise; that suggests that in those particular workloads the
non-invariant schedutil (and active intel_pstate, too) makes some rather
suboptimal frequency selections.

+-------------------------------------------------------------------------+
| 6. MICROARCH'ES ADDRESSED HERE
+-------------------------------------------------------------------------+

The patch addresses Xeon Core processors that use MSR_PLATFORM_INFO and
MSR_TURBO_RATIO_LIMIT to advertise their base frequency and turbo frequencies
respectively. This excludes the recent Xeon Scalable Performance processors
line (Xeon Gold, Platinum etc) whose MSRs have to be parsed differently.

Subsequent patches will address:

* Xeon Scalable Performance processors and Atom Goldmont/Goldmont Plus
* Xeon Phi (Knights Landing, Knights Mill)
* Atom Silvermont

+-------------------------------------------------------------------------+
| 7. REFERENCES
+-------------------------------------------------------------------------+

Tests have been run with the help of the MMTests performance testing
framework, see github.com/gormanm/mmtests. The configuration file names for
the benchmark used are:

    db-pgbench-timed-ro-small-xfs
    db-pgbench-timed-rw-small-xfs
    io-dbench4-async-xfs
    network-netperf-unbound
    network-tbench
    scheduler-unbound
    workload-kerndevel-xfs
    workload-shellscripts-xfs
    hpc-nas-c-class-mpi-full-xfs
    hpc-nas-c-class-omp-full

All those benchmarks are generally available on the web:

pgbench: https://www.postgresql.org/docs/10/pgbench.html
netperf: https://hewlettpackard.github.io/netperf/
dbench/tbench: https://dbench.samba.org/
gitsource: git unit test suite, github.com/git/git
NAS Parallel Benchmarks: https://www.nas.nasa.gov/publications/npb.html
hackbench: https://people.redhat.com/mingo/cfs-scheduler/tools/hackbench.c

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Doug Smythies <dsmythies@telus.net>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lkml.kernel.org/r/20200122151617.531-2-ggherdovich@suse.cz
2020-01-28 21:36:59 +01:00
Vincent Guittot
2a4b03ffc6 sched/fair: Prevent unlimited runtime on throttled group
When a running task is moved on a throttled task group and there is no
other task enqueued on the CPU, the task can keep running using 100% CPU
whatever the allocated bandwidth for the group and although its cfs rq is
throttled. Furthermore, the group entity of the cfs_rq and its parents are
not enqueued but only set as curr on their respective cfs_rqs.

We have the following sequence:

sched_move_task
  -dequeue_task: dequeue task and group_entities.
  -put_prev_task: put task and group entities.
  -sched_change_group: move task to new group.
  -enqueue_task: enqueue only task but not group entities because cfs_rq is
    throttled.
  -set_next_task : set task and group_entities as current sched_entity of
    their cfs_rq.

Another impact is that the root cfs_rq runnable_load_avg at root rq stays
null because the group_entities are not enqueued. This situation will stay
the same until an "external" event triggers a reschedule. Let trigger it
immediately instead.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Ben Segall <bsegall@google.com>
Link: https://lkml.kernel.org/r/1579011236-31256-1-git-send-email-vincent.guittot@linaro.org
2020-01-28 21:36:58 +01:00
Wanpeng Li
e938b9c941 sched/nohz: Optimize get_nohz_timer_target()
On a machine, CPU 0 is used for housekeeping, the other 39 CPUs in the
same socket are in nohz_full mode. We can observe huge time burn in the
loop for seaching nearest busy housekeeper cpu by ftrace.

  2)               |                        get_nohz_timer_target() {
  2)   0.240 us    |                          housekeeping_test_cpu();
  2)   0.458 us    |                          housekeeping_test_cpu();

  ...

  2)   0.292 us    |                          housekeeping_test_cpu();
  2)   0.240 us    |                          housekeeping_test_cpu();
  2)   0.227 us    |                          housekeeping_any_cpu();
  2) + 43.460 us   |                        }

This patch optimizes the searching logic by finding a nearest housekeeper
CPU in the housekeeping cpumask, it can minimize the worst searching time
from ~44us to < 10us in my testing. In addition, the last iterated busy
housekeeper can become a random candidate while current CPU is a better
fallback if it is a housekeeper.

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/1578876627-11938-1-git-send-email-wanpengli@tencent.com
2020-01-28 21:36:57 +01:00
Qais Yousef
b562d14064 sched/uclamp: Reject negative values in cpu_uclamp_write()
The check to ensure that the new written value into cpu.uclamp.{min,max}
is within range, [0:100], wasn't working because of the signed
comparison

 7301                 if (req.percent > UCLAMP_PERCENT_SCALE) {
 7302                         req.ret = -ERANGE;
 7303                         return req;
 7304                 }

	# echo -1 > cpu.uclamp.min
	# cat cpu.uclamp.min
	42949671.96

Cast req.percent into u64 to force the comparison to be unsigned and
work as intended in capacity_from_percent().

	# echo -1 > cpu.uclamp.min
	sh: write error: Numerical result out of range

Fixes: 2480c09313 ("sched/uclamp: Extend CPU's cgroup controller")
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200114210947.14083-1-qais.yousef@arm.com
2020-01-28 21:36:56 +01:00
Mel Gorman
b396f52326 sched/fair: Allow a small load imbalance between low utilisation SD_NUMA domains
The CPU load balancer balances between different domains to spread load
and strives to have equal balance everywhere. Communicating tasks can
migrate so they are topologically close to each other but these decisions
are independent. On a lightly loaded NUMA machine, two communicating tasks
pulled together at wakeup time can be pushed apart by the load balancer.
In isolation, the load balancer decision is fine but it ignores the tasks
data locality and the wakeup/LB paths continually conflict. NUMA balancing
is also a factor but it also simply conflicts with the load balancer.

This patch allows a fixed degree of imbalance of two tasks to exist
between NUMA domains regardless of utilisation levels. In many cases,
this prevents communicating tasks being pulled apart. It was evaluated
whether the imbalance should be scaled to the domain size. However, no
additional benefit was measured across a range of workloads and machines
and scaling adds the risk that lower domains have to be rebalanced. While
this could change again in the future, such a change should specify the
use case and benefit.

The most obvious impact is on netperf TCP_STREAM -- two simple
communicating tasks with some softirq offload depending on the
transmission rate.

 2-socket Haswell machine 48 core, HT enabled
 netperf-tcp -- mmtests config config-network-netperf-unbound
			      baseline              lbnuma-v3
 Hmean     64         568.73 (   0.00%)      577.56 *   1.55%*
 Hmean     128       1089.98 (   0.00%)     1128.06 *   3.49%*
 Hmean     256       2061.72 (   0.00%)     2104.39 *   2.07%*
 Hmean     1024      7254.27 (   0.00%)     7557.52 *   4.18%*
 Hmean     2048     11729.20 (   0.00%)    13350.67 *  13.82%*
 Hmean     3312     15309.08 (   0.00%)    18058.95 *  17.96%*
 Hmean     4096     17338.75 (   0.00%)    20483.66 *  18.14%*
 Hmean     8192     25047.12 (   0.00%)    27806.84 *  11.02%*
 Hmean     16384    27359.55 (   0.00%)    33071.88 *  20.88%*
 Stddev    64           2.16 (   0.00%)        2.02 (   6.53%)
 Stddev    128          2.31 (   0.00%)        2.19 (   5.05%)
 Stddev    256         11.88 (   0.00%)        3.22 (  72.88%)
 Stddev    1024        23.68 (   0.00%)        7.24 (  69.43%)
 Stddev    2048        79.46 (   0.00%)       71.49 (  10.03%)
 Stddev    3312        26.71 (   0.00%)       57.80 (-116.41%)
 Stddev    4096       185.57 (   0.00%)       96.15 (  48.19%)
 Stddev    8192       245.80 (   0.00%)      100.73 (  59.02%)
 Stddev    16384      207.31 (   0.00%)      141.65 (  31.67%)

In this case, there was a sizable improvement to performance and
a general reduction in variance. However, this is not univeral.
For most machines, the impact was roughly a 3% performance gain.

 Ops NUMA base-page range updates       19796.00         292.00
 Ops NUMA PTE updates                   19796.00         292.00
 Ops NUMA PMD updates                       0.00           0.00
 Ops NUMA hint faults                   16113.00         143.00
 Ops NUMA hint local faults %            8407.00         142.00
 Ops NUMA hint local percent               52.18          99.30
 Ops NUMA pages migrated                 4244.00           1.00

Without the patch, only 52.18% of sampled accesses are local.  In an
earlier changelog, 100% of sampled accesses are local and indeed on
most machines, this was still the case. In this specific case, the
local sampled rates was 99.3% but note the "base-page range updates"
and "PTE updates".  The activity with the patch is negligible as were
the number of faults. The small number of pages migrated were related to
shared libraries.  A 2-socket Broadwell showed better results on average
but are not presented for brevity as the performance was similar except
it showed 100% of the sampled NUMA hints were local. The patch holds up
for a 4-socket Haswell, an AMD EPYC and AMD Epyc 2 machine.

For dbench, the impact depends on the filesystem used and the number of
clients. On XFS, there is little difference as the clients typically
communicate with workqueues which have a separate class of scheduler
problem at the moment. For ext4, performance is generally better,
particularly for small numbers of clients as NUMA balancing activity is
negligible with the patch applied.

A more interesting example is the Facebook schbench which uses a
number of messaging threads to communicate with worker threads. In this
configuration, one messaging thread is used per NUMA node and the number of
worker threads is varied. The 50, 75, 90, 95, 99, 99.5 and 99.9 percentiles
for response latency is then reported.

 Lat 50.00th-qrtle-1        44.00 (   0.00%)       37.00 (  15.91%)
 Lat 75.00th-qrtle-1        53.00 (   0.00%)       41.00 (  22.64%)
 Lat 90.00th-qrtle-1        57.00 (   0.00%)       42.00 (  26.32%)
 Lat 95.00th-qrtle-1        63.00 (   0.00%)       43.00 (  31.75%)
 Lat 99.00th-qrtle-1        76.00 (   0.00%)       51.00 (  32.89%)
 Lat 99.50th-qrtle-1        89.00 (   0.00%)       52.00 (  41.57%)
 Lat 99.90th-qrtle-1        98.00 (   0.00%)       55.00 (  43.88%)
 Lat 50.00th-qrtle-2        42.00 (   0.00%)       42.00 (   0.00%)
 Lat 75.00th-qrtle-2        48.00 (   0.00%)       47.00 (   2.08%)
 Lat 90.00th-qrtle-2        53.00 (   0.00%)       52.00 (   1.89%)
 Lat 95.00th-qrtle-2        55.00 (   0.00%)       53.00 (   3.64%)
 Lat 99.00th-qrtle-2        62.00 (   0.00%)       60.00 (   3.23%)
 Lat 99.50th-qrtle-2        63.00 (   0.00%)       63.00 (   0.00%)
 Lat 99.90th-qrtle-2        68.00 (   0.00%)       66.00 (   2.94%

For higher worker threads, the differences become negligible but it's
interesting to note the difference in wakeup latency at low utilisation
and mpstat confirms that activity was almost all on one node until
the number of worker threads increase.

Hackbench generally showed neutral results across a range of machines.
This is different to earlier versions of the patch which allowed imbalances
for higher degrees of utilisation. perf bench pipe showed negligible
differences in overall performance as the differences are very close to
the noise.

An earlier prototype of the patch showed major regressions for NAS C-class
when running with only half of the available CPUs -- 20-30% performance
hits were measured at the time. With this version of the patch, the impact
is negligible with small gains/losses within the noise measured. This is
because the number of threads far exceeds the small imbalance the aptch
cares about. Similarly, there were report of regressions for the autonuma
benchmark against earlier versions but again, normal load balancing now
applies for that workload.

In general, the patch simply seeks to avoid unnecessary cross-node
migrations in the basic case where imbalances are very small.  For low
utilisation communicating workloads, this patch generally behaves better
with less NUMA balancing activity. For high utilisation, there is no
change in behaviour.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Phil Auld <pauld@redhat.com>
Tested-by: Phil Auld <pauld@redhat.com>
Link: https://lkml.kernel.org/r/20200114101319.GO3466@techsingularity.net
2020-01-28 21:36:55 +01:00
Peter Zijlstra (Intel)
ebc0f83c78 timers/nohz: Update NOHZ load in remote tick
The way loadavg is tracked during nohz only pays attention to the load
upon entering nohz.  This can be particularly noticeable if full nohz is
entered while non-idle, and then the cpu goes idle and stays that way for
a long time.

Use the remote tick to ensure that full nohz cpus report their deltas
within a reasonable time.

[ swood: Added changelog and removed recheck of stopped tick. ]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Scott Wood <swood@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/1578736419-14628-3-git-send-email-swood@redhat.com
2020-01-28 21:36:44 +01:00
Scott Wood
488603b815 sched/core: Don't skip remote tick for idle CPUs
This will be used in the next patch to get a loadavg update from
nohz cpus.  The delta check is skipped because idle_sched_class
doesn't update se.exec_start.

Signed-off-by: Scott Wood <swood@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/1578736419-14628-2-git-send-email-swood@redhat.com
2020-01-28 21:36:16 +01:00
Song Liu
07c5972951 perf/cgroups: Install cgroup events to correct cpuctx
cgroup events are always installed in the cpuctx. However, when it is not
installed via IPI, list_update_cgroup_event() adds it to cpuctx of current
CPU, which triggers list corruption:

  [] list_add double add: new=ffff888ff7cf0db0, prev=ffff888ff7ce82f0, next=ffff888ff7cf0db0.

To reproduce this, we can simply run:

  # perf stat -e cs -a &
  # perf stat -e cs -G anycgroup

Fix this by installing it to cpuctx that contains event->ctx, and the
proper cgrp_cpuctx_list.

Fixes: db0503e4f6 ("perf/core: Optimize perf_install_in_event()")
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200122195027.2112449-1-songliubraving@fb.com
2020-01-28 21:20:19 +01:00
Song Liu
003461559e perf/core: Fix mlock accounting in perf_mmap()
Decreasing sysctl_perf_event_mlock between two consecutive perf_mmap()s of
a perf ring buffer may lead to an integer underflow in locked memory
accounting. This may lead to the undesired behaviors, such as failures in
BPF map creation.

Address this by adjusting the accounting logic to take into account the
possibility that the amount of already locked memory may exceed the
current limit.

Fixes: c4b7547974 ("perf/core: Make the mlock accounting simple again")
Suggested-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: <stable@vger.kernel.org>
Acked-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Link: https://lkml.kernel.org/r/20200123181146.2238074-1-songliubraving@fb.com
2020-01-28 21:20:18 +01:00
Linus Torvalds
c677124e63 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "These were the main changes in this cycle:

   - More -rt motivated separation of CONFIG_PREEMPT and
     CONFIG_PREEMPTION.

   - Add more low level scheduling topology sanity checks and warnings
     to filter out nonsensical topologies that break scheduling.

   - Extend uclamp constraints to influence wakeup CPU placement

   - Make the RT scheduler more aware of asymmetric topologies and CPU
     capacities, via uclamp metrics, if CONFIG_UCLAMP_TASK=y

   - Make idle CPU selection more consistent

   - Various fixes, smaller cleanups, updates and enhancements - please
     see the git log for details"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (58 commits)
  sched/fair: Define sched_idle_cpu() only for SMP configurations
  sched/topology: Assert non-NUMA topology masks don't (partially) overlap
  idle: fix spelling mistake "iterrupts" -> "interrupts"
  sched/fair: Remove redundant call to cpufreq_update_util()
  sched/psi: create /proc/pressure and /proc/pressure/{io|memory|cpu} only when psi enabled
  sched/fair: Fix sgc->{min,max}_capacity calculation for SD_OVERLAP
  sched/fair: calculate delta runnable load only when it's needed
  sched/cputime: move rq parameter in irqtime_account_process_tick
  stop_machine: Make stop_cpus() static
  sched/debug: Reset watchdog on all CPUs while processing sysrq-t
  sched/core: Fix size of rq::uclamp initialization
  sched/uclamp: Fix a bug in propagating uclamp value in new cgroups
  sched/fair: Load balance aggressively for SCHED_IDLE CPUs
  sched/fair : Improve update_sd_pick_busiest for spare capacity case
  watchdog: Remove soft_lockup_hrtimer_cnt and related code
  sched/rt: Make RT capacity-aware
  sched/fair: Make EAS wakeup placement consider uclamp restrictions
  sched/fair: Make task_fits_capacity() consider uclamp restrictions
  sched/uclamp: Rename uclamp_util_with() into uclamp_rq_util_with()
  sched/uclamp: Make uclamp util helpers use and return UL values
  ...
2020-01-28 10:07:09 -08:00
Linus Torvalds
c0e809e244 Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Ingo Molnar:
 "Kernel side changes:

   - Ftrace is one of the last W^X violators (after this only KLP is
     left). These patches move it over to the generic text_poke()
     interface and thereby get rid of this oddity. This requires a
     surprising amount of surgery, by Peter Zijlstra.

   - x86/AMD PMUs: add support for 'Large Increment per Cycle Events' to
     count certain types of events that have a special, quirky hw ABI
     (by Kim Phillips)

   - kprobes fixes by Masami Hiramatsu

  Lots of tooling updates as well, the following subcommands were
  updated: annotate/report/top, c2c, clang, record, report/top TUI,
  sched timehist, tests; plus updates were done to the gtk ui, libperf,
  headers and the parser"

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits)
  perf/x86/amd: Add support for Large Increment per Cycle Events
  perf/x86/amd: Constrain Large Increment per Cycle events
  perf/x86/intel/rapl: Add Comet Lake support
  tracing: Initialize ret in syscall_enter_define_fields()
  perf header: Use last modification time for timestamp
  perf c2c: Fix return type for histogram sorting comparision functions
  perf beauty sockaddr: Fix augmented syscall format warning
  perf/ui/gtk: Fix gtk2 build
  perf ui gtk: Add missing zalloc object
  perf tools: Use %define api.pure full instead of %pure-parser
  libperf: Setup initial evlist::all_cpus value
  perf report: Fix no libunwind compiled warning break s390 issue
  perf tools: Support --prefix/--prefix-strip
  perf report: Clarify in help that --children is default
  tools build: Fix test-clang.cpp with Clang 8+
  perf clang: Fix build with Clang 9
  kprobes: Fix optimize_kprobe()/unoptimize_kprobe() cancellation logic
  tools lib: Fix builds when glibc contains strlcpy()
  perf report/top: Make 'e' visible in the help and make it toggle showing callchains
  perf report/top: Do not offer annotation for symbols without samples
  ...
2020-01-28 09:44:15 -08:00
Linus Torvalds
2180f214f4 Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
 "Just a handful of changes in this cycle: an ARM64 performance
  optimization, a comment fix and a debug output fix"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/osq: Use optimized spinning loop for arm64
  locking/qspinlock: Fix inaccessible URL of MCS lock paper
  locking/lockdep: Fix lockdep_stats indentation problem
2020-01-28 09:33:25 -08:00
Linus Torvalds
d99391ec2b Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 "The RCU changes in this cycle were:
   - Expedited grace-period updates
   - kfree_rcu() updates
   - RCU list updates
   - Preemptible RCU updates
   - Torture-test updates
   - Miscellaneous fixes
   - Documentation updates"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (69 commits)
  rcu: Remove unused stop-machine #include
  powerpc: Remove comment about read_barrier_depends()
  .mailmap: Add entries for old paulmck@kernel.org addresses
  srcu: Apply *_ONCE() to ->srcu_last_gp_end
  rcu: Switch force_qs_rnp() to for_each_leaf_node_cpu_mask()
  rcu: Move rcu_{expedited,normal} definitions into rcupdate.h
  rcu: Move gp_state_names[] and gp_state_getname() to tree_stall.h
  rcu: Remove the declaration of call_rcu() in tree.h
  rcu: Fix tracepoint tracking RCU CPU kthread utilization
  rcu: Fix harmless omission of "CONFIG_" from #if condition
  rcu: Avoid tick_dep_set_cpu() misordering
  rcu: Provide wrappers for uses of ->rcu_read_lock_nesting
  rcu: Use READ_ONCE() for ->expmask in rcu_read_unlock_special()
  rcu: Clear ->rcu_read_unlock_special only once
  rcu: Clear .exp_hint only when deferred quiescent state has been reported
  rcu: Rename some instance of CONFIG_PREEMPTION to CONFIG_PREEMPT_RCU
  rcu: Remove kfree_call_rcu_nobatch()
  rcu: Remove kfree_rcu() special casing and lazy-callback handling
  rcu: Add support for debug_objects debugging for kfree_rcu()
  rcu: Add multiple in-flight batches of kfree_rcu() work
  ...
2020-01-28 08:46:13 -08:00
Sebastian Andrzej Siewior
25a3a15417 smp: Remove superfluous cond_func check in smp_call_function_many_cond()
It was requested to remove the cond_func check but the follow up patch was
overlooked. Remove it now.

Fixes: 67719ef25e ("smp: Add a smp_cond_func_t argument to smp_call_function_many()")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20200127083915.434tdkztorkklpdu@linutronix.de
2020-01-28 15:43:00 +01:00
Mike Christie
8d19f1c8e1
prctl: PR_{G,S}ET_IO_FLUSHER to support controlling memory reclaim
There are several storage drivers like dm-multipath, iscsi, tcmu-runner,
amd nbd that have userspace components that can run in the IO path. For
example, iscsi and nbd's userspace deamons may need to recreate a socket
and/or send IO on it, and dm-multipath's daemon multipathd may need to
send SG IO or read/write IO to figure out the state of paths and re-set
them up.

In the kernel these drivers have access to GFP_NOIO/GFP_NOFS and the
memalloc_*_save/restore functions to control the allocation behavior,
but for userspace we would end up hitting an allocation that ended up
writing data back to the same device we are trying to allocate for.
The device is then in a state of deadlock, because to execute IO the
device needs to allocate memory, but to allocate memory the memory
layers want execute IO to the device.

Here is an example with nbd using a local userspace daemon that performs
network IO to a remote server. We are using XFS on top of the nbd device,
but it can happen with any FS or other modules layered on top of the nbd
device that can write out data to free memory.  Here a nbd daemon helper
thread, msgr-worker-1, is performing a write/sendmsg on a socket to execute
a request. This kicks off a reclaim operation which results in a WRITE to
the nbd device and the nbd thread calling back into the mm layer.

[ 1626.609191] msgr-worker-1   D    0  1026      1 0x00004000
[ 1626.609193] Call Trace:
[ 1626.609195]  ? __schedule+0x29b/0x630
[ 1626.609197]  ? wait_for_completion+0xe0/0x170
[ 1626.609198]  schedule+0x30/0xb0
[ 1626.609200]  schedule_timeout+0x1f6/0x2f0
[ 1626.609202]  ? blk_finish_plug+0x21/0x2e
[ 1626.609204]  ? _xfs_buf_ioapply+0x2e6/0x410
[ 1626.609206]  ? wait_for_completion+0xe0/0x170
[ 1626.609208]  wait_for_completion+0x108/0x170
[ 1626.609210]  ? wake_up_q+0x70/0x70
[ 1626.609212]  ? __xfs_buf_submit+0x12e/0x250
[ 1626.609214]  ? xfs_bwrite+0x25/0x60
[ 1626.609215]  xfs_buf_iowait+0x22/0xf0
[ 1626.609218]  __xfs_buf_submit+0x12e/0x250
[ 1626.609220]  xfs_bwrite+0x25/0x60
[ 1626.609222]  xfs_reclaim_inode+0x2e8/0x310
[ 1626.609224]  xfs_reclaim_inodes_ag+0x1b6/0x300
[ 1626.609227]  xfs_reclaim_inodes_nr+0x31/0x40
[ 1626.609228]  super_cache_scan+0x152/0x1a0
[ 1626.609231]  do_shrink_slab+0x12c/0x2d0
[ 1626.609233]  shrink_slab+0x9c/0x2a0
[ 1626.609235]  shrink_node+0xd7/0x470
[ 1626.609237]  do_try_to_free_pages+0xbf/0x380
[ 1626.609240]  try_to_free_pages+0xd9/0x1f0
[ 1626.609245]  __alloc_pages_slowpath+0x3a4/0xd30
[ 1626.609251]  ? ___slab_alloc+0x238/0x560
[ 1626.609254]  __alloc_pages_nodemask+0x30c/0x350
[ 1626.609259]  skb_page_frag_refill+0x97/0xd0
[ 1626.609274]  sk_page_frag_refill+0x1d/0x80
[ 1626.609279]  tcp_sendmsg_locked+0x2bb/0xdd0
[ 1626.609304]  tcp_sendmsg+0x27/0x40
[ 1626.609307]  sock_sendmsg+0x54/0x60
[ 1626.609308]  ___sys_sendmsg+0x29f/0x320
[ 1626.609313]  ? sock_poll+0x66/0xb0
[ 1626.609318]  ? ep_item_poll.isra.15+0x40/0xc0
[ 1626.609320]  ? ep_send_events_proc+0xe6/0x230
[ 1626.609322]  ? hrtimer_try_to_cancel+0x54/0xf0
[ 1626.609324]  ? ep_read_events_proc+0xc0/0xc0
[ 1626.609326]  ? _raw_write_unlock_irq+0xa/0x20
[ 1626.609327]  ? ep_scan_ready_list.constprop.19+0x218/0x230
[ 1626.609329]  ? __hrtimer_init+0xb0/0xb0
[ 1626.609331]  ? _raw_spin_unlock_irq+0xa/0x20
[ 1626.609334]  ? ep_poll+0x26c/0x4a0
[ 1626.609337]  ? tcp_tsq_write.part.54+0xa0/0xa0
[ 1626.609339]  ? release_sock+0x43/0x90
[ 1626.609341]  ? _raw_spin_unlock_bh+0xa/0x20
[ 1626.609342]  __sys_sendmsg+0x47/0x80
[ 1626.609347]  do_syscall_64+0x5f/0x1c0
[ 1626.609349]  ? prepare_exit_to_usermode+0x75/0xa0
[ 1626.609351]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

This patch adds a new prctl command that daemons can use after they have
done their initial setup, and before they start to do allocations that
are in the IO path. It sets the PF_MEMALLOC_NOIO and PF_LESS_THROTTLE
flags so both userspace block and FS threads can use it to avoid the
allocation recursion and try to prevent from being throttled while
writing out data to free up memory.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Tested-by: Masato Suzuki <masato.suzuki@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Link: https://lore.kernel.org/r/20191112001900.9206-1-mchristi@redhat.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-01-28 10:09:51 +01:00
Ingo Molnar
0cc4bd8f70 Merge branch 'core/kprobes' into perf/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-01-28 07:59:05 +01:00
Linus Torvalds
3d3b44a61a The interrupt departement provides:
- A mechanism to shield isolated tasks from managed interrupts:
 
    The affinity of managed interrupts is completely controlled by the
    kernel and user space has no influence on them. The reason is that
    the automatically assigned affinity correlates to the multi-queue
    CPU handling of block devices.
 
    If the generated affinity mask spaws both housekeeping and isolated CPUs
    the interrupt could be routed to an isolated CPU which would then be
    disturbed by I/O submitted by a housekeeping CPU.
 
    The new mechamism ensures that as long as one housekeeping CPU is online
    in the assigned affinity mask the interrupt is routed to a housekeeping
    CPU.
 
    If there is no online housekeeping CPU in the affinity mask, then the
    interrupt is routed to an isolated CPU to keep the device queue intact,
    but unless the isolated CPU submits I/O by itself these interrupts are
    not raised.
 
  - A small addon to the device tree irqdomain core code to avoid
    duplication in irq chip drivers
 
  - Conversion of the SiFive PLIC to hierarchical domains
 
  - The usual pile of new irq chip drivers: SiFive GPIO, Aspeed SCI, NXP
    INTMUX, Meson A1 GPIO
 
  - The first cut of support for the new ARM GICv4.1
 
  - The usual pile of fixes and improvements in core and driver code
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl4vcbETHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoezyEADBPf0ipu5+KeTtCR+DjRAO8o0wM0J/
 JNkRkSrS/qENSda/d6pZE2AWpqlDOs6apg+SNGkv0knM+1Xy94nLOf4zJBsR+GW0
 w2jw68egnyB2QZtm/BvOJL+qCoixcObg5sLt0165pDdKzyDNWeCMtRU+QAw42T/l
 WC2QrhjKKqYST1m+UgDf1UXz8TDGIW4muRP9UiG0Uwc0LU6cG2H4OmGn0bYissaT
 JTG75pzGqUH3kZ1a1qD28nGyoY85BXz1iV5/IvIPaQbkQARbvfMbh1KvAnGhJj7N
 96rjMpOGv2/kv1FI+4FUy6w5Wn4EyW2OaCtB/oUCFNcZvrNNgvglxCRQkkO8yb3D
 VOOm595ICm3EnIfxBpSzhgvVl5MY39g6qRb6Rpnna+8eRtrYnytMBdvhY0OGlG8/
 cZYZDay0nzhY6vq023iw1YMDKqft7TR1R+6w1iPL7nXHXW99Dhv87d1Fjt0CqphD
 NIoNDgxciIyfMbMBvcg1qPe/g3L8+cAKNzGsIwIU9GneEZFBk3/piGcBlFpoEEOK
 2QKvks3QRXMx+qVWkIqy3LZKV9EAQlb9Lpjaa1ec5d4m/EdACm19OpZpqoCljPtw
 9vdaMz4ZxvUbwjih3VnVPklZCiVGiKj1j0iw5v3FCHh4MUljzCrxNMqK/U9CR8H0
 uid3EX8YMi+DXA==
 =E2VR
 -----END PGP SIGNATURE-----

Merge tag 'irq-core-2020-01-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq updates from Thomas Gleixner:
 "The interrupt departement provides:

   - A mechanism to shield isolated tasks from managed interrupts:

     The affinity of managed interrupts is completely controlled by the
     kernel and user space has no influence on them. The reason is that
     the automatically assigned affinity correlates to the multi-queue
     CPU handling of block devices.

     If the generated affinity mask spaws both housekeeping and isolated
     CPUs the interrupt could be routed to an isolated CPU which would
     then be disturbed by I/O submitted by a housekeeping CPU.

     The new mechamism ensures that as long as one housekeeping CPU is
     online in the assigned affinity mask the interrupt is routed to a
     housekeeping CPU.

     If there is no online housekeeping CPU in the affinity mask, then
     the interrupt is routed to an isolated CPU to keep the device queue
     intact, but unless the isolated CPU submits I/O by itself these
     interrupts are not raised.

   - A small addon to the device tree irqdomain core code to avoid
     duplication in irq chip drivers

   - Conversion of the SiFive PLIC to hierarchical domains

   - The usual pile of new irq chip drivers: SiFive GPIO, Aspeed SCI,
     NXP INTMUX, Meson A1 GPIO

   - The first cut of support for the new ARM GICv4.1

   - The usual pile of fixes and improvements in core and driver code"

* tag 'irq-core-2020-01-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
  genirq, sched/isolation: Isolate from handling managed interrupts
  irqchip/gic-v4.1: Allow direct invalidation of VLPIs
  irqchip/gic-v4.1: Suppress per-VLPI doorbell
  irqchip/gic-v4.1: Add VPE INVALL callback
  irqchip/gic-v4.1: Add VPE eviction callback
  irqchip/gic-v4.1: Add VPE residency callback
  irqchip/gic-v4.1: Add mask/unmask doorbell callbacks
  irqchip/gic-v4.1: Plumb skeletal VPE irqchip
  irqchip/gic-v4.1: Implement the v4.1 flavour of VMOVP
  irqchip/gic-v4.1: Don't use the VPE proxy if RVPEID is set
  irqchip/gic-v4.1: Implement the v4.1 flavour of VMAPP
  irqchip/gic-v4.1: VPE table (aka GICR_VPROPBASER) allocation
  irqchip/gic-v3: Add GICv4.1 VPEID size discovery
  irqchip/gic-v3: Detect GICv4.1 supporting RVPEID
  irqchip/gic-v3-its: Fix get_vlpi_map() breakage with doorbells
  irqdomain: Fix a memory leak in irq_domain_push_irq()
  irqchip: Add NXP INTMUX interrupt multiplexer support
  dt-bindings: interrupt-controller: Add binding for NXP INTMUX interrupt multiplexer
  irqchip: Define EXYNOS_IRQ_COMBINER
  irqchip/meson-gpio: Add support for meson a1 SoCs
  ...
2020-01-27 17:22:21 -08:00
Linus Torvalds
ab67f60025 A small set of SMP core code changes:
- Rework the smp function call core code to avoid the allocation of an
    additional cpumask.
 
  - Remove the not longer required GFP argument from on_each_cpu_cond() and
    on_each_cpu_cond_mask() and fixup the callers.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl4vcrATHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYocr1D/4ptWrZKsgBxGKBP34lvJAjd0KRqVoz
 J9dLAN+AAs6YZSnOmRBX1b9d9IL2PrccOEF+J/Ja3ZkB+PAoAQ9W3uCHkZ77WUph
 xx5eJahZCo+3nZ6amGgS2cPdG8WjxSK3enxPcU4pJhV/QaaP7R9BZt5YQgreYAQO
 kRi0qyt10AExLqLd+077GX5DKcEOXwwVG/qckUQK2h8Kkd68vTbjDxggvsHwmpSE
 MHaszv85UpE+YQbT6DyG5Hi4kK3AJeODBy/fKr2VODIBLZpKiuQ5kK4lbNHYPpVB
 wXw0umXHLQggrKoPKo58ayoCXD0bAG9JT0rvapjUJIz1/9YejQ6lB/t5f0dPbSrU
 al4CJq/pfNky4H6uLWFVbAXJabJuBcB/eG1csaM88Yw0pEXkbnHCOkJAdosoDhhl
 qNQYg4yaE9tTuy1chXDMntH0R0Qztqry6+DMsczJxT21TgERsHCRJV+mGLV46/ZN
 GXJEoJ/cnjNJlqj8GirjbksPRbxuvmQNHRVrTh8qOSxbPKUQZfZocp9HHNmFsBaN
 Q07VgWMHXzYj1L4r3cbJ/ONpOCo66lw7F//MNGk0eIWdeL6H7XZvJQPX+YUrLsZc
 tVlZh8mZOGbRiM8g1dN0BSJO7QrVYmJWGb0oQQtv5tVSRN/V8Y9VZ8YX8lpYlF1e
 ETkrZLGhTJWp4A==
 =M4aK
 -----END PGP SIGNATURE-----

Merge tag 'smp-core-2020-01-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull core SMP updates from Thomas Gleixner:
 "A small set of SMP core code changes:

   - Rework the smp function call core code to avoid the allocation of
     an additional cpumask

   - Remove the not longer required GFP argument from on_each_cpu_cond()
     and on_each_cpu_cond_mask() and fixup the callers"

* tag 'smp-core-2020-01-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  smp: Remove allocation mask from on_each_cpu_cond.*()
  smp: Add a smp_cond_func_t argument to smp_call_function_many()
  smp: Use smp_cond_func_t as type for the conditional function
2020-01-27 17:04:51 -08:00
Linus Torvalds
e279160f49 The timekeeping and timers departement provides:
- Time namespace support:
 
     If a container migrates from one host to another then it expects that
     clocks based on MONOTONIC and BOOTTIME are not subject to
     disruption. Due to different boot time and non-suspended runtime these
     clocks can differ significantly on two hosts, in the worst case time
     goes backwards which is a violation of the POSIX requirements.
 
     The time namespace addresses this problem. It allows to set offsets for
     clock MONOTONIC and BOOTTIME once after creation and before tasks are
     associated with the namespace. These offsets are taken into account by
     timers and timekeeping including the VDSO.
 
     Offsets for wall clock based clocks (REALTIME/TAI) are not provided by
     this mechanism. While in theory possible, the overhead and code
     complexity would be immense and not justified by the esoteric potential
     use cases which were discussed at Plumbers '18.
 
     The overhead for tasks in the root namespace (host time offsets = 0) is
     in the noise and great effort was made to ensure that especially in the
     VDSO. If time namespace is disabled in the kernel configuration the
     code is compiled out.
 
     Kudos to Andrei Vagin and Dmitry Sofanov who implemented this feature
     and kept on for more than a year addressing review comments, finding
     better solutions. A pleasant experience.
 
   - Overhaul of the alarmtimer device dependency handling to ensure that
     the init/suspend/resume ordering is correct.
 
   - A new clocksource/event driver for Microchip PIT64
 
   - Suspend/resume support for the Hyper-V clocksource
 
   - The usual pile of fixes, updates and improvements mostly in the
     driver code.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl4vbTcTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoXT2D/96iJ3G9Snn2khEQP3XS2rYmtDGw7NO
 m1n96falwWeGe6zreU80R2Jge5nLxQtNhRoMPLLee1GpHwRC6lvqEqgdZ4LMBrD2
 JqV7Gzg8Urmdh+hpDsyTCpeEWEzoMKxiFOX8PxwctqUhM4szEe5iQg2YQsg85Jw2
 vG6M93N2xwDILh4rhEMbKjo+5ZmYn7c1RQvpGOSmpKOj940W/N7H2HBsFhdaJ1Kw
 FW5pFv1211PaU5RV2YNb2dMeeMTT1N3e2VN4Dkadoxp47pb+725gNHEBEjmV9poG
 Lp4IhzGAPnj8zVD88icQZSTaK3gUHMClxprJ0Pf84WEtiH7SeGu8BPYyu77+oNDe
 yzcctDJNyCWXkzmaP/fe/HLc0TStbvNAJ5Tagp4BC75gzebeb4/n8RtRT0fKeDYL
 pxpDPKDAPU7p1JSjxiWAtshqjBycWNY3Z49bA7/VhKBhnv8BDyBPGlYd7/4xrbGr
 RK7DQNXJwaJaiNJ7p5PiaFxGzNyB0B9sThD/slSlEInIKb4h9YzWr0TV+NB62VnB
 sDcN+tpLbRPz5/5cHGGfxR0+zKWpfyai8pzbmmaXEaKssjRYwyvcac5EZdgbWpbK
 k7CqAjoWLA2P+tGeePNJOf5JYK6Vmdyh4clmuwM0zOiRJ9NlWUyMf3z7QYILs4RO
 UAI+6opYlZEPAw==
 =x3qT
 -----END PGP SIGNATURE-----

Merge tag 'timers-core-2020-01-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer updates from Thomas Gleixner:
 "The timekeeping and timers departement provides:

   - Time namespace support:

     If a container migrates from one host to another then it expects
     that clocks based on MONOTONIC and BOOTTIME are not subject to
     disruption. Due to different boot time and non-suspended runtime
     these clocks can differ significantly on two hosts, in the worst
     case time goes backwards which is a violation of the POSIX
     requirements.

     The time namespace addresses this problem. It allows to set offsets
     for clock MONOTONIC and BOOTTIME once after creation and before
     tasks are associated with the namespace. These offsets are taken
     into account by timers and timekeeping including the VDSO.

     Offsets for wall clock based clocks (REALTIME/TAI) are not provided
     by this mechanism. While in theory possible, the overhead and code
     complexity would be immense and not justified by the esoteric
     potential use cases which were discussed at Plumbers '18.

     The overhead for tasks in the root namespace (ie where host time
     offsets = 0) is in the noise and great effort was made to ensure
     that especially in the VDSO. If time namespace is disabled in the
     kernel configuration the code is compiled out.

     Kudos to Andrei Vagin and Dmitry Sofanov who implemented this
     feature and kept on for more than a year addressing review
     comments, finding better solutions. A pleasant experience.

   - Overhaul of the alarmtimer device dependency handling to ensure
     that the init/suspend/resume ordering is correct.

   - A new clocksource/event driver for Microchip PIT64

   - Suspend/resume support for the Hyper-V clocksource

   - The usual pile of fixes, updates and improvements mostly in the
     driver code"

* tag 'timers-core-2020-01-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (71 commits)
  alarmtimer: Make alarmtimer_get_rtcdev() a stub when CONFIG_RTC_CLASS=n
  alarmtimer: Use wakeup source from alarmtimer platform device
  alarmtimer: Make alarmtimer platform device child of RTC device
  alarmtimer: Update alarmtimer_get_rtcdev() docs to reflect reality
  hrtimer: Add missing sparse annotation for __run_timer()
  lib/vdso: Only read hrtimer_res when needed in __cvdso_clock_getres()
  MIPS: vdso: Define BUILD_VDSO32 when building a 32bit kernel
  clocksource/drivers/hyper-v: Set TSC clocksource as default w/ InvariantTSC
  clocksource/drivers/hyper-v: Untangle stimers and timesync from clocksources
  clocksource/drivers/timer-microchip-pit64b: Fix sparse warning
  clocksource/drivers/exynos_mct: Rename Exynos to lowercase
  clocksource/drivers/timer-ti-dm: Fix uninitialized pointer access
  clocksource/drivers/timer-ti-dm: Switch to platform_get_irq
  clocksource/drivers/timer-ti-dm: Convert to devm_platform_ioremap_resource
  clocksource/drivers/em_sti: Fix variable declaration in em_sti_probe
  clocksource/drivers/em_sti: Convert to devm_platform_ioremap_resource
  clocksource/drivers/bcm2835_timer: Fix memory leak of timer
  clocksource/drivers/cadence-ttc: Use ttc driver as platform driver
  clocksource/drivers/timer-microchip-pit64b: Add Microchip PIT64B support
  clocksource/drivers/hyper-v: Reserve PAGE_SIZE space for tsc page
  ...
2020-01-27 16:47:05 -08:00
Linus Torvalds
b11c89a158 A set of watchdog/softlockup related improvements:
- Enforce that the watchdog timestamp is always valid on boot. The
    original implementation caused a watchdog disabled gap of one second in
    the boot process due to truncation of the underlying sched clock. The
    sched clock is divided by 1e9 to convert nanoseconds to seconds. So for
    the first second of the boot process the result is 0 which is at the
    same time the indicator to disable the watchdog. The trivial fix is to
    change the disabled indicator to ULONG_MAX.
 
  - Two cleanup patches removing unused and redundant code which got
    forgotten to be cleaned up in previous changes.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl4vbrQTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoTQHD/9ONyg9VQLjk6aH94H1Sjik/K7zvxoC
 aMGY2onZ6PddVrcTgJoMmWteQlQ2YScCSVnfVedmxTRU8laEHU/LQnMntTAbuHWj
 VUkK8X/AI5l+VY6p0Sr1iCyxcFezoC2VMqOKntuQl3080mK7R7/fQ+ZVmimiPihr
 46qMikIfBN7w2od7Ger3dZRttbnRj5YsmLBenX/HtBY/HPdhoDx6lfW/5AbAgUH5
 qnAmM0yPZ/VUSfo45z+exESUezxByIkGsrROBtPSRwql3Oqbyrza2UC48dRjsuIQ
 vO0coorlhqJGF72WW45DiLvg4Hew/vVyzcYrIiOSQPZpeTtPzL23zk/cqcqpKy6N
 pCuiSgimzbPgzqTHs6WQR/D0Dn76rruUqXqteuD5zirC9Kjf2TWeIMPTgPfy8irt
 2RwT1+5Ao/SNkdm/Pxk0S/+Y99uRJSqeNTV3lroYGC7IFMAnG4P0S9uyFJ6ZFIMz
 nOvEOhUlFXWw/w7WPZv+ytx40sRkqFVIePSRtzq+cjlDEYCgLhuveE2A4/6IGPMP
 Ej6vsGh3lMyHieRhmymESG8uLU2P/L7hhPexUPJJu4QSxKbKQNfWx+0z7bm86Ic7
 0uDSNZZl7UDYq6tioS1DBTq9ybly9vn1WDe5tHMJDllPe9TIEnqynvVLIg6MMGdm
 GjbTNysDPx85yw==
 =WMiM
 -----END PGP SIGNATURE-----

Merge tag 'core-core-2020-01-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull watchdog updates from Thomas Gleixner:
 "A set of watchdog/softlockup related improvements:

   - Enforce that the watchdog timestamp is always valid on boot. The
     original implementation caused a watchdog disabled gap of one
     second in the boot process due to truncation of the underlying
     sched clock.

     The sched clock is divided by 1e9 to convert nanoseconds to
     seconds. So for the first second of the boot process the result is
     0 which is at the same time the indicator to disable the watchdog.

     The trivial fix is to change the disabled indicator to ULONG_MAX.

   - Two cleanup patches removing unused and redundant code which got
     forgotten to be cleaned up in previous changes"

* tag 'core-core-2020-01-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  watchdog/softlockup: Enforce that timestamp is valid on boot
  watchdog/softlockup: Remove obsolete check of last reported task
  watchdog: Remove soft_lockup_hrtimer_cnt and related code
2020-01-27 16:42:11 -08:00
Linus Torvalds
a56c41e5d7 Two fixes for the generic VDSO code which missed 5.5:
- Make the update to the coarse timekeeper unconditional. This is required
    because the coarse timekeeper interfaces in the VDSO do not depend on a
    VDSO capable clocksource. If the system does not have a VDSO capable
    clocksource and the update is depending on the VDSO capable clocksource,
    the coarse VDSO interfaces would operate on stale data forever.
 
  - Invert the logic of __arch_update_vdso_data() to avoid further head
    scratching. Tripped over this several times while analyzing the update
    problem above.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl4vXzUTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYodbPD/4km+XOhsbefcn1Xo6SAQV9akPhKSHY
 h1gfjpe4UD+Uj4WfmpERHcCJA3sYtZSjNyEWkwagH1XjB+rcLc3JE8XvhPCZTXCx
 g/OQlww1ef6mBZ5nslpPUZs8i0HppoV7Sa955QxR/jWuOIEssg5c+XGqP8xX8AhX
 TqBOUcJd0LhqCGt76Gb6LHnOEshE8e6ptZ0xayzMZsab3LJTEaJCrsoDpADQ1q8A
 hMjiL3CG9/e12qKYhODFTbyc/wgyGQYK8g6sb9E1Twd2Tw2+ikRbtZuQd3HQv4jV
 SiVtmMqLu6IH+G608zeNIn/67/WX9zYqUZ3fZgSjBwXWoB84Gyj11KLnjmCgS6SH
 0ddOQKPn8VyQc2anG4obRtMNB+TjJvGnB4QSL2ROJB7Zx6EYMsduhXwIbaNZDDro
 nIh6Xvl6iyb0lkhd9zCR7ak7UHJg4ECJsVKK3kAMIHJM4f53d/DwT+ZaHbJZa/2a
 OLoBGpBkJoE1X40dXou+0FUyUFRla42+ho99nCU580EyK/ZAuZEqKjjez9QIh4vN
 L/I6uEHGBw9myB40nb0DFhRIFR97BUkRTRA3VhyX0CYIE3gUL43zNFsdvcugsxRy
 4/Cf7tqhQcSjYjJxpLTRRWt2t6QvDoWfTnrwiPqSepcO17uV8WHLrxK4mT2i8Vjc
 PIq7OgZlp09gQA==
 =ONO4
 -----END PGP SIGNATURE-----

Merge tag 'timers-urgent-2020-01-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fixes from Thomas Gleixner:
 "Two fixes for the generic VDSO code which missed 5.5:

   - Make the update to the coarse timekeeper unconditional.

     This is required because the coarse timekeeper interfaces in the
     VDSO do not depend on a VDSO capable clocksource. If the system
     does not have a VDSO capable clocksource and the update is
     depending on the VDSO capable clocksource, the coarse VDSO
     interfaces would operate on stale data forever.

   - Invert the logic of __arch_update_vdso_data() to avoid further head
     scratching.

     Tripped over this several times while analyzing the update problem
     above"

* tag 'timers-urgent-2020-01-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  lib/vdso: Update coarse timekeeper unconditionally
  lib/vdso: Make __arch_update_vdso_data() logic understandable
2020-01-27 16:37:40 -08:00
Linus Torvalds
07e309a972 audit/stable-5.6 PR 20200127
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAl4vRtMUHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXM6rw//RXPHJ+U1gjtC5kWQX66/HxEwSY3c
 M236UiJD+xbEHKWpViFd6S7YzHQCkqEO2UvMSwMFP0aL2D56nhkEIKblQJ5sLSK9
 3kNq/7wmxZgCj+/YrGeCiFFWpgSj/PiNB+VDouUkEkT5ZtKamA63qzhqEAUY995L
 vlZVgE8Cpu92JKJKZXKOnlJ+gYh3icFXKbWp0Lk9mmte4RiJ/zsFo+rRou5TzrMm
 30D3A9p9A7sC3jMeRQCowE5UwTkdOeknRi1b4obAGAajuaA+/HtL7bUj8rVwjJXl
 bpX/wShrZDb+dc0NGLQikhzDV/i3qn1DzMbSMuJL/1tf9Jv5lzoJ0/14RkBzd5sm
 pPFA/tUs/3NlPKEyZluA7W21LOUdWk4UxeOJkysJLjfYvsVDg02yFS3qYaZRPaSa
 B3Ex36drCfQfMpMH4Nglh1iDl5oOIoAwn4mSCtirAw6YYG/sW6YnBEnloNYFfahs
 b4/xPhzKfzLtKdc+4yUSbTlIUU+GAdCLxPlp2IvRgqfa9oTATIRP9DY70//V3myN
 PGnCLCu10ag47fJWV4mNetYUv6BR22dvLLX8igcfYmIS3zYM0lEWEz7SOaRuPBdf
 QqAHMNaDCY6z8aEFr+aXW6kr2SP3ycqdvv+b+CbfX1Z7R7wZ8iG3uRyaQHEGPvN2
 zje4VYJQcJs+EXE=
 =tPy4
 -----END PGP SIGNATURE-----

Merge tag 'audit-pr-20200127' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit

Pull audit update from Paul Moore:
 "One small audit patch for the Linux v5.6 merge window, and
  unsurprisingly it passes our test suite with flying colors"

* tag 'audit-pr-20200127' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
  audit: Add __rcu annotation to RCU pointer
2020-01-27 15:35:50 -08:00
Linus Torvalds
03aa8c8cfa Merge branch 'for-5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:

 - cgroup2 interface for hugetlb controller. I think this was the last
   remaining bit which was missing from cgroup2

 - fixes for race and a spurious warning in threaded cgroup handling

 - other minor changes

* 'for-5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  iocost: Fix iocost_monitor.py due to helper type mismatch
  cgroup: Prevent double killing of css when enabling threaded cgroup
  cgroup: fix function name in comment
  mm: hugetlb controller for cgroups v2
2020-01-27 15:18:25 -08:00
Linus Torvalds
16d06120d7 Merge branch 'for-5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue updates from Tejun Heo:
 "Just a couple tracepoint patches"

* 'for-5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: remove workqueue_work event class
  workqueue: add worker function to workqueue_execute_end tracepoint
2020-01-27 15:16:52 -08:00
Linus Torvalds
6d277aca48 Power management updates for 5.6-rc1
- Update the ACPI processor driver in order to export
    acpi_processor_evaluate_cst() to the code outside of it, add
    ACPI support to the intel_idle driver based on that and clean
    up that driver somewhat (Rafael Wysocki).
 
  - Add an admin guide document for the intel_idle driver (Rafael
    Wysocki).
 
  - Clean up cpuidle core and drivers, enable compilation testing
    for some of them (Benjamin Gaignard, Krzysztof Kozlowski, Rafael
    Wysocki, Yangtao Li).
 
  - Fix reference counting of OPP (operating performance points) table
    structures (Viresh Kumar).
 
  - Add support for CPR (Core Power Reduction) to the AVS (Adaptive
    Voltage Scaling) subsystem (Niklas Cassel, Colin Ian King,
    YueHaibing).
 
  - Add support for TigerLake Mobile and JasperLake to the Intel RAPL
    power capping driver (Zhang Rui).
 
  - Update cpufreq drivers:
 
    * Add i.MX8MP support to imx-cpufreq-dt (Anson Huang).
 
    * Fix usage of a macro in loongson2_cpufreq (Alexandre Oliva).
 
    * Fix cpufreq policy reference counting issues in s3c and
      brcmstb-avs (chenqiwu).
 
    * Fix ACPI table reference counting issue and HiSilicon quirk
      handling in the CPPC driver (Hanjun Guo).
 
    * Clean up spelling mistake in intel_pstate (Harry Pan).
 
    * Convert the kirkwood and tegra186 drivers to using
      devm_platform_ioremap_resource() (Yangtao Li).
 
  - Update devfreq core:
 
    * Add 'name' sysfs attribute for devfreq devices (Chanwoo Choi).
 
    * Clean up the handing of transition statistics and allow them
      to be reset by writing 0 to the 'trans_stat' devfreq device
      attribute in sysfs (Kamil Konieczny).
 
    * Add 'devfreq_summary' to debugfs (Chanwoo Choi).
 
    * Clean up kerneldoc comments and Kconfig indentation (Krzysztof
      Kozlowski, Randy Dunlap).
 
  - Update devfreq drivers:
 
    * Add dynamic scaling for the imx8m DDR controller and clean up
      imx8m-ddrc (Leonard Crestez, YueHaibing).
 
    * Fix DT node reference counting and nitialization error code path
      in rk3399_dmc and add COMPILE_TEST and HAVE_ARM_SMCCC dependency
      for it (Chanwoo Choi, Yangtao Li).
 
    * Fix DT node reference counting in rockchip-dfi and make it use
      devm_platform_ioremap_resource() (Yangtao Li).
 
    * Fix excessive stack usage in exynos-ppmu (Arnd Bergmann).
 
    * Fix initialization error code paths in exynos-bus (Yangtao Li).
 
    * Clean up exynos-bus and exynos somewhat (Artur Świgoń, Krzysztof
      Kozlowski).
 
  - Add tracepoints for tracking usage_count updates unrelated to
    status changes in PM-runtime (Michał Mirosław).
 
  - Add sysfs attribute to control the "sync on suspend" behavior
    during system-wide suspend (Jonas Meurer).
 
  - Switch system-wide suspend tests over to 64-bit time (Alexandre
    Belloni).
 
  - Make wakeup sources statistics in debugfs cover deleted ones which
    used to be the case some time ago (zhuguangqing).
 
  - Clean up computations carried out during hibernation, update
    messages related to hibernation and fix a spelling mistake in one
    of them (Wen Yang, Luigi Semenzato, Colin Ian King).
 
  - Add mailmap entry for maintainer e-mail address that has not been
    functional for several years (Rafael Wysocki).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl4u2fESHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxvlkP/j5vDzyNUNJjnD6+897c8W+z5dwdiQfU
 QNtoopFXgw/fpOhGXRdj2mA4e6RtpU9aCCiHR6/qdh3/1qSnR5Y9R/51/gmdkwhY
 YakSxmgpgGrOJru94ApI1o/35eWwN/GxjajbfNY5ScrPQl/L0DF3iJWRsAOR5534
 p9e2gQqKecoE+MEn5JcGAXApA5xBLXuUmtWPUn5UGyhaz+jdmsf1zkDEOEvxREay
 hLGH1y6BY8HS/jytyNzISs9iDeBvg2fHmG8SskDiXVMke5sHBTU9MilgpnCFfQ0l
 OF/eNnTXTU7mAJhlnjBUt2rIe5peGSuhgg+Ur7s86xYqbj2SfsVM4UHjU0A6t9Jm
 sauWQh/Nbzw6XaCNzYKxP+dREAg0g/aq7xFqQi3bWx7YvzLk/hvNWi2+bv3adzx7
 Z3fvOki4xMXzLLrh0f1ipC8BKTsdioDZPAy06B80a0luv6ROdr6bPL7did14mWt2
 eCuPuZyXKhdV+PkjZHF+c4XT7N9NfGtE0WUQf54Q4VT00hDagGDliwXpm4ht1pjJ
 iO7uUJevXKSxMaV2xPZ+nWZaOeCVrMMTA1Ec1ELgC1n8WROZJ+SfhehgMQGp7BHS
 Hz4QO1HjTsCDnT+OU7JFeCRrkyXIlh75MOndWOOH6eTEXCAI9PihstB+UGXeNsK0
 BesNQz1sYY1O
 =g48u
 -----END PGP SIGNATURE-----

Merge tag 'pm-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:
 "These add ACPI support to the intel_idle driver along with an admin
  guide document for it, add support for CPR (Core Power Reduction) to
  the AVS (Adaptive Voltage Scaling) subsystem, add new hardware support
  in a few places, add some new sysfs attributes, debugfs files and
  tracepoints, fix bugs and clean up a bunch of things all over.

  Specifics:

   - Update the ACPI processor driver in order to export
     acpi_processor_evaluate_cst() to the code outside of it, add ACPI
     support to the intel_idle driver based on that and clean up that
     driver somewhat (Rafael Wysocki).

   - Add an admin guide document for the intel_idle driver (Rafael
     Wysocki).

   - Clean up cpuidle core and drivers, enable compilation testing for
     some of them (Benjamin Gaignard, Krzysztof Kozlowski, Rafael
     Wysocki, Yangtao Li).

   - Fix reference counting of OPP (operating performance points) table
     structures (Viresh Kumar).

   - Add support for CPR (Core Power Reduction) to the AVS (Adaptive
     Voltage Scaling) subsystem (Niklas Cassel, Colin Ian King,
     YueHaibing).

   - Add support for TigerLake Mobile and JasperLake to the Intel RAPL
     power capping driver (Zhang Rui).

   - Update cpufreq drivers:
      - Add i.MX8MP support to imx-cpufreq-dt (Anson Huang).
      - Fix usage of a macro in loongson2_cpufreq (Alexandre Oliva).
      - Fix cpufreq policy reference counting issues in s3c and
        brcmstb-avs (chenqiwu).
      - Fix ACPI table reference counting issue and HiSilicon quirk
        handling in the CPPC driver (Hanjun Guo).
      - Clean up spelling mistake in intel_pstate (Harry Pan).
      - Convert the kirkwood and tegra186 drivers to using
        devm_platform_ioremap_resource() (Yangtao Li).

   - Update devfreq core:
      - Add 'name' sysfs attribute for devfreq devices (Chanwoo Choi).
      - Clean up the handing of transition statistics and allow them to
        be reset by writing 0 to the 'trans_stat' devfreq device
        attribute in sysfs (Kamil Konieczny).
      - Add 'devfreq_summary' to debugfs (Chanwoo Choi).
      - Clean up kerneldoc comments and Kconfig indentation (Krzysztof
        Kozlowski, Randy Dunlap).

   - Update devfreq drivers:
      - Add dynamic scaling for the imx8m DDR controller and clean up
        imx8m-ddrc (Leonard Crestez, YueHaibing).
      - Fix DT node reference counting and nitialization error code path
        in rk3399_dmc and add COMPILE_TEST and HAVE_ARM_SMCCC dependency
        for it (Chanwoo Choi, Yangtao Li).
      - Fix DT node reference counting in rockchip-dfi and make it use
        devm_platform_ioremap_resource() (Yangtao Li).
      - Fix excessive stack usage in exynos-ppmu (Arnd Bergmann).
      - Fix initialization error code paths in exynos-bus (Yangtao Li).
      - Clean up exynos-bus and exynos somewhat (Artur Świgoń, Krzysztof
        Kozlowski).

   - Add tracepoints for tracking usage_count updates unrelated to
     status changes in PM-runtime (Michał Mirosław).

   - Add sysfs attribute to control the "sync on suspend" behavior
     during system-wide suspend (Jonas Meurer).

   - Switch system-wide suspend tests over to 64-bit time (Alexandre
     Belloni).

   - Make wakeup sources statistics in debugfs cover deleted ones which
     used to be the case some time ago (zhuguangqing).

   - Clean up computations carried out during hibernation, update
     messages related to hibernation and fix a spelling mistake in one
     of them (Wen Yang, Luigi Semenzato, Colin Ian King).

   - Add mailmap entry for maintainer e-mail address that has not been
     functional for several years (Rafael Wysocki)"

* tag 'pm-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (83 commits)
  cpufreq: loongson2_cpufreq: adjust cpufreq uses of LOONGSON_CHIPCFG
  intel_idle: Clean up irtl_2_usec()
  intel_idle: Move 3 functions closer to their callers
  intel_idle: Annotate initialization code and data structures
  intel_idle: Move and clean up intel_idle_cpuidle_devices_uninit()
  intel_idle: Rearrange intel_idle_cpuidle_driver_init()
  intel_idle: Clean up NULL pointer check in intel_idle_init()
  intel_idle: Fold intel_idle_probe() into intel_idle_init()
  intel_idle: Eliminate __setup_broadcast_timer()
  cpuidle: fix cpuidle_find_deepest_state() kerneldoc warnings
  cpuidle: sysfs: fix warnings when compiling with W=1
  cpuidle: coupled: fix warnings when compiling with W=1
  cpufreq: brcmstb-avs: fix imbalance of cpufreq policy refcount
  PM: suspend: Add sysfs attribute to control the "sync on suspend" behavior
  PM / devfreq: Add debugfs support with devfreq_summary file
  Documentation: admin-guide: PM: Add intel_idle document
  cpuidle: arm: Enable compile testing for some of drivers
  PM-runtime: add tracepoints for usage_count changes
  cpufreq: intel_pstate: fix spelling mistake: "Whethet" -> "Whether"
  PM: hibernate: fix spelling mistake "shapshot" -> "snapshot"
  ...
2020-01-27 11:23:54 -08:00
Linus Torvalds
0238d3c753 arm64 updates for 5.6
- New architecture features
 	* Support for Armv8.5 E0PD, which benefits KASLR in the same way as
 	  KPTI but without the overhead. This allows KPTI to be disabled on
 	  CPUs that are not affected by Meltdown, even is KASLR is enabled.
 
 	* Initial support for the Armv8.5 RNG instructions, which claim to
 	  provide access to a high bandwidth, cryptographically secure hardware
 	  random number generator. As well as exposing these to userspace, we
 	  also use them as part of the KASLR seed and to seed the crng once
 	  all CPUs have come online.
 
 	* Advertise a bunch of new instructions to userspace, including support
 	  for Data Gathering Hint, Matrix Multiply and 16-bit floating point.
 
 - Kexec
 	* Cleanups in preparation for relocating with the MMU enabled
 	* Support for loading crash dump kernels with kexec_file_load()
 
 - Perf and PMU drivers
 	* Cleanups and non-critical fixes for a couple of system PMU drivers
 
 - FPU-less (aka broken) CPU support
 	* Considerable fixes to support CPUs without the FP/SIMD extensions,
 	  including their presence in heterogeneous systems. Good luck finding
 	  a 64-bit userspace that handles this.
 
 - Modern assembly function annotations
 	* Start migrating our use of ENTRY() and ENDPROC() over to the
 	  new-fangled SYM_{CODE,FUNC}_{START,END} macros, which are intended to
 	  aid debuggers
 
 - Kbuild
 	* Cleanup detection of LSE support in the assembler by introducing
 	  'as-instr'
 
 	* Remove compressed Image files when building clean targets
 
 - IP checksumming
 	* Implement optimised IPv4 checksumming routine when hardware offload
 	  is not in use. An IPv6 version is in the works, pending testing.
 
 - Hardware errata
 	* Work around Cortex-A55 erratum #1530923
 
 - Shadow call stack
 	* Work around some issues with Clang's integrated assembler not liking
 	  our perfectly reasonable assembly code
 
 	* Avoid allocating the X18 register, so that it can be used to hold the
 	  shadow call stack pointer in future
 
 - ACPI
 	* Fix ID count checking in IORT code. This may regress broken firmware
 	  that happened to work with the old implementation, in which case we'll
 	  have to revert it and try something else
 
 	* Fix DAIF corruption on return from GHES handler with pseudo-NMIs
 
 - Miscellaneous
 	* Whitelist some CPUs that are unaffected by Spectre-v2
 
 	* Reduce frequency of ASID rollover when KPTI is compiled in but
 	  inactive
 
 	* Reserve a couple of arch-specific PROT flags that are already used by
 	  Sparc and PowerPC and are planned for later use with BTI on arm64
 
 	* Preparatory cleanup of our entry assembly code in preparation for
 	  moving more of it into C later on
 
 	* Refactoring and cleanup
 -----BEGIN PGP SIGNATURE-----
 
 iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAl4oY+IQHHdpbGxAa2Vy
 bmVsLm9yZwAKCRC3rHDchMFjNNfRB/4p3vax0hqaOnLRvmJPRXF31B8oPlivnr2u
 6HCA9LkdU5IlrgaTNOJ/sQEqJAPOPCU7v49Ol0iYw0iKL1suUE7Ikui5VB6Uybqt
 YbfF5UNzfXAMs2A86TF/hzqhxw+W+lpnZX8NVTuQeAODfHEGUB1HhTLfRi9INsER
 wKEAuoZyuSUibxTFvji+DAq7nVRniXX7CM7tE385pxDisCMuu/7E5wOl+3EZYXWz
 DTGzTbHXuVFL+UFCANFEUlAtmr3dQvPFIqAwVl/CxjRJjJ7a+/G3cYLsHFPrQCjj
 qYX4kfhAeeBtqmHL7YFNWFwFs5WaT5UcQquFO665/+uCTWSJpORY
 =AIh/
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Will Deacon:
 "The changes are a real mixed bag this time around.

  The only scary looking one from the diffstat is the uapi change to
  asm-generic/mman-common.h, but this has been acked by Arnd and is
  actually just adding a pair of comments in an attempt to prevent
  allocation of some PROT values which tend to get used for
  arch-specific purposes. We'll be using them for Branch Target
  Identification (a CFI-like hardening feature), which is currently
  under review on the mailing list.

  New architecture features:

   - Support for Armv8.5 E0PD, which benefits KASLR in the same way as
     KPTI but without the overhead. This allows KPTI to be disabled on
     CPUs that are not affected by Meltdown, even is KASLR is enabled.

   - Initial support for the Armv8.5 RNG instructions, which claim to
     provide access to a high bandwidth, cryptographically secure
     hardware random number generator. As well as exposing these to
     userspace, we also use them as part of the KASLR seed and to seed
     the crng once all CPUs have come online.

   - Advertise a bunch of new instructions to userspace, including
     support for Data Gathering Hint, Matrix Multiply and 16-bit
     floating point.

  Kexec:

   - Cleanups in preparation for relocating with the MMU enabled

   - Support for loading crash dump kernels with kexec_file_load()

  Perf and PMU drivers:

   - Cleanups and non-critical fixes for a couple of system PMU drivers

  FPU-less (aka broken) CPU support:

   - Considerable fixes to support CPUs without the FP/SIMD extensions,
     including their presence in heterogeneous systems. Good luck
     finding a 64-bit userspace that handles this.

  Modern assembly function annotations:

   - Start migrating our use of ENTRY() and ENDPROC() over to the
     new-fangled SYM_{CODE,FUNC}_{START,END} macros, which are intended
     to aid debuggers

  Kbuild:

   - Cleanup detection of LSE support in the assembler by introducing
     'as-instr'

   - Remove compressed Image files when building clean targets

  IP checksumming:

   - Implement optimised IPv4 checksumming routine when hardware offload
     is not in use. An IPv6 version is in the works, pending testing.

  Hardware errata:

   - Work around Cortex-A55 erratum #1530923

  Shadow call stack:

   - Work around some issues with Clang's integrated assembler not
     liking our perfectly reasonable assembly code

   - Avoid allocating the X18 register, so that it can be used to hold
     the shadow call stack pointer in future

  ACPI:

   - Fix ID count checking in IORT code. This may regress broken
     firmware that happened to work with the old implementation, in
     which case we'll have to revert it and try something else

   - Fix DAIF corruption on return from GHES handler with pseudo-NMIs

  Miscellaneous:

   - Whitelist some CPUs that are unaffected by Spectre-v2

   - Reduce frequency of ASID rollover when KPTI is compiled in but
     inactive

   - Reserve a couple of arch-specific PROT flags that are already used
     by Sparc and PowerPC and are planned for later use with BTI on
     arm64

   - Preparatory cleanup of our entry assembly code in preparation for
     moving more of it into C later on

   - Refactoring and cleanup"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (73 commits)
  arm64: acpi: fix DAIF manipulation with pNMI
  arm64: kconfig: Fix alignment of E0PD help text
  arm64: Use v8.5-RNG entropy for KASLR seed
  arm64: Implement archrandom.h for ARMv8.5-RNG
  arm64: kbuild: remove compressed images on 'make ARCH=arm64 (dist)clean'
  arm64: entry: Avoid empty alternatives entries
  arm64: Kconfig: select HAVE_FUTEX_CMPXCHG
  arm64: csum: Fix pathological zero-length calls
  arm64: entry: cleanup sp_el0 manipulation
  arm64: entry: cleanup el0 svc handler naming
  arm64: entry: mark all entry code as notrace
  arm64: assembler: remove smp_dmb macro
  arm64: assembler: remove inherit_daif macro
  ACPI/IORT: Fix 'Number of IDs' handling in iort_id_map()
  mm: Reserve asm-generic prot flags 0x10 and 0x20 for arch use
  arm64: Use macros instead of hard-coded constants for MAIR_EL1
  arm64: Add KRYO{3,4}XX CPU cores to spectre-v2 safe list
  arm64: kernel: avoid x18 in __cpu_soft_restart
  arm64: kvm: stop treating register x18 as caller save
  arm64/lib: copy_page: avoid x18 register in assembler code
  ...
2020-01-27 08:58:19 -08:00
Steven Rostedt (VMware)
20279420ae tracing/kprobes: Have uname use __get_str() in print_fmt
Thomas Richter reported:

> Test case 66 'Use vfs_getname probe to get syscall args filenames'
> is broken on s390, but works on x86. The test case fails with:
>
>  [root@m35lp76 perf]# perf test -F 66
>  66: Use vfs_getname probe to get syscall args filenames
>            :Recording open file:
>  [ perf record: Woken up 1 times to write data ]
>  [ perf record: Captured and wrote 0.004 MB /tmp/__perf_test.perf.data.TCdYj\
> 	 (20 samples) ]
>  Looking at perf.data file for vfs_getname records for the file we touched:
>   FAILED!
>   [root@m35lp76 perf]#

The root cause was the print_fmt of the kprobe event that referenced the
"ustring"

> Setting up the kprobe event using perf command:
>
>  # ./perf probe "vfs_getname=getname_flags:72 pathname=filename:ustring"
>
> generates this format file:
>   [root@m35lp76 perf]# cat /sys/kernel/debug/tracing/events/probe/\
> 	  vfs_getname/format
>   name: vfs_getname
>   ID: 1172
>   format:
>     field:unsigned short common_type; offset:0; size:2; signed:0;
>     field:unsigned char common_flags; offset:2; size:1; signed:0;
>     field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
>     field:int common_pid; offset:4; size:4; signed:1;
>
>     field:unsigned long __probe_ip; offset:8; size:8; signed:0;
>     field:__data_loc char[] pathname; offset:16; size:4; signed:1;
>
>     print fmt: "(%lx) pathname=\"%s\"", REC->__probe_ip, REC->pathname

Instead of using "__get_str(pathname)" it referenced it directly.

Link: http://lkml.kernel.org/r/20200124100742.4050c15e@gandalf.local.home

Cc: stable@vger.kernel.org
Fixes: 88903c4643 ("tracing/probe: Add ustring type for user-space string")
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Reported-by: Thomas Richter <tmricht@linux.ibm.com>
Tested-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-27 10:56:02 -05:00
David S. Miller
9e0703a265 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2020-01-27

The following pull-request contains BPF updates for your *net-next* tree.

We've added 20 non-merge commits during the last 5 day(s) which contain
a total of 24 files changed, 433 insertions(+), 104 deletions(-).

The main changes are:

1) Make BPF trampolines and dispatcher aware for the stack unwinder, from Jiri Olsa.

2) Improve handling of failed CO-RE relocations in libbpf, from Andrii Nakryiko.

3) Several fixes to BPF sockmap and reuseport selftests, from Lorenz Bauer.

4) Various cleanups in BPF devmap's XDP flush code, from John Fastabend.

5) Fix BPF flow dissector when used with port ranges, from Yoshiki Komachi.

6) Fix bpffs' map_seq_next callback to always inc position index, from Vasily Averin.

7) Allow overriding LLVM tooling for runqslower utility, from Andrey Ignatov.

8) Silence false-positive lockdep splats in devmap hash lookup, from Amol Grover.

9) Fix fentry/fexit selftests to initialize a variable before use, from John Sperbeck.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-27 14:31:40 +01:00
Rafael J. Wysocki
245224d1cb Merge branches 'pm-cpufreq' and 'pm-sleep'
* pm-cpufreq:
  cpufreq: loongson2_cpufreq: adjust cpufreq uses of LOONGSON_CHIPCFG
  cpufreq: brcmstb-avs: fix imbalance of cpufreq policy refcount
  cpufreq: intel_pstate: fix spelling mistake: "Whethet" -> "Whether"
  cpufreq: s3c: fix unbalances of cpufreq policy refcount
  cpufreq: imx-cpufreq-dt: Add i.MX8MP support
  cpufreq: Use imx-cpufreq-dt for i.MX8MP's speed grading
  cpufreq: tegra186: convert to devm_platform_ioremap_resource
  cpufreq: kirkwood: convert to devm_platform_ioremap_resource
  cpufreq: CPPC: put ACPI table after using it
  cpufreq : CPPC: Break out if HiSilicon CPPC workaround is matched

* pm-sleep:
  PM: suspend: Add sysfs attribute to control the "sync on suspend" behavior
  PM: hibernate: fix spelling mistake "shapshot" -> "snapshot"
  PM: hibernate: Add more logging on hibernation failure
  PM: hibernate: improve arithmetic division in preallocate_highmem_fraction()
  PM: wakeup: Show statistics for deleted wakeup sources again
  PM: sleep: Switch to rtc_time64_to_tm()/rtc_tm_to_time64()
2020-01-27 11:29:09 +01:00
John Fastabend
b23bfa5633 bpf, xdp: Remove no longer required rcu_read_{un}lock()
Now that we depend on rcu_call() and synchronize_rcu() to also wait
for preempt_disabled region to complete the rcu read critical section
in __dev_map_flush() is no longer required. Except in a few special
cases in drivers that need it for other reasons.

These originally ensured the map reference was safe while a map was
also being free'd. And additionally that bpf program updates via
ndo_bpf did not happen while flush updates were in flight. But flush
by new rules can only be called from preempt-disabled NAPI context.
The synchronize_rcu from the map free path and the rcu_call from the
delete path will ensure the reference there is safe. So lets remove
the rcu_read_lock and rcu_read_unlock pair to avoid any confusion
around how this is being protected.

If the rcu_read_lock was required it would mean errors in the above
logic and the original patch would also be wrong.

Now that we have done above we put the rcu_read_lock in the driver
code where it is needed in a driver dependent way. I think this
helps readability of the code so we know where and why we are
taking read locks. Most drivers will not need rcu_read_locks here
and further XDP drivers already have rcu_read_locks in their code
paths for reading xdp programs on RX side so this makes it symmetric
where we don't have half of rcu critical sections define in driver
and the other half in devmap.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/bpf/1580084042-11598-4-git-send-email-john.fastabend@gmail.com
2020-01-27 11:16:25 +01:00
John Fastabend
42a84a8cd0 bpf, xdp: Update devmap comments to reflect napi/rcu usage
Now that we rely on synchronize_rcu and call_rcu waiting to
exit perempt-disable regions (NAPI) lets update the comments
to reflect this.

Fixes: 0536b85239 ("xdp: Simplify devmap cleanup")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/1580084042-11598-2-git-send-email-john.fastabend@gmail.com
2020-01-27 11:16:20 +01:00
Vasily Averin
90435a7891 bpf: map_seq_next should always increase position index
If seq_file .next fuction does not change position index,
read after some lseek can generate an unexpected output.

See also: https://bugzilla.kernel.org/show_bug.cgi?id=206283

v1 -> v2: removed missed increment in end of function

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/eca84fdd-c374-a154-d874-6c7b55fc3bc4@virtuozzo.com
2020-01-27 10:54:32 +01:00
Madhuparna Bhowmik
913292c97d sched.h: Annotate sighand_struct with __rcu
This patch fixes the following sparse errors by annotating the
sighand_struct with __rcu

kernel/fork.c:1511:9: error: incompatible types in comparison expression
kernel/exit.c💯19: error: incompatible types in comparison expression
kernel/signal.c:1370:27: error: incompatible types in comparison expression

This fix introduces the following sparse error in signal.c due to
checking the sighand pointer without rcu primitives:

kernel/signal.c:1386:21: error: incompatible types in comparison expression

This new sparse error is also fixed in this patch.

Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20200124045908.26389-1-madhuparnabhowmik10@gmail.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-01-26 10:54:47 +01:00
David S. Miller
4d8773b68e Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Minor conflict in mlx5 because changes happened to code that has
moved meanwhile.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-26 10:40:21 +01:00
Paul E. McKenney
59d8cc6b2e rcu: Forgive slow expedited grace periods at boot time
Boot-time processing often loops in the kernel longer than one might
prefer, which can prevent expedited grace periods from completing in
a timely manner.  This in turn triggers a splat In nohz_full CPUs  One
could argue that long-looping code should be fixed, but on the other hand,
boot time is a bit special.

This commit therefore removes the splat.  Later commits will add the
splat back in, but in a way that removes false positives.

Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-01-25 12:00:40 -08:00
Steven Rostedt (VMware)
24589e3a20 tracing: Use pr_err() instead of WARN() for memory failures
As warnings can trigger panics, especially when "panic_on_warn" is set,
memory failure warnings can cause panics and fail fuzz testers that are
stressing memory.

Create a MEM_FAIL() macro to use instead of WARN() in the tracing code
(perhaps this should be a kernel wide macro?), and use that for memory
failure issues. This should stop failing fuzz tests due to warnings.

Link: https://lore.kernel.org/r/CACT4Y+ZP-7np20GVRu3p+eZys9GPtbu+JpfV+HtsufAzvTgJrg@mail.gmail.com

Suggested-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-25 10:52:30 -05:00
Jiri Olsa
e9b4e606c2 bpf: Allow to resolve bpf trampoline and dispatcher in unwind
When unwinding the stack we need to identify each address
to successfully continue. Adding latch tree to keep trampolines
for quick lookup during the unwind.

The patch uses first 48 bytes for latch tree node, leaving 4048
bytes from the rest of the page for trampoline or dispatcher
generated code.

It's still enough not to affect trampoline and dispatcher progs
maximum counts.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200123161508.915203-3-jolsa@kernel.org
2020-01-25 07:12:40 -08:00
Jiri Olsa
84ad7a7ab6 bpf: Allow BTF ctx access for string pointers
When accessing the context we allow access to arguments with
scalar type and pointer to struct. But we deny access for
pointer to scalar type, which is the case for many functions.

Alexei suggested to take conservative approach and allow
currently only string pointer access, which is the case
for most functions now:

Adding check if the pointer is to string type and allow access to it.

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200123161508.915203-2-jolsa@kernel.org
2020-01-25 07:12:40 -08:00
Ingo Molnar
f8a4bb6bfa Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU updates from Paul E. McKenney:

 - Expedited grace-period updates
 - kfree_rcu() updates
 - RCU list updates
 - Preemptible RCU updates
 - Torture-test updates
 - Miscellaneous fixes
 - Documentation updates

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-01-25 10:05:23 +01:00
Steven Rostedt (VMware)
28394da258 tracing: Decrement trace_array when bootconfig creates an instance
The trace_array_get_by_name() creates a ftrace instance and
trace_array_put() is used to remove the reference. Even though the
trace_array_get_by_name() creates the instance, it also adds a reference
count to it, that prevents user space from removing it.

As the bootconfig just creates the instance on boot up, it should still be
used where it can be deleted by user space after boot. A trace_array_put()
is required to let that happen.

Also, change the documentation on trace_array_get_by_name() to make this not
be so confusing.

Link: https://lore.kernel.org/r/20200124205927.76128804@rorschach.local.home

Fixes: 4f712a4d04 ("tracing/boot: Add instance node support")
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-24 21:29:13 -05:00
Dan Carpenter
b3f7a6cd49 tracing: Remove unneeded NULL check
We checked "iter->trace" earlier so there is no need to check here.

Link: http://lkml.kernel.org/r/20141122183012.GB6994@mwanda

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
[ Pulled from the archeological digging of my INBOX ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-24 18:22:33 -05:00
Josef Bacik
cbc3b92ce0 tracing: Set kernel_stack's caller size properly
I noticed when trying to use the trace-cmd python interface that reading the raw
buffer wasn't working for kernel_stack events.  This is because it uses a
stubbed version of __dynamic_array that doesn't do the __data_loc trick and
encode the length of the array into the field.  Instead it just shows up as a
size of 0.  So change this to __array and set the len to FTRACE_STACK_ENTRIES
since this is what we actually do in practice and matches how user_stack_trace
works.

Link: http://lkml.kernel.org/r/1411589652-1318-1-git-send-email-jbacik@fb.com

Signed-off-by: Josef Bacik <jbacik@fb.com>
[ Pulled from the archeological digging of my INBOX ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-24 18:09:40 -05:00
Luis Henriques
afccc00f75 tracing: Fix tracing_stat return values in error handling paths
tracing_stat_init() was always returning '0', even on the error paths.  It
now returns -ENODEV if tracing_init_dentry() fails or -ENOMEM if it fails
to created the 'trace_stat' debugfs directory.

Link: http://lkml.kernel.org/r/1410299381-20108-1-git-send-email-luis.henriques@canonical.com

Fixes: ed6f1c996b ("tracing: Check return value of tracing_init_dentry()")
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
[ Pulled from the archeological digging of my INBOX ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-24 18:06:48 -05:00
Steven Rostedt (VMware)
dfb6cd1e65 tracing: Fix very unlikely race of registering two stat tracers
Looking through old emails in my INBOX, I came across a patch from Luis
Henriques that attempted to fix a race of two stat tracers registering the
same stat trace (extremely unlikely, as this is done in the kernel, and
probably doesn't even exist). The submitted patch wasn't quite right as it
needed to deal with clean up a bit better (if two stat tracers were the
same, it would have the same files).

But to make the code cleaner, all we needed to do is to keep the
all_stat_sessions_mutex held for most of the registering function.

Link: http://lkml.kernel.org/r/1410299375-20068-1-git-send-email-luis.henriques@canonical.com

Fixes: 002bb86d8d ("tracing/ftrace: separate events tracing and stats tracing engine")
Reported-by: Luis Henriques <luis.henriques@canonical.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-24 17:54:06 -05:00
Stephen Boyd
fd928f3e32 alarmtimer: Make alarmtimer_get_rtcdev() a stub when CONFIG_RTC_CLASS=n
The stubbed version of alarmtimer_get_rtcdev() is not exported.
so this won't work if this function is used in a module when
CONFIG_RTC_CLASS=n.

Move the stub function to the header file and make it inline so that
callers don't have to worry about linking against this symbol.

rtcdev isn't used outside of this ifdef so it's not required to be
redefined to NULL. Drop that while touching this area.

Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20200124055849.154411-4-swboyd@chromium.org
2020-01-24 21:03:53 +01:00
Stephen Boyd
7c94caca87 alarmtimer: Use wakeup source from alarmtimer platform device
Use the wakeup source that can be associated with the 'alarmtimer'
platform device instead of registering another one by hand.

Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20200124055849.154411-3-swboyd@chromium.org
2020-01-24 21:00:21 +01:00
Stephen Boyd
c79108bd19 alarmtimer: Make alarmtimer platform device child of RTC device
The alarmtimer_suspend() function will fail if an RTC device is on a bus
such as SPI or i2c and that RTC device registers and probes after
alarmtimer_init() registers and probes the 'alarmtimer' platform device.

This is because system wide suspend suspends devices in the reverse order
of their probe. When alarmtimer_suspend() attempts to program the RTC for a
wakeup it will try to program an RTC device on a bus that has already been
suspended.

Move the alarmtimer device registration to happen when the RTC which is
used for wakeup is registered. Register the 'alarmtimer' platform device as
a child of the RTC device too, so that it can be guaranteed that the RTC
device won't be suspended when alarmtimer_suspend() is called.

Reported-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20200124055849.154411-2-swboyd@chromium.org
2020-01-24 21:00:20 +01:00
Stephen Boyd
6b088cefbe alarmtimer: Update alarmtimer_get_rtcdev() docs to reflect reality
This function doesn't do anything like this comment says when an RTC device
hasn't been chosen. It looks like we used to do something like that before
commit 8bc0dafb5c ("alarmtimers: Rework RTC device selection using class
interface") but that's long gone now. Remove this sentence to avoid
confusing the reader.

Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20200124055849.154411-5-swboyd@chromium.org
2020-01-24 21:00:20 +01:00
Sebastian Andrzej Siewior
cb923159bb smp: Remove allocation mask from on_each_cpu_cond.*()
The allocation mask is no longer used by on_each_cpu_cond() and
on_each_cpu_cond_mask() and can be removed.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20200117090137.1205765-4-bigeasy@linutronix.de
2020-01-24 20:40:09 +01:00
Sebastian Andrzej Siewior
67719ef25e smp: Add a smp_cond_func_t argument to smp_call_function_many()
on_each_cpu_cond_mask() allocates a new CPU mask. The newly allocated
mask is a subset of the provided mask based on the conditional function.

This memory allocation can be avoided by extending smp_call_function_many()
with the conditional function and performing the remote function call based
on the mask and the conditional function.

Rename smp_call_function_many() to smp_call_function_many_cond() and add
the smp_cond_func_t argument. If smp_cond_func_t is provided then it is
used before invoking the function.  Provide smp_call_function_many() with
cond_func set to NULL.  Let on_each_cpu_cond_mask() use
smp_call_function_many_cond().

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20200117090137.1205765-3-bigeasy@linutronix.de
2020-01-24 20:40:09 +01:00
Sebastian Andrzej Siewior
5671d814db smp: Use smp_cond_func_t as type for the conditional function
Use a typdef for the conditional function instead defining it each time in
the function prototype.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20200117090137.1205765-2-bigeasy@linutronix.de
2020-01-24 20:40:08 +01:00
Thomas Gleixner
43ee74487b irqchip updates for Linux 5.6:
- Conversion of the SiFive PLIC to hierarchical domains
 - New SiFive GPIO irqchip driver
 - New Aspeed SCI irqchip driver
 - New NXP INTMUX irqchip driver
 - Additional support for the Meson A1 GPIO irqchip
 - First part of the GICv4.1 support
 - Assorted fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAl4rKn0PHG1hekBrZXJu
 ZWwub3JnAAoJECPQ0LrRPXpDHVoQALTTYQol+5Gz5pLxnROYEAdFjzrVrCarsK/b
 Cl4uVa5efOTCItSO3L9cEo1zoB++aJxPSOaKqX9hryPwPLTZzDiHYtVQ870tZB+k
 233cTvtT8+iw7/JPKnA8706TYDk1FUkJQ87V0gMLrnVH00dmJ8LvjW1bCdXV8iIa
 Ln78XIF+Ass+qJjSpCDRaOukDm6Qs+sZKAY0+nLXM8Ge564fdX7bPkDGN4tq9DLz
 74ZxY6s3rI5FoPceS270dtDf4Ib8gH+T8Bqd5AYSj/tcRE23s4muGb/O3Kez5Oko
 eEiuSadpep/kPQhgZlpX0tJgtEqHNfi6K8AIMscQQDFmJyuCqgR9/5as+UKX1V0M
 kPlOQtYCAVZmTnlOP6rA2V3RUFurVkFPkwUGzVYlCYxxrARvsH+vPxYqAPH/EEFq
 lGUo+2Z7Z+1ubPsnR8WKs8heC6qJidegGUtKoKYWroJl+tiuT6EtCP3J0QZPhdXT
 lVOBVnR6DHNIURuAEmag/eNYsBIj7PdmlByoMkBFn9LPE7Fn+OExJgbyVsu1IaTe
 AcUHmXR9QpcAKnDLmNSqFvhWsLo8CJ607rH3tL8vqnfijOHyt4AvKeE1R4QSavPx
 0F3FFNdo7Y1FAlJ9Ibw0gLvoIa6uP6FpdI3rht0iRaOZJlnDTbn+B8UayY0Ajvyp
 aGIjx7tY
 =8iz1
 -----END PGP SIGNATURE-----

Merge tag 'irqchip-5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/core

Pull irqchip updates from Marc Zyngier:

- Conversion of the SiFive PLIC to hierarchical domains
- New SiFive GPIO irqchip driver
- New Aspeed SCI irqchip driver
- New NXP INTMUX irqchip driver
- Additional support for the Meson A1 GPIO irqchip
- First part of the GICv4.1 support
- Assorted fixes
2020-01-24 20:08:51 +01:00
Paul E. McKenney
0e247386d9 Merge branches 'doc.2019.12.10a', 'exp.2019.12.09a', 'fixes.2020.01.24a', 'kfree_rcu.2020.01.24a', 'list.2020.01.10a', 'preempt.2020.01.24a' and 'torture.2019.12.09a' into HEAD
doc.2019.12.10a: Documentations updates
exp.2019.12.09a: Expedited grace-period updates
fixes.2020.01.24a: Miscellaneous fixes
kfree_rcu.2020.01.24a: Batch kfree_rcu() work
list.2020.01.10a: RCU-protected-list updates
preempt.2020.01.24a: Preemptible RCU updates
torture.2019.12.09a: Torture-test updates
2020-01-24 10:37:27 -08:00
Paul E. McKenney
f6105fc2a9 rcu: Remove unused stop-machine #include
Long ago, RCU used the stop-machine mechanism to implement expedited
grace periods, but no longer does so.  This commit therefore removes
the no-longer-needed #includes of linux/stop_machine.h.

Link: https://lwn.net/Articles/805317/
Reported-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-01-24 10:33:52 -08:00
Paul E. McKenney
844a378de3 srcu: Apply *_ONCE() to ->srcu_last_gp_end
The ->srcu_last_gp_end field is accessed from any CPU at any time
by synchronize_srcu(), so non-initialization references need to use
READ_ONCE() and WRITE_ONCE().  This commit therefore makes that change.

Reported-by: syzbot+08f3e9d26e5541e1ecf2@syzkaller.appspotmail.com
Acked-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-01-24 10:33:51 -08:00
Paul E. McKenney
7441e7661d rcu: Switch force_qs_rnp() to for_each_leaf_node_cpu_mask()
Currently, force_qs_rnp() uses a for_each_leaf_node_possible_cpu()
loop containing a check of the current CPU's bit in ->qsmask.
This works, but this commit saves three lines by instead using
for_each_leaf_node_cpu_mask(), which combines the functionality of
for_each_leaf_node_possible_cpu() and leaf_node_cpu_bit().  This commit
also replaces the use of the local variable "bit" with rdp->grpmask.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-01-24 10:33:51 -08:00
Ben Dooks
e1350e8e0e rcu: Move rcu_{expedited,normal} definitions into rcupdate.h
This commit moves the rcu_{expedited,normal} definitions from
kernel/rcu/update.c to include/linux/rcupdate.h to make sure they are
in sync, and also to avoid the following warning from sparse:

kernel/ksysfs.c:150:5: warning: symbol 'rcu_expedited' was not declared. Should it be static?
kernel/ksysfs.c:167:5: warning: symbol 'rcu_normal' was not declared. Should it be static?

Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-01-24 10:33:50 -08:00
Lai Jiangshan
e2167b38c8 rcu: Move gp_state_names[] and gp_state_getname() to tree_stall.h
Only tree_stall.h needs to get name from GP state, so this commit
moves the gp_state_names[] array and the gp_state_getname()
from kernel/rcu/tree.h and kernel/rcu/tree.c, respectively, to
kernel/rcu/tree_stall.h.  While moving gp_state_names[], this commit
uses the GCC syntax to ensure that the right string is associated with
the right CPP macro.

Signed-off-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-01-24 10:33:45 -08:00
Lai Jiangshan
4778339df0 rcu: Remove the declaration of call_rcu() in tree.h
The call_rcu() function is an external RCU API that is declared in
include/linux/rcupdate.h.  There is thus no point in redeclaring it
in kernel/rcu/tree.h, so this commit removes that redundant declaration.

Signed-off-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-01-24 10:33:38 -08:00
Lai Jiangshan
2488a5e695 rcu: Fix tracepoint tracking RCU CPU kthread utilization
In the call to trace_rcu_utilization() at the start of the loop in
rcu_cpu_kthread(), "rcu_wait" is incorrect, plus this trace event needs
to be hoisted above the loop to balance with either the "rcu_wait" or
"rcu_yield", depending on how the loop exits.  This commit therefore
makes these changes.

Signed-off-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-01-24 10:33:31 -08:00
Lai Jiangshan
822175e729 rcu: Fix harmless omission of "CONFIG_" from #if condition
The C preprocessor macros SRCU and TINY_RCU should instead be CONFIG_SRCU
and CONFIG_TINY_RCU, respectively in the #f in kernel/rcu/rcu.h. But
there is no harm when "TINY_RCU" is wrongly used, which are always
non-defined, which makes "!defined(TINY_RCU)" always true, which means
the code block is always included, and the included code block doesn't
cause any compilation error so far in CONFIG_TINY_RCU builds.  It is
also the reason this change should not be taken in -stable.

This commit adds the needed "CONFIG_" prefix to both macros.

Not for -stable.

Signed-off-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-01-24 10:33:13 -08:00
Paul E. McKenney
5b14557b07 rcu: Avoid tick_dep_set_cpu() misordering
In the current code, rcu_nmi_enter_common() might decide to turn on
the tick using tick_dep_set_cpu(), but be delayed just before doing so.
Then the grace-period kthread might notice that the CPU in question had
in fact gone through a quiescent state, thus turning off the tick using
tick_dep_clear_cpu().  The later invocation of tick_dep_set_cpu() would
then incorrectly leave the tick on.

This commit therefore enlists the aid of the leaf rcu_node structure's
->lock to ensure that decisions to enable or disable the tick are
carried out before they can be reversed.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-01-24 10:27:33 -08:00
Lai Jiangshan
77339e61aa rcu: Provide wrappers for uses of ->rcu_read_lock_nesting
This commit provides wrapper functions for uses of ->rcu_read_lock_nesting
to improve readability and to ease future changes to support inlining
of __rcu_read_lock() and __rcu_read_unlock().

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-01-24 10:27:33 -08:00
Paul E. McKenney
c51f83c315 rcu: Use READ_ONCE() for ->expmask in rcu_read_unlock_special()
The rcu_node structure's ->expmask field is updated only when holding the
->lock, but is also accessed locklessly.  This means that all ->expmask
updates must use WRITE_ONCE() and all reads carried out without holding
->lock must use READ_ONCE().  This commit therefore changes the lockless
->expmask read in rcu_read_unlock_special() to use READ_ONCE().

Reported-by: syzbot+99f4ddade3c22ab0cf23@syzkaller.appspotmail.com
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Marco Elver <elver@google.com>
2020-01-24 10:27:33 -08:00
Lai Jiangshan
3717e1e9f2 rcu: Clear ->rcu_read_unlock_special only once
In rcu_preempt_deferred_qs_irqrestore(), ->rcu_read_unlock_special is
cleared one piece at a time.  Given that the "if" statements in this
function use the copy in "special", this commit removes the clearing
of the individual pieces in favor of clearing ->rcu_read_unlock_special
in one go just after it has been determined to be non-zero.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-01-24 10:27:33 -08:00
Lai Jiangshan
2eeba5838f rcu: Clear .exp_hint only when deferred quiescent state has been reported
Currently, the .exp_hint flag is cleared in rcu_read_unlock_special(),
which works, but which can also prevent subsequent rcu_read_unlock() calls
from helping expedite the quiescent state needed by an ongoing expedited
RCU grace period.  This commit therefore defers clearing of .exp_hint
from rcu_read_unlock_special() to rcu_preempt_deferred_qs_irqrestore(),
thus ensuring that intervening calls to rcu_read_unlock() have a chance
to help end the expedited grace period.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-01-24 10:27:33 -08:00
Lai Jiangshan
c130d2dc93 rcu: Rename some instance of CONFIG_PREEMPTION to CONFIG_PREEMPT_RCU
CONFIG_PREEMPTION and CONFIG_PREEMPT_RCU are always identical,
but some code depends on CONFIG_PREEMPTION to access to
rcu_preempt functionality. This patch changes CONFIG_PREEMPTION
to CONFIG_PREEMPT_RCU in these cases.

Signed-off-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-01-24 10:26:28 -08:00
Joel Fernandes (Google)
189a6883dc rcu: Remove kfree_call_rcu_nobatch()
Now that the kfree_rcu() special-casing has been removed from tree RCU,
this commit removes kfree_call_rcu_nobatch() since it is no longer needed.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-01-24 10:24:31 -08:00
Joel Fernandes (Google)
77a40f9703 rcu: Remove kfree_rcu() special casing and lazy-callback handling
This commit removes kfree_rcu() special-casing and the lazy-callback
handling from Tree RCU.  It moves some of this special casing to Tiny RCU,
the removal of which will be the subject of later commits.

This results in a nice negative delta.

Suggested-by: Paul E. McKenney <paulmck@linux.ibm.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
[ paulmck: Add slab.h #include, thanks to kbuild test robot <lkp@intel.com>. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-01-24 10:24:31 -08:00
Joel Fernandes (Google)
e99637becb rcu: Add support for debug_objects debugging for kfree_rcu()
This commit applies RCU's debug_objects debugging to the new batched
kfree_rcu() implementations.  The object is queued at the kfree_rcu()
call and dequeued during reclaim.

Tested that enabling CONFIG_DEBUG_OBJECTS_RCU_HEAD successfully detects
double kfree_rcu() calls.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
[ paulmck: Fix IRQ per kbuild test robot <lkp@intel.com> feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-01-24 10:24:31 -08:00
Joel Fernandes (Google)
0392bebebf rcu: Add multiple in-flight batches of kfree_rcu() work
During testing, it was observed that amount of memory consumed due
kfree_rcu() batching is 300-400MB. Previously we had only a single
head_free pointer pointing to the list of rcu_head(s) that are to be
freed after a grace period. Until this list is drained, we cannot queue
any more objects on it since such objects may not be ready to be
reclaimed when the worker thread eventually gets to drainin g the
head_free list.

We can do better by maintaining multiple lists as done by this patch.
Testing shows that memory consumption came down by around 100-150MB with
just adding another list. Adding more than 1 additional list did not
show any improvement.

Suggested-by: Paul E. McKenney <paulmck@linux.ibm.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
[ paulmck: Code style and initialization handling. ]
[ paulmck: Fix field name, reported by kbuild test robot <lkp@intel.com>. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-01-24 10:24:31 -08:00
Joel Fernandes
569d767087 rcu: Make kfree_rcu() use a non-atomic ->monitor_todo
Because the ->monitor_todo field is always protected by krcp->lock,
this commit downgrades from xchg() to non-atomic unmarked assignment
statements.

Signed-off-by: Joel Fernandes <joel@joelfernandes.org>
[ paulmck: Update to include early-boot kick code. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-01-24 10:24:31 -08:00
Joel Fernandes (Google)
e6e78b004f rcuperf: Add kfree_rcu() performance Tests
This test runs kfree_rcu() in a loop to measure performance of the new
kfree_rcu() batching functionality.

The following table shows results when booting with arguments:
rcuperf.kfree_loops=20000 rcuperf.kfree_alloc_num=8000
rcuperf.kfree_rcu_test=1 rcuperf.kfree_no_batch=X

rcuperf.kfree_no_batch=X    # Grace Periods	Test Duration (s)
  X=1 (old behavior)              9133                 11.5
  X=0 (new behavior)              1732                 12.5

On a 16 CPU system with the above boot parameters, we see that the total
number of grace periods that elapse during the test drops from 9133 when
not batching to 1732 when batching (a 5X improvement). The kfree_rcu()
flood itself slows down a bit when batching, though, as shown.

Note that the active memory consumption during the kfree_rcu() flood
does increase to around 200-250MB due to the batching (from around 50MB
without batching). However, this memory consumption is relatively
constant. In other words, the system is able to keep up with the
kfree_rcu() load. The memory consumption comes down considerably if
KFREE_DRAIN_JIFFIES is increased from HZ/50 to HZ/80. A later patch will
reduce memory consumption further by using multiple lists.

Also, when running the test, please disable CONFIG_DEBUG_PREEMPT and
CONFIG_PROVE_RCU for realistic comparisons with/without batching.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-01-24 10:24:31 -08:00
Byungchul Park
a35d16905e rcu: Add basic support for kfree_rcu() batching
Recently a discussion about stability and performance of a system
involving a high rate of kfree_rcu() calls surfaced on the list [1]
which led to another discussion how to prepare for this situation.

This patch adds basic batching support for kfree_rcu(). It is "basic"
because we do none of the slab management, dynamic allocation, code
moving or any of the other things, some of which previous attempts did
[2]. These fancier improvements can be follow-up patches and there are
different ideas being discussed in those regards. This is an effort to
start simple, and build up from there. In the future, an extension to
use kfree_bulk and possibly per-slab batching could be done to further
improve performance due to cache-locality and slab-specific bulk free
optimizations. By using an array of pointers, the worker thread
processing the work would need to read lesser data since it does not
need to deal with large rcu_head(s) any longer.

Torture tests follow in the next patch and show improvements of around
5x reduction in number of  grace periods on a 16 CPU system. More
details and test data are in that patch.

There is an implication with rcu_barrier() with this patch. Since the
kfree_rcu() calls can be batched, and may not be handed yet to the RCU
machinery in fact, the monitor may not have even run yet to do the
queue_rcu_work(), there seems no easy way of implementing rcu_barrier()
to wait for those kfree_rcu()s that are already made. So this means a
kfree_rcu() followed by an rcu_barrier() does not imply that memory will
be freed once rcu_barrier() returns.

Another implication is higher active memory usage (although not
run-away..) until the kfree_rcu() flooding ends, in comparison to
without batching. More details about this are in the second patch which
adds an rcuperf test.

Finally, in the near future we will get rid of kfree_rcu() special casing
within RCU such as in rcu_do_batch and switch everything to just
batching. Currently we don't do that since timer subsystem is not yet up
and we cannot schedule the kfree_rcu() monitor as the timer subsystem's
lock are not initialized. That would also mean getting rid of
kfree_call_rcu_nobatch() entirely.

[1] http://lore.kernel.org/lkml/20190723035725-mutt-send-email-mst@kernel.org
[2] https://lkml.org/lkml/2017/12/19/824

Cc: kernel-team@android.com
Cc: kernel-team@lge.com
Co-developed-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
[ paulmck: Applied 0day and Paul Walmsley feedback on ->monitor_todo. ]
[ paulmck: Make it work during early boot. ]
[ paulmck: Add a crude early boot self-test. ]
[ paulmck: Style adjustments and experimental docbook structure header. ]
Link: https://lore.kernel.org/lkml/alpine.DEB.2.21.9999.1908161931110.32497@viisi.sifive.com/T/#me9956f66cb611b95d26ae92700e1d901f46e8c59
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-01-24 10:17:03 -08:00
Ingo Molnar
7add7875a8 Merge branch 'kcsan.2020.01.07a' into locking/kcsan
Pull KCSAN updates from Paul E. McKenney:

 - UBSAN fixes
 - inlining updates
 - documentation updates

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-01-24 09:35:51 +01:00
Amol Grover
485ec2ea9c bpf, devmap: Pass lockdep expression to RCU lists
head is traversed using hlist_for_each_entry_rcu outside an RCU
read-side critical section but under the protection of dtab->index_lock.

Hence, add corresponding lockdep expression to silence false-positive
lockdep warnings, and harden RCU lists.

Fixes: 6f9d451ab1 ("xdp: Add devmap_hash map type for looking up devices by hashed index")
Signed-off-by: Amol Grover <frextrite@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20200123120437.26506-1-frextrite@gmail.com
2020-01-23 23:01:16 +01:00
Linus Torvalds
34597c85be Various tracing fixes:
- Fix a function comparison warning for a xen trace event macro
  - Fix a double perf_event linking to a trace_uprobe_filter for multiple events
  - Fix suspicious RCU warnings in trace event code for using
     list_for_each_entry_rcu() when the "_rcu" portion wasn't needed.
  - Fix a bug in the histogram code when using the same variable
  - Fix a NULL pointer dereference when tracefs lockdown enabled and calling
     trace_set_default_clock()
 
 This v2 version contains:
 
  - A fix to a bug found with the double perf_event linking patch
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXinakBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qhNZAQCi86p9eW3f3w7hM2hZcirC+mQKVZgp
 2rO4zIAK5V6G7gEAh6I7VZa50a6AE647ZjryE7ufTRUhmSFMWoG0kcJ7OAk=
 =/J9n
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.5-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
 "Various tracing fixes:

   - Fix a function comparison warning for a xen trace event macro

   - Fix a double perf_event linking to a trace_uprobe_filter for
     multiple events

   - Fix suspicious RCU warnings in trace event code for using
     list_for_each_entry_rcu() when the "_rcu" portion wasn't needed.

   - Fix a bug in the histogram code when using the same variable

   - Fix a NULL pointer dereference when tracefs lockdown enabled and
     calling trace_set_default_clock()

   - A fix to a bug found with the double perf_event linking patch"

* tag 'trace-v5.5-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing/uprobe: Fix to make trace_uprobe_filter alignment safe
  tracing: Do not set trace clock if tracefs lockdown is in effect
  tracing: Fix histogram code when expression has same var as value
  tracing: trigger: Replace unneeded RCU-list traversals
  tracing/uprobe: Fix double perf_event linking on multiprobe uprobe
  tracing: xen: Ordered comparison of function pointers
2020-01-23 11:23:37 -08:00
Linus Torvalds
3a83c8c81c Power management fix for 5.5-rc8
Prevent the kernel from crashing during resume from hibernation
 if free pages contain leftover data from the restore kernel and
 init_on_free is set (Alexander Potapenko).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl4psjQSHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxpOcP/1UTGUr+VdGfBBjG5WlgCY0Jrd50Y78b
 RoVNDR/NvSVSuIs44AgrnfyQiz2Y8jG6qY1iSAbIXmFl37/4+kGhkNmd6pV/xFUc
 TdZZotFwFlRQjmeQxxH0kNXuAY6nJ2RwELWrjXqM8PuNjNKIEpfS+0fSaWexHqIm
 MDArxcDHkvZU5SnnRQM+LkT/EmbEheB7tgm7vGGqMLsSKc0gUsBmVCURe/lLAH5o
 EUKX4FI2jCy+LlmSdZ3EDjf1cstm3YXLiegTLSq1Jh3mFHXkFTwJMmidiz21qXJh
 Hc4r3iG0NZ37J8HXpwuq++KlhvNbhHJz+ZgC1IYls16RNYh5mUzxtMHWdSyyqlrW
 +z8gBVUyeJUYos5Kjb/NKSt43gnz7Uhy0UVbQXD66hgajXe71CQZQq/D5CMeTdJL
 jWNaeGYnhskz3IW2vnrs9Ucf6RHHWezXk51kVsyJXadiLhTdOv7DKahDKVwC/Hvf
 kyN1W0F5PZpF50yYmnhJgqDfxkGBNKpwXxTAGk6X0WQFaWeh/2FkX045UdJzBbHu
 fa1taTM/5RfPlbWq0wLPeHHSP4M2I0ndeWXZk88vUwwMfm9Wo+FNBhgs/EZfeuKD
 16sVMsX0r7R1bG+hEj+mvNeLWqfgz7MpGludCkV1dHIDkn2esxx6JqaWxR/UnLHb
 D3fZM/cKGdpY
 =kDFG
 -----END PGP SIGNATURE-----

Merge tag 'pm-5.5-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fix from Rafael Wysocki:
 "Prevent the kernel from crashing during resume from hibernation if
  free pages contain leftover data from the restore kernel and
  init_on_free is set (Alexander Potapenko)"

* tag 'pm-5.5-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PM: hibernate: fix crashes with init_on_free=1
2020-01-23 11:10:21 -08:00
Rafael J. Wysocki
322e929d19 Merge back new material related to system-wide PM for v5.6. 2020-01-23 16:00:56 +01:00
David S. Miller
954b3c4397 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2020-01-22

The following pull-request contains BPF updates for your *net-next* tree.

We've added 92 non-merge commits during the last 16 day(s) which contain
a total of 320 files changed, 7532 insertions(+), 1448 deletions(-).

The main changes are:

1) function by function verification and program extensions from Alexei.

2) massive cleanup of selftests/bpf from Toke and Andrii.

3) batched bpf map operations from Brian and Yonghong.

4) tcp congestion control in bpf from Martin.

5) bulking for non-map xdp_redirect form Toke.

6) bpf_send_signal_thread helper from Yonghong.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-23 08:10:16 +01:00
Martin KaFai Lau
5576b991e9 bpf: Add BPF_FUNC_jiffies64
This patch adds a helper to read the 64bit jiffies.  It will be used
in a later patch to implement the bpf_cubic.c.

The helper is inlined for jit_requested and 64 BITS_PER_LONG
as the map_gen_lookup().  Other cases could be considered together
with map_gen_lookup() if needed.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200122233646.903260-1-kafai@fb.com
2020-01-22 16:30:10 -08:00
Alexei Starovoitov
be8704ff07 bpf: Introduce dynamic program extensions
Introduce dynamic program extensions. The users can load additional BPF
functions and replace global functions in previously loaded BPF programs while
these programs are executing.

Global functions are verified individually by the verifier based on their types only.
Hence the global function in the new program which types match older function can
safely replace that corresponding function.

This new function/program is called 'an extension' of old program. At load time
the verifier uses (attach_prog_fd, attach_btf_id) pair to identify the function
to be replaced. The BPF program type is derived from the target program into
extension program. Technically bpf_verifier_ops is copied from target program.
The BPF_PROG_TYPE_EXT program type is a placeholder. It has empty verifier_ops.
The extension program can call the same bpf helper functions as target program.
Single BPF_PROG_TYPE_EXT type is used to extend XDP, SKB and all other program
types. The verifier allows only one level of replacement. Meaning that the
extension program cannot recursively extend an extension. That also means that
the maximum stack size is increasing from 512 to 1024 bytes and maximum
function nesting level from 8 to 16. The programs don't always consume that
much. The stack usage is determined by the number of on-stack variables used by
the program. The verifier could have enforced 512 limit for combined original
plus extension program, but it makes for difficult user experience. The main
use case for extensions is to provide generic mechanism to plug external
programs into policy program or function call chaining.

BPF trampoline is used to track both fentry/fexit and program extensions
because both are using the same nop slot at the beginning of every BPF
function. Attaching fentry/fexit to a function that was replaced is not
allowed. The opposite is true as well. Replacing a function that currently
being analyzed with fentry/fexit is not allowed. The executable page allocated
by BPF trampoline is not used by program extensions. This inefficiency will be
optimized in future patches.

Function by function verification of global function supports scalars and
pointer to context only. Hence program extensions are supported for such class
of global functions only. In the future the verifier will be extended with
support to pointers to structures, arrays with sizes, etc.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20200121005348.2769920-2-ast@kernel.org
2020-01-22 23:04:52 +01:00
Ming Lei
11ea68f553 genirq, sched/isolation: Isolate from handling managed interrupts
The affinity of managed interrupts is completely handled in the kernel and
cannot be changed via the /proc/irq/* interfaces from user space. As the
kernel tries to spread out interrupts evenly accross CPUs on x86 to prevent
vector exhaustion, it can happen that a managed interrupt whose affinity
mask contains both isolated and housekeeping CPUs is routed to an isolated
CPU. As a consequence IO submitted on a housekeeping CPU causes interrupts
on the isolated CPU.

Add a new sub-parameter 'managed_irq' for 'isolcpus' and the corresponding
logic in the interrupt affinity selection code.

The subparameter indicates to the interrupt affinity selection logic that
it should try to avoid the above scenario.

This isolation is best effort and only effective if the automatically
assigned interrupt mask of a device queue contains isolated and
housekeeping CPUs. If housekeeping CPUs are online then such interrupts are
directed to the housekeeping CPU so that IO submitted on the housekeeping
CPU cannot disturb the isolated CPU.

If a queue's affinity mask contains only isolated CPUs then this parameter
has no effect on the interrupt routing decision, though interrupts are only
happening when tasks running on those isolated CPUs submit IO. IO submitted
on housekeeping CPUs has no influence on those queues.

If the affinity mask contains both housekeeping and isolated CPUs, but none
of the contained housekeeping CPUs is online, then the interrupt is also
routed to an isolated CPU. Interrupts are only delivered when one of the
isolated CPUs in the affinity mask submits IO. If one of the contained
housekeeping CPUs comes online, the CPU hotplug logic migrates the
interrupt automatically back to the upcoming housekeeping CPU. Depending on
the type of interrupt controller, this can require that at least one
interrupt is delivered to the isolated CPU in order to complete the
migration.

[ tglx: Removed unused parameter, added and edited comments/documentation
  	and rephrased the changelog so it contains more details. ]

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20200120091625.17912-1-ming.lei@redhat.com
2020-01-22 16:29:49 +01:00
Jules Irenge
eb5a4d0a9e hrtimer: Add missing sparse annotation for __run_timer()
Sparse reports a warning at __run_hrtimer()
|warning: context imbalance in __run_hrtimer() - unexpected unlock

Add the missing must_hold() annotation.

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20200120224347.51843-1-jbi.octave@gmail.com
2020-01-22 15:50:11 +01:00
Masami Hiramatsu
b61387cb73 tracing/uprobe: Fix to make trace_uprobe_filter alignment safe
Commit 99c9a923e9 ("tracing/uprobe: Fix double perf_event
linking on multiprobe uprobe") moved trace_uprobe_filter on
trace_probe_event. However, since it introduced a flexible
data structure with char array and type casting, the
alignment of trace_uprobe_filter can be broken.

This changes the type of the array to trace_uprobe_filter
data strucure to fix it.

Link: http://lore.kernel.org/r/20200120124022.GA14897@hirez.programming.kicks-ass.net
Link: http://lkml.kernel.org/r/157966340499.5107.10978352478952144902.stgit@devnote2

Fixes: 99c9a923e9 ("tracing/uprobe: Fix double perf_event linking on multiprobe uprobe")
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-22 07:09:20 -05:00
Alex Shi
659ded3027 trace/kprobe: Remove unused MAX_KPROBE_CMDLINE_SIZE
This limitation are never lunched from introduce commit 970988e19e
("tracing/kprobe: Add kprobe_event= boot parameter")

Could we remove it if no intention to implement it?

Link: http://lkml.kernel.org/r/1579586075-45132-1-git-send-email-alex.shi@linux.alibaba.com

Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-22 07:07:38 -05:00
Steven Rostedt (VMware)
34423f250a tracing: Fix uninitialized buffer var on early exit to trace_vbprintk()
If we exit due to a bad input to trace_printk() (highly unlikely), then the
buffer variable will not be initialized when we unnest the ring buffer.

Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-22 06:44:50 -05:00
Alexei Starovoitov
f59bbfc2f6 bpf: Fix error path under memory pressure
Restore the 'if (env->cur_state)' check that was incorrectly removed during
code move. Under memory pressure env->cur_state can be freed and zeroed inside
do_check(). Hence the check is necessary.

Fixes: 51c39bb1d5 ("bpf: Introduce function-by-function verification")
Reported-by: syzbot+b296579ba5015704d9fa@syzkaller.appspotmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200122024138.3385590-1-ast@kernel.org
2020-01-22 12:09:02 +01:00
Alexei Starovoitov
05d57f1793 bpf: Fix trampoline usage in preempt
Though the second half of trampoline page is unused a task could be
preempted in the middle of the first half of trampoline and two
updates to trampoline would change the code from underneath the
preempted task. Hence wait for tasks to voluntarily schedule or go
to userspace. Add similar wait before freeing the trampoline.

Fixes: fec56f5890 ("bpf: Introduce BPF trampoline")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/bpf/20200121032231.3292185-1-ast@kernel.org
2020-01-22 11:31:21 +01:00
Dan Carpenter
532f49a6f1 tracing/boot: Fix an IS_ERR() vs NULL bug
The trace_array_get_by_name() function doesn't return error pointers,
it returns NULL on error.

Link: http://lkml.kernel.org/r/20200117053007.5h2juv272pokqhtq@kili.mountain

Fixes: 4f712a4d04 ("tracing/boot: Add instance node support")
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-21 18:41:39 -05:00
Alex Shi
141597204e tracing: Remove unused TRACE_SEQ_BUF_USED
This macro isn't used from commit 3a161d99c4 ("tracing: Create
seq_buf layer in trace_seq"). so no needs to keep it.

Link: http://lkml.kernel.org/r/1579586086-45543-1-git-send-email-alex.shi@linux.alibaba.com

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-21 18:39:54 -05:00
Alex Shi
b83479482f ring-buffer: Remove abandoned macro RB_MISSED_FLAGS
This macro isn't used since commit d325c40296 ("ring-buffer: Remove
unused function ring_buffer_page_len()"), so better to remove it.

Link: http://lkml.kernel.org/r/1579586080-45300-1-git-send-email-alex.shi@linux.alibaba.com

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-21 18:38:02 -05:00
Al Viro
b87121dd3f bpf: don't bother with getname/kern_path - use user_path_at
kernel/bpf/inode.c misuses kern_path...() - it's much simpler (and
more efficient, on top of that) to use user_path...() counterparts
rather than bothering with doing getname() manually.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200120232858.GF8904@ZenIV.linux.org.uk
2020-01-21 23:46:21 +01:00
Alex Shi
aff4866db5 ftrace: Remove NR_TO_INIT macro
This macro isn't used from commit cb7be3b2fc ("ftrace: remove
daemon"). So no needs to keep it.

Link: http://lkml.kernel.org/r/1579586063-44984-1-git-send-email-alex.shi@linux.alibaba.com

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-21 17:30:39 -05:00
Alex Shi
9a09cd74e7 ftrace: Remove abandoned macros
These 2 macros aren't used from commit eee8ded131 ("ftrace: Have the
function probes call their own function"), so remove them.

Link: http://lkml.kernel.org/r/1579585807-43316-1-git-send-email-alex.shi@linux.alibaba.com

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-21 17:28:35 -05:00
Brian Vazquez
2e3a94aa2b bpf: Fix memory leaks in generic update/delete batch ops
Generic update/delete batch ops functions were using __bpf_copy_key
without properly freeing the memory. Handle the memory allocation and
copy_from_user separately.

Fixes: aa2e93b8e5 ("bpf: Add generic support for update and delete batch ops")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Brian Vazquez <brianvv@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200119194040.128369-1-brianvv@google.com
2020-01-20 22:27:51 +01:00
Masami Ichikawa
bf24daac8f tracing: Do not set trace clock if tracefs lockdown is in effect
When trace_clock option is not set and unstable clcok detected,
tracing_set_default_clock() sets trace_clock(ThinkPad A285 is one of
case). In that case, if lockdown is in effect, null pointer
dereference error happens in ring_buffer_set_clock().

Link: http://lkml.kernel.org/r/20200116131236.3866925-1-masami256@gmail.com

Cc: stable@vger.kernel.org
Fixes: 17911ff38a ("tracing: Add locked_down checks to the open calls of files created for tracefs")
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1788488
Signed-off-by: Masami Ichikawa <masami256@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-20 16:18:14 -05:00
Steven Rostedt (VMware)
8bcebc77e8 tracing: Fix histogram code when expression has same var as value
While working on a tool to convert SQL syntex into the histogram language of
the kernel, I discovered the following bug:

 # echo 'first u64 start_time u64 end_time pid_t pid u64 delta' >> synthetic_events
 # echo 'hist:keys=pid:start=common_timestamp' > events/sched/sched_waking/trigger
 # echo 'hist:keys=next_pid:delta=common_timestamp-$start,start2=$start:onmatch(sched.sched_waking).trace(first,$start2,common_timestamp,next_pid,$delta)' > events/sched/sched_switch/trigger

Would not display any histograms in the sched_switch histogram side.

But if I were to swap the location of

  "delta=common_timestamp-$start" with "start2=$start"

Such that the last line had:

 # echo 'hist:keys=next_pid:start2=$start,delta=common_timestamp-$start:onmatch(sched.sched_waking).trace(first,$start2,common_timestamp,next_pid,$delta)' > events/sched/sched_switch/trigger

The histogram works as expected.

What I found out is that the expressions clear out the value once it is
resolved. As the variables are resolved in the order listed, when
processing:

  delta=common_timestamp-$start

The $start is cleared. When it gets to "start2=$start", it errors out with
"unresolved symbol" (which is silent as this happens at the location of the
trace), and the histogram is dropped.

When processing the histogram for variable references, instead of adding a
new reference for a variable used twice, use the same reference. That way,
not only is it more efficient, but the order will no longer matter in
processing of the variables.

From Tom Zanussi:

 "Just to clarify some more about what the problem was is that without
  your patch, we would have two separate references to the same variable,
  and during resolve_var_refs(), they'd both want to be resolved
  separately, so in this case, since the first reference to start wasn't
  part of an expression, it wouldn't get the read-once flag set, so would
  be read normally, and then the second reference would do the read-once
  read and also be read but using read-once.  So everything worked and
  you didn't see a problem:

   from: start2=$start,delta=common_timestamp-$start

  In the second case, when you switched them around, the first reference
  would be resolved by doing the read-once, and following that the second
  reference would try to resolve and see that the variable had already
  been read, so failed as unset, which caused it to short-circuit out and
  not do the trigger action to generate the synthetic event:

   to: delta=common_timestamp-$start,start2=$start

  With your patch, we only have the single resolution which happens
  correctly the one time it's resolved, so this can't happen."

Link: https://lore.kernel.org/r/20200116154216.58ca08eb@gandalf.local.home

Cc: stable@vger.kernel.org
Fixes: 067fe038e7 ("tracing: Add variable reference handling to hist triggers")
Reviewed-by: Tom Zanuss <zanussi@kernel.org>
Tested-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-20 16:11:47 -05:00
Kevin Hao
0f394daef8 irqdomain: Fix a memory leak in irq_domain_push_irq()
Fix a memory leak reported by kmemleak:
unreferenced object 0xffff000bc6f50e80 (size 128):
  comm "kworker/23:2", pid 201, jiffies 4294894947 (age 942.132s)
  hex dump (first 32 bytes):
    00 00 00 00 41 00 00 00 86 c0 03 00 00 00 00 00  ....A...........
    00 a0 b2 c6 0b 00 ff ff 40 51 fd 10 00 80 ff ff  ........@Q......
  backtrace:
    [<00000000e62d2240>] kmem_cache_alloc_trace+0x1a4/0x320
    [<00000000279143c9>] irq_domain_push_irq+0x7c/0x188
    [<00000000d9f4c154>] thunderx_gpio_probe+0x3ac/0x438
    [<00000000fd09ec22>] pci_device_probe+0xe4/0x198
    [<00000000d43eca75>] really_probe+0xdc/0x320
    [<00000000d3ebab09>] driver_probe_device+0x5c/0xf0
    [<000000005b3ecaa0>] __device_attach_driver+0x88/0xc0
    [<000000004e5915f5>] bus_for_each_drv+0x7c/0xc8
    [<0000000079d4db41>] __device_attach+0xe4/0x140
    [<00000000883bbda9>] device_initial_probe+0x18/0x20
    [<000000003be59ef6>] bus_probe_device+0x98/0xa0
    [<0000000039b03d3f>] deferred_probe_work_func+0x74/0xa8
    [<00000000870934ce>] process_one_work+0x1c8/0x470
    [<00000000e3cce570>] worker_thread+0x1f8/0x428
    [<000000005d64975e>] kthread+0xfc/0x128
    [<00000000f0eaa764>] ret_from_fork+0x10/0x18

Fixes: 495c38d300 ("irqdomain: Add irq_domain_{push,pop}_irq() functions")
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20200120043547.22271-1-haokexin@gmail.com
2020-01-20 19:10:05 +00:00
Jessica Yu
708e0ada19 module: avoid setting info->name early in case we can fall back to info->mod->name
In setup_load_info(), info->name (which contains the name of the module,
mostly used for early logging purposes before the module gets set up)
gets unconditionally assigned if .modinfo is missing despite the fact
that there is an if (!info->name) check near the end of the function.
Avoid assigning a placeholder string to info->name if .modinfo doesn't
exist, so that we can fall back to info->mod->name later on.

Fixes: 5fdc7db644 ("module: setup load info before module_sig_check()")
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2020-01-20 16:59:39 +01:00
Yash Shah
b01ecceaf2 genirq: Introduce irq_domain_translate_onecell
Add a new function irq_domain_translate_onecell() that is to be used as
the translate function in struct irq_domain_ops.

Signed-off-by: Yash Shah <yash.shah@sifive.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/1575976274-13487-2-git-send-email-yash.shah@sifive.com
2020-01-20 09:19:33 +00:00
Ingo Molnar
cb6c82df68 Linux 5.5-rc7
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl4k7i8eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGvk0IAKRenVOdiudY77SQ
 VZjsteyrYTTQtPPv494ToIRjR0XQ+gYp8vyWzXTUC5Nm9Y9U3VzDqUPUjWszrSXE
 6mU+tzcMc9qwuUxnIFn8zfg64ygw+37sn/w3xqeH4QmF9Z5Wl3EX3SdXTs7jp3RS
 VxiztkUNI5ZBV2GDtla5K/9qLPqCQnUYXIiyi5lAtBtiitZDVXFp7dy7hMgEiaEO
 +78K5Kh3xlt5ndDsBFOlwIb2Oof3KL7bBXntdbSBc/bjol6IRvAgln48HWCv59G2
 jzAp2tj2KobX9GRAEPj+v4TQZEW0SXDNDi8MgQsM+3DYVCTmANsv57CBKRuf01+F
 nB1kAys=
 =zSnJ
 -----END PGP SIGNATURE-----

Merge tag 'v5.5-rc7' into perf/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-01-20 08:43:44 +01:00
Ingo Molnar
837171fe77 Linux 5.5-rc7
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl4k7i8eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGvk0IAKRenVOdiudY77SQ
 VZjsteyrYTTQtPPv494ToIRjR0XQ+gYp8vyWzXTUC5Nm9Y9U3VzDqUPUjWszrSXE
 6mU+tzcMc9qwuUxnIFn8zfg64ygw+37sn/w3xqeH4QmF9Z5Wl3EX3SdXTs7jp3RS
 VxiztkUNI5ZBV2GDtla5K/9qLPqCQnUYXIiyi5lAtBtiitZDVXFp7dy7hMgEiaEO
 +78K5Kh3xlt5ndDsBFOlwIb2Oof3KL7bBXntdbSBc/bjol6IRvAgln48HWCv59G2
 jzAp2tj2KobX9GRAEPj+v4TQZEW0SXDNDi8MgQsM+3DYVCTmANsv57CBKRuf01+F
 nB1kAys=
 =zSnJ
 -----END PGP SIGNATURE-----

Merge tag 'v5.5-rc7' into locking/kcsan, to refresh the tree

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-01-20 08:42:47 +01:00
Viresh Kumar
afa70d941f sched/fair: Define sched_idle_cpu() only for SMP configurations
sched_idle_cpu() isn't used for non SMP configuration and with a recent
change, we have started getting following warning:

  kernel/sched/fair.c:5221:12: warning: ‘sched_idle_cpu’ defined but not used [-Wunused-function]

Fix that by defining sched_idle_cpu() only for SMP configurations.

Fixes: 323af6deaf ("sched/fair: Load balance aggressively for SCHED_IDLE CPUs")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/f0554f590687478b33914a4aff9f0e6a62886d44.1579499907.git.viresh.kumar@linaro.org
2020-01-20 08:03:39 +01:00
Johannes Berg
87c9366e17 Revert "um: Enable CONFIG_CONSTRUCTORS"
This reverts commit 786b2384bf ("um: Enable CONFIG_CONSTRUCTORS").

There are two issues with this commit, uncovered by Anton in tests
on some (Debian) systems:

1) I completely forgot to call any constructors if CONFIG_CONSTRUCTORS
   isn't set. Don't recall now if it just wasn't needed on my system, or
   if I never tested this case.

2) With that fixed, it works - with CONFIG_CONSTRUCTORS *unset*. If I
   set CONFIG_CONSTRUCTORS, it fails again, which isn't totally
   unexpected since whatever wanted to run is likely to have to run
   before the kernel init etc. that calls the constructors in this case.

Basically, some constructors that gcc emits (libc has?) need to run
very early during init; the failure mode otherwise was that the ptrace
fork test already failed:

----------------------
$ ./linux mem=512M
Core dump limits :
	soft - 0
	hard - NONE
Checking that ptrace can change system call numbers...check_ptrace : child exited with exitcode 6, while expecting 0; status 0x67f
Aborted
----------------------

Thinking more about this, it's clear that we simply cannot support
CONFIG_CONSTRUCTORS in UML. All the cases we need now (gcov, kasan)
involve not use of the __attribute__((constructor)), but instead
some constructor code/entry generated by gcc. Therefore, we cannot
distinguish between kernel constructors and system constructors.

Thus, revert this commit.

Cc: stable@vger.kernel.org [5.4+]
Fixes: 786b2384bf ("um: Enable CONFIG_CONSTRUCTORS")
Reported-by: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Acked-by: Anton Ivanov <anton.ivanov@cambridgegreys.co.uk>

Signed-off-by: Richard Weinberger <richard@nod.at>
2020-01-19 22:42:06 +01:00
David S. Miller
b3f7e3f23a Merge ra.kernel.org:/pub/scm/linux/kernel/git/netdev/net 2020-01-19 22:10:04 +01:00
Linus Torvalds
11a8272947 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Fix non-blocking connect() in x25, from Martin Schiller.

 2) Fix spurious decryption errors in kTLS, from Jakub Kicinski.

 3) Netfilter use-after-free in mtype_destroy(), from Cong Wang.

 4) Limit size of TSO packets properly in lan78xx driver, from Eric
    Dumazet.

 5) r8152 probe needs an endpoint sanity check, from Johan Hovold.

 6) Prevent looping in tcp_bpf_unhash() during sockmap/tls free, from
    John Fastabend.

 7) hns3 needs short frames padded on transmit, from Yunsheng Lin.

 8) Fix netfilter ICMP header corruption, from Eyal Birger.

 9) Fix soft lockup when low on memory in hns3, from Yonglong Liu.

10) Fix NTUPLE firmware command failures in bnxt_en, from Michael Chan.

11) Fix memory leak in act_ctinfo, from Eric Dumazet.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (91 commits)
  cxgb4: reject overlapped queues in TC-MQPRIO offload
  cxgb4: fix Tx multi channel port rate limit
  net: sched: act_ctinfo: fix memory leak
  bnxt_en: Do not treat DSN (Digital Serial Number) read failure as fatal.
  bnxt_en: Fix ipv6 RFS filter matching logic.
  bnxt_en: Fix NTUPLE firmware command failures.
  net: systemport: Fixed queue mapping in internal ring map
  net: dsa: bcm_sf2: Configure IMP port for 2Gb/sec
  net: dsa: sja1105: Don't error out on disabled ports with no phy-mode
  net: phy: dp83867: Set FORCE_LINK_GOOD to default after reset
  net: hns: fix soft lockup when there is not enough memory
  net: avoid updating qdisc_xmit_lock_key in netdev_update_lockdep_key()
  net/sched: act_ife: initalize ife->metalist earlier
  netfilter: nat: fix ICMP header corruption on ICMP errors
  net: wan: lapbether.c: Use built-in RCU list checking
  netfilter: nf_tables: fix flowtable list del corruption
  netfilter: nf_tables: fix memory leak in nf_tables_parse_netdev_hooks()
  netfilter: nf_tables: remove WARN and add NLA_STRING upper limits
  netfilter: nft_tunnel: ERSPAN_VERSION must not be null
  netfilter: nft_tunnel: fix null-attribute check
  ...
2020-01-19 12:03:53 -08:00
Linus Torvalds
7ff15cd045 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Ingo Molnar:
 "Three fixes: fix link failure on Alpha, fix a Sparse warning and
  annotate/robustify a lockless access in the NOHZ code"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  tick/sched: Annotate lockless access to last_jiffies_update
  lib/vdso: Make __cvdso_clock_getres() static
  time/posix-stubs: Provide compat itimer supoprt for alpha
2020-01-18 13:00:59 -08:00
Linus Torvalds
9e79c52332 Merge branch 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull cpu/SMT fix from Ingo Molnar:
 "Fix a build bug on CONFIG_HOTPLUG_SMT=y && !CONFIG_SYSFS kernels"

* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  cpu/SMT: Fix x86 link error without CONFIG_SYSFS
2020-01-18 12:57:41 -08:00
Linus Torvalds
b07b9e8d63 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Tooling fixes, three Intel uncore driver fixes, plus an AUX events fix
  uncovered by the perf fuzzer"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/intel/uncore: Remove PCIe3 unit for SNR
  perf/x86/intel/uncore: Fix missing marker for snr_uncore_imc_freerunning_events
  perf/x86/intel/uncore: Add PCI ID of IMC for Xeon E3 V5 Family
  perf: Correctly handle failed perf_get_aux_event()
  perf hists: Fix variable name's inconsistency in hists__for_each() macro
  perf map: Set kmap->kmaps backpointer for main kernel map chunks
  perf report: Fix incorrectly added dimensions as switch perf data file
  tools lib traceevent: Fix memory leakage in filter_event
2020-01-18 12:55:19 -08:00
Linus Torvalds
124b5547ec Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Ingo Molnar:
 "Three fixes:

    - Fix an rwsem spin-on-owner crash, introduced in v5.4

    - Fix a lockdep bug when running out of stack_trace entries,
      introduced in v5.4

    - Docbook fix"

* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/rwsem: Fix kernel crash when spinning on RWSEM_OWNER_UNKNOWN
  futex: Fix kernel-doc notation warning
  locking/lockdep: Fix buffer overrun problem in stack_trace[]
2020-01-18 12:53:28 -08:00
Linus Torvalds
ba0f472203 Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull rseq fixes from Ingo Molnar:
 "Two rseq bugfixes:

   - CLONE_VM !CLONE_THREAD didn't work properly, the kernel would end
     up corrupting the TLS of the parent. Technically a change in the
     ABI but the previous behavior couldn't resonably have been relied
     on by applications so this looks like a valid exception to the ABI
     rule.

   - Make the RSEQ_FLAG_UNREGISTER ABI behavior consistent with the
     handling of other flags. This is not thought to impact any
     applications either"

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  rseq: Unregister rseq for clone CLONE_VM
  rseq: Reject unknown flags on rseq unregister
2020-01-18 12:29:13 -08:00
Linus Torvalds
8cac89909a for-linus-2020-01-18
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXiL/qwAKCRCRxhvAZXjc
 oln5AP9ITypHs2iNWl1Cbte++y2iflWevDyPUrmagegqpKwbJAD9EypY0RVDor8T
 LXWK4WaNgB0K0MK/gSPRAlgx9ejNwA4=
 =6xXo
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-2020-01-18' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull thread fixes from Christian Brauner:
 "Here is an urgent fix for ptrace_may_access() permission checking.

  Commit 69f594a389 ("ptrace: do not audit capability check when
  outputing /proc/pid/stat") introduced the ability to opt out of audit
  messages for accesses to various proc files since they are not
  violations of policy.

  While doing so it switched the check from ns_capable() to
  has_ns_capability{_noaudit}(). That means it switched from checking
  the subjective credentials (ktask->cred) of the task to using the
  objective credentials (ktask->real_cred). This is appears to be wrong.
  ptrace_has_cap() is currently only used in ptrace_may_access() And is
  used to check whether the calling task (subject) has the
  CAP_SYS_PTRACE capability in the provided user namespace to operate on
  the target task (object). According to the cred.h comments this means
  the subjective credentials of the calling task need to be used.

  With this fix we switch ptrace_has_cap() to use security_capable() and
  thus back to using the subjective credentials.

  As one example where this might be particularly problematic, Jann
  pointed out that in combination with the upcoming IORING_OP_OPENAT{2}
  feature, this bug might allow unprivileged users to bypass the
  capability checks while asynchronously opening files like /proc/*/mem,
  because the capability checks for this would be performed against
  kernel credentials.

  To illustrate on the former point about this being exploitable: When
  io_uring creates a new context it records the subjective credentials
  of the caller. Later on, when it starts to do work it creates a kernel
  thread and registers a callback. The callback runs with kernel creds
  for ktask->real_cred and ktask->cred.

  To prevent this from becoming a full-blown 0-day io_uring will call
  override_cred() and override ktask->cred with the subjective
  credentials of the creator of the io_uring instance. With
  ptrace_has_cap() currently looking at ktask->real_cred this override
  will be ineffective and the caller will be able to open arbitray proc
  files as mentioned above.

  Luckily, this is currently not exploitable but would be so once
  IORING_OP_OPENAT{2} land in v5.6. Let's fix it now.

  To minimize potential regressions I successfully ran the criu
  testsuite. criu makes heavy use of ptrace() and extensively hits
  ptrace_may_access() codepaths and has a good change of detecting any
  regressions.

  Additionally, I succesfully ran the ptrace and seccomp kernel tests"

* tag 'for-linus-2020-01-18' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  ptrace: reintroduce usage of subjective credentials in ptrace_has_cap()
2020-01-18 12:23:31 -08:00
Christian Brauner
6b3ad6649a
ptrace: reintroduce usage of subjective credentials in ptrace_has_cap()
Commit 69f594a389 ("ptrace: do not audit capability check when outputing /proc/pid/stat")
introduced the ability to opt out of audit messages for accesses to various
proc files since they are not violations of policy.  While doing so it
somehow switched the check from ns_capable() to
has_ns_capability{_noaudit}(). That means it switched from checking the
subjective credentials of the task to using the objective credentials. This
is wrong since. ptrace_has_cap() is currently only used in
ptrace_may_access() And is used to check whether the calling task (subject)
has the CAP_SYS_PTRACE capability in the provided user namespace to operate
on the target task (object). According to the cred.h comments this would
mean the subjective credentials of the calling task need to be used.
This switches ptrace_has_cap() to use security_capable(). Because we only
call ptrace_has_cap() in ptrace_may_access() and in there we already have a
stable reference to the calling task's creds under rcu_read_lock() there's
no need to go through another series of dereferences and rcu locking done
in ns_capable{_noaudit}().

As one example where this might be particularly problematic, Jann pointed
out that in combination with the upcoming IORING_OP_OPENAT feature, this
bug might allow unprivileged users to bypass the capability checks while
asynchronously opening files like /proc/*/mem, because the capability
checks for this would be performed against kernel credentials.

To illustrate on the former point about this being exploitable: When
io_uring creates a new context it records the subjective credentials of the
caller. Later on, when it starts to do work it creates a kernel thread and
registers a callback. The callback runs with kernel creds for
ktask->real_cred and ktask->cred. To prevent this from becoming a
full-blown 0-day io_uring will call override_cred() and override
ktask->cred with the subjective credentials of the creator of the io_uring
instance. With ptrace_has_cap() currently looking at ktask->real_cred this
override will be ineffective and the caller will be able to open arbitray
proc files as mentioned above.
Luckily, this is currently not exploitable but will turn into a 0-day once
IORING_OP_OPENAT{2} land in v5.6. Fix it now!

Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Eric Paris <eparis@redhat.com>
Cc: stable@vger.kernel.org
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Reviewed-by: Jann Horn <jannh@google.com>
Fixes: 69f594a389 ("ptrace: do not audit capability check when outputing /proc/pid/stat")
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-01-18 13:51:39 +01:00
Thomas Gleixner
9f24c540f7 lib/vdso: Update coarse timekeeper unconditionally
The low resolution parts of the VDSO, i.e.:

  clock_gettime(CLOCK_*_COARSE), clock_getres(), time()

can be used even if there is no VDSO capable clocksource.

But if an architecture opts out of the VDSO data update then this
information becomes stale. This affects ARM when there is no architected
timer available. The lack of update causes userspace to use stale data
forever.

Make the update of the low resolution parts unconditional and only skip
the update of the high resolution parts if the architecture requests it.

Fixes: 44f57d788e ("timekeeping: Provide a generic update_vsyscall() implementation")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20200114185946.765577901@linutronix.de
2020-01-17 15:53:50 +01:00
Thomas Gleixner
9a6b55ac4a lib/vdso: Make __arch_update_vdso_data() logic understandable
The function name suggests that this is a boolean checking whether the
architecture asks for an update of the VDSO data, but it works the other
way round. To spare further confusion invert the logic.

Fixes: 44f57d788e ("timekeeping: Provide a generic update_vsyscall() implementation")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20200114185946.656652824@linutronix.de
2020-01-17 15:53:50 +01:00
Mark Rutland
da9ec3d3dd perf: Correctly handle failed perf_get_aux_event()
Vince reports a worrying issue:

| so I was tracking down some odd behavior in the perf_fuzzer which turns
| out to be because perf_even_open() sometimes returns 0 (indicating a file
| descriptor of 0) even though as far as I can tell stdin is still open.

... and further the cause:

| error is triggered if aux_sample_size has non-zero value.
|
| seems to be this line in kernel/events/core.c:
|
| if (perf_need_aux_event(event) && !perf_get_aux_event(event, group_leader))
|                goto err_locked;
|
| (note, err is never set)

This seems to be a thinko in commit:

  ab43762ef0 ("perf: Allow normal events to output AUX data")

... and we should probably return -EINVAL here, as this should only
happen when the new event is mis-configured or does not have a
compatible aux_event group leader.

Fixes: ab43762ef0 ("perf: Allow normal events to output AUX data")
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Tested-by: Vince Weaver <vincent.weaver@maine.edu>
2020-01-17 11:32:44 +01:00
Thomas Gleixner
11e31f608b watchdog/softlockup: Enforce that timestamp is valid on boot
Robert reported that during boot the watchdog timestamp is set to 0 for one
second which is the indicator for a watchdog reset.

The reason for this is that the timestamp is in seconds and the time is
taken from sched clock and divided by ~1e9. sched clock starts at 0 which
means that for the first second during boot the watchdog timestamp is 0,
i.e. reset.

Use ULONG_MAX as the reset indicator value so the watchdog works correctly
right from the start. ULONG_MAX would only conflict with a real timestamp
if the system reaches an uptime of 136 years on 32bit and almost eternity
on 64bit.

Reported-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/87o8v3uuzl.fsf@nanos.tec.linutronix.de
2020-01-17 11:19:22 +01:00
Waiman Long
f5bfdc8e39 locking/osq: Use optimized spinning loop for arm64
Arm64 has a more optimized spinning loop (atomic_cond_read_acquire)
using wfe for spinlock that can boost performance of sibling threads
by putting the current cpu to a wait state that is broken only when
the monitored variable changes or an external event happens.

OSQ has a more complicated spinning loop. Besides the lock value, it
also checks for need_resched() and vcpu_is_preempted(). The check for
need_resched() is not a problem as it is only set by the tick interrupt
handler. That will be detected by the spinning cpu right after iret.

The vcpu_is_preempted() check, however, is a problem as changes to the
preempt state of of previous node will not affect the wait state. For
ARM64, vcpu_is_preempted is not currently defined and so is a no-op.
Will has indicated that he is planning to para-virtualize wfe instead
of defining vcpu_is_preempted for PV support. So just add a comment in
arch/arm64/include/asm/spinlock.h to indicate that vcpu_is_preempted()
should not be defined as suggested.

On a 2-socket 56-core 224-thread ARM64 system, a kernel mutex locking
microbenchmark was run for 10s with and without the patch. The
performance numbers before patch were:

Running locktest with mutex [runtime = 10s, load = 1]
Threads = 224, Min/Mean/Max = 316/123,143/2,121,269
Threads = 224, Total Rate = 2,757 kop/s; Percpu Rate = 12 kop/s

After patch, the numbers were:

Running locktest with mutex [runtime = 10s, load = 1]
Threads = 224, Min/Mean/Max = 334/147,836/1,304,787
Threads = 224, Total Rate = 3,311 kop/s; Percpu Rate = 15 kop/s

So there was about 20% performance improvement.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lkml.kernel.org/r/20200113150735.21956-1-longman@redhat.com
2020-01-17 10:19:30 +01:00
Waiman Long
57097124cb locking/qspinlock: Fix inaccessible URL of MCS lock paper
It turns out that the URL of the MCS lock paper listed in the source
code is no longer accessible. I did got question about where the paper
was. This patch updates the URL to BZ 206115 which contains a copy of
the paper from

  https://www.cs.rochester.edu/u/scott/papers/1991_TOCS_synch.pdf

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lkml.kernel.org/r/20200107174914.4187-1-longman@redhat.com
2020-01-17 10:19:30 +01:00
Waiman Long
a030f9767d locking/lockdep: Fix lockdep_stats indentation problem
It was found that two lines in the output of /proc/lockdep_stats have
indentation problem:

  # cat /proc/lockdep_stats
     :
   in-process chains:                   25057
   stack-trace entries:                137827 [max: 524288]
   number of stack traces:        7973
   number of stack hash chains:   6355
   combined max dependencies:      1356414598
   hardirq-safe locks:                     57
   hardirq-unsafe locks:                 1286
     :

All the numbers displayed in /proc/lockdep_stats except the two stack
trace numbers are formatted with a field with of 11. To properly align
all the numbers, a field width of 11 is now added to the two stack
trace numbers.

Fixes: 8c779229d0 ("locking/lockdep: Report more stack trace statistics")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lkml.kernel.org/r/20191211213139.29934-1-longman@redhat.com
2020-01-17 10:19:30 +01:00
Waiman Long
39e7234f00 locking/rwsem: Fix kernel crash when spinning on RWSEM_OWNER_UNKNOWN
The commit 91d2a812df ("locking/rwsem: Make handoff writer
optimistically spin on owner") will allow a recently woken up waiting
writer to spin on the owner. Unfortunately, if the owner happens to be
RWSEM_OWNER_UNKNOWN, the code will incorrectly spin on it leading to a
kernel crash. This is fixed by passing the proper non-spinnable bits
to rwsem_spin_on_owner() so that RWSEM_OWNER_UNKNOWN will be treated
as a non-spinnable target.

Fixes: 91d2a812df ("locking/rwsem: Make handoff writer optimistically spin on owner")

Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20200115154336.8679-1-longman@redhat.com
2020-01-17 10:19:27 +01:00
Valentin Schneider
ccf74128d6 sched/topology: Assert non-NUMA topology masks don't (partially) overlap
topology.c::get_group() relies on the assumption that non-NUMA domains do
not partially overlap. Zeng Tao pointed out in [1] that such topology
descriptions, while completely bogus, can end up being exposed to the
scheduler.

In his example (8 CPUs, 2-node system), we end up with:
  MC span for CPU3 == 3-7
  MC span for CPU4 == 4-7

The first pass through get_group(3, sdd@MC) will result in the following
sched_group list:

  3 -> 4 -> 5 -> 6 -> 7
  ^                  /
   `----------------'

And a later pass through get_group(4, sdd@MC) will "corrupt" that to:

  3 -> 4 -> 5 -> 6 -> 7
       ^             /
	`-----------'

which will completely break things like 'while (sg != sd->groups)' when
using CPU3's base sched_domain.

There already are some architecture-specific checks in place such as
x86/kernel/smpboot.c::topology.sane(), but this is something we can detect
in the core scheduler, so it seems worthwhile to do so.

Warn and abort the construction of the sched domains if such a broken
topology description is detected. Note that this is somewhat
expensive (O(t.c²), 't' non-NUMA topology levels and 'c' CPUs) and could be
gated under SCHED_DEBUG if deemed necessary.

Testing
=======

Dietmar managed to reproduce this using the following qemu incantation:

  $ qemu-system-aarch64 -kernel ./Image -hda ./qemu-image-aarch64.img \
  -append 'root=/dev/vda console=ttyAMA0 loglevel=8 sched_debug' -smp \
  cores=8 --nographic -m 512 -cpu cortex-a53 -machine virt -numa \
  node,cpus=0-2,nodeid=0 -numa node,cpus=3-7,nodeid=1

alongside the following drivers/base/arch_topology.c hack (AIUI wouldn't be
needed if '-smp cores=X, sockets=Y' would work with qemu):

8<---
@@ -465,6 +465,9 @@ void update_siblings_masks(unsigned int cpuid)
 		if (cpuid_topo->package_id != cpu_topo->package_id)
 			continue;

+		if ((cpu < 4 && cpuid > 3) || (cpu > 3 && cpuid < 4))
+			continue;
+
 		cpumask_set_cpu(cpuid, &cpu_topo->core_sibling);
 		cpumask_set_cpu(cpu, &cpuid_topo->core_sibling);

8<---

[1]: https://lkml.kernel.org/r/1577088979-8545-1-git-send-email-prime.zeng@hisilicon.com

Reported-by: Zeng Tao <prime.zeng@hisilicon.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200115160915.22575-1-valentin.schneider@arm.com
2020-01-17 10:19:23 +01:00
Hewenliang
3e0de271ff idle: fix spelling mistake "iterrupts" -> "interrupts"
There is a spelling misake in comments of cpuidle_idle_call. Fix it.

Signed-off-by: Hewenliang <hewenliang4@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Link: https://lkml.kernel.org/r/20200110025604.34373-1-hewenliang4@huawei.com
2020-01-17 10:19:22 +01:00
Vincent Guittot
a4f9a0e51b sched/fair: Remove redundant call to cpufreq_update_util()
With commit

  bef69dd878 ("sched/cpufreq: Move the cfs_rq_util_change() call to cpufreq_update_util()")

update_load_avg() has become the central point for calling cpufreq
(not including the update of blocked load). This change helps to
simplify further the number of calls to cpufreq_update_util() and to
remove last redundant ones. With update_load_avg(), we are now sure
that cpufreq_update_util() will be called after every task attachment
to a cfs_rq and especially after propagating this event down to the
util_avg of the root cfs_rq, which is the level that is used by
cpufreq governors like schedutil to set the frequency of a CPU.

The SCHED_CPUFREQ_MIGRATION flag forces an early call to cpufreq when
the migration happens in a cgroup whereas util_avg of root cfs_rq is
not yet updated and this call is duplicated with the one that happens
immediately after when the migration event reaches the root cfs_rq.
The dedicated flag SCHED_CPUFREQ_MIGRATION is now useless and can be
removed. The interface of attach_entity_load_avg() can also be
simplified accordingly.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lkml.kernel.org/r/1579083620-24943-1-git-send-email-vincent.guittot@linaro.org
2020-01-17 10:19:22 +01:00
Wang Long
3d817689a6 sched/psi: create /proc/pressure and /proc/pressure/{io|memory|cpu} only when psi enabled
when CONFIG_PSI_DEFAULT_DISABLED set to N or the command line set psi=0,
I think we should not create /proc/pressure and
/proc/pressure/{io|memory|cpu}.

In the future, user maybe determine whether the psi feature is enabled by
checking the existence of the /proc/pressure dir or
/proc/pressure/{io|memory|cpu} files.

Signed-off-by: Wang Long <w@laoqinren.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/1576672698-32504-1-git-send-email-w@laoqinren.net
2020-01-17 10:19:22 +01:00
Peng Liu
4c58f57fa6 sched/fair: Fix sgc->{min,max}_capacity calculation for SD_OVERLAP
commit bf475ce0a3 ("sched/fair: Add per-CPU min capacity to
sched_group_capacity") introduced per-cpu min_capacity.

commit e3d6d0cb66 ("sched/fair: Add sched_group per-CPU max capacity")
introduced per-cpu max_capacity.

In the SD_OVERLAP case, the local variable 'capacity' represents the sum
of CPU capacity of all CPUs in the first sched group (sg) of the sched
domain (sd).

It is erroneously used to calculate sg's min and max CPU capacity.
To fix this use capacity_of(cpu) instead of 'capacity'.

The code which achieves this via cpu_rq(cpu)->sd->groups->sgc->capacity
(for rq->sd != NULL) can be removed since it delivers the same value as
capacity_of(cpu) which is currently only used for the (!rq->sd) case
(see update_cpu_capacity()).
An sg of the lowest sd (rq->sd or sd->child == NULL) represents a single
CPU (and hence sg->sgc->capacity == capacity_of(cpu)).

Signed-off-by: Peng Liu <iwtbavbm@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20200104130828.GA7718@iZj6chx1xj0e0buvshuecpZ
2020-01-17 10:19:21 +01:00
Peng Wang
fe71bbb21e sched/fair: calculate delta runnable load only when it's needed
Move the code of calculation for delta_sum/delta_avg to where
it is really needed to be done.

Signed-off-by: Peng Wang <rocking@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20200103114400.17668-1-rocking@linux.alibaba.com
2020-01-17 10:19:21 +01:00
Alex Shi
9dec1b6949 sched/cputime: move rq parameter in irqtime_account_process_tick
Every time we call irqtime_account_process_tick() is in a interrupt,
Every caller will get and assign a parameter rq = this_rq(), This is
unnecessary and increase the code size a little bit. Move the rq getting
action to irqtime_account_process_tick internally is better.

             base               with this patch
cputime.o    578792 bytes        577888 bytes

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1577959674-255537-1-git-send-email-alex.shi@linux.alibaba.com
2020-01-17 10:19:21 +01:00
Yangtao Li
35f4cd96f5 stop_machine: Make stop_cpus() static
The function stop_cpus() is only used internally by the
stop_machine for stop multiple cpus.

Make it static.

Signed-off-by: Yangtao Li <tiny.windzz@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20191228161912.24082-1-tiny.windzz@gmail.com
2020-01-17 10:19:21 +01:00
Wei Li
02d4ac5885 sched/debug: Reset watchdog on all CPUs while processing sysrq-t
Lengthy output of sysrq-t may take a lot of time on slow serial console
with lots of processes and CPUs.

So we need to reset NMI-watchdog to avoid spurious lockup messages, and
we also reset softlockup watchdogs on all other CPUs since another CPU
might be blocked waiting for us to process an IPI or stop_machine.

Add to sysrq_sched_debug_show() as what we did in show_state_filter().

Signed-off-by: Wei Li <liwei391@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Link: https://lkml.kernel.org/r/20191226085224.48942-1-liwei391@huawei.com
2020-01-17 10:19:20 +01:00
Li Guanglei
dcd6dffb0a sched/core: Fix size of rq::uclamp initialization
rq::uclamp is an array of struct uclamp_rq, make sure we clear the
whole thing.

Fixes: 69842cba9a ("sched/uclamp: Add CPU's clamp buckets refcountinga")
Signed-off-by: Li Guanglei <guanglei.li@unisoc.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Qais Yousef <qais.yousef@arm.com>
Link: https://lkml.kernel.org/r/1577259844-12677-1-git-send-email-guangleix.li@gmail.com
2020-01-17 10:19:20 +01:00
Qais Yousef
7226017ad3 sched/uclamp: Fix a bug in propagating uclamp value in new cgroups
When a new cgroup is created, the effective uclamp value wasn't updated
with a call to cpu_util_update_eff() that looks at the hierarchy and
update to the most restrictive values.

Fix it by ensuring to call cpu_util_update_eff() when a new cgroup
becomes online.

Without this change, the newly created cgroup uses the default
root_task_group uclamp values, which is 1024 for both uclamp_{min, max},
which will cause the rq to to be clamped to max, hence cause the
system to run at max frequency.

The problem was observed on Ubuntu server and was reproduced on Debian
and Buildroot rootfs.

By default, Ubuntu and Debian create a cpu controller cgroup hierarchy
and add all tasks to it - which creates enough noise to keep the rq
uclamp value at max most of the time. Imitating this behavior makes the
problem visible in Buildroot too which otherwise looks fine since it's a
minimal userspace.

Fixes: 0b60ba2dd3 ("sched/uclamp: Propagate parent clamps")
Reported-by: Doug Smythies <dsmythies@telus.net>
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Doug Smythies <dsmythies@telus.net>
Link: https://lore.kernel.org/lkml/000701d5b965$361b6c60$a2524520$@net/
2020-01-17 10:19:20 +01:00
Viresh Kumar
323af6deaf sched/fair: Load balance aggressively for SCHED_IDLE CPUs
The fair scheduler performs periodic load balance on every CPU to check
if it can pull some tasks from other busy CPUs. The duration of this
periodic load balance is set to sd->balance_interval for the idle CPUs
and is calculated by multiplying the sd->balance_interval with the
sd->busy_factor (set to 32 by default) for the busy CPUs. The
multiplication is done for busy CPUs to avoid doing load balance too
often and rather spend more time executing actual task. While that is
the right thing to do for the CPUs busy with SCHED_OTHER or SCHED_BATCH
tasks, it may not be the optimal thing for CPUs running only SCHED_IDLE
tasks.

With the recent enhancements in the fair scheduler around SCHED_IDLE
CPUs, we now prefer to enqueue a newly-woken task to a SCHED_IDLE
CPU instead of other busy or idle CPUs. The same reasoning should be
applied to the load balancer as well to make it migrate tasks more
aggressively to a SCHED_IDLE CPU, as that will reduce the scheduling
latency of the migrated (SCHED_OTHER) tasks.

This patch makes minimal changes to the fair scheduler to do the next
load balance soon after the last non SCHED_IDLE task is dequeued from a
runqueue, i.e. making the CPU SCHED_IDLE. Also the sd->busy_factor is
ignored while calculating the balance_interval for such CPUs. This is
done to avoid delaying the periodic load balance by few hundred
milliseconds for SCHED_IDLE CPUs.

This is tested on ARM64 Hikey620 platform (octa-core) with the help of
rt-app and it is verified, using kernel traces, that the newly
SCHED_IDLE CPU does load balancing shortly after it becomes SCHED_IDLE
and pulls tasks from other busy CPUs.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/e485827eb8fe7db0943d6f3f6e0f5a4a70272781.1578471925.git.viresh.kumar@linaro.org
2020-01-17 10:19:20 +01:00
Vincent Guittot
5f68eb19b5 sched/fair : Improve update_sd_pick_busiest for spare capacity case
Similarly to calculate_imbalance() and find_busiest_group(), using the
number of idle CPUs when there is only 1 CPU in the group is not efficient
because we can't make a difference between a CPU running 1 task and a CPU
running dozens of small tasks competing for the same CPU but not enough
to overload it. More generally speaking, we should use the number of
running tasks when there is the same number of idle CPUs in a group instead
of blindly select the 1st one.

When the groups have spare capacity and the same number of idle CPUs, we
compare the number of running tasks to select the busiest group.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1576839893-26930-1-git-send-email-vincent.guittot@linaro.org
2020-01-17 10:19:19 +01:00
Jisheng Zhang
db5793c599 watchdog: Remove soft_lockup_hrtimer_cnt and related code
After commit 9cf57731b6 ("watchdog/softlockup: Replace "watchdog/%u"
threads with cpu_stop_work"), the percpu soft_lockup_hrtimer_cnt is
not used any more, so remove it and related code.

Signed-off-by: Jisheng Zhang <Jisheng.Zhang@synaptics.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20191218131720.4146aea2@xhacker.debian
2020-01-17 10:19:19 +01:00
Steven Rostedt (VMware)
31537cf8f3 tracing: Initialize ret in syscall_enter_define_fields()
If syscall_enter_define_fields() is called on a system call with no
arguments, the return code variable "ret" will never get initialized.
Initialize it to zero.

Fixes: 04ae87a520 ("ftrace: Rework event_create_dir()")
Reported-by: Qian Cai <cai@lca.pw>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/0FA8C6E3-D9F5-416D-A1B0-5E4CD583A101@lca.pw
2020-01-17 10:19:18 +01:00
YueHaibing
81f2b572cf bpf: Remove set but not used variable 'first_key'
kernel/bpf/syscall.c: In function generic_map_lookup_batch:
kernel/bpf/syscall.c:1339:7: warning: variable first_key set but not used [-Wunused-but-set-variable]

It is never used, so remove it.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Brian Vazquez <brianvv@google.com>
Link: https://lore.kernel.org/bpf/20200116145300.59056-1-yuehaibing@huawei.com
2020-01-16 20:15:24 -08:00
Jesper Dangaard Brouer
58aa94f922 devmap: Adjust tracepoint for map-less queue flush
Now that we don't have a reference to a devmap when flushing the device
bulk queue, let's change the the devmap_xmit tracepoint to remote the
map_id and map_index fields entirely. Rearrange the fields so 'drops' and
'sent' stay in the same position in the tracepoint struct, to make it
possible for the xdp_monitor utility to read both the old and the new
format.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/157918768613.1458396.9165902403373826572.stgit@toke.dk
2020-01-16 20:03:34 -08:00
Toke Høiland-Jørgensen
1d233886dd xdp: Use bulking for non-map XDP_REDIRECT and consolidate code paths
Since the bulk queue used by XDP_REDIRECT now lives in struct net_device,
we can re-use the bulking for the non-map version of the bpf_redirect()
helper. This is a simple matter of having xdp_do_redirect_slow() queue the
frame on the bulk queue instead of sending it out with __bpf_tx_xdp().

Unfortunately we can't make the bpf_redirect() helper return an error if
the ifindex doesn't exit (as bpf_redirect_map() does), because we don't
have a reference to the network namespace of the ingress device at the time
the helper is called. So we have to leave it as-is and keep the device
lookup in xdp_do_redirect_slow().

Since this leaves less reason to have the non-map redirect code in a
separate function, so we get rid of the xdp_do_redirect_slow() function
entirely. This does lose us the tracepoint disambiguation, but fortunately
the xdp_redirect and xdp_redirect_map tracepoints use the same tracepoint
entry structures. This means both can contain a map index, so we can just
amend the tracepoint definitions so we always emit the xdp_redirect(_err)
tracepoints, but with the map ID only populated if a map is present. This
means we retire the xdp_redirect_map(_err) tracepoints entirely, but keep
the definitions around in case someone is still listening for them.

With this change, the performance of the xdp_redirect sample program goes
from 5Mpps to 8.4Mpps (a 68% increase).

Since the flush functions are no longer map-specific, rename the flush()
functions to drop _map from their names. One of the renamed functions is
the xdp_do_flush_map() callback used in all the xdp-enabled drivers. To
keep from having to update all drivers, use a #define to keep the old name
working, and only update the virtual drivers in this patch.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/157918768505.1458396.17518057312953572912.stgit@toke.dk
2020-01-16 20:03:34 -08:00
Toke Høiland-Jørgensen
75ccae62cb xdp: Move devmap bulk queue into struct net_device
Commit 96360004b8 ("xdp: Make devmap flush_list common for all map
instances"), changed devmap flushing to be a global operation instead of a
per-map operation. However, the queue structure used for bulking was still
allocated as part of the containing map.

This patch moves the devmap bulk queue into struct net_device. The
motivation for this is reusing it for the non-map variant of XDP_REDIRECT,
which will be changed in a subsequent commit.  To avoid other fields of
struct net_device moving to different cache lines, we also move a couple of
other members around.

We defer the actual allocation of the bulk queue structure until the
NETDEV_REGISTER notification devmap.c. This makes it possible to check for
ndo_xdp_xmit support before allocating the structure, which is not possible
at the time struct net_device is allocated. However, we keep the freeing in
free_netdev() to avoid adding another RCU callback on NETDEV_UNREGISTER.

Because of this change, we lose the reference back to the map that
originated the redirect, so change the tracepoint to always return 0 as the
map ID and index. Otherwise no functional change is intended with this
patch.

After this patch, the relevant part of struct net_device looks like this,
according to pahole:

	/* --- cacheline 14 boundary (896 bytes) --- */
	struct netdev_queue *      _tx __attribute__((__aligned__(64))); /*   896     8 */
	unsigned int               num_tx_queues;        /*   904     4 */
	unsigned int               real_num_tx_queues;   /*   908     4 */
	struct Qdisc *             qdisc;                /*   912     8 */
	unsigned int               tx_queue_len;         /*   920     4 */
	spinlock_t                 tx_global_lock;       /*   924     4 */
	struct xdp_dev_bulk_queue * xdp_bulkq;           /*   928     8 */
	struct xps_dev_maps *      xps_cpus_map;         /*   936     8 */
	struct xps_dev_maps *      xps_rxqs_map;         /*   944     8 */
	struct mini_Qdisc *        miniq_egress;         /*   952     8 */
	/* --- cacheline 15 boundary (960 bytes) --- */
	struct hlist_head  qdisc_hash[16];               /*   960   128 */
	/* --- cacheline 17 boundary (1088 bytes) --- */
	struct timer_list  watchdog_timer;               /*  1088    40 */

	/* XXX last struct has 4 bytes of padding */

	int                        watchdog_timeo;       /*  1128     4 */

	/* XXX 4 bytes hole, try to pack */

	struct list_head   todo_list;                    /*  1136    16 */
	/* --- cacheline 18 boundary (1152 bytes) --- */

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/157918768397.1458396.12673224324627072349.stgit@toke.dk
2020-01-16 20:03:34 -08:00
Alexander Potapenko
18451f9f9e PM: hibernate: fix crashes with init_on_free=1
Upon resuming from hibernation, free pages may contain stale data from
the kernel that initiated the resume. This breaks the invariant
inflicted by init_on_free=1 that freed pages must be zeroed.

To deal with this problem, make clear_free_pages() also clear the free
pages when init_on_free is enabled.

Fixes: 6471384af2 ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options")
Reported-by: Johannes Stezenbach <js@sig21.net>
Signed-off-by: Alexander Potapenko <glider@google.com>
Cc: 5.3+ <stable@vger.kernel.org> # 5.3+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-01-16 23:51:45 +01:00
Jonas Meurer
c052bf82c6 PM: suspend: Add sysfs attribute to control the "sync on suspend" behavior
The sysfs attribute `/sys/power/sync_on_suspend` controls, whether or not
filesystems are synced by the kernel before system suspend.

Congruously, the behaviour of build-time switch CONFIG_SUSPEND_SKIP_SYNC
is slightly changed: It now defines the run-tim default for the new sysfs
attribute `/sys/power/sync_on_suspend`.

The run-time attribute is added because the existing corresponding
build-time Kconfig flag for (`CONFIG_SUSPEND_SKIP_SYNC`) is not flexible
enough. E.g. Linux distributions that provide pre-compiled kernels
usually want to stick with the default (sync filesystems before suspend)
but under special conditions this needs to be changed.

One example for such a special condition is user-space handling of
suspending block devices (e.g. using `cryptsetup luksSuspend` or `dmsetup
suspend`) before system suspend. The Kernel trying to sync filesystems
after the underlying block device already got suspended obviously leads
to dead-locks. Be aware that you have to take care of the filesystem sync
yourself before suspending the system in those scenarios.

Signed-off-by: Jonas Meurer <jonas@freesources.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-01-16 21:47:03 +01:00
Petr Mladek
3a51449b79 watchdog/softlockup: Remove obsolete check of last reported task
commit 9cf57731b6 ("watchdog/softlockup: Replace "watchdog/%u" threads
 with cpu_stop_work") ensures that the watchdog is reliably touched during
a task switch.

As a result the check for an unnoticed task switch is not longer needed.

Remove the relevant code, which effectively reverts commit b1a8de1f53
("softlockup: make detector be aware of task switch of processes hogging
cpu")

Signed-off-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Ziljstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20191024114928.15377-2-pmladek@suse.com
2020-01-16 14:52:48 +01:00
Steven Rostedt (VMware)
82d1b8158c tracing: Allow trace_printk() to nest in other tracing code
trace_printk() is used to debug the kernel which includes the tracing
infrastructure. But because it writes to the ring buffer, and so does much
of the tracing infrastructure, the ring buffer's recursive detection will
drop writes to the ring buffer that is in the same context as the current
write is happening (it allows interrupts to write when normal context is
writing, but wont let normal context write while normal context is writing).

This can cause confusion and think that the code is where the trace_printk()
exists is not hit. To solve this, up the recursive nesting of the ring
buffer when trace_printk() is called before it writes to the buffer itself.

Note, this does make it dangerous to use trace_printk() in the ring buffer
code itself, because this basically disables the recursion protection of
trace_printk() buffer writes. But as trace_printk() is only used for
debugging, and if this does occur, the developer will see the cause real
quick (recursive blowing up of the stack). Thus the developer can deal with
that. But having trace_printk() silently ignored is a much bigger problem,
and disabling recursive protection is a small price to pay to fix it.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-16 08:20:18 -05:00
Jisheng Zhang
d129479f1f watchdog: Remove soft_lockup_hrtimer_cnt and related code
After commit 9cf57731b6 ("watchdog/softlockup: Replace "watchdog/%u"
threads with cpu_stop_work"), the percpu soft_lockup_hrtimer_cnt is
not used any more, so remove it and related code.

Signed-off-by: Jisheng Zhang <Jisheng.Zhang@synaptics.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20191218131720.4146aea2@xhacker.debian
2020-01-16 12:25:51 +01:00
David S. Miller
3981f955eb Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2020-01-15

The following pull-request contains BPF updates for your *net* tree.

We've added 12 non-merge commits during the last 9 day(s) which contain
a total of 13 files changed, 95 insertions(+), 43 deletions(-).

The main changes are:

1) Fix refcount leak for TCP time wait and request sockets for socket lookup
   related BPF helpers, from Lorenz Bauer.

2) Fix wrong verification of ARSH instruction under ALU32, from Daniel Borkmann.

3) Batch of several sockmap and related TLS fixes found while operating
   more complex BPF programs with Cilium and OpenSSL, from John Fastabend.

4) Fix sockmap to read psock's ingress_msg queue before regular sk_receive_queue()
   to avoid purging data upon teardown, from Lingpeng Chen.

5) Fix printing incorrect pointer in bpftool's btf_dump_ptr() in order to properly
   dump a BPF map's value with BTF, from Martin KaFai Lau.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-16 10:04:40 +01:00
Yonghong Song
057996380a bpf: Add batch ops to all htab bpf map
htab can't use generic batch support due some problematic behaviours
inherent to the data structre, i.e. while iterating the bpf map  a
concurrent program might delete the next entry that batch was about to
use, in that case there's no easy solution to retrieve the next entry,
the issue has been discussed multiple times (see [1] and [2]).

The only way hmap can be traversed without the problem previously
exposed is by making sure that the map is traversing entire buckets.
This commit implements those strict requirements for hmap, the
implementation follows the same interaction that generic support with
some exceptions:

 - If keys/values buffer are not big enough to traverse a bucket,
   ENOSPC will be returned.
 - out_batch contains the value of the next bucket in the iteration, not
   the next key, but this is transparent for the user since the user
   should never use out_batch for other than bpf batch syscalls.

This commits implements BPF_MAP_LOOKUP_BATCH and adds support for new
command BPF_MAP_LOOKUP_AND_DELETE_BATCH. Note that for update/delete
batch ops it is possible to use the generic implementations.

[1] https://lore.kernel.org/bpf/20190724165803.87470-1-brianvv@google.com/
[2] https://lore.kernel.org/bpf/20190906225434.3635421-1-yhs@fb.com/

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Brian Vazquez <brianvv@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200115184308.162644-6-brianvv@google.com
2020-01-15 14:00:35 -08:00
Brian Vazquez
c60f2d2861 bpf: Add lookup and update batch ops to arraymap
This adds the generic batch ops functionality to bpf arraymap, note that
since deletion is not a valid operation for arraymap, only batch and
lookup are added.

Signed-off-by: Brian Vazquez <brianvv@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200115184308.162644-5-brianvv@google.com
2020-01-15 14:00:35 -08:00
Brian Vazquez
aa2e93b8e5 bpf: Add generic support for update and delete batch ops
This commit adds generic support for update and delete batch ops that
can be used for almost all the bpf maps. These commands share the same
UAPI attr that lookup and lookup_and_delete batch ops use and the
syscall commands are:

  BPF_MAP_UPDATE_BATCH
  BPF_MAP_DELETE_BATCH

The main difference between update/delete and lookup batch ops is that
for update/delete keys/values must be specified for userspace and
because of that, neither in_batch nor out_batch are used.

Suggested-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Brian Vazquez <brianvv@google.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200115184308.162644-4-brianvv@google.com
2020-01-15 14:00:35 -08:00
Brian Vazquez
cb4d03ab49 bpf: Add generic support for lookup batch op
This commit introduces generic support for the bpf_map_lookup_batch.
This implementation can be used by almost all the bpf maps since its core
implementation is relying on the existing map_get_next_key and
map_lookup_elem. The bpf syscall subcommand introduced is:

  BPF_MAP_LOOKUP_BATCH

The UAPI attribute is:

  struct { /* struct used by BPF_MAP_*_BATCH commands */
         __aligned_u64   in_batch;       /* start batch,
                                          * NULL to start from beginning
                                          */
         __aligned_u64   out_batch;      /* output: next start batch */
         __aligned_u64   keys;
         __aligned_u64   values;
         __u32           count;          /* input/output:
                                          * input: # of key/value
                                          * elements
                                          * output: # of filled elements
                                          */
         __u32           map_fd;
         __u64           elem_flags;
         __u64           flags;
  } batch;

in_batch/out_batch are opaque values use to communicate between
user/kernel space, in_batch/out_batch must be of key_size length.

To start iterating from the beginning in_batch must be null,
count is the # of key/value elements to retrieve. Note that the 'keys'
buffer must be a buffer of key_size * count size and the 'values' buffer
must be value_size * count, where value_size must be aligned to 8 bytes
by userspace if it's dealing with percpu maps. 'count' will contain the
number of keys/values successfully retrieved. Note that 'count' is an
input/output variable and it can contain a lower value after a call.

If there's no more entries to retrieve, ENOENT will be returned. If error
is ENOENT, count might be > 0 in case it copied some values but there were
no more entries to retrieve.

Note that if the return code is an error and not -EFAULT,
count indicates the number of elements successfully processed.

Suggested-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Brian Vazquez <brianvv@google.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200115184308.162644-3-brianvv@google.com
2020-01-15 14:00:35 -08:00
Brian Vazquez
15c14a3dca bpf: Add bpf_map_{value_size, update_value, map_copy_value} functions
This commit moves reusable code from map_lookup_elem and map_update_elem
to avoid code duplication in kernel/bpf/syscall.c.

Signed-off-by: Brian Vazquez <brianvv@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200115184308.162644-2-brianvv@google.com
2020-01-15 14:00:34 -08:00
Daniel Borkmann
0af2ffc93a bpf: Fix incorrect verifier simulation of ARSH under ALU32
Anatoly has been fuzzing with kBdysch harness and reported a hang in one
of the outcomes:

  0: R1=ctx(id=0,off=0,imm=0) R10=fp0
  0: (85) call bpf_get_socket_cookie#46
  1: R0_w=invP(id=0) R10=fp0
  1: (57) r0 &= 808464432
  2: R0_w=invP(id=0,umax_value=808464432,var_off=(0x0; 0x30303030)) R10=fp0
  2: (14) w0 -= 810299440
  3: R0_w=invP(id=0,umax_value=4294967295,var_off=(0xcf800000; 0x3077fff0)) R10=fp0
  3: (c4) w0 s>>= 1
  4: R0_w=invP(id=0,umin_value=1740636160,umax_value=2147221496,var_off=(0x67c00000; 0x183bfff8)) R10=fp0
  4: (76) if w0 s>= 0x30303030 goto pc+216
  221: R0_w=invP(id=0,umin_value=1740636160,umax_value=2147221496,var_off=(0x67c00000; 0x183bfff8)) R10=fp0
  221: (95) exit
  processed 6 insns (limit 1000000) [...]

Taking a closer look, the program was xlated as follows:

  # ./bpftool p d x i 12
  0: (85) call bpf_get_socket_cookie#7800896
  1: (bf) r6 = r0
  2: (57) r6 &= 808464432
  3: (14) w6 -= 810299440
  4: (c4) w6 s>>= 1
  5: (76) if w6 s>= 0x30303030 goto pc+216
  6: (05) goto pc-1
  7: (05) goto pc-1
  8: (05) goto pc-1
  [...]
  220: (05) goto pc-1
  221: (05) goto pc-1
  222: (95) exit

Meaning, the visible effect is very similar to f54c7898ed ("bpf: Fix
precision tracking for unbounded scalars"), that is, the fall-through
branch in the instruction 5 is considered to be never taken given the
conclusion from the min/max bounds tracking in w6, and therefore the
dead-code sanitation rewrites it as goto pc-1. However, real-life input
disagrees with verification analysis since a soft-lockup was observed.

The bug sits in the analysis of the ARSH. The definition is that we shift
the target register value right by K bits through shifting in copies of
its sign bit. In adjust_scalar_min_max_vals(), we do first coerce the
register into 32 bit mode, same happens after simulating the operation.
However, for the case of simulating the actual ARSH, we don't take the
mode into account and act as if it's always 64 bit, but location of sign
bit is different:

  dst_reg->smin_value >>= umin_val;
  dst_reg->smax_value >>= umin_val;
  dst_reg->var_off = tnum_arshift(dst_reg->var_off, umin_val);

Consider an unknown R0 where bpf_get_socket_cookie() (or others) would
for example return 0xffff. With the above ARSH simulation, we'd see the
following results:

  [...]
  1: R1=ctx(id=0,off=0,imm=0) R2_w=invP65535 R10=fp0
  1: (85) call bpf_get_socket_cookie#46
  2: R0_w=invP(id=0) R10=fp0
  2: (57) r0 &= 808464432
    -> R0_runtime = 0x3030
  3: R0_w=invP(id=0,umax_value=808464432,var_off=(0x0; 0x30303030)) R10=fp0
  3: (14) w0 -= 810299440
    -> R0_runtime = 0xcfb40000
  4: R0_w=invP(id=0,umax_value=4294967295,var_off=(0xcf800000; 0x3077fff0)) R10=fp0
                              (0xffffffff)
  4: (c4) w0 s>>= 1
    -> R0_runtime = 0xe7da0000
  5: R0_w=invP(id=0,umin_value=1740636160,umax_value=2147221496,var_off=(0x67c00000; 0x183bfff8)) R10=fp0
                              (0x67c00000)           (0x7ffbfff8)
  [...]

In insn 3, we have a runtime value of 0xcfb40000, which is '1100 1111 1011
0100 0000 0000 0000 0000', the result after the shift has 0xe7da0000 that
is '1110 0111 1101 1010 0000 0000 0000 0000', where the sign bit is correctly
retained in 32 bit mode. In insn4, the umax was 0xffffffff, and changed into
0x7ffbfff8 after the shift, that is, '0111 1111 1111 1011 1111 1111 1111 1000'
and means here that the simulation didn't retain the sign bit. With above
logic, the updates happen on the 64 bit min/max bounds and given we coerced
the register, the sign bits of the bounds are cleared as well, meaning, we
need to force the simulation into s32 space for 32 bit alu mode.

Verification after the fix below. We're first analyzing the fall-through branch
on 32 bit signed >= test eventually leading to rejection of the program in this
specific case:

  0: R1=ctx(id=0,off=0,imm=0) R10=fp0
  0: (b7) r2 = 808464432
  1: R1=ctx(id=0,off=0,imm=0) R2_w=invP808464432 R10=fp0
  1: (85) call bpf_get_socket_cookie#46
  2: R0_w=invP(id=0) R10=fp0
  2: (bf) r6 = r0
  3: R0_w=invP(id=0) R6_w=invP(id=0) R10=fp0
  3: (57) r6 &= 808464432
  4: R0_w=invP(id=0) R6_w=invP(id=0,umax_value=808464432,var_off=(0x0; 0x30303030)) R10=fp0
  4: (14) w6 -= 810299440
  5: R0_w=invP(id=0) R6_w=invP(id=0,umax_value=4294967295,var_off=(0xcf800000; 0x3077fff0)) R10=fp0
  5: (c4) w6 s>>= 1
  6: R0_w=invP(id=0) R6_w=invP(id=0,umin_value=3888119808,umax_value=4294705144,var_off=(0xe7c00000; 0x183bfff8)) R10=fp0
                                              (0x67c00000)          (0xfffbfff8)
  6: (76) if w6 s>= 0x30303030 goto pc+216
  7: R0_w=invP(id=0) R6_w=invP(id=0,umin_value=3888119808,umax_value=4294705144,var_off=(0xe7c00000; 0x183bfff8)) R10=fp0
  7: (30) r0 = *(u8 *)skb[808464432]
  BPF_LD_[ABS|IND] uses reserved fields
  processed 8 insns (limit 1000000) [...]

Fixes: 9cbe1f5a32 ("bpf/verifier: improve register value range tracking with ARSH")
Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200115204733.16648-1-daniel@iogearbox.net
2020-01-15 13:39:59 -08:00
Chunyan Zhang
5167c506d6 tick/common: Touch watchdog in tick_unfreeze() on all CPUs
Suspend to IDLE invokes tick_unfreeze() on resume. tick_unfreeze() on the
first resuming CPU resumes timekeeping, which also has the side effect of
resetting the softlockup watchdog on this CPU.

But on the secondary CPUs the watchdog is not reset in the resume /
unfreeze() path, which can result in false softlockup warnings on those
CPUs depending on the time spent in suspend.

Prevent this by clearing the softlock watchdog in the unfreeze path also
on the secondary resuming CPUs.

[ tglx: Massaged changelog ]

Signed-off-by: Chunyan Zhang <chunyan.zhang@unisoc.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20200110083902.27276-1-chunyan.zhang@unisoc.com
2020-01-15 21:29:45 +01:00
Yonghong Song
8482941f09 bpf: Add bpf_send_signal_thread() helper
Commit 8b401f9ed2 ("bpf: implement bpf_send_signal() helper")
added helper bpf_send_signal() which permits bpf program to
send a signal to the current process. The signal may be
delivered to any threads in the process.

We found a use case where sending the signal to the current
thread is more preferable.
  - A bpf program will collect the stack trace and then
    send signal to the user application.
  - The user application will add some thread specific
    information to the just collected stack trace for
    later analysis.

If bpf_send_signal() is used, user application will need
to check whether the thread receiving the signal matches
the thread collecting the stack by checking thread id.
If not, it will need to send signal to another thread
through pthread_kill().

This patch proposed a new helper bpf_send_signal_thread(),
which sends the signal to the thread corresponding to
the current kernel task. This way, user space is guaranteed that
bpf_program execution context and user space signal handling
context are the same thread.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200115035002.602336-1-yhs@fb.com
2020-01-15 11:44:51 -08:00
Michal Koutný
3bc0bb36fa cgroup: Prevent double killing of css when enabling threaded cgroup
The test_cgcore_no_internal_process_constraint_on_threads selftest when
running with subsystem controlling noise triggers two warnings:

> [  597.443115] WARNING: CPU: 1 PID: 28167 at kernel/cgroup/cgroup.c:3131 cgroup_apply_control_enable+0xe0/0x3f0
> [  597.443413] WARNING: CPU: 1 PID: 28167 at kernel/cgroup/cgroup.c:3177 cgroup_apply_control_disable+0xa6/0x160

Both stem from a call to cgroup_type_write. The first warning was also
triggered by syzkaller.

When we're switching cgroup to threaded mode shortly after a subsystem
was disabled on it, we can see the respective subsystem css dying there.

The warning in cgroup_apply_control_enable is harmless in this case
since we're not adding new subsys anyway.
The warning in cgroup_apply_control_disable indicates an attempt to kill
css of recently disabled subsystem repeatedly.

The commit prevents these situations by making cgroup_type_write wait
for all dying csses to go away before re-applying subtree controls.
When at it, the locations of WARN_ON_ONCE calls are moved so that
warning is triggered only when we are about to misuse the dying css.

Reported-by: syzbot+5493b2a54d31d6aea629@syzkaller.appspotmail.com
Reported-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2020-01-15 08:04:29 -08:00
Daniel Jordan
1c5da0ec7f workqueue: add worker function to workqueue_execute_end tracepoint
It's surprising that workqueue_execute_end includes only the work when
its counterpart workqueue_execute_start has both the work and the worker
function.

You can't set a tracing filter or trigger based on the function, and
postprocessing scripts interested in specific functions are harder to
write since they have to remember the work from _start and match it up
with the same field in _end.

Add the function name, taking care to use the copy stashed in the
worker since the work is no longer safe to touch.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
2020-01-15 08:02:47 -08:00
Chen Zhou
75ea91cd3e cgroup: fix function name in comment
Function name cgroup_rstat_cpu_pop_upated() in comment should be
cgroup_rstat_cpu_pop_updated().

Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2020-01-15 07:58:13 -08:00
Jessica Yu
e9f35f634e modsign: print module name along with error message
It is useful to know which module failed signature verification, so
print the module name along with the error message.

Signed-off-by: Jessica Yu <jeyu@kernel.org>
2020-01-15 15:49:31 +01:00
Stephen Boyd
6b6d188aae alarmtimer: Unregister wakeup source when module get fails
The alarmtimer_rtc_add_device() function creates a wakeup source and then
tries to grab a module reference. If that fails the function returns early
with an error code, but fails to remove the wakeup source.

Cleanup this exit path so there is no dangling wakeup source, which is
named 'alarmtime' left allocated which will conflict with another RTC
device that may be registered later.

Fixes: 51218298a2 ("alarmtimer: Ensure RTC module is not unloaded")
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20200109155910.907-2-swboyd@chromium.org
2020-01-15 11:16:54 +01:00
Eric Dumazet
de95a991bb tick/sched: Annotate lockless access to last_jiffies_update
syzbot (KCSAN) reported a data-race in tick_do_update_jiffies64():

BUG: KCSAN: data-race in tick_do_update_jiffies64 / tick_do_update_jiffies64

write to 0xffffffff8603d008 of 8 bytes by interrupt on cpu 1:
 tick_do_update_jiffies64+0x100/0x250 kernel/time/tick-sched.c:73
 tick_sched_do_timer+0xd4/0xe0 kernel/time/tick-sched.c:138
 tick_sched_timer+0x43/0xe0 kernel/time/tick-sched.c:1292
 __run_hrtimer kernel/time/hrtimer.c:1514 [inline]
 __hrtimer_run_queues+0x274/0x5f0 kernel/time/hrtimer.c:1576
 hrtimer_interrupt+0x22a/0x480 kernel/time/hrtimer.c:1638
 local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1110 [inline]
 smp_apic_timer_interrupt+0xdc/0x280 arch/x86/kernel/apic/apic.c:1135
 apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:830
 arch_local_irq_restore arch/x86/include/asm/paravirt.h:756 [inline]
 kcsan_setup_watchpoint+0x1d4/0x460 kernel/kcsan/core.c:436
 check_access kernel/kcsan/core.c:466 [inline]
 __tsan_read1 kernel/kcsan/core.c:593 [inline]
 __tsan_read1+0xc2/0x100 kernel/kcsan/core.c:593
 kallsyms_expand_symbol.constprop.0+0x70/0x160 kernel/kallsyms.c:79
 kallsyms_lookup_name+0x7f/0x120 kernel/kallsyms.c:170
 insert_report_filterlist kernel/kcsan/debugfs.c:155 [inline]
 debugfs_write+0x14b/0x2d0 kernel/kcsan/debugfs.c:256
 full_proxy_write+0xbd/0x100 fs/debugfs/file.c:225
 __vfs_write+0x67/0xc0 fs/read_write.c:494
 vfs_write fs/read_write.c:558 [inline]
 vfs_write+0x18a/0x390 fs/read_write.c:542
 ksys_write+0xd5/0x1b0 fs/read_write.c:611
 __do_sys_write fs/read_write.c:623 [inline]
 __se_sys_write fs/read_write.c:620 [inline]
 __x64_sys_write+0x4c/0x60 fs/read_write.c:620
 do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

read to 0xffffffff8603d008 of 8 bytes by task 0 on cpu 0:
 tick_do_update_jiffies64+0x2b/0x250 kernel/time/tick-sched.c:62
 tick_nohz_update_jiffies kernel/time/tick-sched.c:505 [inline]
 tick_nohz_irq_enter kernel/time/tick-sched.c:1257 [inline]
 tick_irq_enter+0x139/0x1c0 kernel/time/tick-sched.c:1274
 irq_enter+0x4f/0x60 kernel/softirq.c:354
 entering_irq arch/x86/include/asm/apic.h:517 [inline]
 entering_ack_irq arch/x86/include/asm/apic.h:523 [inline]
 smp_apic_timer_interrupt+0x55/0x280 arch/x86/kernel/apic/apic.c:1133
 apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:830
 native_safe_halt+0xe/0x10 arch/x86/include/asm/irqflags.h:60
 arch_cpu_idle+0xa/0x10 arch/x86/kernel/process.c:571
 default_idle_call+0x1e/0x40 kernel/sched/idle.c:94
 cpuidle_idle_call kernel/sched/idle.c:154 [inline]
 do_idle+0x1af/0x280 kernel/sched/idle.c:263
 cpu_startup_entry+0x1b/0x20 kernel/sched/idle.c:355
 rest_init+0xec/0xf6 init/main.c:452
 arch_call_rest_init+0x17/0x37
 start_kernel+0x838/0x85e init/main.c:786
 x86_64_start_reservations+0x29/0x2b arch/x86/kernel/head64.c:490
 x86_64_start_kernel+0x72/0x76 arch/x86/kernel/head64.c:471
 secondary_startup_64+0xa4/0xb0 arch/x86/kernel/head_64.S:241

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.4.0-rc7+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

Use READ_ONCE() and WRITE_ONCE() to annotate this expected race.

Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20191205045619.204946-1-edumazet@google.com
2020-01-15 10:54:12 +01:00
Masami Hiramatsu
aeed8aa387 tracing: trigger: Replace unneeded RCU-list traversals
With CONFIG_PROVE_RCU_LIST, I had many suspicious RCU warnings
when I ran ftracetest trigger testcases.

-----
  # dmesg -c > /dev/null
  # ./ftracetest test.d/trigger
  ...
  # dmesg | grep "RCU-list traversed" | cut -f 2 -d ] | cut -f 2 -d " "
  kernel/trace/trace_events_hist.c:6070
  kernel/trace/trace_events_hist.c:1760
  kernel/trace/trace_events_hist.c:5911
  kernel/trace/trace_events_trigger.c:504
  kernel/trace/trace_events_hist.c:1810
  kernel/trace/trace_events_hist.c:3158
  kernel/trace/trace_events_hist.c:3105
  kernel/trace/trace_events_hist.c:5518
  kernel/trace/trace_events_hist.c:5998
  kernel/trace/trace_events_hist.c:6019
  kernel/trace/trace_events_hist.c:6044
  kernel/trace/trace_events_trigger.c:1500
  kernel/trace/trace_events_trigger.c:1540
  kernel/trace/trace_events_trigger.c:539
  kernel/trace/trace_events_trigger.c:584
-----

I investigated those warnings and found that the RCU-list
traversals in event trigger and hist didn't need to use
RCU version because those were called only under event_mutex.

I also checked other RCU-list traversals related to event
trigger list, and found that most of them were called from
event_hist_trigger_func() or hist_unregister_trigger() or
register/unregister functions except for a few cases.

Replace these unneeded RCU-list traversals with normal list
traversal macro and lockdep_assert_held() to check the
event_mutex is held.

Link: http://lkml.kernel.org/r/157680910305.11685.15110237954275915782.stgit@devnote2

Cc: stable@vger.kernel.org
Fixes: 30350d65ac ("tracing: Add variable support to hist triggers")
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-14 17:12:04 -05:00
Steven Rostedt (VMware)
cfc585a401 ring-buffer: Fix kernel doc for rb_update_event()
rb_update_event has changed without the kernel-doc update.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-14 16:27:51 -05:00
Fabian Frederick
59e7cffe5c ring-bufer: kernel-doc warning fixes
Also fixes a couple of typos

Link: http://lkml.kernel.org/r/1401992525-10417-1-git-send-email-fabf@skynet.be

Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
[ Found this deep in the abyss of my INBOX ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-14 16:23:34 -05:00
Masami Hiramatsu
99c9a923e9 tracing/uprobe: Fix double perf_event linking on multiprobe uprobe
Fix double perf_event linking to trace_uprobe_filter on
multiple uprobe event by moving trace_uprobe_filter under
trace_probe_event.

In uprobe perf event, trace_uprobe_filter data structure is
managing target mm filters (in perf_event) related to each
uprobe event.

Since commit 60d53e2c3b ("tracing/probe: Split trace_event
related data from trace_probe") left the trace_uprobe_filter
data structure in trace_uprobe, if a trace_probe_event has
multiple trace_uprobe (multi-probe event), a perf_event is
added to different trace_uprobe_filter on each trace_uprobe.
This leads a linked list corruption.

To fix this issue, move trace_uprobe_filter to trace_probe_event
and link it once on each event instead of each probe.

Link: http://lkml.kernel.org/r/157862073931.1800.3800576241181489174.stgit@devnote2

Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: "Naveen N . Rao" <naveen.n.rao@linux.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: "David S . Miller" <davem@davemloft.net>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: =?utf-8?q?Toke_H=C3=B8iland-J?= =?utf-8?b?w7hyZ2Vuc2Vu?= <thoiland@redhat.com>
Cc: Jean-Tsung Hsiao <jhsiao@redhat.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 60d53e2c3b ("tracing/probe: Split trace_event related data from trace_probe")
Link: https://lkml.kernel.org/r/20200108171611.GA8472@kernel.org
Reported-by: Arnaldo Carvalho de Melo <acme@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-14 15:57:59 -05:00
Linus Torvalds
e033e7d4a8 Merge branch 'dhowells' (patches from DavidH)
Merge misc fixes from David Howells.

Two afs fixes and a key refcounting fix.

* dhowells:
  afs: Fix afs_lookup() to not clobber the version on a new dentry
  afs: Fix use-after-loss-of-ref
  keys: Fix request_key() cache
2020-01-14 09:56:31 -08:00
Martin KaFai Lau
3b4130418f bpf: Fix seq_show for BPF_MAP_TYPE_STRUCT_OPS
Instead of using bpf_struct_ops_map_lookup_elem() which is
not implemented, bpf_struct_ops_map_seq_show_elem() should
also use bpf_struct_ops_map_sys_lookup_elem() which does
an inplace update to the value.  The change allocates
a value to pass to bpf_struct_ops_map_sys_lookup_elem().

[root@arch-fb-vm1 bpf]# cat /sys/fs/bpf/dctcp
{{{1}},BPF_STRUCT_OPS_STATE_INUSE,{{00000000df93eebc,00000000df93eebc},0,2, ...

Fixes: 85d33df357 ("bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200114072647.3188298-1-kafai@fb.com
2020-01-14 09:54:31 -08:00
David Howells
8379bb84be keys: Fix request_key() cache
When the key cached by request_key() and co.  is cleaned up on exit(),
the code looks in the wrong task_struct, and so clears the wrong cache.
This leads to anomalies in key refcounting when doing, say, a kernel
build on an afs volume, that then trigger kasan to report a
use-after-free when the key is viewed in /proc/keys.

Fix this by making exit_creds() look in the passed-in task_struct rather
than in current (the task_struct cleanup code is deferred by RCU and
potentially run in another task).

Fixes: 7743c48e54 ("keys: Cache result of request_key*() temporarily in task_struct")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-14 09:40:06 -08:00
Jason Gunthorpe
984cfe4e25 mm/mmu_notifier: Rename struct mmu_notifier_mm to mmu_notifier_subscriptions
The name mmu_notifier_mm implies that the thing is a mm_struct pointer,
and is difficult to abbreviate. The struct is actually holding the
interval tree and hlist containing the notifiers subscribed to a mm.

Use 'subscriptions' as the variable name for this struct instead of the
really terrible and misleading 'mmn_mm'.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-01-14 11:54:47 -04:00
Andrei Vagin
04a8682a71 fs/proc: Introduce /proc/pid/timens_offsets
API to set time namespace offsets for children processes, i.e.:
echo "$clockid $offset_sec $offset_nsec" > /proc/self/timens_offsets

Co-developed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20191112012724.250792-28-dima@arista.com
2020-01-14 12:20:59 +01:00
Dmitry Safonov
70ddf65184 x86/vdso: Zap vvar pages when switching to a time namespace
The VVAR page layout depends on whether a task belongs to the root or
non-root time namespace. Whenever a task changes its namespace, the VVAR
page tables are cleared and then they will be re-faulted with a
corresponding layout.

Co-developed-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20191112012724.250792-27-dima@arista.com
2020-01-14 12:20:59 +01:00
Dmitry Safonov
afaa7b5ac7 time: Allocate per-timens vvar page
VDSO support for Time namespace needs to set up a page with the same
layout as VVAR. That timens page will be placed on position of VVAR page
inside namespace. That page contains time namespace clock offsets and it
has vdso_data->seq set to 1 to enforce the slow path and
vdso_data->clock_mode set to VCLOCK_TIMENS to enforce the time namespace
handling path.

Allocate the timens page during namespace creation. Setup the offsets
when the first task enters the ns and freeze them to guarantee the pace
of monotonic/boottime clocks and to avoid breakage of applications.

The design decision is to have a global offset_lock which is used during
namespace offsets setup and to freeze offsets when the first task joins the
new time namespace. That is better in terms of memory usage compared to
having a per namespace mutex that's used only during the setup period.

Suggested-by: Andy Lutomirski <luto@kernel.org>
Based-on-work-by: Thomas Gleixner <tglx@linutronix.de>
Co-developed-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20191112012724.250792-24-dima@arista.com
2020-01-14 12:20:58 +01:00
Andrei Vagin
1f9b37bfbb posix-timers: Make clock_nanosleep() time namespace aware
clock_nanosleep() accepts absolute values of expiration time, if the
TIMER_ABSTIME flag is set. This value is in the tasks time namespace,
which has to be converted to the host time namespace.

Co-developed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20191112012724.250792-18-dima@arista.com
2020-01-14 12:20:55 +01:00
Andrei Vagin
ea2d1f7fce hrtimers: Prepare hrtimer_nanosleep() for time namespaces
clock_nanosleep() accepts absolute values of expiration time when
TIMER_ABSTIME flag is set. This absolute value is inside the task's
time namespace, and has to be converted to the host's time.

There is timens_ktime_to_host() helper for converting time, but
it accepts ktime argument.

As a preparation, make hrtimer_nanosleep() accept a clock value in ktime
instead of timespec64.

Co-developed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20191112012724.250792-17-dima@arista.com
2020-01-14 12:20:55 +01:00
Andrei Vagin
0b9b9a3b16 alarmtimer: Make nanosleep() time namespace aware
clock_nanosleep() accepts absolute values of expiration time when the
TIMER_ABSTIME flag is set. This absolute value is inside the task's
time namespace and has to be converted to the host's time.

Co-developed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20191112012724.250792-16-dima@arista.com
2020-01-14 12:20:55 +01:00
Andrei Vagin
7da8b3a44b posix-timers: Make timer_settime() time namespace aware
Wire timer_settime() syscall into time namespace virtualization.

sys_timer_settime() calls the ktime->timer_set() callback. Right now,
common_timer_set() is the only implementation for the callback.

The user-supplied expiry value is converted from timespec64 to ktime and
then timens_ktime_to_host() can be used to convert namespace's time to the
host time.

Inside a time namespace kernel's time differs by a fixed offset from a
user-supplied time, but only absolute values (TIMER_ABSTIME) must be
converted.

Co-developed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20191112012724.250792-15-dima@arista.com
2020-01-14 12:20:54 +01:00
Andrei Vagin
89dd8eecfe time: Add do_timens_ktime_to_host() helper
The helper subtracts namespace's clock offset from the given time
and ensures that the result is within [0, KTIME_MAX].

Co-developed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20191112012724.250792-13-dima@arista.com
2020-01-14 12:20:53 +01:00
Andrei Vagin
5a590f35ad posix-clocks: Wire up clock_gettime() with timens offsets
Adjust monotonic and boottime clocks with per-timens offsets.  As the
result a process inside time namespace will see timers and clocks corrected
to offsets that were set when the namespace was created

Note that applications usually go through vDSO to get time, which is not
yet adjusted. Further changes will complete time namespace virtualisation
with vDSO support.

Co-developed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20191112012724.250792-12-dima@arista.com
2020-01-14 12:20:52 +01:00
Andrei Vagin
198fa445d5 posix-timers: Use clock_get_ktime() in common_timer_get()
Now, when the clock_get_ktime() callback exists, the suboptimal
timespec64-based conversion can be removed from common_timer_get().

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Co-developed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20191112012724.250792-11-dima@arista.com
2020-01-14 12:20:52 +01:00
Andrei Vagin
9c71a2e8a7 posix-clocks: Introduce clock_get_ktime() callback
The callsite in common_timer_get() has already a comment:
    /*
     * The timespec64 based conversion is suboptimal, but it's not
     * worth to implement yet another callback.
     */
    kc->clock_get(timr->it_clock, &ts64);
    now = timespec64_to_ktime(ts64);

The upcoming support for time namespaces requires to have access to:

 - The time in a task's time namespace for sys_clock_gettime()
 - The time in the root name space for common_timer_get()

That adds a valid reason to finally implement a separate callback which
returns the time in ktime_t format.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Co-developed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20191112012724.250792-10-dima@arista.com
2020-01-14 12:20:51 +01:00
Andrei Vagin
2f58bf909a alarmtimer: Provide get_timespec() callback
The upcoming support for time namespaces requires to have access to:

  - The time in a task's time namespace for sys_clock_gettime()
  - The time in the root name space for common_timer_get()

Wire up alarm bases with get_timespec().

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Co-developed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20191112012724.250792-9-dima@arista.com
2020-01-14 12:20:51 +01:00
Andrei Vagin
41b3b8dffc alarmtimer: Rename gettime() callback to get_ktime()
The upcoming support for time namespaces requires to have access to:

  - The time in a tasks time namespace for sys_clock_gettime()
  - The time in the root name space for common_timer_get()

struct alarm_base needs to follow the same naming convention, so rename
.gettime() callback into get_ktime() as a preparation for introducing
get_timespec().

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Co-developed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20191112012724.250792-8-dima@arista.com
2020-01-14 12:20:50 +01:00
Andrei Vagin
eaf80194d0 posix-clocks: Rename .clock_get_timespec() callbacks accordingly
The upcoming support for time namespaces requires to have access to:

  - The time in a task's time namespace for sys_clock_gettime()
  - The time in the root name space for common_timer_get()

That adds a valid reason to finally implement a separate callback which
returns the time in ktime_t format in (struct k_clock).

As a preparation ground for introducing clock_get_ktime(), the original
callback clock_get() was renamed into clock_get_timespec().
Reflect the renaming into the callback implementations.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Co-developed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20191112012724.250792-7-dima@arista.com
2020-01-14 12:20:50 +01:00
Andrei Vagin
819a95fe3a posix-clocks: Rename the clock_get() callback to clock_get_timespec()
The upcoming support for time namespaces requires to have access to:

 - The time in a task's time namespace for sys_clock_gettime()
 - The time in the root name space for common_timer_get()

That adds a valid reason to finally implement a separate callback which
returns the time in ktime_t format, rather than in (struct timespec).

Rename the clock_get() callback to clock_get_timespec() as a preparation
for introducing clock_get_ktime().

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Co-developed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20191112012724.250792-6-dima@arista.com
2020-01-14 12:20:49 +01:00
Andrei Vagin
af993f58d6 time: Add timens_offsets to be used for tasks in time namespace
Introduce offsets for time namespace. They will contain an adjustment
needed to convert clocks to/from host's.

A new namespace is created with the same offsets as the time namespace
of the current process.

Co-developed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20191112012724.250792-5-dima@arista.com
2020-01-14 12:20:49 +01:00
Andrei Vagin
769071ac9f ns: Introduce Time Namespace
Time Namespace isolates clock values.

The kernel provides access to several clocks CLOCK_REALTIME,
CLOCK_MONOTONIC, CLOCK_BOOTTIME, etc.

CLOCK_REALTIME
      System-wide clock that measures real (i.e., wall-clock) time.

CLOCK_MONOTONIC
      Clock that cannot be set and represents monotonic time since
      some unspecified starting point.

CLOCK_BOOTTIME
      Identical to CLOCK_MONOTONIC, except it also includes any time
      that the system is suspended.

For many users, the time namespace means the ability to changes date and
time in a container (CLOCK_REALTIME). Providing per namespace notions of
CLOCK_REALTIME would be complex with a massive overhead, but has a dubious
value.

But in the context of checkpoint/restore functionality, monotonic and
boottime clocks become interesting. Both clocks are monotonic with
unspecified starting points. These clocks are widely used to measure time
slices and set timers. After restoring or migrating processes, it has to be
guaranteed that they never go backward. In an ideal case, the behavior of
these clocks should be the same as for a case when a whole system is
suspended. All this means that it is required to set CLOCK_MONOTONIC and
CLOCK_BOOTTIME clocks, which can be achieved by adding per-namespace
offsets for clocks.

A time namespace is similar to a pid namespace in the way how it is
created: unshare(CLONE_NEWTIME) system call creates a new time namespace,
but doesn't set it to the current process. Then all children of the process
will be born in the new time namespace, or a process can use the setns()
system call to join a namespace.

This scheme allows setting clock offsets for a namespace, before any
processes appear in it.

All available clone flags have been used, so CLONE_NEWTIME uses the highest
bit of CSIGNAL. It means that it can be used only with the unshare() and
the clone3() system calls.

[ tglx: Adjusted paragraph about clone3() to reality and massaged the
  	changelog a bit. ]

Co-developed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://criu.org/Time_namespace
Link: https://lists.openvz.org/pipermail/criu/2018-June/041504.html
Link: https://lore.kernel.org/r/20191112012724.250792-4-dima@arista.com
2020-01-14 12:20:48 +01:00
Masami Hiramatsu
3b42a4c83a tracing: trigger: Replace unneeded RCU-list traversals
With CONFIG_PROVE_RCU_LIST, I had many suspicious RCU warnings
when I ran ftracetest trigger testcases.

-----
  # dmesg -c > /dev/null
  # ./ftracetest test.d/trigger
  ...
  # dmesg | grep "RCU-list traversed" | cut -f 2 -d ] | cut -f 2 -d " "
  kernel/trace/trace_events_hist.c:6070
  kernel/trace/trace_events_hist.c:1760
  kernel/trace/trace_events_hist.c:5911
  kernel/trace/trace_events_trigger.c:504
  kernel/trace/trace_events_hist.c:1810
  kernel/trace/trace_events_hist.c:3158
  kernel/trace/trace_events_hist.c:3105
  kernel/trace/trace_events_hist.c:5518
  kernel/trace/trace_events_hist.c:5998
  kernel/trace/trace_events_hist.c:6019
  kernel/trace/trace_events_hist.c:6044
  kernel/trace/trace_events_trigger.c:1500
  kernel/trace/trace_events_trigger.c:1540
  kernel/trace/trace_events_trigger.c:539
  kernel/trace/trace_events_trigger.c:584
-----

I investigated those warnings and found that the RCU-list
traversals in event trigger and hist didn't need to use
RCU version because those were called only under event_mutex.

I also checked other RCU-list traversals related to event
trigger list, and found that most of them were called from
event_hist_trigger_func() or hist_unregister_trigger() or
register/unregister functions except for a few cases.

Replace these unneeded RCU-list traversals with normal list
traversal macro and lockdep_assert_held() to check the
event_mutex is held.

Link: http://lkml.kernel.org/r/157680910305.11685.15110237954275915782.stgit@devnote2

Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-13 15:59:11 -05:00
Sargun Dhillon
8649c322f7
pid: Implement pidfd_getfd syscall
This syscall allows for the retrieval of file descriptors from other
processes, based on their pidfd. This is possible using ptrace, and
injection of parasitic code to inject code which leverages SCM_RIGHTS
to move file descriptors between a tracee and a tracer. Unfortunately,
ptrace comes with a high cost of requiring the process to be stopped,
and breaks debuggers. This does not require stopping the process under
manipulation.

One reason to use this is to allow sandboxers to take actions on file
descriptors on the behalf of another process. For example, this can be
combined with seccomp-bpf's user notification to do on-demand fd
extraction and take privileged actions. One such privileged action
is binding a socket to a privileged port.

/* prototype */
  /* flags is currently reserved and should be set to 0 */
  int sys_pidfd_getfd(int pidfd, int fd, unsigned int flags);

/* testing */
Ran self-test suite on x86_64

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20200107175927.4558-3-sargun@sargun.me
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-01-13 21:49:36 +01:00
Masami Hiramatsu
fe1efe9252 tracing/boot: Add function tracer filter options
Add below function-tracer filter options to boot-time tracing.

 - ftrace.[instance.INSTANCE.]ftrace.filters
   This will take an array of tracing function filter rules

 - ftrace.[instance.INSTANCE.]ftrace.notraces
   This will take an array of NON-tracing function filter rules

Link: http://lkml.kernel.org/r/157867244841.17873.10933616628243103561.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-13 13:19:42 -05:00
Masami Hiramatsu
9d15dbbde1 tracing/boot: Add cpu_mask option support
Add ftrace.cpumask option support to boot-time tracing.
This sets cpumask for each instance.

 - ftrace.[instance.INSTANCE.]cpumask = CPUMASK;
   Set the trace cpumask. Note that the CPUMASK should be a string
   which <tracefs>/tracing_cpumask can accepts.

Link: http://lkml.kernel.org/r/157867243625.17873.13613922641273149372.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-13 13:19:42 -05:00
Masami Hiramatsu
4f712a4d04 tracing/boot: Add instance node support
Add instance node support to boot-time tracing. User can set
some options and event nodes under instance node.

 - ftrace.instance.INSTANCE[...]
   Add new INSTANCE instance. Some options and event nodes
   are acceptable for instance node.

Link: http://lkml.kernel.org/r/157867242413.17873.9814204526141500278.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-13 13:19:42 -05:00
Masami Hiramatsu
3fbe2d6e1f tracing/boot: Add synthetic event support
Add synthetic event node support to boot time tracing.
The synthetic event is a kind of event node, but the group
name is "synthetic".

 - ftrace.event.synthetic.EVENT.fields = FIELD[, FIELD2...]
   Defines new synthetic event with FIELDs. Each field should be
   "type varname".

The synthetic node requires "fields" string arraies, which defines
the fields as same as tracing/synth_events interface.

Link: http://lkml.kernel.org/r/157867241236.17873.12411615143321557709.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-13 13:19:42 -05:00
Masami Hiramatsu
4d655281eb tracing/boot Add kprobe event support
Add kprobe event support on event node to boot-time tracing.
If the group name of event is "kprobes", the boot-time tracing
defines new probe event according to "probes" values.

 - ftrace.event.kprobes.EVENT.probes = PROBE[, PROBE2...]
   Defines new kprobe event based on PROBEs. It is able to define
   multiple probes on one event, but those must have same type of
   arguments.

For example,

 ftrace.events.kprobes.myevent {
	probes = "vfs_read $arg1 $arg2";
	enable;
 }

This will add kprobes:myevent on vfs_read with the 1st and the 2nd
arguments.

Link: http://lkml.kernel.org/r/157867240104.17873.9712052065426433111.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-13 13:19:42 -05:00
Masami Hiramatsu
81a59555ff tracing/boot: Add per-event settings
Add per-event settings for boottime tracing. User can set filter,
actions and enable on each event on boot. The event entries are
under ftrace.event.GROUP.EVENT node (note that the option key
includes event's group name and event name.) This supports below
configs.

 - ftrace.event.GROUP.EVENT.enable
   Enables GROUP:EVENT tracing.

 - ftrace.event.GROUP.EVENT.filter = FILTER
   Set FILTER rule to the GROUP:EVENT.

 - ftrace.event.GROUP.EVENT.actions = ACTION[, ACTION2...]
   Set ACTIONs to the GROUP:EVENT.

For example,

  ftrace.event.sched.sched_process_exec {
                filter = "pid < 128"
		enable
  }

this will enable tracing "sched:sched_process_exec" event
with "pid < 128" filter.

Link: http://lkml.kernel.org/r/157867238942.17873.11177628789184546198.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-13 13:19:41 -05:00
Masami Hiramatsu
9c5b9d3d65 tracing/boot: Add boot-time tracing
Setup tracing options via extra boot config in addition to kernel
command line.

This adds following commands support. These are applied to
the global trace instance.

 - ftrace.options = OPT1[,OPT2...]
   Enable given ftrace options.

 - ftrace.trace_clock = CLOCK
   Set given CLOCK to ftrace's trace_clock.

 - ftrace.buffer_size = SIZE
   Configure ftrace buffer size to SIZE. You can use "KB" or "MB"
   for that SIZE.

 - ftrace.events = EVENT[, EVENT2...]
   Enable given events on boot. You can use a wild card in EVENT.

 - ftrace.tracer = TRACER
   Set TRACER to current tracer on boot. (e.g. function)

Note that this is NOT replacing the kernel parameters, because
this boot config based setting is later than that. If you want to
trace earlier boot events, you still need kernel parameters.

Link: http://lkml.kernel.org/r/157867237723.17873.17494943526320587488.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-13 13:19:41 -05:00
Masami Hiramatsu
48ac9488a5 tracing: Add NULL trace-array check in print_synth_event()
Add NULL trace-array check in print_synth_event(), because
if we enable tp_printk option, iter->tr can be NULL.

Link: http://lkml.kernel.org/r/157867236536.17873.12529350542460184019.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-13 13:19:41 -05:00
Masami Hiramatsu
b05e89ae7c tracing: Accept different type for synthetic event fields
Make the synthetic event accepts a different type field to record.
However, the size and signed flag must be same.

Link: http://lkml.kernel.org/r/157867235358.17873.61732996461602171.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-13 13:19:41 -05:00
Masami Hiramatsu
d8d4c6d0e7 tracing: kprobes: Register to dynevent earlier stage
Register kprobe event to dynevent in subsys_initcall level.
This will allow kernel to register new kprobe events in
fs_initcall level via trace_run_command.

Link: http://lkml.kernel.org/r/157867234213.17873.18039000024374948737.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-13 13:19:41 -05:00
Masami Hiramatsu
8cfcf15503 tracing: kprobes: Output kprobe event to printk buffer
Since kprobe-events use event_trigger_unlock_commit_regs() directly,
that events doesn't show up in printk buffer if "tp_printk" is set.

Use trace_event_buffer_commit() in kprobe events so that it can
invoke output_printk() as same as other trace events.

Link: http://lkml.kernel.org/r/157867233085.17873.5210928676787339604.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
[ Adjusted data var declaration placement in __kretprobe_trace_func() ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-13 13:19:40 -05:00
Masami Hiramatsu
d8d0c245a7 tracing: Apply soft-disabled and filter to tracepoints printk
Apply soft-disabled and the filter rule of the trace events to
the printk output of tracepoints (a.k.a. tp_printk kernel parameter)
as same as trace buffer output.

Link: http://lkml.kernel.org/r/157867231876.17873.15825819592284704068.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-13 13:19:40 -05:00
Steven Rostedt (VMware)
1329249437 tracing: Make struct ring_buffer less ambiguous
As there's two struct ring_buffers in the kernel, it causes some confusion.
The other one being the perf ring buffer. It was agreed upon that as neither
of the ring buffers are generic enough to be used globally, they should be
renamed as:

   perf's ring_buffer -> perf_buffer
   ftrace's ring_buffer -> trace_buffer

This implements the changes to the ring buffer that ftrace uses.

Link: https://lore.kernel.org/r/20191213140531.116b3200@gandalf.local.home

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-13 13:19:38 -05:00
Steven Rostedt (VMware)
1c5eb4481e tracing: Rename trace_buffer to array_buffer
As we are working to remove the generic "ring_buffer" name that is used by
both tracing and perf, the ring_buffer name for tracing will be renamed to
trace_buffer, and perf's ring buffer will be renamed to perf_buffer.

As there already exists a trace_buffer that is used by the trace_arrays, it
needs to be first renamed to array_buffer.

Link: https://lore.kernel.org/r/20191213153553.GE20583@krava

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-13 13:19:38 -05:00
Steven Rostedt (VMware)
56de4e8f91 perf: Make struct ring_buffer less ambiguous
eBPF requires needing to know the size of the perf ring buffer structure.
But it unfortunately has the same name as the generic ring buffer used by
tracing and oprofile. To make it less ambiguous, rename the perf ring buffer
structure to "perf_buffer".

As other parts of the ring buffer code has "perf_" as the prefix, it only
makes sense to give the ring buffer the "perf_" prefix as well.

Link: https://lore.kernel.org/r/20191213153553.GE20583@krava
Acked-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-13 13:19:38 -05:00
Linus Torvalds
606e9ad200 clone3-tls-v5.5-rc6
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXhhtDQAKCRCRxhvAZXjc
 orQ3AQD7H2ovZbPIpWbwOnRIExBF4O8gPDfFc/J/RweZx40v/AD/QwfFnq0TpmUc
 UfS4zzLxJ4K+L4RYWId5v8MFHGIu8QQ=
 =LmmJ
 -----END PGP SIGNATURE-----

Merge tag 'clone3-tls-v5.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull thread fixes from Christian Brauner:
 "This contains a series of patches to fix CLONE_SETTLS when used with
  clone3().

  The clone3() syscall passes the tls argument through struct clone_args
  instead of a register. This means, all architectures that do not
  implement copy_thread_tls() but still support CLONE_SETTLS via
  copy_thread() expecting the tls to be located in a register argument
  based on clone() are currently unfortunately broken. Their tls value
  will be garbage.

  The patch series fixes this on all architectures that currently define
  __ARCH_WANT_SYS_CLONE3. It also adds a compile-time check to ensure
  that any architecture that enables clone3() in the future is forced to
  also implement copy_thread_tls().

  My ultimate goal is to get rid of the copy_thread()/copy_thread_tls()
  split and just have copy_thread_tls() at some point in the not too
  distant future (Maybe even renaming copy_thread_tls() back to simply
  copy_thread() once the old function is ripped from all arches). This
  is dependent now on all arches supporting clone3().

  While all relevant arches do that now there are still four missing:
  ia64, m68k, sh and sparc. They have the system call reserved, but not
  implemented. Once they all implement clone3() we can get rid of
  ARCH_WANT_SYS_CLONE3 and HAVE_COPY_THREAD_TLS.

  This series also includes a minor fix for the arm64 uapi headers which
  caused __NR_clone3 to be missing from the exported user headers.

  Unfortunately the series came in a little late especially given that
  it touches a range of architectures. Due to the holidays not all arch
  maintainers responded in time probably due to their backlog. Will and
  Arnd have thankfully acked the arm specific changes.

  Given that the changes are straightforward and rather minimal combined
  with the fact the that clone3() with CLONE_SETTLS is broken I decided
  to send them post rc3 nonetheless"

* tag 'clone3-tls-v5.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  um: Implement copy_thread_tls
  clone3: ensure copy_thread_tls is implemented
  xtensa: Implement copy_thread_tls
  riscv: Implement copy_thread_tls
  parisc: Implement copy_thread_tls
  arm: Implement copy_thread_tls
  arm64: Implement copy_thread_tls
  arm64: Move __ARCH_WANT_SYS_CLONE3 definition to uapi headers
2020-01-11 15:33:48 -08:00
Thomas Gleixner
2e34d63d82 Merge branch 'timers/urgent' into timers/core
Pick up upstream VDSO fix before adding more VDSO changes.
2020-01-10 21:11:54 +01:00
Alexei Starovoitov
51c39bb1d5 bpf: Introduce function-by-function verification
New llvm and old llvm with libbpf help produce BTF that distinguish global and
static functions. Unlike arguments of static function the arguments of global
functions cannot be removed or optimized away by llvm. The compiler has to use
exactly the arguments specified in a function prototype. The argument type
information allows the verifier validate each global function independently.
For now only supported argument types are pointer to context and scalars. In
the future pointers to structures, sizes, pointer to packet data can be
supported as well. Consider the following example:

static int f1(int ...)
{
  ...
}

int f3(int b);

int f2(int a)
{
  f1(a) + f3(a);
}

int f3(int b)
{
  ...
}

int main(...)
{
  f1(...) + f2(...) + f3(...);
}

The verifier will start its safety checks from the first global function f2().
It will recursively descend into f1() because it's static. Then it will check
that arguments match for the f3() invocation inside f2(). It will not descend
into f3(). It will finish f2() that has to be successfully verified for all
possible values of 'a'. Then it will proceed with f3(). That function also has
to be safe for all possible values of 'b'. Then it will start subprog 0 (which
is main() function). It will recursively descend into f1() and will skip full
check of f2() and f3(), since they are global. The order of processing global
functions doesn't affect safety, since all global functions must be proven safe
based on their arguments only.

Such function by function verification can drastically improve speed of the
verification and reduce complexity.

Note that the stack limit of 512 still applies to the call chain regardless whether
functions were static or global. The nested level of 8 also still applies. The
same recursion prevention checks are in place as well.

The type information and static/global kind is preserved after the verification
hence in the above example global function f2() and f3() can be replaced later
by equivalent functions with the same types that are loaded and verified later
without affecting safety of this main() program. Such replacement (re-linking)
of global functions is a subject of future patches.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200110064124.1760511-3-ast@kernel.org
2020-01-10 17:20:07 +01:00
Colin Ian King
5c0e9de065 PM: hibernate: fix spelling mistake "shapshot" -> "snapshot"
There is a spelling mistake in a pr_info message. Fix it.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-01-10 12:15:30 +01:00
Alan Maguire
c475c77d5b kunit: allow kunit tests to be loaded as a module
As tests are added to kunit, it will become less feasible to execute
all built tests together.  By supporting modular tests we provide
a simple way to do selective execution on a running system; specifying

CONFIG_KUNIT=y
CONFIG_KUNIT_EXAMPLE_TEST=m

...means we can simply "insmod example-test.ko" to run the tests.

To achieve this we need to do the following:

o export the required symbols in kunit
o string-stream tests utilize non-exported symbols so for now we skip
  building them when CONFIG_KUNIT_TEST=m.
o drivers/base/power/qos-test.c contains a few unexported interface
  references, namely freq_qos_read_value() and freq_constraints_init().
  Both of these could be potentially defined as static inline functions
  in include/linux/pm_qos.h, but for now we simply avoid supporting
  module build for that test suite.
o support a new way of declaring test suites.  Because a module cannot
  do multiple late_initcall()s, we provide a kunit_test_suites() macro
  to declare multiple suites within the same module at once.
o some test module names would have been too general ("test-test"
  and "example-test" for kunit tests, "inode-test" for ext4 tests);
  rename these as appropriate ("kunit-test", "kunit-example-test"
  and "ext4-inode-test" respectively).

Also define kunit_test_suite() via kunit_test_suites()
as callers in other trees may need the old definition.

Co-developed-by: Knut Omang <knut.omang@oracle.com>
Signed-off-by: Knut Omang <knut.omang@oracle.com>
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Reviewed-by: Brendan Higgins <brendanhiggins@google.com>
Acked-by: Theodore Ts'o <tytso@mit.edu> # for ext4 bits
Acked-by: David Gow <davidgow@google.com> # For list-test
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2020-01-09 16:42:29 -07:00
David S. Miller
a2d6d7ae59 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
The ungrafting from PRIO bug fixes in net, when merged into net-next,
merge cleanly but create a build failure.  The resolution used here is
from Petr Machata.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-09 12:13:43 -08:00
Linus Torvalds
a5f48c7878 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Missing netns pointer init in arp_tables, from Florian Westphal.

 2) Fix normal tcp SACK being treated as D-SACK, from Pengcheng Yang.

 3) Fix divide by zero in sch_cake, from Wen Yang.

 4) Len passed to skb_put_padto() is wrong in qrtr code, from Carl
    Huang.

 5) cmd->obj.chunk is leaked in sctp code error paths, from Xin Long.

 6) cgroup bpf programs can be released out of order, fix from Roman
    Gushchin.

 7) Make sure stmmac debugfs entry name is changed when device name
    changes, from Jiping Ma.

 8) Fix memory leak in vlan_dev_set_egress_priority(), from Eric
    Dumazet.

 9) SKB leak in lan78xx usb driver, also from Eric Dumazet.

10) Ridiculous TCA_FQ_QUANTUM values configured can cause loops in fq
    packet scheduler, reject them. From Eric Dumazet.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (69 commits)
  tipc: fix wrong connect() return code
  tipc: fix link overflow issue at socket shutdown
  netfilter: ipset: avoid null deref when IPSET_ATTR_LINENO is present
  netfilter: conntrack: dccp, sctp: handle null timeout argument
  atm: eni: fix uninitialized variable warning
  macvlan: do not assume mac_header is set in macvlan_broadcast()
  net: sch_prio: When ungrafting, replace with FIFO
  mlxsw: spectrum_qdisc: Ignore grafting of invisible FIFO
  MAINTAINERS: Remove myself as co-maintainer for qcom-ethqos
  gtp: fix bad unlock balance in gtp_encap_enable_socket
  pkt_sched: fq: do not accept silly TCA_FQ_QUANTUM
  tipc: remove meaningless assignment in Makefile
  tipc: do not add socket.o to tipc-y twice
  net: stmmac: dwmac-sun8i: Allow all RGMII modes
  net: stmmac: dwmac-sunxi: Allow all RGMII modes
  net: usb: lan78xx: fix possible skb leak
  net: stmmac: Fixed link does not need MDIO Bus
  vlan: vlan_changelink() should propagate errors
  vlan: fix memory leak in vlan_dev_set_egress_priority
  stmmac: debugfs entry name is not be changed when udev rename device name.
  ...
2020-01-09 10:34:07 -08:00
Paul Cercueil
2707745533 time/sched_clock: Disable interrupts in sched_clock_register()
Instead of issueing a warning if sched_clock_register() is called from a
context where IRQs are enabled, the code now ensures that IRQs are indeed
disabled.

Signed-off-by: Paul Cercueil <paul@crapouillou.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Link: https://lore.kernel.org/r/20200107010630.954648-1-paul@crapouillou.net
2020-01-09 18:50:18 +01:00
Arnd Bergmann
f35deaff1b time/posix-stubs: Provide compat itimer supoprt for alpha
Using compat_sys_getitimer and compat_sys_setitimer on alpha
causes a link failure in the Alpha tinyconfig and other configurations
that turn off CONFIG_POSIX_TIMERS.

Use the same #ifdef check for the stub version as well.

Fixes: 4c22ea2b91 ("y2038: use compat_{get,set}_itimer on alpha")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Link: https://lore.kernel.org/r/20191207191043.656328-1-arnd@arndb.de
2020-01-09 18:20:23 +01:00
Jules Irenge
099368bb10 genirq: Add missing __must_hold() sparse annotation
Add __must_hold() annotation to address the following sparse warning:

  warning: context imbalance in irq_wait_for_poll - unexpected unlock

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20191216144208.29852-2-jbi.octave@gmail.com
2020-01-09 18:03:37 +01:00
Jules Irenge
8b3b54799b genirq: Add missing __releases() sparse annotation
Add __releases() annotation to address the following sparse warning:

  warning: context imbalance in __irq_put_desc_unlock() - unexpected unlock

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20191216144208.29852-1-jbi.octave@gmail.com
2020-01-09 18:03:24 +01:00
Martin KaFai Lau
0baf26b0fc bpf: tcp: Support tcp_congestion_ops in bpf
This patch makes "struct tcp_congestion_ops" to be the first user
of BPF STRUCT_OPS.  It allows implementing a tcp_congestion_ops
in bpf.

The BPF implemented tcp_congestion_ops can be used like
regular kernel tcp-cc through sysctl and setsockopt.  e.g.
[root@arch-fb-vm1 bpf]# sysctl -a | egrep congestion
net.ipv4.tcp_allowed_congestion_control = reno cubic bpf_cubic
net.ipv4.tcp_available_congestion_control = reno bic cubic bpf_cubic
net.ipv4.tcp_congestion_control = bpf_cubic

There has been attempt to move the TCP CC to the user space
(e.g. CCP in TCP).   The common arguments are faster turn around,
get away from long-tail kernel versions in production...etc,
which are legit points.

BPF has been the continuous effort to join both kernel and
userspace upsides together (e.g. XDP to gain the performance
advantage without bypassing the kernel).  The recent BPF
advancements (in particular BTF-aware verifier, BPF trampoline,
BPF CO-RE...) made implementing kernel struct ops (e.g. tcp cc)
possible in BPF.  It allows a faster turnaround for testing algorithm
in the production while leveraging the existing (and continue growing)
BPF feature/framework instead of building one specifically for
userspace TCP CC.

This patch allows write access to a few fields in tcp-sock
(in bpf_tcp_ca_btf_struct_access()).

The optional "get_info" is unsupported now.  It can be added
later.  One possible way is to output the info with a btf-id
to describe the content.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200109003508.3856115-1-kafai@fb.com
2020-01-09 08:46:18 -08:00
Martin KaFai Lau
85d33df357 bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS
The patch introduces BPF_MAP_TYPE_STRUCT_OPS.  The map value
is a kernel struct with its func ptr implemented in bpf prog.
This new map is the interface to register/unregister/introspect
a bpf implemented kernel struct.

The kernel struct is actually embedded inside another new struct
(or called the "value" struct in the code).  For example,
"struct tcp_congestion_ops" is embbeded in:
struct bpf_struct_ops_tcp_congestion_ops {
	refcount_t refcnt;
	enum bpf_struct_ops_state state;
	struct tcp_congestion_ops data;  /* <-- kernel subsystem struct here */
}
The map value is "struct bpf_struct_ops_tcp_congestion_ops".
The "bpftool map dump" will then be able to show the
state ("inuse"/"tobefree") and the number of subsystem's refcnt (e.g.
number of tcp_sock in the tcp_congestion_ops case).  This "value" struct
is created automatically by a macro.  Having a separate "value" struct
will also make extending "struct bpf_struct_ops_XYZ" easier (e.g. adding
"void (*init)(void)" to "struct bpf_struct_ops_XYZ" to do some
initialization works before registering the struct_ops to the kernel
subsystem).  The libbpf will take care of finding and populating the
"struct bpf_struct_ops_XYZ" from "struct XYZ".

Register a struct_ops to a kernel subsystem:
1. Load all needed BPF_PROG_TYPE_STRUCT_OPS prog(s)
2. Create a BPF_MAP_TYPE_STRUCT_OPS with attr->btf_vmlinux_value_type_id
   set to the btf id "struct bpf_struct_ops_tcp_congestion_ops" of the
   running kernel.
   Instead of reusing the attr->btf_value_type_id,
   btf_vmlinux_value_type_id s added such that attr->btf_fd can still be
   used as the "user" btf which could store other useful sysadmin/debug
   info that may be introduced in the furture,
   e.g. creation-date/compiler-details/map-creator...etc.
3. Create a "struct bpf_struct_ops_tcp_congestion_ops" object as described
   in the running kernel btf.  Populate the value of this object.
   The function ptr should be populated with the prog fds.
4. Call BPF_MAP_UPDATE with the object created in (3) as
   the map value.  The key is always "0".

During BPF_MAP_UPDATE, the code that saves the kernel-func-ptr's
args as an array of u64 is generated.  BPF_MAP_UPDATE also allows
the specific struct_ops to do some final checks in "st_ops->init_member()"
(e.g. ensure all mandatory func ptrs are implemented).
If everything looks good, it will register this kernel struct
to the kernel subsystem.  The map will not allow further update
from this point.

Unregister a struct_ops from the kernel subsystem:
BPF_MAP_DELETE with key "0".

Introspect a struct_ops:
BPF_MAP_LOOKUP_ELEM with key "0".  The map value returned will
have the prog _id_ populated as the func ptr.

The map value state (enum bpf_struct_ops_state) will transit from:
INIT (map created) =>
INUSE (map updated, i.e. reg) =>
TOBEFREE (map value deleted, i.e. unreg)

The kernel subsystem needs to call bpf_struct_ops_get() and
bpf_struct_ops_put() to manage the "refcnt" in the
"struct bpf_struct_ops_XYZ".  This patch uses a separate refcnt
for the purose of tracking the subsystem usage.  Another approach
is to reuse the map->refcnt and then "show" (i.e. during map_lookup)
the subsystem's usage by doing map->refcnt - map->usercnt to filter out
the map-fd/pinned-map usage.  However, that will also tie down the
future semantics of map->refcnt and map->usercnt.

The very first subsystem's refcnt (during reg()) holds one
count to map->refcnt.  When the very last subsystem's refcnt
is gone, it will also release the map->refcnt.  All bpf_prog will be
freed when the map->refcnt reaches 0 (i.e. during map_free()).

Here is how the bpftool map command will look like:
[root@arch-fb-vm1 bpf]# bpftool map show
6: struct_ops  name dctcp  flags 0x0
	key 4B  value 256B  max_entries 1  memlock 4096B
	btf_id 6
[root@arch-fb-vm1 bpf]# bpftool map dump id 6
[{
        "value": {
            "refcnt": {
                "refs": {
                    "counter": 1
                }
            },
            "state": 1,
            "data": {
                "list": {
                    "next": 0,
                    "prev": 0
                },
                "key": 0,
                "flags": 2,
                "init": 24,
                "release": 0,
                "ssthresh": 25,
                "cong_avoid": 30,
                "set_state": 27,
                "cwnd_event": 28,
                "in_ack_event": 26,
                "undo_cwnd": 29,
                "pkts_acked": 0,
                "min_tso_segs": 0,
                "sndbuf_expand": 0,
                "cong_control": 0,
                "get_info": 0,
                "name": [98,112,102,95,100,99,116,99,112,0,0,0,0,0,0,0
                ],
                "owner": 0
            }
        }
    }
]

Misc Notes:
* bpf_struct_ops_map_sys_lookup_elem() is added for syscall lookup.
  It does an inplace update on "*value" instead returning a pointer
  to syscall.c.  Otherwise, it needs a separate copy of "zero" value
  for the BPF_STRUCT_OPS_STATE_INIT to avoid races.

* The bpf_struct_ops_map_delete_elem() is also called without
  preempt_disable() from map_delete_elem().  It is because
  the "->unreg()" may requires sleepable context, e.g.
  the "tcp_unregister_congestion_control()".

* "const" is added to some of the existing "struct btf_func_model *"
  function arg to avoid a compiler warning caused by this patch.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200109003505.3855919-1-kafai@fb.com
2020-01-09 08:46:18 -08:00
Martin KaFai Lau
27ae7997a6 bpf: Introduce BPF_PROG_TYPE_STRUCT_OPS
This patch allows the kernel's struct ops (i.e. func ptr) to be
implemented in BPF.  The first use case in this series is the
"struct tcp_congestion_ops" which will be introduced in a
latter patch.

This patch introduces a new prog type BPF_PROG_TYPE_STRUCT_OPS.
The BPF_PROG_TYPE_STRUCT_OPS prog is verified against a particular
func ptr of a kernel struct.  The attr->attach_btf_id is the btf id
of a kernel struct.  The attr->expected_attach_type is the member
"index" of that kernel struct.  The first member of a struct starts
with member index 0.  That will avoid ambiguity when a kernel struct
has multiple func ptrs with the same func signature.

For example, a BPF_PROG_TYPE_STRUCT_OPS prog is written
to implement the "init" func ptr of the "struct tcp_congestion_ops".
The attr->attach_btf_id is the btf id of the "struct tcp_congestion_ops"
of the _running_ kernel.  The attr->expected_attach_type is 3.

The ctx of BPF_PROG_TYPE_STRUCT_OPS is an array of u64 args saved
by arch_prepare_bpf_trampoline that will be done in the next
patch when introducing BPF_MAP_TYPE_STRUCT_OPS.

"struct bpf_struct_ops" is introduced as a common interface for the kernel
struct that supports BPF_PROG_TYPE_STRUCT_OPS prog.  The supporting kernel
struct will need to implement an instance of the "struct bpf_struct_ops".

The supporting kernel struct also needs to implement a bpf_verifier_ops.
During BPF_PROG_LOAD, bpf_struct_ops_find() will find the right
bpf_verifier_ops by searching the attr->attach_btf_id.

A new "btf_struct_access" is also added to the bpf_verifier_ops such
that the supporting kernel struct can optionally provide its own specific
check on accessing the func arg (e.g. provide limited write access).

After btf_vmlinux is parsed, the new bpf_struct_ops_init() is called
to initialize some values (e.g. the btf id of the supporting kernel
struct) and it can only be done once the btf_vmlinux is available.

The R0 checks at BPF_EXIT is excluded for the BPF_PROG_TYPE_STRUCT_OPS prog
if the return type of the prog->aux->attach_func_proto is "void".

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200109003503.3855825-1-kafai@fb.com
2020-01-09 08:46:18 -08:00
Martin KaFai Lau
976aba002f bpf: Support bitfield read access in btf_struct_access
This patch allows bitfield access as a scalar.

It checks "off + size > t->size" to avoid accessing bitfield
end up accessing beyond the struct.  This check is done
outside of the loop since it is applicable to all access.

It also takes this chance to break early on the "off < moff" case.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200109003501.3855427-1-kafai@fb.com
2020-01-09 08:46:18 -08:00
Martin KaFai Lau
218b3f65f9 bpf: Add enum support to btf_ctx_access()
It allows bpf prog (e.g. tracing) to attach
to a kernel function that takes enum argument.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200109003459.3855366-1-kafai@fb.com
2020-01-09 08:46:18 -08:00
Martin KaFai Lau
275517ff45 bpf: Avoid storing modifier to info->btf_id
info->btf_id expects the btf_id of a struct, so it should
store the final result after skipping modifiers (if any).

It also takes this chanace to add a missing newline in one of the
bpf_log() messages.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200109003456.3855176-1-kafai@fb.com
2020-01-09 08:46:18 -08:00
Martin KaFai Lau
65726b5b7e bpf: Save PTR_TO_BTF_ID register state when spilling to stack
This patch makes the verifier save the PTR_TO_BTF_ID register state when
spilling to the stack.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200109003454.3854870-1-kafai@fb.com
2020-01-09 08:45:32 -08:00
Arnd Bergmann
dc8d37ed30 cpu/SMT: Fix x86 link error without CONFIG_SYSFS
When CONFIG_SYSFS is disabled, but CONFIG_HOTPLUG_SMT is enabled,
the kernel fails to link:

arch/x86/power/cpu.o: In function `hibernate_resume_nonboot_cpu_disable':
(.text+0x38d): undefined reference to `cpuhp_smt_enable'
arch/x86/power/hibernate.o: In function `arch_resume_nosmt':
hibernate.c:(.text+0x291): undefined reference to `cpuhp_smt_enable'
hibernate.c:(.text+0x29c): undefined reference to `cpuhp_smt_disable'

Move the exported functions out of the #ifdef section into its
own with the correct conditions.

The patch that caused this is marked for stable backports, so
this one may need to be backported as well.

Fixes: ec527c3180 ("x86/power: Fix 'nosmt' vs hibernation triple fault during resume")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jiri Kosina <jkosina@suse.cz>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20191210195614.786555-1-arnd@arndb.de
2020-01-09 17:31:45 +01:00
Luca Ceresoli
025af39b87 genirq: Show irq name in non-oneshot error message
Requesting a threaded IRQ with handler=NULL and !ONESHOT fails, but the
error message does not include the IRQ line name, which makes it harder to
find the offending driver.

Print the IRQ line name to clarify where the error comes from. Use the same
format as the other pr_err() above in the same function.

Signed-off-by: Luca Ceresoli <luca@lucaceresoli.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20191105140854.27893-1-luca@lucaceresoli.net
2020-01-09 15:42:54 +01:00
Randy Dunlap
51bfb1d11d futex: Fix kernel-doc notation warning
Fix a kernel-doc warning in kernel/futex.c by adding notation
for @ret.

../kernel/futex.c:1187: warning: Function parameter or member 'ret' not described in 'wait_for_owner_exiting'

Fixes: 3ef240eaff ("futex: Prevent exit livelock")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/223be78c-f3c8-52df-836d-c5fb8e7907e9@infradead.org
2020-01-09 13:23:40 +01:00
Masami Hiramatsu
e4add24778 kprobes: Fix optimize_kprobe()/unoptimize_kprobe() cancellation logic
optimize_kprobe() and unoptimize_kprobe() cancels if a given kprobe
is on the optimizing_list or unoptimizing_list already. However, since
the following commit:

  f66c0447cc ("kprobes: Set unoptimized flag after unoptimizing code")

modified the update timing of the KPROBE_FLAG_OPTIMIZED, it doesn't
work as expected anymore.

The optimized_kprobe could be in the following states:

- [optimizing]: Before inserting jump instruction
  op.kp->flags has KPROBE_FLAG_OPTIMIZED and
  op->list is not empty.

- [optimized]: jump inserted
  op.kp->flags has KPROBE_FLAG_OPTIMIZED and
  op->list is empty.

- [unoptimizing]: Before removing jump instruction (including unused
  optprobe)
  op.kp->flags has KPROBE_FLAG_OPTIMIZED and
  op->list is not empty.

- [unoptimized]: jump removed
  op.kp->flags doesn't have KPROBE_FLAG_OPTIMIZED and
  op->list is empty.

Current code mis-expects [unoptimizing] state doesn't have
KPROBE_FLAG_OPTIMIZED, and that can cause incorrect results.

To fix this, introduce optprobe_queued_unopt() to distinguish [optimizing]
and [unoptimizing] states and fixes the logic in optimize_kprobe() and
unoptimize_kprobe().

[ mingo: Cleaned up the changelog and the code a bit. ]

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bristot@redhat.com
Fixes: f66c0447cc ("kprobes: Set unoptimized flag after unoptimizing code")
Link: https://lkml.kernel.org/r/157840814418.7181.13478003006386303481.stgit@devnote2
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-01-09 12:40:13 +01:00
Pavel Tatashin
de68e4daea kexec: add machine_kexec_post_load()
It is the same as machine_kexec_prepare(), but is called after segments are
loaded. This way, can do processing work with already loaded relocation
segments. One such example is arm64: it has to have segments loaded in
order to create a page table, but it cannot do it during kexec time,
because at that time allocations won't be possible anymore.

Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Acked-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Will Deacon <will@kernel.org>
2020-01-08 16:32:55 +00:00
Pavel Tatashin
d42cc530b1 kexec: quiet down kexec reboot
Here is a regular kexec command sequence and output:
=====
$ kexec --reuse-cmdline -i --load Image
$ kexec -e
[  161.342002] kexec_core: Starting new kernel

Welcome to Buildroot
buildroot login:
=====

Even when "quiet" kernel parameter is specified, "kexec_core: Starting
new kernel" is printed.

This message has  KERN_EMERG level, but there is no emergency, it is a
normal kexec operation, so quiet it down to appropriate KERN_NOTICE.

Machines that have slow console baud rate benefit from less output.

Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Simon Horman <horms@verge.net.au>
Acked-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Will Deacon <will@kernel.org>
2020-01-08 16:32:55 +00:00
YueHaibing
f6d061d617 kernel/module: Fix memleak in module_add_modinfo_attrs()
In module_add_modinfo_attrs() if sysfs_create_file() fails
on the first iteration of the loop (so i = 0), we forget to
free the modinfo_attrs.

Fixes: bc6f2a757d ("kernel/module: Fix mem leak in module_add_modinfo_attrs")
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2020-01-08 17:07:20 +01:00
Marco Elver
d47715f50e kcsan, ubsan: Make KCSAN+UBSAN work together
Context:
http://lkml.kernel.org/r/fb7e25d8-aba4-3dcf-7761-cb7ecb3ebb71@infradead.org

Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Marco Elver <elver@google.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-01-07 07:47:23 -08:00
Amanieu d'Antras
dd499f7a7e
clone3: ensure copy_thread_tls is implemented
copy_thread implementations handle CLONE_SETTLS by reading the TLS
value from the registers containing the syscall arguments for
clone. This doesn't work with clone3 since the TLS value is passed
in clone_args instead.

Signed-off-by: Amanieu d'Antras <amanieu@gmail.com>
Cc: <stable@vger.kernel.org> # 5.3.x
Link: https://lore.kernel.org/r/20200102172413.654385-8-amanieu@gmail.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-01-07 13:31:27 +01:00
Luigi Semenzato
7a7b99bf80 PM: hibernate: Add more logging on hibernation failure
Hibernation fails when the kernel cannot allocate enough memory
to copy all pages of RAM in use.

Ensure that the failure reason is clearly logged, and clearly
attributable to the hibernation module.

Signed-off-by: Luigi Semenzato <semenzato@google.com>
[ rjw: Subject & changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-01-07 13:31:12 +01:00
Wen Yang
809ed78a83 PM: hibernate: improve arithmetic division in preallocate_highmem_fraction()
do_div() does a 64-by-32 division. Use div64_u64() instead of
do_div() if the divisor is u64, to avoid truncation to 32-bit.

This change also cleans up code a tad.

Signed-off-by: Wen Yang <wenyang@linux.alibaba.com>
[ rjw: Subject ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-01-07 12:42:56 +01:00
Linus Torvalds
ae6088216c Various tracing fixes:
- kbuild found missing define of MCOUNT_INSN_SIZE for various build configs
  - Initialize variable to zero as gcc thinks it is used undefined
     (it really isn't but the code is subtle enough that this doesn't hurt)
  - Convert from do_div() to div64_ull() to prevent potential divide by zero
  - Unregister a trace point on error path in sched_wakeup tracer
  - Use signed offset for archs that can have stext not be first
  - A simple indentation fix (whitespace error)
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXhOj6xQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qukzAQCMNfkAbMFA+C1uORMhr/jWhi4eshWN
 4jZ2u5X8zGuuXQD+PaQU4n8d0K4uCPF+lFD16DfFxXvCOXHfN3/zXmxGvw8=
 =djaW
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
 "Various tracing fixes:

   - kbuild found missing define of MCOUNT_INSN_SIZE for various build
     configs

   - Initialize variable to zero as gcc thinks it is used undefined (it
     really isn't but the code is subtle enough that this doesn't hurt)

   - Convert from do_div() to div64_ull() to prevent potential divide by
     zero

   - Unregister a trace point on error path in sched_wakeup tracer

   - Use signed offset for archs that can have stext not be first

   - A simple indentation fix (whitespace error)"

* tag 'trace-v5.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Fix indentation issue
  kernel/trace: Fix do not unregister tracepoints when register sched_migrate_task fail
  tracing: Change offset type to s32 in preempt/irq tracepoints
  ftrace: Avoid potential division by zero in function profiler
  tracing: Have stack tracer compile when MCOUNT_INSN_SIZE is not defined
  tracing: Define MCOUNT_INSN_SIZE when not defined without direct calls
  tracing: Initialize val to zero in parse_entry of inject code
2020-01-06 15:38:38 -08:00
Daniel Borkmann
6d4f151acf bpf: Fix passing modified ctx to ld/abs/ind instruction
Anatoly has been fuzzing with kBdysch harness and reported a KASAN
slab oob in one of the outcomes:

  [...]
  [   77.359642] BUG: KASAN: slab-out-of-bounds in bpf_skb_load_helper_8_no_cache+0x71/0x130
  [   77.360463] Read of size 4 at addr ffff8880679bac68 by task bpf/406
  [   77.361119]
  [   77.361289] CPU: 2 PID: 406 Comm: bpf Not tainted 5.5.0-rc2-xfstests-00157-g2187f215eba #1
  [   77.362134] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
  [   77.362984] Call Trace:
  [   77.363249]  dump_stack+0x97/0xe0
  [   77.363603]  print_address_description.constprop.0+0x1d/0x220
  [   77.364251]  ? bpf_skb_load_helper_8_no_cache+0x71/0x130
  [   77.365030]  ? bpf_skb_load_helper_8_no_cache+0x71/0x130
  [   77.365860]  __kasan_report.cold+0x37/0x7b
  [   77.366365]  ? bpf_skb_load_helper_8_no_cache+0x71/0x130
  [   77.366940]  kasan_report+0xe/0x20
  [   77.367295]  bpf_skb_load_helper_8_no_cache+0x71/0x130
  [   77.367821]  ? bpf_skb_load_helper_8+0xf0/0xf0
  [   77.368278]  ? mark_lock+0xa3/0x9b0
  [   77.368641]  ? kvm_sched_clock_read+0x14/0x30
  [   77.369096]  ? sched_clock+0x5/0x10
  [   77.369460]  ? sched_clock_cpu+0x18/0x110
  [   77.369876]  ? bpf_skb_load_helper_8+0xf0/0xf0
  [   77.370330]  ___bpf_prog_run+0x16c0/0x28f0
  [   77.370755]  __bpf_prog_run32+0x83/0xc0
  [   77.371153]  ? __bpf_prog_run64+0xc0/0xc0
  [   77.371568]  ? match_held_lock+0x1b/0x230
  [   77.371984]  ? rcu_read_lock_held+0xa1/0xb0
  [   77.372416]  ? rcu_is_watching+0x34/0x50
  [   77.372826]  sk_filter_trim_cap+0x17c/0x4d0
  [   77.373259]  ? sock_kzfree_s+0x40/0x40
  [   77.373648]  ? __get_filter+0x150/0x150
  [   77.374059]  ? skb_copy_datagram_from_iter+0x80/0x280
  [   77.374581]  ? do_raw_spin_unlock+0xa5/0x140
  [   77.375025]  unix_dgram_sendmsg+0x33a/0xa70
  [   77.375459]  ? do_raw_spin_lock+0x1d0/0x1d0
  [   77.375893]  ? unix_peer_get+0xa0/0xa0
  [   77.376287]  ? __fget_light+0xa4/0xf0
  [   77.376670]  __sys_sendto+0x265/0x280
  [   77.377056]  ? __ia32_sys_getpeername+0x50/0x50
  [   77.377523]  ? lock_downgrade+0x350/0x350
  [   77.377940]  ? __sys_setsockopt+0x2a6/0x2c0
  [   77.378374]  ? sock_read_iter+0x240/0x240
  [   77.378789]  ? __sys_socketpair+0x22a/0x300
  [   77.379221]  ? __ia32_sys_socket+0x50/0x50
  [   77.379649]  ? mark_held_locks+0x1d/0x90
  [   77.380059]  ? trace_hardirqs_on_thunk+0x1a/0x1c
  [   77.380536]  __x64_sys_sendto+0x74/0x90
  [   77.380938]  do_syscall_64+0x68/0x2a0
  [   77.381324]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
  [   77.381878] RIP: 0033:0x44c070
  [...]

After further debugging, turns out while in case of other helper functions
we disallow passing modified ctx, the special case of ld/abs/ind instruction
which has similar semantics (except r6 being the ctx argument) is missing
such check. Modified ctx is impossible here as bpf_skb_load_helper_8_no_cache()
and others are expecting skb fields in original position, hence, add
check_ctx_reg() to reject any modified ctx. Issue was first introduced back
in f1174f77b5 ("bpf/verifier: rework value tracking").

Fixes: f1174f77b5 ("bpf/verifier: rework value tracking")
Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200106215157.3553-1-daniel@iogearbox.net
2020-01-06 14:19:47 -08:00
Roman Gushchin
e10360f815 bpf: cgroup: prevent out-of-order release of cgroup bpf
Before commit 4bfc0bb2c6 ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself")
cgroup bpf structures were released with
corresponding cgroup structures. It guaranteed the hierarchical order
of destruction: children were always first. It preserved attached
programs from being released before their propagated copies.

But with cgroup auto-detachment there are no such guarantees anymore:
cgroup bpf is released as soon as the cgroup is offline and there are
no live associated sockets. It means that an attached program can be
detached and released, while its propagated copy is still living
in the cgroup subtree. This will obviously lead to an use-after-free
bug.

To reproduce the issue the following script can be used:

  #!/bin/bash

  CGROOT=/sys/fs/cgroup

  mkdir -p ${CGROOT}/A ${CGROOT}/B ${CGROOT}/A/C
  sleep 1

  ./test_cgrp2_attach ${CGROOT}/A egress &
  A_PID=$!
  ./test_cgrp2_attach ${CGROOT}/B egress &
  B_PID=$!

  echo $$ > ${CGROOT}/A/C/cgroup.procs
  iperf -s &
  S_PID=$!
  iperf -c localhost -t 100 &
  C_PID=$!

  sleep 1

  echo $$ > ${CGROOT}/B/cgroup.procs
  echo ${S_PID} > ${CGROOT}/B/cgroup.procs
  echo ${C_PID} > ${CGROOT}/B/cgroup.procs

  sleep 1

  rmdir ${CGROOT}/A/C
  rmdir ${CGROOT}/A

  sleep 1

  kill -9 ${S_PID} ${C_PID} ${A_PID} ${B_PID}

On the unpatched kernel the following stacktrace can be obtained:

[   33.619799] BUG: unable to handle page fault for address: ffffbdb4801ab002
[   33.620677] #PF: supervisor read access in kernel mode
[   33.621293] #PF: error_code(0x0000) - not-present page
[   33.622754] Oops: 0000 [#1] SMP NOPTI
[   33.623202] CPU: 0 PID: 601 Comm: iperf Not tainted 5.5.0-rc2+ #23
[   33.625545] RIP: 0010:__cgroup_bpf_run_filter_skb+0x29f/0x3d0
[   33.635809] Call Trace:
[   33.636118]  ? __cgroup_bpf_run_filter_skb+0x2bf/0x3d0
[   33.636728]  ? __switch_to_asm+0x40/0x70
[   33.637196]  ip_finish_output+0x68/0xa0
[   33.637654]  ip_output+0x76/0xf0
[   33.638046]  ? __ip_finish_output+0x1c0/0x1c0
[   33.638576]  __ip_queue_xmit+0x157/0x410
[   33.639049]  __tcp_transmit_skb+0x535/0xaf0
[   33.639557]  tcp_write_xmit+0x378/0x1190
[   33.640049]  ? _copy_from_iter_full+0x8d/0x260
[   33.640592]  tcp_sendmsg_locked+0x2a2/0xdc0
[   33.641098]  ? sock_has_perm+0x10/0xa0
[   33.641574]  tcp_sendmsg+0x28/0x40
[   33.641985]  sock_sendmsg+0x57/0x60
[   33.642411]  sock_write_iter+0x97/0x100
[   33.642876]  new_sync_write+0x1b6/0x1d0
[   33.643339]  vfs_write+0xb6/0x1a0
[   33.643752]  ksys_write+0xa7/0xe0
[   33.644156]  do_syscall_64+0x5b/0x1b0
[   33.644605]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fix this by grabbing a reference to the bpf structure of each ancestor
on the initialization of the cgroup bpf structure, and dropping the
reference at the end of releasing the cgroup bpf structure.

This will restore the hierarchical order of cgroup bpf releasing,
without adding any operations on hot paths.

Thanks to Josef Bacik for the debugging and the initial analysis of
the problem.

Fixes: 4bfc0bb2c6 ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself")
Reported-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-01-06 14:00:30 -08:00
Ingo Molnar
31c7ac388a Linux 5.5-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl4SYegeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG4m4H+QGCUN8SXN+2B+0/
 BfzOf7PFoKzAx3NwDbJQIZqhSl+Zfa4n3VGPEF8sXsvoQgdYvuJnS/5JiAZ9iRIH
 HAfFzegzQ3mCl8Du+SqCvQKs2Jt4OMCX62KGRebRBhpoKfZdwmN7n7pn9lWO771K
 9rxTpeItXhmK46jOFRbi5oyQfmkfSfyUN1b9CB53FXFS+ZDkDNA7QQiIYnKOD7SZ
 RrL7czhZ580QOC61qOlnz1GIhRzvU5SXg4OtuI3YfoOJRY5FKC3YtOgLReT0vPs+
 vEhAyP93upVXIhqm10WHNjd4t4a45Vy5ff64uFsQ9QV4nnqsC2C70YwWbVDdtz/W
 Lm0mvE8=
 =NECs
 -----END PGP SIGNATURE-----

Merge tag 'v5.5-rc5' into locking/kcsan, to resolve conflict

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-01-06 07:51:15 +01:00
Shakeel Butt
84029fd04c memcg: account security cred as well to kmemcg
The cred_jar kmem_cache is already memcg accounted in the current kernel
but cred->security is not.  Account cred->security to kmemcg.

Recently we saw high root slab usage on our production and on further
inspection, we found a buggy application leaking processes.  Though that
buggy application was contained within its memcg but we observe much
more system memory overhead, couple of GiBs, during that period.  This
overhead can adversely impact the isolation on the system.

One source of high overhead we found was cred->security objects, which
have a lifetime of at least the life of the process which allocated
them.

Link: http://lkml.kernel.org/r/20191205223721.40034-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Chris Down <chris@chrisdown.name>
Reviewed-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-04 13:55:09 -08:00
Colin Ian King
72879ee0c5 tracing: Fix indentation issue
There is a declaration that is indented one level too deeply, remove
the extraneous tab.

Link: http://lkml.kernel.org/r/20191221154825.33073-1-colin.king@canonical.com

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-03 15:20:46 -05:00
Linus Torvalds
d9c82fd8c8 for-linus-2020-01-03
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXg9C5wAKCRCRxhvAZXjc
 oiZXAPsGFXyDCWlKnShBpKufdFh6XugADlyZK0Si2ISWQoJJsgD/Ri1g3zg6V7YC
 HBG0sz8+vSk/Ys55yDQz+K1d1MTkdQ4=
 =8uQe
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-2020-01-03' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull thread fixes from Christian Brauner:
 "Here are two fixes:

   - Panic earlier when global init exits to generate useable coredumps.

     Currently, when global init and all threads in its thread-group
     have exited we panic via:

       do_exit()
       -> exit_notify()
          -> forget_original_parent()
             -> find_child_reaper()

     This makes it hard to extract a useable coredump for global init
     from a kernel crashdump because by the time we panic exit_mm() will
     have already released global init's mm. We now panic slightly
     earlier. This has been a problem in certain environments such as
     Android.

   - Fix a race in assigning and reading taskstats for thread-groups
     with more than one thread.

     This patch has been waiting for quite a while since people
     disagreed on what the correct fix was at first"

* tag 'for-linus-2020-01-03' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  exit: panic before exit_mm() on global init exit
  taskstats: fix data-race
2020-01-03 11:17:14 -08:00
Kaitao Cheng
50f9ad607e kernel/trace: Fix do not unregister tracepoints when register sched_migrate_task fail
In the function, if register_trace_sched_migrate_task() returns error,
sched_switch/sched_wakeup_new/sched_wakeup won't unregister. That is
why fail_deprobe_sched_switch was added.

Link: http://lkml.kernel.org/r/20191231133530.2794-1-pilgrimtao@gmail.com

Cc: stable@vger.kernel.org
Fixes: 478142c39c ("tracing: do not grab lock in wakeup latency function tracing")
Signed-off-by: Kaitao Cheng <pilgrimtao@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-03 11:43:03 -05:00
Wen Yang
e31f7939c1 ftrace: Avoid potential division by zero in function profiler
The ftrace_profile->counter is unsigned long and
do_div truncates it to 32 bits, which means it can test
non-zero and be truncated to zero for division.
Fix this issue by using div64_ul() instead.

Link: http://lkml.kernel.org/r/20200103030248.14516-1-wenyang@linux.alibaba.com

Cc: stable@vger.kernel.org
Fixes: e330b3bcd8 ("tracing: Show sample std dev in function profiling")
Fixes: 34886c8bc5 ("tracing: add average time in function to function profiler")
Signed-off-by: Wen Yang <wenyang@linux.alibaba.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-02 22:14:57 -05:00
Steven Rostedt (VMware)
b8299d362d tracing: Have stack tracer compile when MCOUNT_INSN_SIZE is not defined
On some archs with some configurations, MCOUNT_INSN_SIZE is not defined, and
this makes the stack tracer fail to compile. Just define it to zero in this
case.

Link: https://lore.kernel.org/r/202001020219.zvE3vsty%lkp@intel.com

Cc: stable@vger.kernel.org
Fixes: 4df297129f ("tracing: Remove most or all of stack tracer stack size from stack_max_size")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-02 22:04:07 -05:00
Steven Rostedt (VMware)
d2ccbccb54 tracing: Define MCOUNT_INSN_SIZE when not defined without direct calls
In order to handle direct calls along side of function graph tracer, a check
is made to see if the address being traced by the function graph tracer is a
direct call or not. To get the address used by direct callers, the return
address is subtracted by MCOUNT_INSN_SIZE.

For some archs with certain configurations, MCOUNT_INSN_SIZE is undefined
here. But these should not be using direct calls anyway. Just define
MCOUNT_INSN_SIZE to zero in this case.

Link: https://lore.kernel.org/r/202001020219.zvE3vsty%lkp@intel.com

Reported-by: kbuild test robot <lkp@intel.com>
Fixes: ff205766db ("ftrace: Fix function_graph tracer interaction with BPF trampoline")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-02 21:56:44 -05:00
Linus Torvalds
bf6dd9a58e Fixes for seccomp_notify_ioctl uapi sanity
- Fix samples and selftests to zero passed-in buffer (Sargun Dhillon)
 - Enforce zeroed buffer checking (Sargun Dhillon)
 - Verify buffer sanity check in selftest (Sargun Dhillon)
 -----BEGIN PGP SIGNATURE-----
 Comment: Kees Cook <kees@outflux.net>
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAl4OX5wWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJtJZD/4iLG7mOUQNXdcPidjcIMO/tjST
 UzW+9Cb3buePgmCHO9v1TKGL29fVwP5TkuxdrBYDGrJ4rEYANSDX0aNmpHsO8/8M
 2/B/Lo/f9cxFgoKI4QLY2XZ1YR+zkH980mtIG7ZcpYjsNl5AwmT27m2lo6iE7J+x
 7rsaTRPFmUfgbblB6Z5gNwwATudrWJgq066lY2fg3GADP81s6lGQB+ul8rtu84ME
 mTvtb3w6piJb3E+DeYY8p4ykyiewDuYqZWDY+dvWi3kRDjNWX+yFJaPW0YNhM+yh
 HaMXnbuh6gDyCbeUHorC9ypQhJJKzEWCUW8e60BND+fOFCdKMa1AdCtlXWHjrXDQ
 x9hUgQ3UhEedYtQeYtYuoltf0W8Ft4wAapxKJJRegYPQ0RPOgcfdAg4UquusCaLo
 fWK2Hy4XFrxOwISqsFUczUVkBcXl+w0GGH59pSyTImgoQPlTpbVP6f7Axbl+qpKo
 pqOe4bO8curLGlZpdBN6syR5Ik0bizQK0kDZeo+wPmEClp/1zJWMJ4MTP4T80rxY
 74DiQyfNH2iHfsOkdfHCsJC3jM8nmdKk5wMqtrAiIoT8/vdTBgumHrnmkORWFf8c
 R/NHCCLVs9q9sKV0s+VUR3OM2RjqpG1Wo/EBjTlbDQnibC5qdha8X2uVJWIHiF61
 ZgwZ9BoKV/+mKSqTAQ==
 =WgBI
 -----END PGP SIGNATURE-----

Merge tag 'seccomp-v5.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull seccomp fixes from Kees Cook:
 "Fixes for seccomp_notify_ioctl uapi sanity from Sargun Dhillon.

  The bulk of this is fixing the surrounding samples and selftests so
  that seccomp can correctly validate the seccomp_notify_ioctl buffer as
  being initially zeroed.

  Summary:

   - Fix samples and selftests to zero passed-in buffer

   - Enforce zeroed buffer checking

   - Verify buffer sanity check in selftest"

* tag 'seccomp-v5.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  selftests/seccomp: Catch garbage on SECCOMP_IOCTL_NOTIF_RECV
  seccomp: Check that seccomp_notif is zeroed out by the user
  selftests/seccomp: Zero out seccomp_notif
  samples/seccomp: Zero out members based on seccomp_notif_sizes
2020-01-02 16:42:10 -08:00
Steven Rostedt (VMware)
02f4e01ce7 tracing: Initialize val to zero in parse_entry of inject code
gcc produces a variable may be uninitialized warning for "val" in
parse_entry(). This is really a false positive, but the code is subtle
enough to just initialize val to zero and it's not a fast path to worry
about it.

Marked for stable to remove the warning in the stable trees as well.

Cc: stable@vger.kernel.org
Fixes: 6c3edaf9fd ("tracing: Introduce trace event injection")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-01-02 19:04:57 -05:00
Sargun Dhillon
2882d53c9c seccomp: Check that seccomp_notif is zeroed out by the user
This patch is a small change in enforcement of the uapi for
SECCOMP_IOCTL_NOTIF_RECV ioctl. Specifically, the datastructure which
is passed (seccomp_notif) must be zeroed out. Previously any of its
members could be set to nonsense values, and we would ignore it.

This ensures all fields are set to their zero value.

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Aleksa Sarai <cyphar@cyphar.com>
Acked-by: Tycho Andersen <tycho@tycho.ws>
Link: https://lore.kernel.org/r/20191229062451.9467-2-sargun@sargun.me
Fixes: 6a21cc50f0 ("seccomp: add a return code to trap to userspace")
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-01-02 13:03:45 -08:00
John Ogness
def97da136 printk: fix exclusive_console replaying
Commit f92b070f2d ("printk: Do not miss new messages when replaying
the log") introduced a new variable @exclusive_console_stop_seq to
store when an exclusive console should stop printing. It should be
set to the @console_seq value at registration. However, @console_seq
is previously set to @syslog_seq so that the exclusive console knows
where to begin. This results in the exclusive console immediately
reactivating all the other consoles and thus repeating the messages
for those consoles.

Set @console_seq after @exclusive_console_stop_seq has stored the
current @console_seq value.

Fixes: f92b070f2d ("printk: Do not miss new messages when replaying the log")
Link: http://lkml.kernel.org/r/20191219115322.31160-1-john.ogness@linutronix.de
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2020-01-02 16:15:04 +01:00
David S. Miller
31d518f35e Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Simple overlapping changes in bpf land wrt. bpf_helper_defs.h
handling.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-31 13:37:13 -08:00
Vladis Dronov
a33121e548 ptp: fix the race between the release of ptp_clock and cdev
In a case when a ptp chardev (like /dev/ptp0) is open but an underlying
device is removed, closing this file leads to a race. This reproduces
easily in a kvm virtual machine:

ts# cat openptp0.c
int main() { ... fp = fopen("/dev/ptp0", "r"); ... sleep(10); }
ts# uname -r
5.5.0-rc3-46cf053e
ts# cat /proc/cmdline
... slub_debug=FZP
ts# modprobe ptp_kvm
ts# ./openptp0 &
[1] 670
opened /dev/ptp0, sleeping 10s...
ts# rmmod ptp_kvm
ts# ls /dev/ptp*
ls: cannot access '/dev/ptp*': No such file or directory
ts# ...woken up
[   48.010809] general protection fault: 0000 [#1] SMP
[   48.012502] CPU: 6 PID: 658 Comm: openptp0 Not tainted 5.5.0-rc3-46cf053e #25
[   48.014624] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), ...
[   48.016270] RIP: 0010:module_put.part.0+0x7/0x80
[   48.017939] RSP: 0018:ffffb3850073be00 EFLAGS: 00010202
[   48.018339] RAX: 000000006b6b6b6b RBX: 6b6b6b6b6b6b6b6b RCX: ffff89a476c00ad0
[   48.018936] RDX: fffff65a08d3ea08 RSI: 0000000000000247 RDI: 6b6b6b6b6b6b6b6b
[   48.019470] ...                                              ^^^ a slub poison
[   48.023854] Call Trace:
[   48.024050]  __fput+0x21f/0x240
[   48.024288]  task_work_run+0x79/0x90
[   48.024555]  do_exit+0x2af/0xab0
[   48.024799]  ? vfs_write+0x16a/0x190
[   48.025082]  do_group_exit+0x35/0x90
[   48.025387]  __x64_sys_exit_group+0xf/0x10
[   48.025737]  do_syscall_64+0x3d/0x130
[   48.026056]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   48.026479] RIP: 0033:0x7f53b12082f6
[   48.026792] ...
[   48.030945] Modules linked in: ptp i6300esb watchdog [last unloaded: ptp_kvm]
[   48.045001] Fixing recursive fault but reboot is needed!

This happens in:

static void __fput(struct file *file)
{   ...
    if (file->f_op->release)
        file->f_op->release(inode, file); <<< cdev is kfree'd here
    if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL &&
             !(mode & FMODE_PATH))) {
        cdev_put(inode->i_cdev); <<< cdev fields are accessed here

Namely:

__fput()
  posix_clock_release()
    kref_put(&clk->kref, delete_clock) <<< the last reference
      delete_clock()
        delete_ptp_clock()
          kfree(ptp) <<< cdev is embedded in ptp
  cdev_put
    module_put(p->owner) <<< *p is kfree'd, bang!

Here cdev is embedded in posix_clock which is embedded in ptp_clock.
The race happens because ptp_clock's lifetime is controlled by two
refcounts: kref and cdev.kobj in posix_clock. This is wrong.

Make ptp_clock's sysfs device a parent of cdev with cdev_device_add()
created especially for such cases. This way the parent device with its
ptp_clock is not released until all references to the cdev are released.
This adds a requirement that an initialized but not exposed struct
device should be provided to posix_clock_register() by a caller instead
of a simple dev_t.

This approach was adopted from the commit 72139dfa24 ("watchdog: Fix
the race between the release of watchdog_core_data and cdev"). See
details of the implementation in the commit 233ed09d7f ("chardev: add
helper function to register char devs with a struct device").

Link: https://lore.kernel.org/linux-fsdevel/20191125125342.6189-1-vdronov@redhat.com/T/#u
Analyzed-by: Stephen Johnston <sjohnsto@redhat.com>
Analyzed-by: Vern Lovejoy <vlovejoy@redhat.com>
Signed-off-by: Vladis Dronov <vdronov@redhat.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-30 20:19:27 -08:00
Ingo Molnar
28336be568 Linux 5.5-rc4
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl4JNtkeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGdN0H/3UI6LHOx1ol3/7L
 TwgMibg2pNxNU05bowDjQt92+Hgj9JM0TeFBsfr5hLaeKBgeVCPr5xK/vH09NlKu
 otVGbhBLpl9OAUu9znTfbt4bcqhJKlr/K0mS5e1vPsXvZ3wdHS27trwjgyu16/pP
 NJwkcs5/VRYVC/SrZay2NvheKN+DoGSd4+ZlJprwtAAVMdbEvoaGqRLGKLfLeDMc
 Z04w8AKhnKIxSkt+eEmuW9+pAQJUAkk4QVjixcJe8q0QpA1XIj965yvE8+XpjbLo
 eFxupmZq4S2JdCjsa+iBferJ5juR1FVhbHSbZtLsTtkPVegI9ug911WQ+KiCqErI
 VkiKUl8=
 =rNsn
 -----END PGP SIGNATURE-----

Merge tag 'v5.5-rc4' into locking/kcsan, to resolve conflicts

Conflicts:
	init/main.c
	lib/Kconfig.debug

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-12-30 08:10:51 +01:00
David S. Miller
2bbc078f81 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2019-12-27

The following pull-request contains BPF updates for your *net-next* tree.

We've added 127 non-merge commits during the last 17 day(s) which contain
a total of 110 files changed, 6901 insertions(+), 2721 deletions(-).

There are three merge conflicts. Conflicts and resolution looks as follows:

1) Merge conflict in net/bpf/test_run.c:

There was a tree-wide cleanup c593642c8b ("treewide: Use sizeof_field() macro")
which gets in the way with b590cb5f80 ("bpf: Switch to offsetofend in
BPF_PROG_TEST_RUN"):

  <<<<<<< HEAD
          if (!range_is_zero(__skb, offsetof(struct __sk_buff, priority) +
                             sizeof_field(struct __sk_buff, priority),
  =======
          if (!range_is_zero(__skb, offsetofend(struct __sk_buff, priority),
  >>>>>>> 7c8dce4b16

There are a few occasions that look similar to this. Always take the chunk with
offsetofend(). Note that there is one where the fields differ in here:

  <<<<<<< HEAD
          if (!range_is_zero(__skb, offsetof(struct __sk_buff, tstamp) +
                             sizeof_field(struct __sk_buff, tstamp),
  =======
          if (!range_is_zero(__skb, offsetofend(struct __sk_buff, gso_segs),
  >>>>>>> 7c8dce4b16

Just take the one with offsetofend() /and/ gso_segs. Latter is correct due to
850a88cc40 ("bpf: Expose __sk_buff wire_len/gso_segs to BPF_PROG_TEST_RUN").

2) Merge conflict in arch/riscv/net/bpf_jit_comp.c:

(I'm keeping Bjorn in Cc here for a double-check in case I got it wrong.)

  <<<<<<< HEAD
          if (is_13b_check(off, insn))
                  return -1;
          emit(rv_blt(tcc, RV_REG_ZERO, off >> 1), ctx);
  =======
          emit_branch(BPF_JSLT, RV_REG_T1, RV_REG_ZERO, off, ctx);
  >>>>>>> 7c8dce4b16

Result should look like:

          emit_branch(BPF_JSLT, tcc, RV_REG_ZERO, off, ctx);

3) Merge conflict in arch/riscv/include/asm/pgtable.h:

  <<<<<<< HEAD
  =======
  #define VMALLOC_SIZE     (KERN_VIRT_SIZE >> 1)
  #define VMALLOC_END      (PAGE_OFFSET - 1)
  #define VMALLOC_START    (PAGE_OFFSET - VMALLOC_SIZE)

  #define BPF_JIT_REGION_SIZE     (SZ_128M)
  #define BPF_JIT_REGION_START    (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
  #define BPF_JIT_REGION_END      (VMALLOC_END)

  /*
   * Roughly size the vmemmap space to be large enough to fit enough
   * struct pages to map half the virtual address space. Then
   * position vmemmap directly below the VMALLOC region.
   */
  #define VMEMMAP_SHIFT \
          (CONFIG_VA_BITS - PAGE_SHIFT - 1 + STRUCT_PAGE_MAX_SHIFT)
  #define VMEMMAP_SIZE    BIT(VMEMMAP_SHIFT)
  #define VMEMMAP_END     (VMALLOC_START - 1)
  #define VMEMMAP_START   (VMALLOC_START - VMEMMAP_SIZE)

  #define vmemmap         ((struct page *)VMEMMAP_START)

  >>>>>>> 7c8dce4b16

Only take the BPF_* defines from there and move them higher up in the
same file. Remove the rest from the chunk. The VMALLOC_* etc defines
got moved via 01f52e16b8 ("riscv: define vmemmap before pfn_to_page
calls"). Result:

  [...]
  #define __S101  PAGE_READ_EXEC
  #define __S110  PAGE_SHARED_EXEC
  #define __S111  PAGE_SHARED_EXEC

  #define VMALLOC_SIZE     (KERN_VIRT_SIZE >> 1)
  #define VMALLOC_END      (PAGE_OFFSET - 1)
  #define VMALLOC_START    (PAGE_OFFSET - VMALLOC_SIZE)

  #define BPF_JIT_REGION_SIZE     (SZ_128M)
  #define BPF_JIT_REGION_START    (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
  #define BPF_JIT_REGION_END      (VMALLOC_END)

  /*
   * Roughly size the vmemmap space to be large enough to fit enough
   * struct pages to map half the virtual address space. Then
   * position vmemmap directly below the VMALLOC region.
   */
  #define VMEMMAP_SHIFT \
          (CONFIG_VA_BITS - PAGE_SHIFT - 1 + STRUCT_PAGE_MAX_SHIFT)
  #define VMEMMAP_SIZE    BIT(VMEMMAP_SHIFT)
  #define VMEMMAP_END     (VMALLOC_START - 1)
  #define VMEMMAP_START   (VMALLOC_START - VMEMMAP_SIZE)

  [...]

Let me know if there are any other issues.

Anyway, the main changes are:

1) Extend bpftool to produce a struct (aka "skeleton") tailored and specific
   to a provided BPF object file. This provides an alternative, simplified API
   compared to standard libbpf interaction. Also, add libbpf extern variable
   resolution for .kconfig section to import Kconfig data, from Andrii Nakryiko.

2) Add BPF dispatcher for XDP which is a mechanism to avoid indirect calls by
   generating a branch funnel as discussed back in bpfconf'19 at LSF/MM. Also,
   add various BPF riscv JIT improvements, from Björn Töpel.

3) Extend bpftool to allow matching BPF programs and maps by name,
   from Paul Chaignon.

4) Support for replacing cgroup BPF programs attached with BPF_F_ALLOW_MULTI
   flag for allowing updates without service interruption, from Andrey Ignatov.

5) Cleanup and simplification of ring access functions for AF_XDP with a
   bonus of 0-5% performance improvement, from Magnus Karlsson.

6) Enable BPF JITs for x86-64 and arm64 by default. Also, final version of
   audit support for BPF, from Daniel Borkmann and latter with Jiri Olsa.

7) Move and extend test_select_reuseport into BPF program tests under
   BPF selftests, from Jakub Sitnicki.

8) Various BPF sample improvements for xdpsock for customizing parameters
   to set up and benchmark AF_XDP, from Jay Jayatheerthan.

9) Improve libbpf to provide a ulimit hint on permission denied errors.
   Also change XDP sample programs to attach in driver mode by default,
   from Toke Høiland-Jørgensen.

10) Extend BPF test infrastructure to allow changing skb mark from tc BPF
    programs, from Nikita V. Shirokov.

11) Optimize prologue code sequence in BPF arm32 JIT, from Russell King.

12) Fix xdp_redirect_cpu BPF sample to manually attach to tracepoints after
    libbpf conversion, from Jesper Dangaard Brouer.

13) Minor misc improvements from various others.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27 14:20:10 -08:00
Ingo Molnar
46f5cfc13d Merge branch 'core/kprobes' into perf/core, to pick up a completed branch
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-12-25 10:43:08 +01:00
Waiman Long
d91f305726 locking/lockdep: Fix buffer overrun problem in stack_trace[]
If the lockdep code is really running out of the stack_trace entries,
it is likely that buffer overrun can happen and the data immediately
after stack_trace[] will be corrupted.

If there is less than LOCK_TRACE_SIZE_IN_LONGS entries left before
the call to save_trace(), the max_entries computation will leave it
with a very large positive number because of its unsigned nature. The
subsequent call to stack_trace_save() will then corrupt the data after
stack_trace[]. Fix that by changing max_entries to a signed integer
and check for negative value before calling stack_trace_save().

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 12593b7467 ("locking/lockdep: Reduce space occupied by stack traces")
Link: https://lkml.kernel.org/r/20191220135128.14876-1-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-12-25 10:42:32 +01:00
Qais Yousef
804d402fb6 sched/rt: Make RT capacity-aware
Capacity Awareness refers to the fact that on heterogeneous systems
(like Arm big.LITTLE), the capacity of the CPUs is not uniform, hence
when placing tasks we need to be aware of this difference of CPU
capacities.

In such scenarios we want to ensure that the selected CPU has enough
capacity to meet the requirement of the running task. Enough capacity
means here that capacity_orig_of(cpu) >= task.requirement.

The definition of task.requirement is dependent on the scheduling class.

For CFS, utilization is used to select a CPU that has >= capacity value
than the cfs_task.util.

	capacity_orig_of(cpu) >= cfs_task.util

DL isn't capacity aware at the moment but can make use of the bandwidth
reservation to implement that in a similar manner CFS uses utilization.
The following patchset implements that:

https://lore.kernel.org/lkml/20190506044836.2914-1-luca.abeni@santannapisa.it/

	capacity_orig_of(cpu)/SCHED_CAPACITY >= dl_deadline/dl_runtime

For RT we don't have a per task utilization signal and we lack any
information in general about what performance requirement the RT task
needs. But with the introduction of uclamp, RT tasks can now control
that by setting uclamp_min to guarantee a minimum performance point.

ATM the uclamp value are only used for frequency selection; but on
heterogeneous systems this is not enough and we need to ensure that the
capacity of the CPU is >= uclamp_min. Which is what implemented here.

	capacity_orig_of(cpu) >= rt_task.uclamp_min

Note that by default uclamp.min is 1024, which means that RT tasks will
always be biased towards the big CPUs, which make for a better more
predictable behavior for the default case.

Must stress that the bias acts as a hint rather than a definite
placement strategy. For example, if all big cores are busy executing
other RT tasks we can't guarantee that a new RT task will be placed
there.

On non-heterogeneous systems the original behavior of RT should be
retained. Similarly if uclamp is not selected in the config.

[ mingo: Minor edits to comments. ]

Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191009104611.15363-1-qais.yousef@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-12-25 10:42:10 +01:00
Valentin Schneider
1d42509e47 sched/fair: Make EAS wakeup placement consider uclamp restrictions
task_fits_capacity() has just been made uclamp-aware, and
find_energy_efficient_cpu() needs to go through the same treatment.

Things are somewhat different here however - using the task max clamp isn't
sufficient. Consider the following setup:

  The target runqueue, rq:
    rq.cpu_capacity_orig = 512
    rq.cfs.avg.util_avg = 200
    rq.uclamp.max = 768 // the max p.uclamp.max of all enqueued p's is 768

  The waking task, p (not yet enqueued on rq):
    p.util_est = 600
    p.uclamp.max = 100

Now, consider the following code which doesn't use the rq clamps:

  util = uclamp_task_util(p);
  // Does the task fit in the spare CPU capacity?
  cpu = cpu_of(rq);
  fits_capacity(util, cpu_capacity(cpu) - cpu_util(cpu))

This would lead to:

  util = 100;
  fits_capacity(100, 512 - 200)

fits_capacity() would return true. However, enqueuing p on that CPU *will*
cause it to become overutilized since rq clamp values are max-aggregated,
so we'd remain with

  rq.uclamp.max = 768

which comes from the other tasks already enqueued on rq. Thus, we could
select a high enough frequency to reach beyond 0.8 * 512 utilization
(== overutilized) after enqueuing p on rq. What find_energy_efficient_cpu()
needs here is uclamp_rq_util_with() which lets us peek at the future
utilization landscape, including rq-wide uclamp values.

Make find_energy_efficient_cpu() use uclamp_rq_util_with() for its
fits_capacity() check. This is in line with what compute_energy() ends up
using for estimating utilization.

Tested-By: Dietmar Eggemann <dietmar.eggemann@arm.com>
Suggested-by: Quentin Perret <qperret@google.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191211113851.24241-6-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-12-25 10:42:09 +01:00
Valentin Schneider
a7008c07a5 sched/fair: Make task_fits_capacity() consider uclamp restrictions
task_fits_capacity() drives CPU selection at wakeup time, and is also used
to detect misfit tasks. Right now it does so by comparing task_util_est()
with a CPU's capacity, but doesn't take into account uclamp restrictions.

There's a few interesting uses that can come out of doing this. For
instance, a low uclamp.max value could prevent certain tasks from being
flagged as misfit tasks, so they could merrily remain on low-capacity CPUs.
Similarly, a high uclamp.min value would steer tasks towards high capacity
CPUs at wakeup (and, should that fail, later steered via misfit balancing),
so such "boosted" tasks would favor CPUs of higher capacity.

Introduce uclamp_task_util() and make task_fits_capacity() use it.

Tested-By: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Quentin Perret <qperret@google.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191211113851.24241-5-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-12-25 10:42:09 +01:00
Valentin Schneider
d2b58a286e sched/uclamp: Rename uclamp_util_with() into uclamp_rq_util_with()
The current helper returns (CPU) rq utilization with uclamp restrictions
taken into account. A uclamp task utilization helper would be quite
helpful, but this requires some renaming.

Prepare the code for the introduction of a uclamp_task_util() by renaming
the existing uclamp_util_with() to uclamp_rq_util_with().

Tested-By: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Quentin Perret <qperret@google.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191211113851.24241-4-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-12-25 10:42:08 +01:00
Valentin Schneider
686516b55e sched/uclamp: Make uclamp util helpers use and return UL values
Vincent pointed out recently that the canonical type for utilization
values is 'unsigned long'. Internally uclamp uses 'unsigned int' values for
cache optimization, but this doesn't have to be exported to its users.

Make the uclamp helpers that deal with utilization use and return unsigned
long values.

Tested-By: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Quentin Perret <qperret@google.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191211113851.24241-3-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-12-25 10:42:08 +01:00
Valentin Schneider
59fe675248 sched/uclamp: Remove uclamp_util()
The sole user of uclamp_util(), schedutil_cpu_util(), was made to use
uclamp_util_with() instead in commit:

  af24bde8df ("sched/uclamp: Add uclamp support to energy_compute()")

From then on, uclamp_util() has remained unused. Being a simple wrapper
around uclamp_util_with(), we can get rid of it and win back a few lines.

Tested-By: Dietmar Eggemann <dietmar.eggemann@arm.com>
Suggested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191211113851.24241-2-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-12-25 10:42:07 +01:00
Viresh Kumar
17346452b2 sched/fair: Make sched-idle CPU selection consistent throughout
There are instances where we keep searching for an idle CPU despite
already having a sched-idle CPU (in find_idlest_group_cpu(),
select_idle_smt() and select_idle_cpu() and then there are places where
we don't necessarily do that and return a sched-idle CPU as soon as we
find one (in select_idle_sibling()). This looks a bit inconsistent and
it may be worth having the same policy everywhere.

On the other hand, choosing a sched-idle CPU over a idle one shall be
beneficial from performance and power point of view as well, as we don't
need to get the CPU online from a deep idle state which wastes quite a
lot of time and energy and delays the scheduling of the newly woken up
task.

This patch tries to simplify code around sched-idle CPU selection and
make it consistent throughout.

Testing is done with the help of rt-app on hikey board (ARM64 octa-core,
2 clusters, 0-3 and 4-7). The cpufreq governor was set to performance to
avoid any side affects from CPU frequency. Following are the tests
performed:

Test 1: 1-cfs-task:

 A single SCHED_NORMAL task is pinned to CPU5 which runs for 2333 us
 out of 7777 us (so gives time for the cluster to go in deep idle
 state).

Test 2: 1-cfs-1-idle-task:

 A single SCHED_NORMAL task is pinned on CPU5 and single SCHED_IDLE
 task is pinned on CPU6 (to make sure cluster 1 doesn't go in deep idle
 state).

Test 3: 1-cfs-8-idle-task:

 A single SCHED_NORMAL task is pinned on CPU5 and eight SCHED_IDLE
 tasks are created which run forever (not pinned anywhere, so they run
 on all CPUs). Checked with kernelshark that as soon as NORMAL task
 sleeps, the SCHED_IDLE task starts running on CPU5.

And here are the results on mean latency (in us), using the "st" tool.

  $ st 1-cfs-task/rt-app-cfs_thread-0.log
  N       min     max     sum     mean    stddev
  642     90      592     197180  307.134 109.906

  $ st 1-cfs-1-idle-task/rt-app-cfs_thread-0.log
  N       min     max     sum     mean    stddev
  642     67      311     113850  177.336 41.4251

  $ st 1-cfs-8-idle-task/rt-app-cfs_thread-0.log
  N       min     max     sum     mean    stddev
  643     29      173     41364   64.3297 13.2344

The mean latency when we need to:

 - wakeup from deep idle state is 307 us.
 - wakeup from shallow idle state is 177 us.
 - preempt a SCHED_IDLE task is 64 us.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/b90cbcce608cef4e02a7bbfe178335f76d201bab.1573728344.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-12-25 10:42:07 +01:00
Qian Cai
53a23364b6 sched/core: Remove unused variable from set_user_nice()
This commit left behind an unused variable:

  5443a0be61 ("sched: Use fair:prio_changed() instead of ad-hoc implementation") left behind an unused variable.

  kernel/sched/core.c: In function 'set_user_nice':
  kernel/sched/core.c:4507:16: warning: variable 'delta' set but not used
    int old_prio, delta;
                ^~~~~

Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 5443a0be61 ("sched: Use fair:prio_changed() instead of ad-hoc implementation")
Link: https://lkml.kernel.org/r/20191219140314.1252-1-cai@lca.pw
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-12-25 10:42:06 +01:00
Ingo Molnar
1e5f8a3085 Linux 5.5-rc3
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl4AEiYeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGR3sH/ixrBBYUVyjRPOxS
 ce4iVoTqphGSoAzq/3FA1YZZOPQ/Ep0NXL4L2fTGxmoiqIiuy8JPp07/NKbHQjj1
 Rt6PGm6cw2pMJHaK9gRdlTH/6OyXkp06OkH1uHqKYrhPnpCWDnj+i2SHAX21Hr1y
 oBQh4/XKvoCMCV96J2zxRsLvw8OkQFE0ouWWfj6LbpXIsmWZ++s0OuaO1cVdP/oG
 j+j2Voi3B3vZNQtGgJa5W7YoZN5Qk4ZIj9bMPg7bmKRd3wNB228AiJH2w68JWD/I
 jCA+JcITilxC9ud96uJ6k7SMS2ufjQlnP0z6Lzd0El1yGtHYRcPOZBgfOoPU2Euf
 33WGSyI=
 =iEwx
 -----END PGP SIGNATURE-----

Merge tag 'v5.5-rc3' into sched/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-12-25 10:41:37 +01:00
Mathieu Desnoyers
66528a4575 rseq: Reject unknown flags on rseq unregister
It is preferrable to reject unknown flags within rseq unregistration
rather than to ignore them. It is an oversight caused by the fact that
the check for unknown flags is after the rseq unregister flag check.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191211161713.4490-2-mathieu.desnoyers@efficios.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-12-25 10:41:20 +01:00
Daniel Borkmann
f54c7898ed bpf: Fix precision tracking for unbounded scalars
Anatoly has been fuzzing with kBdysch harness and reported a hang in one
of the outcomes. Upon closer analysis, it turns out that precise scalar
value tracking is missing a few precision markings for unknown scalars:

  0: R1=ctx(id=0,off=0,imm=0) R10=fp0
  0: (b7) r0 = 0
  1: R0_w=invP0 R1=ctx(id=0,off=0,imm=0) R10=fp0
  1: (35) if r0 >= 0xf72e goto pc+0
  --> only follow fallthrough
  2: R0_w=invP0 R1=ctx(id=0,off=0,imm=0) R10=fp0
  2: (35) if r0 >= 0x80fe0000 goto pc+0
  --> only follow fallthrough
  3: R0_w=invP0 R1=ctx(id=0,off=0,imm=0) R10=fp0
  3: (14) w0 -= -536870912
  4: R0_w=invP536870912 R1=ctx(id=0,off=0,imm=0) R10=fp0
  4: (0f) r1 += r0
  5: R0_w=invP536870912 R1_w=inv(id=0) R10=fp0
  5: (55) if r1 != 0x104c1500 goto pc+0
  --> push other branch for later analysis
  R0_w=invP536870912 R1_w=inv273421568 R10=fp0
  6: R0_w=invP536870912 R1_w=inv273421568 R10=fp0
  6: (b7) r0 = 0
  7: R0=invP0 R1=inv273421568 R10=fp0
  7: (76) if w1 s>= 0xffffff00 goto pc+3
  --> only follow goto
  11: R0=invP0 R1=inv273421568 R10=fp0
  11: (95) exit
  6: R0_w=invP536870912 R1_w=inv(id=0) R10=fp0
  6: (b7) r0 = 0
  propagating r0
  7: safe
  processed 11 insns [...]

In the analysis of the second path coming after the successful exit above,
the path is being pruned at line 7. Pruning analysis found that both r0 are
precise P0 and both R1 are non-precise scalars and given prior path with
R1 as non-precise scalar succeeded, this one is therefore safe as well.

However, problem is that given condition at insn 7 in the first run, we only
followed goto and didn't push the other branch for later analysis, we've
never walked the few insns in there and therefore dead-code sanitation
rewrites it as goto pc-1, causing the hang depending on the skb address
hitting these conditions. The issue is that R1 should have been marked as
precise as well such that pruning enforces range check and conluded that new
R1 is not in range of old R1. In insn 4, we mark R1 (skb) as unknown scalar
via __mark_reg_unbounded() but not mark_reg_unbounded() and therefore
regs->precise remains as false.

Back in b5dc0163d8 ("bpf: precise scalar_value tracking"), this was not
the case since marking out of __mark_reg_unbounded() had this covered as well.
Once in both are set as precise in 4 as they should have been, we conclude
that given R1 was in prior fall-through path 0x104c1500 and now is completely
unknown, the check at insn 7 concludes that we need to continue walking.
Analysis after the fix:

  0: R1=ctx(id=0,off=0,imm=0) R10=fp0
  0: (b7) r0 = 0
  1: R0_w=invP0 R1=ctx(id=0,off=0,imm=0) R10=fp0
  1: (35) if r0 >= 0xf72e goto pc+0
  2: R0_w=invP0 R1=ctx(id=0,off=0,imm=0) R10=fp0
  2: (35) if r0 >= 0x80fe0000 goto pc+0
  3: R0_w=invP0 R1=ctx(id=0,off=0,imm=0) R10=fp0
  3: (14) w0 -= -536870912
  4: R0_w=invP536870912 R1=ctx(id=0,off=0,imm=0) R10=fp0
  4: (0f) r1 += r0
  5: R0_w=invP536870912 R1_w=invP(id=0) R10=fp0
  5: (55) if r1 != 0x104c1500 goto pc+0
  R0_w=invP536870912 R1_w=invP273421568 R10=fp0
  6: R0_w=invP536870912 R1_w=invP273421568 R10=fp0
  6: (b7) r0 = 0
  7: R0=invP0 R1=invP273421568 R10=fp0
  7: (76) if w1 s>= 0xffffff00 goto pc+3
  11: R0=invP0 R1=invP273421568 R10=fp0
  11: (95) exit
  6: R0_w=invP536870912 R1_w=invP(id=0) R10=fp0
  6: (b7) r0 = 0
  7: R0_w=invP0 R1_w=invP(id=0) R10=fp0
  7: (76) if w1 s>= 0xffffff00 goto pc+3
  R0_w=invP0 R1_w=invP(id=0) R10=fp0
  8: R0_w=invP0 R1_w=invP(id=0) R10=fp0
  8: (a5) if r0 < 0x2007002a goto pc+0
  9: R0_w=invP0 R1_w=invP(id=0) R10=fp0
  9: (57) r0 &= -16316416
  10: R0_w=invP0 R1_w=invP(id=0) R10=fp0
  10: (a6) if w0 < 0x1201 goto pc+0
  11: R0_w=invP0 R1_w=invP(id=0) R10=fp0
  11: (95) exit
  11: R0=invP0 R1=invP(id=0) R10=fp0
  11: (95) exit
  processed 16 insns [...]

Fixes: 6754172c20 ("bpf: fix precision tracking in presence of bpf2bpf calls")
Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20191222223740.25297-1-daniel@iogearbox.net
2019-12-22 17:21:10 -08:00
Linus Torvalds
78bac77b52 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Several nf_flow_table_offload fixes from Pablo Neira Ayuso,
    including adding a missing ipv6 match description.

 2) Several heap overflow fixes in mwifiex from qize wang and Ganapathi
    Bhat.

 3) Fix uninit value in bond_neigh_init(), from Eric Dumazet.

 4) Fix non-ACPI probing of nxp-nci, from Stephan Gerhold.

 5) Fix use after free in tipc_disc_rcv(), from Tuong Lien.

 6) Enforce limit of 33 tail calls in mips and riscv JIT, from Paul
    Chaignon.

 7) Multicast MAC limit test is off by one in qede, from Manish Chopra.

 8) Fix established socket lookup race when socket goes from
    TCP_ESTABLISHED to TCP_LISTEN, because there lacks an intervening
    RCU grace period. From Eric Dumazet.

 9) Don't send empty SKBs from tcp_write_xmit(), also from Eric Dumazet.

10) Fix active backup transition after link failure in bonding, from
    Mahesh Bandewar.

11) Avoid zero sized hash table in gtp driver, from Taehee Yoo.

12) Fix wrong interface passed to ->mac_link_up(), from Russell King.

13) Fix DSA egress flooding settings in b53, from Florian Fainelli.

14) Memory leak in gmac_setup_txqs(), from Navid Emamdoost.

15) Fix double free in dpaa2-ptp code, from Ioana Ciornei.

16) Reject invalid MTU values in stmmac, from Jose Abreu.

17) Fix refcount leak in error path of u32 classifier, from Davide
    Caratti.

18) Fix regression causing iwlwifi firmware crashes on boot, from Anders
    Kaseorg.

19) Fix inverted return value logic in llc2 code, from Chan Shu Tak.

20) Disable hardware GRO when XDP is attached to qede, frm Manish
    Chopra.

21) Since we encode state in the low pointer bits, dst metrics must be
    at least 4 byte aligned, which is not necessarily true on m68k. Add
    annotations to fix this, from Geert Uytterhoeven.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (160 commits)
  sfc: Include XDP packet headroom in buffer step size.
  sfc: fix channel allocation with brute force
  net: dst: Force 4-byte alignment of dst_metrics
  selftests: pmtu: fix init mtu value in description
  hv_netvsc: Fix unwanted rx_table reset
  net: phy: ensure that phy IDs are correctly typed
  mod_devicetable: fix PHY module format
  qede: Disable hardware gro when xdp prog is installed
  net: ena: fix issues in setting interrupt moderation params in ethtool
  net: ena: fix default tx interrupt moderation interval
  net/smc: unregister ib devices in reboot_event
  net: stmmac: platform: Fix MDIO init for platforms without PHY
  llc2: Fix return statement of llc_stat_ev_rx_null_dsap_xid_c (and _test_c)
  net: hisilicon: Fix a BUG trigered by wrong bytes_compl
  net: dsa: ksz: use common define for tag len
  s390/qeth: don't return -ENOTSUPP to userspace
  s390/qeth: fix promiscuous mode after reset
  s390/qeth: handle error due to unsupported transport mode
  cxgb4: fix refcount init for TC-MQPRIO offload
  tc-testing: initial tdc selftests for cls_u32
  ...
2019-12-22 09:54:33 -08:00
Linus Torvalds
b8e382a185 Various tracing fixes:
- Fix memory leak on error path of process_system_preds()
  - Lock inversion fix with updating tgid recording option
  - Fix histogram compare function on big endian machines
  - Fix histogram trigger function on big endian machines
  - Make trace_printk() irq sync on init for kprobe selftest correctness
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXf6MRxQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qlw6AQCny2YeASymmOjDqh9/G53UdhO539Y2
 oL/2nQ8B9T9KWgD6AmmohhbX+TS9l5Nwy2/bKmRgADZ7u+2XLM2f2mYR2Ag=
 =D7hI
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fix memory leak on error path of process_system_preds()

 - Lock inversion fix with updating tgid recording option

 - Fix histogram compare function on big endian machines

 - Fix histogram trigger function on big endian machines

 - Make trace_printk() irq sync on init for kprobe selftest correctness

* tag 'trace-v5.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Fix endianness bug in histogram trigger
  samples/trace_printk: Wait for IRQ work to finish
  tracing: Fix lock inversion in trace_event_enable_tgid_record()
  tracing: Have the histogram compare functions convert to u64 first
  tracing: Avoid memory leak in process_system_preds()
2019-12-21 15:16:56 -08:00
Sven Schnelle
fe6e096a5b tracing: Fix endianness bug in histogram trigger
At least on PA-RISC and s390 synthetic histogram triggers are failing
selftests because trace_event_raw_event_synth() always writes a 64 bit
values, but the reader expects a field->size sized value. On little endian
machines this doesn't hurt, but on big endian this makes the reader always
read zero values.

Link: http://lore.kernel.org/linux-trace-devel/20191218074427.96184-4-svens@linux.ibm.com

Cc: stable@vger.kernel.org
Fixes: 4b147936fa ("tracing: Add support for 'synthetic' events")
Acked-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-12-21 16:08:59 -05:00
Prateek Sood
3a53acf1d9 tracing: Fix lock inversion in trace_event_enable_tgid_record()
Task T2                             Task T3
trace_options_core_write()            subsystem_open()

 mutex_lock(trace_types_lock)           mutex_lock(event_mutex)

 set_tracer_flag()

   trace_event_enable_tgid_record()       mutex_lock(trace_types_lock)

    mutex_lock(event_mutex)

This gives a circular dependency deadlock between trace_types_lock and
event_mutex. To fix this invert the usage of trace_types_lock and
event_mutex in trace_options_core_write(). This keeps the sequence of
lock usage consistent.

Link: http://lkml.kernel.org/r/0101016eef175e38-8ca71caf-a4eb-480d-a1e6-6f0bbc015495-000000@us-west-2.amazonses.com

Cc: stable@vger.kernel.org
Fixes: d914ba37d7 ("tracing: Add support for recording tgid of tasks")
Signed-off-by: Prateek Sood <prsood@codeaurora.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-12-21 16:05:13 -05:00
Linus Torvalds
fd7a6d2b8f Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "Misc fixes: a (rare) PSI crash fix, a CPU affinity related balancing
  fix, and a toning down of active migration attempts"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/cfs: fix spurious active migration
  sched/fair: Fix find_idlest_group() to handle CPU affinity
  psi: Fix a division error in psi poll()
  sched/psi: Fix sampling error and rare div0 crashes with cgroups and high uptime
2019-12-21 10:52:10 -08:00
Linus Torvalds
c4ff10efe8 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Misc fixes: a BTS fix, a PT NMI handling fix, a PMU sysfs fix and an
  SRCU annotation"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/core: Add SRCU annotation for pmus list walk
  perf/x86/intel: Fix PT PMI handling
  perf/x86/intel/bts: Fix the use of page_private()
  perf/x86: Fix potential out-of-bounds access
2019-12-21 10:51:00 -08:00
chenqiwu
43cf75d964
exit: panic before exit_mm() on global init exit
Currently, when global init and all threads in its thread-group have exited
we panic via:
do_exit()
-> exit_notify()
   -> forget_original_parent()
      -> find_child_reaper()
This makes it hard to extract a useable coredump for global init from a
kernel crashdump because by the time we panic exit_mm() will have already
released global init's mm.
This patch moves the panic futher up before exit_mm() is called. As was the
case previously, we only panic when global init and all its threads in the
thread-group have exited.

Signed-off-by: chenqiwu <chenqiwu@xiaomi.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
[christian.brauner@ubuntu.com: fix typo, rewrite commit message]
Link: https://lore.kernel.org/r/1576736993-10121-1-git-send-email-qiwuchen55@gmail.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2019-12-21 16:48:01 +01:00
Alexandre Belloni
2a2ef473cc PM: sleep: Switch to rtc_time64_to_tm()/rtc_tm_to_time64()
Call the 64bit versions of rtc_tm time conversion to avoid the y2038 issue.

Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-12-20 09:58:08 +01:00
Andrey Ignatov
7dd68b3279 bpf: Support replacing cgroup-bpf program in MULTI mode
The common use-case in production is to have multiple cgroup-bpf
programs per attach type that cover multiple use-cases. Such programs
are attached with BPF_F_ALLOW_MULTI and can be maintained by different
people.

Order of programs usually matters, for example imagine two egress
programs: the first one drops packets and the second one counts packets.
If they're swapped the result of counting program will be different.

It brings operational challenges with updating cgroup-bpf program(s)
attached with BPF_F_ALLOW_MULTI since there is no way to replace a
program:

* One way to update is to detach all programs first and then attach the
  new version(s) again in the right order. This introduces an
  interruption in the work a program is doing and may not be acceptable
  (e.g. if it's egress firewall);

* Another way is attach the new version of a program first and only then
  detach the old version. This introduces the time interval when two
  versions of same program are working, what may not be acceptable if a
  program is not idempotent. It also imposes additional burden on
  program developers to make sure that two versions of their program can
  co-exist.

Solve the problem by introducing a "replace" mode in BPF_PROG_ATTACH
command for cgroup-bpf programs being attached with BPF_F_ALLOW_MULTI
flag. This mode is enabled by newly introduced BPF_F_REPLACE attach flag
and bpf_attr.replace_bpf_fd attribute to pass fd of the old program to
replace

That way user can replace any program among those attached with
BPF_F_ALLOW_MULTI flag without the problems described above.

Details of the new API:

* If BPF_F_REPLACE is set but replace_bpf_fd doesn't have valid
  descriptor of BPF program, BPF_PROG_ATTACH will return corresponding
  error (EINVAL or EBADF).

* If replace_bpf_fd has valid descriptor of BPF program but such a
  program is not attached to specified cgroup, BPF_PROG_ATTACH will
  return ENOENT.

BPF_F_REPLACE is introduced to make the user intent clear, since
replace_bpf_fd alone can't be used for this (its default value, 0, is a
valid fd). BPF_F_REPLACE also makes it possible to extend the API in the
future (e.g. add BPF_F_BEFORE and BPF_F_AFTER if needed).

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Andrii Narkyiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/30cd850044a0057bdfcaaf154b7d2f39850ba813.1576741281.git.rdna@fb.com
2019-12-19 21:22:25 -08:00
Andrey Ignatov
9fab329d6a bpf: Remove unused new_flags in hierarchy_allows_attach()
new_flags is unused, remove it.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/2c49b30ab750f93cfef04a1e40b097d70c3a39a1.1576741281.git.rdna@fb.com
2019-12-19 21:22:25 -08:00
Andrey Ignatov
1020c1f24a bpf: Simplify __cgroup_bpf_attach
__cgroup_bpf_attach has a lot of identical code to handle two scenarios:
BPF_F_ALLOW_MULTI is set and unset.

Simplify it by splitting the two main steps:

* First, the decision is made whether a new bpf_prog_list entry should
  be allocated or existing entry should be reused for the new program.
  This decision is saved in replace_pl pointer;

* Next, replace_pl pointer is used to handle both possible states of
  BPF_F_ALLOW_MULTI flag (set / unset) instead of doing similar work for
  them separately.

This splitting, in turn, allows to make further simplifications:

* The check for attaching same program twice in BPF_F_ALLOW_MULTI mode
  can be done before allocating cgroup storage, so that if user tries to
  attach same program twice no alloc/free happens as it was before;

* pl_was_allocated becomes redundant so it's removed.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/c6193db6fe630797110b0d3ff06c125d093b834c.1576741281.git.rdna@fb.com
2019-12-19 21:22:25 -08:00
Björn Töpel
cdfafe98ca xdp: Make cpumap flush_list common for all map instances
The cpumap flush list is used to track entries that need to flushed
from via the xdp_do_flush_map() function. This list used to be
per-map, but there is really no reason for that. Instead make the
flush list global for all devmaps, which simplifies __cpu_map_flush()
and cpu_map_alloc().

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20191219061006.21980-7-bjorn.topel@gmail.com
2019-12-19 21:09:43 -08:00
Björn Töpel
96360004b8 xdp: Make devmap flush_list common for all map instances
The devmap flush list is used to track entries that need to flushed
from via the xdp_do_flush_map() function. This list used to be
per-map, but there is really no reason for that. Instead make the
flush list global for all devmaps, which simplifies __dev_map_flush()
and dev_map_init_map().

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20191219061006.21980-6-bjorn.topel@gmail.com
2019-12-19 21:09:43 -08:00
Björn Töpel
e312b9e706 xsk: Make xskmap flush_list common for all map instances
The xskmap flush list is used to track entries that need to flushed
from via the xdp_do_flush_map() function. This list used to be
per-map, but there is really no reason for that. Instead make the
flush list global for all xskmaps, which simplifies __xsk_map_flush()
and xsk_map_alloc().

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20191219061006.21980-5-bjorn.topel@gmail.com
2019-12-19 21:09:43 -08:00
Björn Töpel
fb5aacdf36 xdp: Fix graze->grace type-o in cpumap comments
Simple spelling fix.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20191219061006.21980-4-bjorn.topel@gmail.com
2019-12-19 21:09:43 -08:00
Björn Töpel
4bc188c7f2 xdp: Simplify cpumap cleanup
After the RCU flavor consolidation [1], call_rcu() and
synchronize_rcu() waits for preempt-disable regions (NAPI) in addition
to the read-side critical sections. As a result of this, the cleanup
code in cpumap can be simplified

* There is no longer a need to flush in __cpu_map_entry_free, since we
  know that this has been done when the call_rcu() callback is
  triggered.

* When freeing the map, there is no need to explicitly wait for a
  flush. It's guaranteed to be done after the synchronize_rcu() call
  in cpu_map_free().

[1] https://lwn.net/Articles/777036/

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20191219061006.21980-3-bjorn.topel@gmail.com
2019-12-19 21:09:43 -08:00
Björn Töpel
0536b85239 xdp: Simplify devmap cleanup
After the RCU flavor consolidation [1], call_rcu() and
synchronize_rcu() waits for preempt-disable regions (NAPI) in addition
to the read-side critical sections. As a result of this, the cleanup
code in devmap can be simplified

* There is no longer a need to flush in __dev_map_entry_free, since we
  know that this has been done when the call_rcu() callback is
  triggered.

* When freeing the map, there is no need to explicitly wait for a
  flush. It's guaranteed to be done after the synchronize_rcu() call
  in dev_map_free(). The rcu_barrier() is still needed, so that the
  map is not freed prior the elements.

[1] https://lwn.net/Articles/777036/

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20191219061006.21980-2-bjorn.topel@gmail.com
2019-12-19 21:09:43 -08:00
Steven Rostedt (VMware)
106f41f5a3 tracing: Have the histogram compare functions convert to u64 first
The compare functions of the histogram code would be specific for the size
of the value being compared (byte, short, int, long long). It would
reference the value from the array via the type of the compare, but the
value was stored in a 64 bit number. This is fine for little endian
machines, but for big endian machines, it would end up comparing zeros or
all ones (depending on the sign) for anything but 64 bit numbers.

To fix this, first derference the value as a u64 then convert it to the type
being compared.

Link: http://lkml.kernel.org/r/20191211103557.7bed6928@gandalf.local.home

Cc: stable@vger.kernel.org
Fixes: 08d43a5fa0 ("tracing: Add lock-free tracing_map")
Acked-by: Tom Zanussi <zanussi@kernel.org>
Reported-by: Sven Schnelle <svens@stackframe.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-12-19 18:26:00 -05:00
Keita Suzuki
79e65c27f0 tracing: Avoid memory leak in process_system_preds()
When failing in the allocation of filter_item, process_system_preds()
goes to fail_mem, where the allocated filter is freed.

However, this leads to memory leak of filter->filter_string and
filter->prog, which is allocated before and in process_preds().
This bug has been detected by kmemleak as well.

Fix this by changing kfree to __free_fiter.

unreferenced object 0xffff8880658007c0 (size 32):
  comm "bash", pid 579, jiffies 4295096372 (age 17.752s)
  hex dump (first 32 bytes):
    63 6f 6d 6d 6f 6e 5f 70 69 64 20 20 3e 20 31 30  common_pid  > 10
    00 00 00 00 00 00 00 00 65 73 00 00 00 00 00 00  ........es......
  backtrace:
    [<0000000067441602>] kstrdup+0x2d/0x60
    [<00000000141cf7b7>] apply_subsystem_event_filter+0x378/0x932
    [<000000009ca32334>] subsystem_filter_write+0x5a/0x90
    [<0000000072da2bee>] vfs_write+0xe1/0x240
    [<000000004f14f473>] ksys_write+0xb4/0x150
    [<00000000a968b4a0>] do_syscall_64+0x6d/0x1e0
    [<000000001a189f40>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
unreferenced object 0xffff888060c22d00 (size 64):
  comm "bash", pid 579, jiffies 4295096372 (age 17.752s)
  hex dump (first 32 bytes):
    01 00 00 00 00 00 00 00 00 e8 d7 41 80 88 ff ff  ...........A....
    01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<00000000b8c1b109>] process_preds+0x243/0x1820
    [<000000003972c7f0>] apply_subsystem_event_filter+0x3be/0x932
    [<000000009ca32334>] subsystem_filter_write+0x5a/0x90
    [<0000000072da2bee>] vfs_write+0xe1/0x240
    [<000000004f14f473>] ksys_write+0xb4/0x150
    [<00000000a968b4a0>] do_syscall_64+0x6d/0x1e0
    [<000000001a189f40>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
unreferenced object 0xffff888041d7e800 (size 512):
  comm "bash", pid 579, jiffies 4295096372 (age 17.752s)
  hex dump (first 32 bytes):
    70 bc 85 97 ff ff ff ff 0a 00 00 00 00 00 00 00  p...............
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<000000001e04af34>] process_preds+0x71a/0x1820
    [<000000003972c7f0>] apply_subsystem_event_filter+0x3be/0x932
    [<000000009ca32334>] subsystem_filter_write+0x5a/0x90
    [<0000000072da2bee>] vfs_write+0xe1/0x240
    [<000000004f14f473>] ksys_write+0xb4/0x150
    [<00000000a968b4a0>] do_syscall_64+0x6d/0x1e0
    [<000000001a189f40>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Link: http://lkml.kernel.org/r/20191211091258.11310-1-keitasuzuki.park@sslab.ics.keio.ac.jp

Cc: Ingo Molnar <mingo@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 404a3add43 ("tracing: Only add filter list when needed")
Signed-off-by: Keita Suzuki <keitasuzuki.park@sslab.ics.keio.ac.jp>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-12-19 18:24:17 -05:00
Daniel Borkmann
cc52d9140a bpf: Fix record_func_key to perform backtracking on r3
While testing Cilium with /unreleased/ Linus' tree under BPF-based NodePort
implementation, I noticed a strange BPF SNAT engine behavior from time to
time. In some cases it would do the correct SNAT/DNAT service translation,
but at a random point in time it would just stop and perform an unexpected
translation after SYN, SYN/ACK and stack would send a RST back. While initially
assuming that there is some sort of a race condition in BPF code, adding
trace_printk()s for debugging purposes at some point seemed to have resolved
the issue auto-magically.

Digging deeper on this Heisenbug and reducing the trace_printk() calls to
an absolute minimum, it turns out that a single call would suffice to
trigger / not trigger the seen RST issue, even though the logic of the
program itself remains unchanged. Turns out the single call changed verifier
pruning behavior to get everything to work. Reconstructing a minimal test
case, the incorrect JIT dump looked as follows:

  # bpftool p d j i 11346
  0xffffffffc0cba96c:
  [...]
    21:   movzbq 0x30(%rdi),%rax
    26:   cmp    $0xd,%rax
    2a:   je     0x000000000000003a
    2c:   xor    %edx,%edx
    2e:   movabs $0xffff89cc74e85800,%rsi
    38:   jmp    0x0000000000000049
    3a:   mov    $0x2,%edx
    3f:   movabs $0xffff89cc74e85800,%rsi
    49:   mov    -0x224(%rbp),%eax
    4f:   cmp    $0x20,%eax
    52:   ja     0x0000000000000062
    54:   add    $0x1,%eax
    57:   mov    %eax,-0x224(%rbp)
    5d:   jmpq   0xffffffffffff6911
    62:   mov    $0x1,%eax
  [...]

Hence, unexpectedly, JIT emitted a direct jump even though retpoline based
one would have been needed since in line 2c and 3a we have different slot
keys in BPF reg r3. Verifier log of the test case reveals what happened:

  0: (b7) r0 = 14
  1: (73) *(u8 *)(r1 +48) = r0
  2: (71) r0 = *(u8 *)(r1 +48)
  3: (15) if r0 == 0xd goto pc+4
   R0_w=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) R1=ctx(id=0,off=0,imm=0) R10=fp0
  4: (b7) r3 = 0
  5: (18) r2 = 0xffff89cc74d54a00
  7: (05) goto pc+3
  11: (85) call bpf_tail_call#12
  12: (b7) r0 = 1
  13: (95) exit
  from 3 to 8: R0_w=inv13 R1=ctx(id=0,off=0,imm=0) R10=fp0
  8: (b7) r3 = 2
  9: (18) r2 = 0xffff89cc74d54a00
  11: safe
  processed 13 insns (limit 1000000) [...]

Second branch is pruned by verifier since considered safe, but issue is that
record_func_key() couldn't have seen the index in line 3a and therefore
decided that emitting a direct jump at this location was okay.

Fix this by reusing our backtracking logic for precise scalar verification
in order to prevent pruning on the slot key. This means verifier will track
content of r3 all the way backwards and only prune if both scalars were
unknown in state equivalence check and therefore poisoned in the first place
in record_func_key(). The range is [x,x] in record_func_key() case since
the slot always would have to be constant immediate. Correct verification
after fix:

  0: (b7) r0 = 14
  1: (73) *(u8 *)(r1 +48) = r0
  2: (71) r0 = *(u8 *)(r1 +48)
  3: (15) if r0 == 0xd goto pc+4
   R0_w=invP(id=0,umax_value=255,var_off=(0x0; 0xff)) R1=ctx(id=0,off=0,imm=0) R10=fp0
  4: (b7) r3 = 0
  5: (18) r2 = 0x0
  7: (05) goto pc+3
  11: (85) call bpf_tail_call#12
  12: (b7) r0 = 1
  13: (95) exit
  from 3 to 8: R0_w=invP13 R1=ctx(id=0,off=0,imm=0) R10=fp0
  8: (b7) r3 = 2
  9: (18) r2 = 0x0
  11: (85) call bpf_tail_call#12
  12: (b7) r0 = 1
  13: (95) exit
  processed 15 insns (limit 1000000) [...]

And correct corresponding JIT dump:

  # bpftool p d j i 11
  0xffffffffc0dc34c4:
  [...]
    21:	  movzbq 0x30(%rdi),%rax
    26:	  cmp    $0xd,%rax
    2a:	  je     0x000000000000003a
    2c:	  xor    %edx,%edx
    2e:	  movabs $0xffff9928b4c02200,%rsi
    38:	  jmp    0x0000000000000049
    3a:	  mov    $0x2,%edx
    3f:	  movabs $0xffff9928b4c02200,%rsi
    49:	  cmp    $0x4,%rdx
    4d:	  jae    0x0000000000000093
    4f:	  and    $0x3,%edx
    52:	  mov    %edx,%edx
    54:	  cmp    %edx,0x24(%rsi)
    57:	  jbe    0x0000000000000093
    59:	  mov    -0x224(%rbp),%eax
    5f:	  cmp    $0x20,%eax
    62:	  ja     0x0000000000000093
    64:	  add    $0x1,%eax
    67:	  mov    %eax,-0x224(%rbp)
    6d:	  mov    0x110(%rsi,%rdx,8),%rax
    75:	  test   %rax,%rax
    78:	  je     0x0000000000000093
    7a:	  mov    0x30(%rax),%rax
    7e:	  add    $0x19,%rax
    82:   callq  0x000000000000008e
    87:   pause
    89:   lfence
    8c:   jmp    0x0000000000000087
    8e:   mov    %rax,(%rsp)
    92:   retq
    93:   mov    $0x1,%eax
  [...]

Also explicitly adding explicit env->allow_ptr_leaks to fixup_bpf_calls() since
backtracking is enabled under former (direct jumps as well, but use different
test). In case of only tracking different map pointers as in c93552c443 ("bpf:
properly enforce index mask to prevent out-of-bounds speculation"), pruning
cannot make such short-cuts, neither if there are paths with scalar and non-scalar
types as r3. mark_chain_precision() is only needed after we know that
register_is_const(). If it was not the case, we already poison the key on first
path and non-const key in later paths are not matching the scalar range in regsafe()
either. Cilium NodePort testing passes fine as well now. Note, released kernels
not affected.

Fixes: d2e4c1e6c2 ("bpf: Constant map key tracking for prog array pokes")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/ac43ffdeb7386c5bd688761ed266f3722bb39823.1576789878.git.daniel@iogearbox.net
2019-12-19 13:39:22 -08:00
Aditya Pakki
5bf2fc1f9c bpf: Remove unnecessary assertion on fp_old
The two callers of bpf_prog_realloc - bpf_patch_insn_single and
bpf_migrate_filter dereference the struct fp_old, before passing
it to the function. Thus assertion to check fp_old is unnecessary
and can be removed.

Signed-off-by: Aditya Pakki <pakki001@umn.edu>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20191219175735.19231-1-pakki001@umn.edu
2019-12-19 22:24:15 +01:00
Linus Torvalds
5f096c0ecd Power management fix for 5.5-rc3
Fix a problem related to CPU offline/online and cpufreq governors
 that in some system configurations may lead to a system-wide
 deadlock during CPU online.
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl37lO4SHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxrUoP+wfiXQ8k3GncyD8NXY1/GhEmqB95v/f4
 clbn0xNu2WaQB3UdO/LkouL0+IaVw/i8PAt0cdeuEjKSgbPT8HHCkN28J0oia02H
 HD7JzdiUZh7ONG1eq9Z/7ckSXBflZaUIjzTi6C1axX8reEzGVVuy5LNhc+0iWjsh
 +mr9hRymgsRcGHPTN+CKi8Qhb29PPvVRt4YbghL0moQUDYewYENb/JBYJIjhgChG
 vWpHX6Kra99uveTMkAN5GVcgZP5b/RiM5E+cCpLEZDTSUnCIuTPM38ATGDTpadpW
 DSDuu+vEEmFu7RHO/lheN92n2fnTgjGpl5d6L5qwGCSzm0GeYZNo84RDEFCWwXZh
 5sY8oz+1wA2MIXV3f1bXYTDMWWQSitSVQ3A9OeKLlprGcZhG/66T2QB7aTut/D/R
 devyNt+xjMoqKcA7AaeVZ6XqUSHMTSCak88okXbKapJq6qkA6QkVsga+LArlRa0c
 xdA6lma2ICPG7Q2ta2G4nHekHd9mDSaR7aFkcKoApOkIDKUY9j47pI3KWSgVFCu3
 D6by7F7CCWHfp0Vw22eGuCQokBsLvhMsa7qwFlxKoxC6iJADANzBVkRzaH70wu2w
 QP2Xu9+WndyRJrrmIQS5iTrClUfgverOgXTJ5OH2jFm+Oi4r6quTKF83rturnDBr
 J8OK4odeh6E9
 =+MQE
 -----END PGP SIGNATURE-----

Merge tag 'pm-5.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fix from Rafael Wysocki:
 "Fix a problem related to CPU offline/online and cpufreq governors that
  in some system configurations may lead to a system-wide deadlock
  during CPU online"

* tag 'pm-5.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpufreq: Avoid leaving stale IRQ work items during CPU offline
2019-12-19 08:09:43 -08:00
Arnd Bergmann
4f9fbd893f y2038: rename itimerval to __kernel_old_itimerval
Take the renaming of timeval and timespec one level further,
also renaming itimerval to __kernel_old_itimerval, to avoid
namespace conflicts with the user-space structure that may
use 64-bit time_t members.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-12-18 18:07:33 +01:00
Arnd Bergmann
751addac78 y2038: remove obsolete jiffies conversion functions
Now that the last user of timespec_to_jiffies() is gone, these
can just be removed, everything else is using ktime_t or timespec64
already.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-12-18 18:07:33 +01:00
Arnd Bergmann
352c912b0a tsacct: add 64-bit btime field
As there is only a 32-bit ac_btime field in taskstat and
we should handle dates after the overflow, add a new field
with the same information but 64-bit width that can hold
a full time64_t.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-12-18 18:07:31 +01:00
Arnd Bergmann
2d602bf283 acct: stop using get_seconds()
In 'struct acct', 'struct acct_v3', and 'struct taskstats' we have
a 32-bit 'ac_btime' field containing an absolute time value, which
will overflow in year 2106.

There are two possible ways to deal with it:

a) let it overflow and have user space code deal with reconstructing
   the data based on the current time, or
b) truncate the times based on the range of the u32 type.

Neither of them solves the actual problem. Pick the second
one to best document what the issue is, and have someone
fix it in a future version.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-12-18 18:07:31 +01:00
Linus Torvalds
9e8a0d5ff8 Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Ingo Molnar:
 "Tone down mutex debugging complaints, and annotate/fix spinlock
  debugging data accesses for KCSAN"

* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  Revert "locking/mutex: Complain upon mutex API misuse in IRQ contexts"
  locking/spinlock/debug: Fix various data races
2019-12-17 11:00:46 -08:00
Daniel Borkmann
e47304232b bpf: Fix cgroup local storage prog tracking
Recently noticed that we're tracking programs related to local storage maps
through their prog pointer. This is a wrong assumption since the prog pointer
can still change throughout the verification process, for example, whenever
bpf_patch_insn_single() is called.

Therefore, the prog pointer that was assigned via bpf_cgroup_storage_assign()
is not guaranteed to be the same as we pass in bpf_cgroup_storage_release()
and the map would therefore remain in busy state forever. Fix this by using
the prog's aux pointer which is stable throughout verification and beyond.

Fixes: de9cbbaadb ("bpf: introduce cgroup storage maps")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/1471c69eca3022218666f909bc927a92388fd09e.1576580332.git.daniel@iogearbox.net
2019-12-17 08:58:02 -08:00
Yangtao Li
a5e37de90e stop_machine: remove try_stop_cpus helper
try_stop_cpus is not used after this:

commit c190c3b16c ("rcu: Switch synchronize_sched_expedited() to
stop_one_cpu()")

So remove it.

Signed-off-by: Yangtao Li <tiny.windzz@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20191214195107.26480-1-tiny.windzz@gmail.com
2019-12-17 13:32:51 +01:00
Peng Wang
d040e0734f schied/fair: Skip calculating @contrib without load
Because of the:

	if (!load)
		runnable = running = 0;

clause in ___update_load_sum(), all the actual users of @contrib in
accumulate_sum():

	if (load)
		sa->load_sum += load * contrib;
	if (runnable)
		sa->runnable_load_sum += runnable * contrib;
	if (running)
		sa->util_sum += contrib << SCHED_CAPACITY_SHIFT;

don't happen, and therefore we don't care what @contrib actually is and
calculating it is pointless.

If we count the times when @load equals zero and not as below:

	if (load) {
		load_is_not_zero_count++;
		contrib = __accumulate_pelt_segments(periods,
				1024 - sa->period_contrib,delta);
	} else
		load_is_zero_count++;

As we can see, load_is_zero_count is much bigger than
load_is_zero_count, and the gap is gradually widening:

	load_is_zero_count:            6016044 times
	load_is_not_zero_count:         244316 times
	19:50:43 up 1 min,  1 user,  load average: 0.09, 0.06, 0.02

	load_is_zero_count:            7956168 times
	load_is_not_zero_count:         261472 times
	19:51:42 up 2 min,  1 user,  load average: 0.03, 0.05, 0.01

	load_is_zero_count:           10199896 times
	load_is_not_zero_count:         278364 times
	19:52:51 up 3 min,  1 user,  load average: 0.06, 0.05, 0.01

	load_is_zero_count:           14333700 times
	load_is_not_zero_count:         318424 times
	19:54:53 up 5 min,  1 user,  load average: 0.01, 0.03, 0.00

Perhaps we can gain some performance advantage by saving these
unnecessary calculation.

Signed-off-by: Peng Wang <rocking@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot < vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/1576208740-35609-1-git-send-email-rocking@linux.alibaba.com
2019-12-17 13:32:51 +01:00
Cheng Jian
60588bfa22 sched/fair: Optimize select_idle_cpu
select_idle_cpu() will scan the LLC domain for idle CPUs,
it's always expensive. so the next commit :

	1ad3aaf3fc ("sched/core: Implement new approach to scale select_idle_cpu()")

introduces a way to limit how many CPUs we scan.

But it consume some CPUs out of 'nr' that are not allowed
for the task and thus waste our attempts. The function
always return nr_cpumask_bits, and we can't find a CPU
which our task is allowed to run.

Cpumask may be too big, similar to select_idle_core(), use
per_cpu_ptr 'select_idle_mask' to prevent stack overflow.

Fixes: 1ad3aaf3fc ("sched/core: Implement new approach to scale select_idle_cpu()")
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20191213024530.28052-1-cj.chengjian@huawei.com
2019-12-17 13:32:51 +01:00
Peter Zijlstra
45178ac0ce cpu/hotplug, stop_machine: Fix stop_machine vs hotplug order
Paul reported a very sporadic, rcutorture induced, workqueue failure.
When the planets align, the workqueue rescuer's self-migrate fails and
then triggers a WARN for running a work on the wrong CPU.

Tejun then figured that set_cpus_allowed_ptr()'s stop_one_cpu() call
could be ignored! When stopper->enabled is false, stop_machine will
insta complete the work, without actually doing the work. Worse, it
will not WARN about this (we really should fix this).

It turns out there is a small window where a freshly online'ed CPU is
marked 'online' but doesn't yet have the stopper task running:

	BP				AP

	bringup_cpu()
	  __cpu_up(cpu, idle)	 -->	start_secondary()
					...
					cpu_startup_entry()
	  bringup_wait_for_ap()
	    wait_for_ap_thread() <--	  cpuhp_online_idle()
					  while (1)
					    do_idle()

					... available to run kthreads ...

	    stop_machine_unpark()
	      stopper->enable = true;

Close this by moving the stop_machine_unpark() into
cpuhp_online_idle(), such that the stopper thread is ready before we
start the idle loop and schedule.

Reported-by: "Paul E. McKenney" <paulmck@kernel.org>
Debugged-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: "Paul E. McKenney" <paulmck@kernel.org>
2019-12-17 13:32:50 +01:00
Oleg Nesterov
cde6519450 sched/wait: fix ___wait_var_event(exclusive)
init_wait_var_entry() forgets to initialize wq_entry->flags.

Currently not a problem, we don't have wait_var_event_exclusive().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Felipe Balbi <balbi@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Juri Lelli <juri.lelli@redhat.com>
Link: https://lkml.kernel.org/r/20191210191902.GB14449@redhat.com
2019-12-17 13:32:50 +01:00
Frederic Weisbecker
5443a0be61 sched: Use fair:prio_changed() instead of ad-hoc implementation
set_user_nice() implements its own version of fair::prio_changed() and
therefore misses a specific optimization towards nohz_full CPUs that
avoid sending an resched IPI to a reniced task running alone. Use the
proper callback instead.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20191203160106.18806-3-frederic@kernel.org
2019-12-17 13:32:50 +01:00
Frederic Weisbecker
7c2e8bbd87 sched: Spare resched IPI when prio changes on a single fair task
The runqueue of a fair task being remotely reniced is going to get a
resched IPI in order to reassess which task should be the current
running on the CPU. However that evaluation is useless if the fair task
is running alone, in which case we can spare that IPI, preventing
nohz_full CPUs from being disturbed.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20191203160106.18806-2-frederic@kernel.org
2019-12-17 13:32:50 +01:00
Vincent Guittot
6cf82d559e sched/cfs: fix spurious active migration
The load balance can fail to find a suitable task during the periodic check
because  the imbalance is smaller than half of the load of the waiting
tasks. This results in the increase of the number of failed load balance,
which can end up to start an active migration. This active migration is
useless because the current running task is not a better choice than the
waiting ones. In fact, the current task was probably not running but
waiting for the CPU during one of the previous attempts and it had already
not been selected.

When load balance fails too many times to migrate a task, we should relax
the contraint on the maximum load of the tasks that can be migrated
similarly to what is done with cache hotness.

Before the rework, load balance used to set the imbalance to the average
load_per_task in order to mitigate such situation. This increased the
likelihood of migrating a task but also of selecting a larger task than
needed while more appropriate ones were in the list.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1575036287-6052-1-git-send-email-vincent.guittot@linaro.org
2019-12-17 13:32:48 +01:00
Vincent Guittot
7ed735c331 sched/fair: Fix find_idlest_group() to handle CPU affinity
Because of CPU affinity, the local group can be skipped which breaks the
assumption that statistics are always collected for local group. With
uninitialized local_sgs, the comparison is meaningless and the behavior
unpredictable. This can even end up to use local pointer which is to
NULL in this case.

If the local group has been skipped because of CPU affinity, we return
the idlest group.

Fixes: 57abff067a ("sched/fair: Rework find_idlest_group()")
Reported-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: John Stultz <john.stultz@linaro.org>
Cc: rostedt@goodmis.org
Cc: valentin.schneider@arm.com
Cc: mingo@redhat.com
Cc: mgorman@suse.de
Cc: juri.lelli@redhat.com
Cc: dietmar.eggemann@arm.com
Cc: bsegall@google.com
Cc: qais.yousef@arm.com
Link: https://lkml.kernel.org/r/1575483700-22153-1-git-send-email-vincent.guittot@linaro.org
2019-12-17 13:32:48 +01:00
Johannes Weiner
c3466952ca psi: Fix a division error in psi poll()
The psi window size is a u64 an can be up to 10 seconds right now,
which exceeds the lower 32 bits of the variable. We currently use
div_u64 for it, which is meant only for 32-bit divisors. The result is
garbage pressure sampling values and even potential div0 crashes.

Use div64_u64.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Cc: Jingfeng Xie <xiejingfeng@linux.alibaba.com>
Link: https://lkml.kernel.org/r/20191203183524.41378-3-hannes@cmpxchg.org
2019-12-17 13:32:48 +01:00
Johannes Weiner
3dfbe25c27 sched/psi: Fix sampling error and rare div0 crashes with cgroups and high uptime
Jingfeng reports rare div0 crashes in psi on systems with some uptime:

[58914.066423] divide error: 0000 [#1] SMP
[58914.070416] Modules linked in: ipmi_poweroff ipmi_watchdog toa overlay fuse tcp_diag inet_diag binfmt_misc aisqos(O) aisqos_hotfixes(O)
[58914.083158] CPU: 94 PID: 140364 Comm: kworker/94:2 Tainted: G W OE K 4.9.151-015.ali3000.alios7.x86_64 #1
[58914.093722] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 3.23.34 02/14/2019
[58914.102728] Workqueue: events psi_update_work
[58914.107258] task: ffff8879da83c280 task.stack: ffffc90059dcc000
[58914.113336] RIP: 0010:[] [] psi_update_stats+0x1c1/0x330
[58914.122183] RSP: 0018:ffffc90059dcfd60 EFLAGS: 00010246
[58914.127650] RAX: 0000000000000000 RBX: ffff8858fe98be50 RCX: 000000007744d640
[58914.134947] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00003594f700648e
[58914.142243] RBP: ffffc90059dcfdf8 R08: 0000359500000000 R09: 0000000000000000
[58914.149538] R10: 0000000000000000 R11: 0000000000000000 R12: 0000359500000000
[58914.156837] R13: 0000000000000000 R14: 0000000000000000 R15: ffff8858fe98bd78
[58914.164136] FS: 0000000000000000(0000) GS:ffff887f7f380000(0000) knlGS:0000000000000000
[58914.172529] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[58914.178467] CR2: 00007f2240452090 CR3: 0000005d5d258000 CR4: 00000000007606f0
[58914.185765] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[58914.193061] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[58914.200360] PKRU: 55555554
[58914.203221] Stack:
[58914.205383] ffff8858fe98bd48 00000000000002f0 0000002e81036d09 ffffc90059dcfde8
[58914.213168] ffff8858fe98bec8 0000000000000000 0000000000000000 0000000000000000
[58914.220951] 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[58914.228734] Call Trace:
[58914.231337] [] psi_update_work+0x22/0x60
[58914.237067] [] process_one_work+0x189/0x420
[58914.243063] [] worker_thread+0x4e/0x4b0
[58914.248701] [] ? process_one_work+0x420/0x420
[58914.254869] [] kthread+0xe6/0x100
[58914.259994] [] ? kthread_park+0x60/0x60
[58914.265640] [] ret_from_fork+0x39/0x50
[58914.271193] Code: 41 29 c3 4d 39 dc 4d 0f 42 dc <49> f7 f1 48 8b 13 48 89 c7 48 c1
[58914.279691] RIP [] psi_update_stats+0x1c1/0x330

The crashing instruction is trying to divide the observed stall time
by the sampling period. The period, stored in R8, is not 0, but we are
dividing by the lower 32 bits only, which are all 0 in this instance.

We could switch to a 64-bit division, but the period shouldn't be that
big in the first place. It's the time between the last update and the
next scheduled one, and so should always be around 2s and comfortably
fit into 32 bits.

The bug is in the initialization of new cgroups: we schedule the first
sampling event in a cgroup as an offset of sched_clock(), but fail to
initialize the last_update timestamp, and it defaults to 0. That
results in a bogusly large sampling period the first time we run the
sampling code, and consequently we underreport pressure for the first
2s of a cgroup's life. But worse, if sched_clock() is sufficiently
advanced on the system, and the user gets unlucky, the period's lower
32 bits can all be 0 and the sampling division will crash.

Fix this by initializing the last update timestamp to the creation
time of the cgroup, thus correctly marking the start of the first
pressure sampling period in a new cgroup.

Reported-by: Jingfeng Xie <xiejingfeng@linux.alibaba.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Link: https://lkml.kernel.org/r/20191203183524.41378-2-hannes@cmpxchg.org
2019-12-17 13:32:47 +01:00
Sebastian Andrzej Siewior
9f0bff1180 perf/core: Add SRCU annotation for pmus list walk
Since commit
   28875945ba ("rcu: Add support for consolidated-RCU reader checking")

there is an additional check to ensure that a RCU related lock is held
while the RCU list is iterated.
This section holds the SRCU reader lock instead.

Add annotation to list_for_each_entry_rcu() that pmus_srcu must be
acquired during the list traversal.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Link: https://lkml.kernel.org/r/20191119121429.zhcubzdhm672zasg@linutronix.de
2019-12-17 13:32:46 +01:00
Daniel Borkmann
a2ea07465c bpf: Fix missing prog untrack in release_maps
Commit da765a2f59 ("bpf: Add poke dependency tracking for prog array
maps") wrongly assumed that in case of prog load errors, we're cleaning
up all program tracking via bpf_free_used_maps().

However, it can happen that we're still at the point where we didn't copy
map pointers into the prog's aux section such that env->prog->aux->used_maps
is still zero, running into a UAF. In such case, the verifier has similar
release_maps() helper that drops references to used maps from its env.

Consolidate the release code into __bpf_free_used_maps() and call it from
all sides to fix it.

Fixes: da765a2f59 ("bpf: Add poke dependency tracking for prog array maps")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/1c2909484ca524ae9f55109b06f22b6213e76376.1576514756.git.daniel@iogearbox.net
2019-12-16 10:59:29 -08:00
Linus Torvalds
22ff311af9 treewide conversion from FIELD_SIZEOF() to sizeof_field()
-----BEGIN PGP SIGNATURE-----
 Comment: Kees Cook <kees@outflux.net>
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAl3umDgWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJlvsD/49R12HK7UzTxNTrcpvbadJ4t7j
 j/qJvjMerW7iVNAPOoNAOePUa21+y3rI1AZPvoPyzIqp1Bf2eOICf5SdisG2cG+O
 X0A8EKWvS0SSQWSKaT6udUKJ3nBJItwvOvQ5B58KQzcOj3S4X7B9iVBWgieMHrzz
 urkZm7pqowrZB3wuF8keRtli5IZaoiCwzApy48Qrn70G3OeXymknFbpHTDwIAiGw
 RiE5Xh0R4EzQdsYyCgjR8U56gBchadAmj8BUJU0ppMnOFMyIAG670hNLrs0L3roP
 8TOIeyb993ZC5GZaMlnR8mz0jfibfkPa3Z85VAsVyQSPaOQldwc9j8TGBqD5Gfat
 1PjOU5RVwma0pH5xTPOeevWPQpIK9KovQpQYqMMN9GMxOEx96IOUjwTrnNK2xWoN
 UGyOVlESFGoniClhCiKYzPSrYOjlIBk5ovf15PdTe+bwyUDMfyfy5CZV88OS2DHz
 ZBZvpLrH/EMW9zJ+FqMTp0C4s4wa2Ioid3bSh6XuNUTtltKSjp71eUja8ZEz+2sd
 5AGstCC+hYqxaEk+6/851pfkQ9sbBjwuGtNrtX+pqreiLUvWLhQ0yUj6cLXlEQNH
 aucjCukCjI+4lMzofeaQ2LbNhtff4YsfO4b1Ye8maoDdHjzUVL57n3bTOxKhdzbt
 y6FM3lApOjk3OyaTJQ==
 =YU4A
 -----END PGP SIGNATURE-----

Merge tag 'sizeof_field-v5.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull FIELD_SIZEOF conversion from Kees Cook:
 "A mostly mechanical treewide conversion from FIELD_SIZEOF() to
  sizeof_field(). This avoids the redundancy of having 2 macros
  (actually 3) doing the same thing, and consolidates on sizeof_field().
  While "field" is not an accurate name, it is the common name used in
  the kernel, and doesn't result in any unintended innuendo.

  As there are still users of FIELD_SIZEOF() in -next, I will clean up
  those during this coming development cycle and send the final old
  macro removal patch at that time"

* tag 'sizeof_field-v5.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  treewide: Use sizeof_field() macro
  MIPS: OCTEON: Replace SIZEOF_FIELD() macro
2019-12-13 14:02:12 -08:00
Björn Töpel
7e6897f959 bpf, xdp: Start using the BPF dispatcher for XDP
This commit adds a BPF dispatcher for XDP. The dispatcher is updated
from the XDP control-path, dev_xdp_install(), and used when an XDP
program is run via bpf_prog_run_xdp().

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20191213175112.30208-4-bjorn.topel@gmail.com
2019-12-13 13:09:32 -08:00
Björn Töpel
75ccbef636 bpf: Introduce BPF dispatcher
The BPF dispatcher is a multi-way branch code generator, mainly
targeted for XDP programs. When an XDP program is executed via the
bpf_prog_run_xdp(), it is invoked via an indirect call. The indirect
call has a substantial performance impact, when retpolines are
enabled. The dispatcher transform indirect calls to direct calls, and
therefore avoids the retpoline. The dispatcher is generated using the
BPF JIT, and relies on text poking provided by bpf_arch_text_poke().

The dispatcher hijacks a trampoline function it via the __fentry__ nop
of the trampoline. One dispatcher instance currently supports up to 64
dispatch points. A user creates a dispatcher with its corresponding
trampoline with the DEFINE_BPF_DISPATCHER macro.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20191213175112.30208-3-bjorn.topel@gmail.com
2019-12-13 13:09:32 -08:00
Björn Töpel
98e8627efc bpf: Move trampoline JIT image allocation to a function
Refactor the image allocation in the BPF trampoline code into a
separate function, so it can be shared with the BPF dispatcher in
upcoming commits.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20191213175112.30208-2-bjorn.topel@gmail.com
2019-12-13 13:09:32 -08:00
Paul E. McKenney
c30fe54189 rcu: Mark non-global functions and variables as static
Each of rcu_state, rcu_rnp_online_cpus(), rcu_dynticks_curr_cpu_in_eqs(),
and rcu_dynticks_snap() are used only in the kernel/rcu/tree.o translation
unit, and may thus be marked static.  This commit therefore makes this
change.

Reported-by: Ben Dooks <ben.dooks@codethink.co.uk>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2019-12-12 10:24:52 -08:00
Rafael J. Wysocki
85572c2c4a cpufreq: Avoid leaving stale IRQ work items during CPU offline
The scheduler code calling cpufreq_update_util() may run during CPU
offline on the target CPU after the IRQ work lists have been flushed
for it, so the target CPU should be prevented from running code that
may queue up an IRQ work item on it at that point.

Unfortunately, that may not be the case if dvfs_possible_from_any_cpu
is set for at least one cpufreq policy in the system, because that
allows the CPU going offline to run the utilization update callback
of the cpufreq governor on behalf of another (online) CPU in some
cases.

If that happens, the cpufreq governor callback may queue up an IRQ
work on the CPU running it, which is going offline, and the IRQ work
may not be flushed after that point.  Moreover, that IRQ work cannot
be flushed until the "offlining" CPU goes back online, so if any
other CPU calls irq_work_sync() to wait for the completion of that
IRQ work, it will have to wait until the "offlining" CPU is back
online and that may not happen forever.  In particular, a system-wide
deadlock may occur during CPU online as a result of that.

The failing scenario is as follows.  CPU0 is the boot CPU, so it
creates a cpufreq policy and becomes the "leader" of it
(policy->cpu).  It cannot go offline, because it is the boot CPU.
Next, other CPUs join the cpufreq policy as they go online and they
leave it when they go offline.  The last CPU to go offline, say CPU3,
may queue up an IRQ work while running the governor callback on
behalf of CPU0 after leaving the cpufreq policy because of the
dvfs_possible_from_any_cpu effect described above.  Then, CPU0 is
the only online CPU in the system and the stale IRQ work is still
queued on CPU3.  When, say, CPU1 goes back online, it will run
irq_work_sync() to wait for that IRQ work to complete and so it
will wait for CPU3 to go back online (which may never happen even
in principle), but (worse yet) CPU0 is waiting for CPU1 at that
point too and a system-wide deadlock occurs.

To address this problem notice that CPUs which cannot run cpufreq
utilization update code for themselves (for example, because they
have left the cpufreq policies that they belonged to), should also
be prevented from running that code on behalf of the other CPUs that
belong to a cpufreq policy with dvfs_possible_from_any_cpu set and so
in that case the cpufreq_update_util_data pointer of the CPU running
the code must not be NULL as well as for the CPU which is the target
of the cpufreq utilization update in progress.

Accordingly, change cpufreq_this_cpu_can_update() into a regular
function in kernel/sched/cpufreq.c (instead of a static inline in a
header file) and make it check the cpufreq_update_util_data pointer
of the local CPU if dvfs_possible_from_any_cpu is set for the target
cpufreq policy.

Also update the schedutil governor to do the
cpufreq_this_cpu_can_update() check in the non-fast-switch
case too to avoid the stale IRQ work issues.

Fixes: 99d14d0e16 ("cpufreq: Process remote callbacks from any CPU if the platform permits")
Link: https://lore.kernel.org/linux-pm/20191121093557.bycvdo4xyinbc5cb@vireshk-i7/
Reported-by: Anson Huang <anson.huang@nxp.com>
Tested-by: Anson Huang <anson.huang@nxp.com>
Cc: 4.14+ <stable@vger.kernel.org> # 4.14+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Tested-by: Peng Fan <peng.fan@nxp.com> (i.MX8QXP-MEK)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-12-12 17:59:43 +01:00
Daniel Borkmann
81c22041d9 bpf, x86, arm64: Enable jit by default when not built as always-on
After Spectre 2 fix via 290af86629 ("bpf: introduce BPF_JIT_ALWAYS_ON
config") most major distros use BPF_JIT_ALWAYS_ON configuration these days
which compiles out the BPF interpreter entirely and always enables the
JIT. Also given recent fix in e1608f3fa8 ("bpf: Avoid setting bpf insns
pages read-only when prog is jited"), we additionally avoid fragmenting
the direct map for the BPF insns pages sitting in the general data heap
since they are not used during execution. Latter is only needed when run
through the interpreter.

Since both x86 and arm64 JITs have seen a lot of exposure over the years,
are generally most up to date and maintained, there is more downside in
!BPF_JIT_ALWAYS_ON configurations to have the interpreter enabled by default
rather than the JIT. Add a ARCH_WANT_DEFAULT_BPF_JIT config which archs can
use to set the bpf_jit_{enable,kallsyms} to 1. Back in the days the
bpf_jit_kallsyms knob was set to 0 by default since major distros still
had /proc/kallsyms addresses exposed to unprivileged user space which is
not the case anymore. Hence both knobs are set via BPF_JIT_DEFAULT_ON which
is set to 'y' in case of BPF_JIT_ALWAYS_ON or ARCH_WANT_DEFAULT_BPF_JIT.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/f78ad24795c2966efcc2ee19025fa3459f622185.1575903816.git.daniel@iogearbox.net
2019-12-11 16:16:01 -08:00
Alexei Starovoitov
b91e014f07 bpf: Make BPF trampoline use register_ftrace_direct() API
Make BPF trampoline attach its generated assembly code to kernel functions via
register_ftrace_direct() API. It helps ftrace-based tracers co-exist with BPF
trampoline on the same kernel function. It also switches attaching logic from
arch specific text_poke to generic ftrace that is available on many
architectures. text_poke is still necessary for bpf-to-bpf attach and for
bpf_tail_call optimization.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20191209000114.1876138-3-ast@kernel.org
2019-12-11 15:18:08 -08:00
Linus Torvalds
6674fdb25a This contains 3 changes:
- Removal of code I accidentally applied when doing a minor fix up
    to a patch, and then using "git commit -a --amend", which pulled
    in some other changes I was playing with.
 
  - Remove an used variable in trace_events_inject code
 
  - Fix to function graph tracer when it traces a ftrace direct function.
    It will now ignore tracing a function that has a ftrace direct
    tramploine attached. This is needed for eBPF to use the ftrace direct
    code.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXfD/thQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qoo2AP4j7ONw7BTmMyo+GdYqPPntBeDnClHK
 vfMKrgK1j5BxYgEA7LgkwuUT9bcyLjfJVcyfeW67rB2PtmovKTWnKihFOwI=
 =DZ6N
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.5-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Remove code I accidentally applied when doing a minor fix up to a
   patch, and then using "git commit -a --amend", which pulled in some
   other changes I was playing with.

 - Remove an used variable in trace_events_inject code

 - Fix function graph tracer when it traces a ftrace direct function.
   It will now ignore tracing a function that has a ftrace direct
   tramploine attached. This is needed for eBPF to use the ftrace direct
   code.

* tag 'trace-v5.5-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  ftrace: Fix function_graph tracer interaction with BPF trampoline
  tracing: remove set but not used variable 'buffer'
  module: Remove accidental change of module_enable_x()
2019-12-11 12:22:38 -08:00
Daniel Borkmann
bae141f54b bpf: Emit audit messages upon successful prog load and unload
Allow for audit messages to be emitted upon BPF program load and
unload for having a timeline of events. The load itself is in
syscall context, so additional info about the process initiating
the BPF prog creation can be logged and later directly correlated
to the unload event.

The only info really needed from BPF side is the globally unique
prog ID where then audit user space tooling can query / dump all
info needed about the specific BPF program right upon load event
and enrich the record, thus these changes needed here can be kept
small and non-intrusive to the core.

Raw example output:

  # auditctl -D
  # auditctl -a always,exit -F arch=x86_64 -S bpf
  # ausearch --start recent -m 1334
  ...
  ----
  time->Wed Nov 27 16:04:13 2019
  type=PROCTITLE msg=audit(1574867053.120:84664): proctitle="./bpf"
  type=SYSCALL msg=audit(1574867053.120:84664): arch=c000003e syscall=321   \
    success=yes exit=3 a0=5 a1=7ffea484fbe0 a2=70 a3=0 items=0 ppid=7477    \
    pid=12698 auid=1001 uid=1001 gid=1001 euid=1001 suid=1001 fsuid=1001    \
    egid=1001 sgid=1001 fsgid=1001 tty=pts2 ses=4 comm="bpf"                \
    exe="/home/jolsa/auditd/audit-testsuite/tests/bpf/bpf"                  \
    subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key=(null)
  type=UNKNOWN[1334] msg=audit(1574867053.120:84664): prog-id=76 op=LOAD
  ----
  time->Wed Nov 27 16:04:13 2019
  type=UNKNOWN[1334] msg=audit(1574867053.120:84665): prog-id=76 op=UNLOAD
  ...

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Paul Moore <paul@paul-moore.com>
Link: https://lore.kernel.org/bpf/20191206214934.11319-1-jolsa@kernel.org
2019-12-11 17:41:09 +01:00
Arnd Bergmann
4c80c7bc58 bpf: Fix build in minimal configurations, again
Building with -Werror showed another failure:

kernel/bpf/btf.c: In function 'btf_get_prog_ctx_type.isra.31':
kernel/bpf/btf.c:3508:63: error: array subscript 0 is above array bounds of 'u8[0]' {aka 'unsigned char[0]'} [-Werror=array-bounds]
  ctx_type = btf_type_member(conv_struct) + bpf_ctx_convert_map[prog_type] * 2;

I don't actually understand why the array is empty, but a similar
fix has addressed a related problem, so I suppose we can do the
same thing here.

Fixes: ce27709b81 ("bpf: Fix build in minimal configurations")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20191210203553.2941035-1-arnd@arndb.de
2019-12-11 13:57:26 +01:00
Daniel Jordan
bfcdcef8c8 padata: update documentation
Remove references to unused functions, standardize language, update to
reflect new functionality, migrate to rst format, and fix all kernel-doc
warnings.

Fixes: 815613da6a ("kernel/padata.c: removed unused code")
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: linux-crypto@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-12-11 16:37:02 +08:00
Daniel Jordan
3facced7ae padata: remove reorder_objects
reorder_objects is unused since the rework of padata's flushing, so
remove it.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-12-11 16:37:02 +08:00
Daniel Jordan
91a71d6121 padata: remove cpumask change notifier
Since commit 63d3578892 ("crypto: pcrypt - remove padata cpumask
notifier") this feature is unused, so get rid of it.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: linux-crypto@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-12-11 16:37:02 +08:00
Daniel Jordan
38228e8848 padata: always acquire cpu_hotplug_lock before pinst->lock
lockdep complains when padata's paths to update cpumasks via CPU hotplug
and sysfs are both taken:

  # echo 0 > /sys/devices/system/cpu/cpu1/online
  # echo ff > /sys/kernel/pcrypt/pencrypt/parallel_cpumask

  ======================================================
  WARNING: possible circular locking dependency detected
  5.4.0-rc8-padata-cpuhp-v3+ #1 Not tainted
  ------------------------------------------------------
  bash/205 is trying to acquire lock:
  ffffffff8286bcd0 (cpu_hotplug_lock.rw_sem){++++}, at: padata_set_cpumask+0x2b/0x120

  but task is already holding lock:
  ffff8880001abfa0 (&pinst->lock){+.+.}, at: padata_set_cpumask+0x26/0x120

  which lock already depends on the new lock.

padata doesn't take cpu_hotplug_lock and pinst->lock in a consistent
order.  Which should be first?  CPU hotplug calls into padata with
cpu_hotplug_lock already held, so it should have priority.

Fixes: 6751fb3c0e ("padata: Use get_online_cpus/put_online_cpus")
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-12-11 16:37:02 +08:00
Daniel Jordan
894c9ef978 padata: validate cpumask without removed CPU during offline
Configuring an instance's parallel mask without any online CPUs...

  echo 2 > /sys/kernel/pcrypt/pencrypt/parallel_cpumask
  echo 0 > /sys/devices/system/cpu/cpu1/online

...makes tcrypt mode=215 crash like this:

  divide error: 0000 [#1] SMP PTI
  CPU: 4 PID: 283 Comm: modprobe Not tainted 5.4.0-rc8-padata-doc-v2+ #2
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20191013_105130-anatol 04/01/2014
  RIP: 0010:padata_do_parallel+0x114/0x300
  Call Trace:
   pcrypt_aead_encrypt+0xc0/0xd0 [pcrypt]
   crypto_aead_encrypt+0x1f/0x30
   do_mult_aead_op+0x4e/0xdf [tcrypt]
   test_mb_aead_speed.constprop.0.cold+0x226/0x564 [tcrypt]
   do_test+0x28c2/0x4d49 [tcrypt]
   tcrypt_mod_init+0x55/0x1000 [tcrypt]
   ...

cpumask_weight() in padata_cpu_hash() returns 0 because the mask has no
CPUs.  The problem is __padata_remove_cpu() checks for valid masks too
early and so doesn't mark the instance PADATA_INVALID as expected, which
would have made padata_do_parallel() return error before doing the
division.

Fix by introducing a second padata CPU hotplug state before
CPUHP_BRINGUP_CPU so that __padata_remove_cpu() sees the online mask
without @cpu.  No need for the second argument to padata_replace() since
@cpu is now already missing from the online mask.

Fixes: 33e5445068 ("padata: Handle empty padata cpumasks")
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-12-11 16:37:02 +08:00
Herbert Xu
bbefa1dd6a crypto: pcrypt - Avoid deadlock by using per-instance padata queues
If the pcrypt template is used multiple times in an algorithm, then a
deadlock occurs because all pcrypt instances share the same
padata_instance, which completes requests in the order submitted.  That
is, the inner pcrypt request waits for the outer pcrypt request while
the outer request is already waiting for the inner.

This patch fixes this by allocating a set of queues for each pcrypt
instance instead of using two global queues.  In order to maintain
the existing user-space interface, the pinst structure remains global
so any sysfs modifications will apply to every pcrypt instance.

Note that when an update occurs we have to allocate memory for
every pcrypt instance.  Should one of the allocations fail we
will abort the update without rolling back changes already made.

The new per-instance data structure is called padata_shell and is
essentially a wrapper around parallel_data.

Reproducer:

	#include <linux/if_alg.h>
	#include <sys/socket.h>
	#include <unistd.h>

	int main()
	{
		struct sockaddr_alg addr = {
			.salg_type = "aead",
			.salg_name = "pcrypt(pcrypt(rfc4106-gcm-aesni))"
		};
		int algfd, reqfd;
		char buf[32] = { 0 };

		algfd = socket(AF_ALG, SOCK_SEQPACKET, 0);
		bind(algfd, (void *)&addr, sizeof(addr));
		setsockopt(algfd, SOL_ALG, ALG_SET_KEY, buf, 20);
		reqfd = accept(algfd, 0, 0);
		write(reqfd, buf, 32);
		read(reqfd, buf, 16);
	}

Reported-by: syzbot+56c7151cad94eec37c521f0e47d2eee53f9361c4@syzkaller.appspotmail.com
Fixes: 5068c7a883 ("crypto: pcrypt - Add pcrypt crypto parallelization wrapper")
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Tested-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-12-11 16:36:45 +08:00
Herbert Xu
13380a1471 padata: Remove unused padata_remove_cpu
The function padata_remove_cpu was supposed to have been removed
along with padata_add_cpu but somehow it remained behind.  Let's
kill it now as it doesn't even have a prototype anymore.

Fixes: 815613da6a ("kernel/padata.c: removed unused code")
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-12-11 16:34:45 +08:00
Herbert Xu
07928d9bfc padata: Remove broken queue flushing
The function padata_flush_queues is fundamentally broken because
it cannot force padata users to complete the request that is
underway.  IOW padata has to passively wait for the completion
of any outstanding work.

As it stands flushing is used in two places.  Its use in padata_stop
is simply unnecessary because nothing depends on the queues to
be flushed afterwards.

The other use in padata_replace is more substantial as we depend
on it to free the old pd structure.  This patch instead uses the
pd->refcnt to dynamically free the pd structure once all requests
are complete.

Fixes: 2b73b07ab8 ("padata: Flush the padata queues actively")
Cc: <stable@vger.kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-12-11 16:34:44 +08:00
Al Viro
a3d1e7eb5a simple_recursive_removal(): kernel-side rm -rf for ramfs-style filesystems
two requirements: no file creations in IS_DEADDIR and no cross-directory
renames whatsoever.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-12-10 22:29:58 -05:00
Davidlohr Bueso
c571b72e2b Revert "locking/mutex: Complain upon mutex API misuse in IRQ contexts"
This ended up causing some noise in places such as rxrpc running in softirq.

The warning is misleading in this case as the mutex trylock and unlock
operations are done within the same context; and therefore we need not
worry about the PI-boosting issues that comes along with no single-owner
lock guarantees.

While we don't want to support this in mutexes, there is no way out of
this yet; so lets get rid of the WARNs for now, as it is only fair to
code that has historically relied on non-preemptible softirq guarantees.
In addition, changing the lock type is also unviable: exclusive rwsems
have the same issue (just not the WARN_ON) and counting semaphores
would introduce a performance hit as mutexes are a lot more optimized.

This reverts:

    a0855d24fc: ("locking/mutex: Complain upon mutex API misuse in IRQ contexts")

Fixes: a0855d24fc: ("locking/mutex: Complain upon mutex API misuse in IRQ contexts")
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Tested-by: David Howells <dhowells@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-afs@lists.infradead.org
Cc: linux-fsdevel@vger.kernel.org
Cc: will@kernel.org
Link: https://lkml.kernel.org/r/20191210220523.28540-1-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-12-11 00:27:43 +01:00
Alexei Starovoitov
ff205766db ftrace: Fix function_graph tracer interaction with BPF trampoline
Depending on type of BPF programs served by BPF trampoline it can call original
function. In such case the trampoline will skip one stack frame while
returning. That will confuse function_graph tracer and will cause crashes with
bad RIP. Teach graph tracer to skip functions that have BPF trampoline attached.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-12-10 13:53:59 -05:00
YueHaibing
a61f810567 tracing: remove set but not used variable 'buffer'
kernel/trace/trace_events_inject.c: In function trace_inject_entry:
kernel/trace/trace_events_inject.c:20:22: warning: variable buffer set but not used [-Wunused-but-set-variable]

It is never used, so remove it.

Link: http://lkml.kernel.org/r/20191207034409.25668-1-yuehaibing@huawei.com

Reported-by: Hulk Robot <hulkci@huawei.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-12-10 13:53:51 -05:00
Steven Rostedt (VMware)
af74262337 module: Remove accidental change of module_enable_x()
When pulling in Divya Indi's patch, I made a minor fix to remove unneeded
braces. I commited my fix up via "git commit -a --amend". Unfortunately, I
didn't realize I had some changes I was testing in the module code, and
those changes were applied to Divya's patch as well.

This reverts the accidental updates to the module code.

Cc: Jessica Yu <jeyu@kernel.org>
Cc: Divya Indi <divya.indi@oracle.com>
Reported-by: Peter Zijlstra <peterz@infradead.org>
Fixes: e585e6469d ("tracing: Verify if trace array exists before destroying it.")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-12-10 13:53:43 -05:00
Ingo Molnar
2040cf9f59 Linux 5.5-rc1
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl3tf/0eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGlKwH/3fTToujuJfTx5E5
 mrARAP65J1L/DxpEKvKRt2bNZo6w13mNd8g7ZPmYChz90bYGvXQSG8hYTU9iAw3O
 yimSTJlNXDhVAluB53XnDdUxIWC4HUZsNxWJNCeXMuiMcGNsTGX+v3f+x7oHCT0P
 jI1RSIsFGjgr0RWqZ8U5aJckQo2xABC1TfYw53K66Oc/JLZpSFJFwMgjf1fD5diU
 HGDA8E2p0u1TQIyNzr86iqMvnlSRYBQwBQn6OgEKCG4Z0NLtXfDF4mqnxsXgLmIH
 oQoFfxaMKXyGWds7ZxwcGWntALCF41ThfpiJWDIyxjWxFEty4bqTCbDPwwyp7ip0
 iuASmTI=
 =YqO2
 -----END PGP SIGNATURE-----

Merge tag 'v5.5-rc1' into core/kprobes, to resolve conflicts

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-12-10 10:11:00 +01:00
Paul E. McKenney
5155be9994 rcutorture: Dynamically allocate rcu_fwds structure
This commit switches from static structure to dynamic allocation
for rcu_fwds as another step towards providing multiple call_rcu()
forward-progress kthreads.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-12-09 13:00:29 -08:00
Paul E. McKenney
6764100bd2 rcutorture: Complete threading rcu_fwd pointers through functions
This commit threads pointers to rcu_fwd structures through the remaining
functions using rcu_fwds directly, namely rcu_torture_fwd_prog_cbfree(),
rcutorture_oom_notify() and rcu_torture_fwd_prog_init().

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-12-09 13:00:28 -08:00
Paul E. McKenney
7beba0c06b rcutorture: Move to dynamic initialization of rcu_fwds
In order to add multiple call_rcu() forward-progress kthreads, it will
be necessary to dynamically allocate and initialize.  This commit
therefore moves the initialization from compile time to instead
immediately precede thread-creation time.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-12-09 13:00:28 -08:00
Paul E. McKenney
6b1b832546 rcutorture: Thread rcu_fwd pointer through forward-progress functions
In order to add multiple kthreads, it will be necessary to allow
the various functions to operate on a pointer to their kthread's
rcu_fwd structure.  This commit therefore starts the process of
adding the needed "struct rcu_fwd" parameters and arguments to the
various callback forward-progress functions.

Note that rcutorture_oom_notify() and rcu_torture_fwd_cb_hist() will
eventually need to iterate over all kthreads' rcu_fwd structures.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-12-09 13:00:28 -08:00
Paul E. McKenney
a289e608b3 rcutorture: Pull callback forward-progress data into rcu_fwd struct
Now that RCU behaves reasonably well with the current single-kthread
call_rcu() forward-progress testing, it is time to add more kthreads.
This commit takes a first step towards that goal by wrapping what
will be the per-kthread data into a new rcu_fwd structure.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-12-09 13:00:27 -08:00
Sebastian Andrzej Siewior
90326f0521 rcu: Use CONFIG_PREEMPTION where appropriate
The config option `CONFIG_PREEMPT' is used for the preemption model
"Low-Latency Desktop". The config option `CONFIG_PREEMPTION' is enabled
when kernel preemption is enabled which is true for the preemption model
`CONFIG_PREEMPT' and `CONFIG_PREEMPT_RT'.

Use `CONFIG_PREEMPTION' if it applies to both preemption models and not
just to `CONFIG_PREEMPT'.

Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: rcu@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-12-09 12:37:51 -08:00
Lai Jiangshan
b3e627d3d5 rcu: Make PREEMPT_RCU be a modifier to TREE_RCU
Currently PREEMPT_RCU and TREE_RCU are mutually exclusive Kconfig
options.  But PREEMPT_RCU actually specifies a kind of TREE_RCU,
namely a preemptible TREE_RCU. This commit therefore makes PREEMPT_RCU
be a modifer to the TREE_RCU Kconfig option.  This has the benefit of
simplifying several of the #if expressions that formerly needed to
check both, but now need only check one or the other.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Lai Jiangshan <jiangshanlai@gmail.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-12-09 12:37:51 -08:00
Paul E. McKenney
03bd2983d7 rcu: Use lockdep rather than comment to enforce lock held
The rcu_preempt_check_blocked_tasks() function has a comment
that states that the rcu_node structure's ->lock must be held,
which might be informative, but which carries little weight if
not read.  This commit therefore removes this comment in favor of
raw_lockdep_assert_held_rcu_node(), which will complain quite
visibly if the required lock is not held.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-12-09 12:37:50 -08:00
Eric Dumazet
6935c3983b rcu: Avoid data-race in rcu_gp_fqs_check_wake()
The rcu_gp_fqs_check_wake() function uses rcu_preempt_blocked_readers_cgp()
to read ->gp_tasks while other cpus might overwrite this field.

We need READ_ONCE()/WRITE_ONCE() pairs to avoid compiler
tricks and KCSAN splats like the following :

BUG: KCSAN: data-race in rcu_gp_fqs_check_wake / rcu_preempt_deferred_qs_irqrestore

write to 0xffffffff85a7f190 of 8 bytes by task 7317 on cpu 0:
 rcu_preempt_deferred_qs_irqrestore+0x43d/0x580 kernel/rcu/tree_plugin.h:507
 rcu_read_unlock_special+0xec/0x370 kernel/rcu/tree_plugin.h:659
 __rcu_read_unlock+0xcf/0xe0 kernel/rcu/tree_plugin.h:394
 rcu_read_unlock include/linux/rcupdate.h:645 [inline]
 __ip_queue_xmit+0x3b0/0xa40 net/ipv4/ip_output.c:533
 ip_queue_xmit+0x45/0x60 include/net/ip.h:236
 __tcp_transmit_skb+0xdeb/0x1cd0 net/ipv4/tcp_output.c:1158
 __tcp_send_ack+0x246/0x300 net/ipv4/tcp_output.c:3685
 tcp_send_ack+0x34/0x40 net/ipv4/tcp_output.c:3691
 tcp_cleanup_rbuf+0x130/0x360 net/ipv4/tcp.c:1575
 tcp_recvmsg+0x633/0x1a30 net/ipv4/tcp.c:2179
 inet_recvmsg+0xbb/0x250 net/ipv4/af_inet.c:838
 sock_recvmsg_nosec net/socket.c:871 [inline]
 sock_recvmsg net/socket.c:889 [inline]
 sock_recvmsg+0x92/0xb0 net/socket.c:885
 sock_read_iter+0x15f/0x1e0 net/socket.c:967
 call_read_iter include/linux/fs.h:1864 [inline]
 new_sync_read+0x389/0x4f0 fs/read_write.c:414

read to 0xffffffff85a7f190 of 8 bytes by task 10 on cpu 1:
 rcu_gp_fqs_check_wake kernel/rcu/tree.c:1556 [inline]
 rcu_gp_fqs_check_wake+0x93/0xd0 kernel/rcu/tree.c:1546
 rcu_gp_fqs_loop+0x36c/0x580 kernel/rcu/tree.c:1611
 rcu_gp_kthread+0x143/0x220 kernel/rcu/tree.c:1768
 kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 10 Comm: rcu_preempt Not tainted 5.3.0+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
[ paulmck:  Added another READ_ONCE() for RCU CPU stall warnings. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-12-09 12:37:50 -08:00
Stefan Reiter
610dea36d3 rcu/nocb: Fix dump_tree hierarchy print always active
Commit 18cd8c93e6 ("rcu/nocb: Print gp/cb kthread hierarchy if
dump_tree") added print statements to rcu_organize_nocb_kthreads for
debugging, but incorrectly guarded them, causing the function to always
spew out its message.

This patch fixes it by guarding both pr_alert statements with dump_tree,
while also changing the second pr_alert to a pr_cont, to print the
hierarchy in a single line (assuming that's how it was supposed to
work).

Fixes: 18cd8c93e6 ("rcu/nocb: Print gp/cb kthread hierarchy if dump_tree")
Signed-off-by: Stefan Reiter <stefan@pimaker.at>
[ paulmck: Make single-nocbs-CPU GP kthreads look less erroneous. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-12-09 12:37:50 -08:00
Paul E. McKenney
df1e849ae4 rcu: Enable tick for nohz_full CPUs slow to provide expedited QS
An expedited grace period can be stalled by a nohz_full CPU looping
in kernel context.  This possibility is currently handled by some
carefully crafted checks in rcu_read_unlock_special() that enlist help
from ksoftirqd when permitted by the scheduler.  However, it is exactly
these checks that require the scheduler avoid holding any of its rq or
pi locks across rcu_read_unlock() without also having held them across
the entire RCU read-side critical section.

It would therefore be very nice if expedited grace periods could
handle nohz_full CPUs looping in kernel context without such checks.
This commit therefore adds code to the expedited grace period's wait
and cleanup code that forces the scheduler-clock interrupt on for CPUs
that fail to quickly supply a quiescent state.  "Quickly" is currently
a hard-coded single-jiffy delay.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-12-09 12:32:59 -08:00
Paul E. McKenney
28f0361fdf rcu: Replace synchronize_sched_expedited_wait() "_sched" with "_rcu"
After RCU flavor consolidation, synchronize_sched_expedited_wait() does
both RCU-preempt and RCU-sched, whichever happens to have been built into
the running kernel.  This commit therefore changes this function's name
to synchronize_rcu_expedited_wait() to reflect its new generic nature.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-12-09 12:24:59 -08:00
Paul E. McKenney
de8cd0a533 rcu: Update tree_exp.h function-header comments
The function-header comments in kernel/rcu/tree_exp.h have gotten a bit
out of date, so this commit updates a number of them.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-12-09 12:24:58 -08:00
Paul E. McKenney
6c7d7dbf5b rcu: Rename sync_rcu_preempt_exp_done() to sync_rcu_exp_done()
Now that the RCU flavors have been consolidated, there is one common
function for checking to see if an expedited RCU grace period has
completed, namely sync_rcu_preempt_exp_done().  Because this function is
no longer specific to RCU-preempt, this commit removes the "_preempt" from
its name.  This commit also changes sync_rcu_preempt_exp_done_unlocked()
to sync_rcu_exp_done_unlocked() for the same reason.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-12-09 12:24:58 -08:00
Neeraj Upadhyay
4bc6b745e5 rcu: Allow only one expedited GP to run concurrently with wakeups
The current expedited RCU grace-period code expects that a task
requesting an expedited grace period cannot awaken until that grace
period has reached the wakeup phase.  However, it is possible for a long
preemption to result in the waiting task never sleeping.  For example,
consider the following sequence of events:

1.	Task A starts an expedited grace period by invoking
	synchronize_rcu_expedited().  It proceeds normally up to the
	wait_event() near the end of that function, and is then preempted
	(or interrupted or whatever).

2.	The expedited grace period completes, and a kworker task starts
	the awaken phase, having incremented the counter and acquired
	the rcu_state structure's .exp_wake_mutex.  This kworker task
	is then preempted or interrupted or whatever.

3.	Task A resumes and enters wait_event(), which notes that the
	expedited grace period has completed, and thus doesn't sleep.

4.	Task B starts an expedited grace period exactly as did Task A,
	complete with the preemption (or whatever delay) just before
	the call to wait_event().

5.	The expedited grace period completes, and another kworker
	task starts the awaken phase, having incremented the counter.
	However, it blocks when attempting to acquire the rcu_state
	structure's .exp_wake_mutex because step 2's kworker task has
	not yet released it.

6.	Steps 4 and 5 repeat, resulting in overflow of the rcu_node
	structure's ->exp_wq[] array.

In theory, this is harmless.  Tasks waiting on the various ->exp_wq[]
array will just be spuriously awakened, but they will just sleep again
on noting that the rcu_state structure's ->expedited_sequence value has
not advanced far enough.

In practice, this wastes CPU time and is an accident waiting to happen.
This commit therefore moves the rcu_exp_gp_seq_end() call that officially
ends the expedited grace period (along with associate tracing) until
after the ->exp_wake_mutex has been acquired.  This prevents Task A from
awakening prematurely, thus preventing more than one expedited grace
period from being in flight during a previous expedited grace period's
wakeup phase.

Fixes: 3b5f668e71 ("rcu: Overlap wakeups with next expedited grace period")
Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
[ paulmck: Added updated comment. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-12-09 12:24:57 -08:00
Neeraj Upadhyay
fd6bc19d76 rcu: Fix missed wakeup of exp_wq waiters
Tasks waiting within exp_funnel_lock() for an expedited grace period to
elapse can be starved due to the following sequence of events:

1.	Tasks A and B both attempt to start an expedited grace
	period at about the same time.	This grace period will have
	completed when the lower four bits of the rcu_state structure's
	->expedited_sequence field are 0b'0100', for example, when the
	initial value of this counter is zero.	Task A wins, and thus
	does the actual work of starting the grace period, including
	acquiring the rcu_state structure's .exp_mutex and sets the
	counter to 0b'0001'.

2.	Because task B lost the race to start the grace period, it
	waits on ->expedited_sequence to reach 0b'0100' inside of
	exp_funnel_lock(). This task therefore blocks on the rcu_node
	structure's ->exp_wq[1] field, keeping in mind that the
	end-of-grace-period value of ->expedited_sequence (0b'0100')
	is shifted down two bits before indexing the ->exp_wq[] field.

3.	Task C attempts to start another expedited grace period,
	but blocks on ->exp_mutex, which is still held by Task A.

4.	The aforementioned expedited grace period completes, so that
	->expedited_sequence now has the value 0b'0100'.  A kworker task
	therefore acquires the rcu_state structure's ->exp_wake_mutex
	and starts awakening any tasks waiting for this grace period.

5.	One of the first tasks awakened happens to be Task A.  Task A
	therefore releases the rcu_state structure's ->exp_mutex,
	which allows Task C to start the next expedited grace period,
	which causes the lower four bits of the rcu_state structure's
	->expedited_sequence field to become 0b'0101'.

6.	Task C's expedited grace period completes, so that the lower four
	bits of the rcu_state structure's ->expedited_sequence field now
	become 0b'1000'.

7.	The kworker task from step 4 above continues its wakeups.
	Unfortunately, the wake_up_all() refetches the rcu_state
	structure's .expedited_sequence field:

	wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rcu_state.expedited_sequence) & 0x3]);

	This results in the wakeup being applied to the rcu_node
	structure's ->exp_wq[2] field, which is unfortunate given that
	Task B is instead waiting on ->exp_wq[1].

On a busy system, no harm is done (or at least no permanent harm is done).
Some later expedited grace period will redo the wakeup.  But on a quiet
system, such as many embedded systems, it might be a good long time before
there was another expedited grace period.  On such embedded systems,
this situation could therefore result in a system hang.

This issue manifested as DPM device timeout during suspend (which
usually qualifies as a quiet time) due to a SCSI device being stuck in
_synchronize_rcu_expedited(), with the following stack trace:

	schedule()
	synchronize_rcu_expedited()
	synchronize_rcu()
	scsi_device_quiesce()
	scsi_bus_suspend()
	dpm_run_callback()
	__device_suspend()

This commit therefore prevents such delays, timeouts, and hangs by
making rcu_exp_wait_wake() use its "s" argument consistently instead of
refetching from rcu_state.expedited_sequence.

Fixes: 3b5f668e71 ("rcu: Overlap wakeups with next expedited grace period")
Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-12-09 12:24:57 -08:00
Paul E. McKenney
aca2991a25 rcu: Substitute lookup for bit-twiddling in sync_rcu_exp_select_node_cpus()
The code in sync_rcu_exp_select_node_cpus() calculates the current
CPU's mask within its rcu_node structure's bitmasks, but this has
already been computed in the ->grpmask field of that CPU's rcu_data
structure.  This commit therefore just uses this ->grpmask field.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-12-09 12:24:57 -08:00
Marco Elver
6cf539a87a rcu: Fix data-race due to atomic_t copy-by-value
This fixes a data-race where `atomic_t dynticks` is copied by value. The
copy is performed non-atomically, resulting in a data-race if `dynticks`
is updated concurrently.

This data-race was found with KCSAN:
==================================================================
BUG: KCSAN: data-race in dyntick_save_progress_counter / rcu_irq_enter

write to 0xffff989dbdbe98e0 of 4 bytes by task 10 on cpu 3:
 atomic_add_return include/asm-generic/atomic-instrumented.h:78 [inline]
 rcu_dynticks_snap kernel/rcu/tree.c:310 [inline]
 dyntick_save_progress_counter+0x43/0x1b0 kernel/rcu/tree.c:984
 force_qs_rnp+0x183/0x200 kernel/rcu/tree.c:2286
 rcu_gp_fqs kernel/rcu/tree.c:1601 [inline]
 rcu_gp_fqs_loop+0x71/0x880 kernel/rcu/tree.c:1653
 rcu_gp_kthread+0x22c/0x3b0 kernel/rcu/tree.c:1799
 kthread+0x1b5/0x200 kernel/kthread.c:255
 <snip>

read to 0xffff989dbdbe98e0 of 4 bytes by task 154 on cpu 7:
 rcu_nmi_enter_common kernel/rcu/tree.c:828 [inline]
 rcu_irq_enter+0xda/0x240 kernel/rcu/tree.c:870
 irq_enter+0x5/0x50 kernel/softirq.c:347
 <snip>

Reported by Kernel Concurrency Sanitizer on:
CPU: 7 PID: 154 Comm: kworker/7:1H Not tainted 5.3.0+ #5
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
Workqueue: kblockd blk_mq_run_work_fn
==================================================================

Signed-off-by: Marco Elver <elver@google.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: rcu@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-12-09 12:24:56 -08:00
Boqun Feng
9f08cf0886 rcu: Avoid modifying mask_ofl_ipi in sync_rcu_exp_select_node_cpus()
The "mask_ofl_ipi" is used to track which CPUs get IPIed, however
in the IPI sending loop, "mask_ofl_ipi" along with another variable
"mask_ofl_test" might also get modified to record which CPUs' quiesent
states must be reported by the sync_rcu_exp_select_node_cpus() at
the end of sync_rcu_exp_select_node_cpus().  This overlap of roles
can be confusing, so this patch cleans things a little by using
"mask_ofl_ipi" solely for determining which CPUs must be IPIed  and
"mask_ofl_test" for solely determining on behalf of  which CPUs
sync_rcu_exp_select_node_cpus() must report a quiscent state.

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Acked-by: Marco Elver <elver@google.com>
2019-12-09 12:24:56 -08:00
Paul E. McKenney
15c7c972cd rcu: Use *_ONCE() to protect lockless ->expmask accesses
The rcu_node structure's ->expmask field is accessed locklessly when
starting a new expedited grace period and when reporting an expedited
RCU CPU stall warning.  This commit therefore handles the former by
taking a snapshot of ->expmask while the lock is held and the latter
by applying READ_ONCE() to lockless reads and WRITE_ONCE() to the
corresponding updates.

Link: https://lore.kernel.org/lkml/CANpmjNNmSOagbTpffHr4=Yedckx9Rm2NuGqC9UqE+AOz5f1-ZQ@mail.gmail.com
Reported-by: syzbot+134336b86f728d6e55a0@syzkaller.appspotmail.com
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Marco Elver <elver@google.com>
2019-12-09 12:24:56 -08:00
Amol Grover
cb5172d96d audit: Add __rcu annotation to RCU pointer
Add __rcu annotation to RCU-protected global pointer auditd_conn.

auditd_conn is an RCU-protected global pointer,i.e., accessed
via RCU methods rcu_dereference() and rcu_assign_pointer(),
hence it must be annotated with __rcu for sparse to report
warnings/errors correctly.

Fix multiple instances of the sparse error:
error: incompatible types in comparison expression
(different address spaces)

Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Amol Grover <frextrite@gmail.com>
[PM: tweak subject line]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2019-12-09 15:19:03 -05:00
Linus Torvalds
184b8f7f91 pr_warning() removal for 5.5
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAl3uHtsACgkQUqAMR0iA
 lPKN3g/9HG6k7NIVtArCy/0kdxLOCr1JZp0EhOCexvCFLSOXInJ82izQVblOA+XE
 +1AceBqF5Akg23u/oLH9tSXBjMULemTyhm7Pnnopn1+bX/bfF+nN6027ltH8ncPY
 25oOP22ulUko3smI2yj2/gmRG5J6z/bUxe/4BMBATsj5YKJs3XGOHtfj2rgVd2qc
 HzPPirH82uOj8sBYxTMeq/+QpCzfJ5HdXXkcJGDuzEW078vb8eRgpcXIqscTBB5Z
 pTsG+Io5/RzfKIDWLs7Eqgg9qSzcGrpXZxPsWpCXXL9nArdeA78ZBLIreDSZIpEt
 QBQMT8Yxy34oW1npw1tFgulxZFjF6np2eg+3a6VX3vqo7DIIAU0gqlTRCde30tNW
 RedFSZIKrlfKfVUJXcBV9sNU57vUx7WwURPnlQxyMmbp0ryf6vhxT3ybPhf8yFXk
 WIKf7PngxLxMRtkRL5rZRDAA6z3/SPg6WkWFzDa/jZKCRRob/uM/35GxWzBk4xR0
 MhCesVDCpM1oB+qgJkJRhyHhddzu3nafxtpjBnrKOUHke+qF5u36BXFDSZZHUSkz
 VXshmaaPcWfHOl2DBGL8SJmYartr/ASvd5TPUbfnqYM1h7+wkB1gZkX8MgSHzc9b
 EYPfmipWh9LsZ1OjSJoFkMIB5bOIaYefOdSUyPDlJwhRuliboWE=
 =d5HL
 -----END PGP SIGNATURE-----

Merge tag 'printk-for-5.5-pr-warning-removal' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk

Pull pr_warning() removal from Petr Mladek.

 - Final removal of the unused pr_warning() alias.

You're supposed to use just "pr_warn()" in the kernel.

* tag 'printk-for-5.5-pr-warning-removal' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk:
  checkpatch: Drop pr_warning check
  printk: Drop pr_warning definition
  Fix up for "printk: Drop pr_warning definition"
  workqueue: Use pr_warn instead of pr_warning
2019-12-09 11:48:21 -08:00
Pankaj Bharadiya
c593642c8b treewide: Use sizeof_field() macro
Replace all the occurrences of FIELD_SIZEOF() with sizeof_field() except
at places where these are defined. Later patches will remove the unused
definition of FIELD_SIZEOF().

This patch is generated using following script:

EXCLUDE_FILES="include/linux/stddef.h|include/linux/kernel.h"

git grep -l -e "\bFIELD_SIZEOF\b" | while read file;
do

	if [[ "$file" =~ $EXCLUDE_FILES ]]; then
		continue
	fi
	sed -i  -e 's/\bFIELD_SIZEOF\b/sizeof_field/g' $file;
done

Signed-off-by: Pankaj Bharadiya <pankaj.laxminarayan.bharadiya@intel.com>
Link: https://lore.kernel.org/r/20190924105839.110713-3-pankaj.laxminarayan.bharadiya@intel.com
Co-developed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: David Miller <davem@davemloft.net> # for net
2019-12-09 10:36:44 -08:00
Masami Hiramatsu
bf08949cc8 modules: lockdep: Suppress suspicious RCU usage warning
While running kprobe module test, find_module_all() caused
a suspicious RCU usage warning.

-----
 =============================
 WARNING: suspicious RCU usage
 5.4.0-next-20191202+ #63 Not tainted
 -----------------------------
 kernel/module.c:619 RCU-list traversed in non-reader section!!

 other info that might help us debug this:

 rcu_scheduler_active = 2, debug_locks = 1
 1 lock held by rmmod/642:
  #0: ffffffff8227da80 (module_mutex){+.+.}, at: __x64_sys_delete_module+0x9a/0x230

 stack backtrace:
 CPU: 0 PID: 642 Comm: rmmod Not tainted 5.4.0-next-20191202+ #63
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
 Call Trace:
  dump_stack+0x71/0xa0
  find_module_all+0xc1/0xd0
  __x64_sys_delete_module+0xac/0x230
  ? do_syscall_64+0x12/0x1f0
  do_syscall_64+0x50/0x1f0
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
 RIP: 0033:0x4b6d49
-----

This is because list_for_each_entry_rcu(modules) is called
without rcu_read_lock(). This is safe because the module_mutex
is locked.

Pass lockdep_is_held(&module_mutex) to the list_for_each_entry_rcu()
to suppress this warning, This also fixes similar issue in
mod_find() and each_symbol_section().

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2019-12-09 10:51:23 +01:00
Aleksa Sarai
ce623f8987 nsfs: clean-up ns_get_path() signature to return int
ns_get_path() and ns_get_path_cb() only ever return either NULL or an
ERR_PTR. It is far more idiomatic to simply return an integer, and it
makes all of the callers of ns_get_path() more straightforward to read.

Fixes: e149ed2b80 ("take the targets of /proc/*/ns/* symlinks to separate fs")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-12-08 19:09:37 -05:00
Linus Torvalds
95e6ba5133 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) More jumbo frame fixes in r8169, from Heiner Kallweit.

 2) Fix bpf build in minimal configuration, from Alexei Starovoitov.

 3) Use after free in slcan driver, from Jouni Hogander.

 4) Flower classifier port ranges don't work properly in the HW offload
    case, from Yoshiki Komachi.

 5) Use after free in hns3_nic_maybe_stop_tx(), from Yunsheng Lin.

 6) Out of bounds access in mqprio_dump(), from Vladyslav Tarasiuk.

 7) Fix flow dissection in dsa TX path, from Alexander Lobakin.

 8) Stale syncookie timestampe fixes from Guillaume Nault.

[ Did an evil merge to silence a warning introduced by this pull - Linus ]

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (84 commits)
  r8169: fix rtl_hw_jumbo_disable for RTL8168evl
  net_sched: validate TCA_KIND attribute in tc_chain_tmplt_add()
  r8169: add missing RX enabling for WoL on RTL8125
  vhost/vsock: accept only packets with the right dst_cid
  net: phy: dp83867: fix hfs boot in rgmii mode
  net: ethernet: ti: cpsw: fix extra rx interrupt
  inet: protect against too small mtu values.
  gre: refetch erspan header from skb->data after pskb_may_pull()
  pppoe: remove redundant BUG_ON() check in pppoe_pernet
  tcp: Protect accesses to .ts_recent_stamp with {READ,WRITE}_ONCE()
  tcp: tighten acceptance of ACKs not matching a child socket
  tcp: fix rejected syncookies due to stale timestamps
  lpc_eth: kernel BUG on remove
  tcp: md5: fix potential overestimation of TCP option space
  net: sched: allow indirect blocks to bind to clsact in TC
  net: core: rename indirect block ingress cb function
  net-sysfs: Call dev_hold always in netdev_queue_add_kobject
  net: dsa: fix flow dissection on Tx path
  net/tls: Fix return values to avoid ENOTSUPP
  net: avoid an indirect call in ____sys_recvmsg()
  ...
2019-12-08 13:28:11 -08:00
Sebastian Andrzej Siewior
025f50f386 sched/rt, workqueue: Use PREEMPTION
CONFIG_PREEMPTION is selected by CONFIG_PREEMPT and by CONFIG_PREEMPT_RT.
Both PREEMPT and PREEMPT_RT require the same functionality which today
depends on CONFIG_PREEMPT.

Update the comment to use PREEMPTION because it is true for both
preemption models.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20191015191821.11479-35-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-12-08 14:37:37 +01:00
Sebastian Andrzej Siewior
1b40cd56f3 sched/rt, locking: Use CONFIG_PREEMPTION
CONFIG_PREEMPTION is selected by CONFIG_PREEMPT and by CONFIG_PREEMPT_RT.
Both PREEMPT and PREEMPT_RT require the same functionality which today
depends on CONFIG_PREEMPT.

Switch the Kconfig dependency to use CONFIG_PREEMPTION.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20191015191821.11479-32-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-12-08 14:37:36 +01:00
Ingo Molnar
4f797f56c3 Merge branch 'linus' into sched/urgent, to pick up the latest before merging new patches
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-12-08 14:37:10 +01:00
Stephen Rothwell
ee19545220 Fix up for "printk: Drop pr_warning definition"
Link: http://lkml.kernel.org/r/20191206092503.303d6a57@canb.auug.org.au
Cc: Linux Next Mailing List <linux-next@vger.kernel.org>
Cc: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2019-12-06 10:02:35 +01:00
Kefeng Wang
1d9a6159bd workqueue: Use pr_warn instead of pr_warning
Use pr_warn() instead of the remaining pr_warning() calls.

Link: http://lkml.kernel.org/r/20191128004752.35268-2-wangkefeng.wang@huawei.com
To: joe@perches.com
To: linux-kernel@vger.kernel.org
Cc: gregkh@linuxfoundation.org
Cc: tj@kernel.org
Cc: arnd@arndb.de
Cc: sergey.senozhatsky@gmail.com
Cc: rostedt@goodmis.org
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2019-12-06 09:59:30 +01:00
Linus Torvalds
0f13741624 Modules updates for v5.5
Summary of modules changes for the 5.5 merge window:
 
 - Refactor include/linux/export.h and remove code duplication between
   EXPORT_SYMBOL and EXPORT_SYMBOL_NS to make it more readable. The most
   notable change is that no namespace is represented by an empty string ""
   rather than NULL.
 
 - Fix a module load/unload race where waiter(s) trying to load the same
   module weren't being woken up when a module finally goes away.
 
 Signed-off-by: Jessica Yu <jeyu@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCgAGBQJd6V3yAAoJEMBFfjjOO8FyeGEP/0Ue4uNehrDYQ6wHaLJOeSA3
 GEcraILbsT4v/9HqVbIaMH2idwwCI5xY6XlyDADaimYkEvs3jLOSsVEpjTvnjt0s
 DnNFR2vm+JsWVmS4jsmij2T6RgVfZq062RWJA1HvxtCsZWHFOttQe3gh9s/ycFAv
 UwGk0FUr4E78pUYNj+zQ35j4/L/C3Va2vC3VwSV4ND0kVTBrqcVHV6g3K409vgb8
 /ZD8/cFwVvOvGSK47M4r+Xt2X/57A/Cb0RgjvKHRfvONfyranKv9WlqM6Y6DXlZ0
 Su7eIo5kAH40/LUR2ludTSHLNcr/PWM4W2q8q81+gqF4h3KitYXXARWjKSLDwo/8
 nEq/rxJzEDX0bIgnSyU3t+ZqK2JonAF0a1D53otPPaSvTMPe1Gz48//cD6TGc3np
 xxLDZEPne/vbNUy3z2K1tXoWbxdThAhtCb8qOilVZBitPtnQpmUt2eyn1/2snoBR
 uerB/S8B48YI1TGxuK6Ksy5QIuJk9DG2o33nD5PPHe5dKEZQPAmSJDEwVaLzpW9b
 t9JoHo+H6BefKj0Sexf+1jlK9WKJEwGpqhZqfRkosACelxJJ3Ap3nuMsdNuZY+6U
 rAG8N322HV5x50weIId+t8AP8cdS+vRfh8PgvpvHY8YYXXeagOK49+snkJLAMgw0
 9Px3j20sNSmFfYloNUzW
 =XDci
 -----END PGP SIGNATURE-----

Merge tag 'modules-for-v5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux

Pull modules updates from Jessica Yu:
 "Summary of modules changes for the 5.5 merge window:

   - Refactor include/linux/export.h and remove code duplication between
     EXPORT_SYMBOL and EXPORT_SYMBOL_NS to make it more readable.

     The most notable change is that no namespace is represented by an
     empty string "" rather than NULL.

   - Fix a module load/unload race where waiter(s) trying to load the
     same module weren't being woken up when a module finally goes away"

* tag 'modules-for-v5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
  kernel/module.c: wakeup processes in module_wq on module unload
  moduleparam: fix parameter description mismatch
  export: avoid code duplication in include/linux/export.h
2019-12-05 12:27:16 -08:00
Linus Torvalds
fb3da48a86 Merge branch 'thermal/next' of git://git.kernel.org/pub/scm/linux/kernel/git/thermal/linux
Pull thermal management updates from Zhang Rui:

 - Fix a deadlock regression in thermal core framework, which was
   introduced in 5.3 (Wei Wang)

 - Initialize thermal control framework earlier to enable thermal
   mitigation during boot (Amit Kucheria)

 - Convert the Intelligent Power Allocator (IPA) thermal governor to
   follow the generic PM_EM instead of its own Energy Model (Quentin
   Perret)

 - Introduce a new Amlogic soc thermal driver (Guillaume La Roque)

 - Add interrupt support for tsens thermal driver (Amit Kucheria)

 - Add support for MSM8956/8976 in tsens thermal driver
   (AngeloGioacchino Del Regno)

 - Add support for r8a774b1 in rcar thermal driver (Biju Das)

 - Add support for Thermal Monitor Unit v2 in qoriq thermal driver
   (Yuantian Tang)

 - Some other fixes/cleanups on thermal core framework and soc thermal
   drivers (Colin Ian King, Daniel Lezcano, Hsin-Yi Wang, Tian Tao)

* 'thermal/next' of git://git.kernel.org/pub/scm/linux/kernel/git/thermal/linux: (32 commits)
  thermal: Fix deadlock in thermal thermal_zone_device_check
  thermal: cpu_cooling: Migrate to using the EM framework
  thermal: cpu_cooling: Make the power-related code depend on IPA
  PM / EM: Declare EM data types unconditionally
  arm64: defconfig: Enable CONFIG_ENERGY_MODEL
  drivers: thermal: tsens: fix potential integer overflow on multiply
  thermal: cpu_cooling: Reorder the header file
  thermal: cpu_cooling: Remove pointless dependency on CONFIG_OF
  thermal: no need to set .owner when using module_platform_driver
  thermal: qcom: tsens-v1: Fix kfree of a non-pointer value
  cpufreq: qcom-hw: Move driver initialization earlier
  clk: qcom: Initialize clock drivers earlier
  cpufreq: Initialize cpufreq-dt driver earlier
  cpufreq: Initialize the governors in core_initcall
  thermal: Initialize thermal subsystem earlier
  thermal: Remove netlink support
  dt: thermal: tsens: Document compatible for MSM8976/56
  thermal: qcom: tsens-v1: Add support for MSM8956 and MSM8976
  MAINTAINERS: add entry for Amlogic Thermal driver
  thermal: amlogic: Add thermal driver to support G12 SoCs
  ...
2019-12-05 11:21:24 -08:00
Linus Torvalds
5ecc9d15f7 Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:
 "Most of the rest of MM and various other things. Some Kconfig rework
  still awaits merges of dependent trees from linux-next.

  Subsystems affected by this patch series: mm/hotfixes, mm/memcg,
  mm/vmstat, mm/thp, procfs, sysctl, misc, notifiers, core-kernel,
  bitops, lib, checkpatch, epoll, binfmt, init, rapidio, uaccess, kcov,
  ubsan, ipc, bitmap, mm/pagemap"

* akpm: (86 commits)
  mm: remove __ARCH_HAS_4LEVEL_HACK and include/asm-generic/4level-fixup.h
  um: add support for folded p4d page tables
  um: remove unused pxx_offset_proc() and addr_pte() functions
  sparc32: use pgtable-nopud instead of 4level-fixup
  parisc/hugetlb: use pgtable-nopXd instead of 4level-fixup
  parisc: use pgtable-nopXd instead of 4level-fixup
  nds32: use pgtable-nopmd instead of 4level-fixup
  microblaze: use pgtable-nopmd instead of 4level-fixup
  m68k: mm: use pgtable-nopXd instead of 4level-fixup
  m68k: nommu: use pgtable-nopud instead of 4level-fixup
  c6x: use pgtable-nopud instead of 4level-fixup
  arm: nommu: use pgtable-nopud instead of 4level-fixup
  alpha: use pgtable-nopud instead of 4level-fixup
  gpio: pca953x: tighten up indentation
  gpio: pca953x: convert to use bitmap API
  gpio: pca953x: use input from regs structure in pca953x_irq_pending()
  gpio: pca953x: remove redundant variable and check in IRQ handler
  lib/bitmap: introduce bitmap_replace() helper
  lib/test_bitmap: fix comment about this file
  lib/test_bitmap: move exp1 and exp2 upper for others to use
  ...
2019-12-05 09:46:26 -08:00
Yonghong Song
e9eeec58c9 bpf: Fix a bug when getting subprog 0 jited image in check_attach_btf_id
For jited bpf program, if the subprogram count is 1, i.e.,
there is no callees in the program, prog->aux->func will be NULL
and prog->bpf_func points to image address of the program.

If there is more than one subprogram, prog->aux->func is populated,
and subprogram 0 can be accessed through either prog->bpf_func or
prog->aux->func[0]. Other subprograms should be accessed through
prog->aux->func[subprog_id].

This patch fixed a bug in check_attach_btf_id(), where
prog->aux->func[subprog_id] is used to access any subprogram which
caused a segfault like below:
  [79162.619208] BUG: kernel NULL pointer dereference, address:
  0000000000000000
  ......
  [79162.634255] Call Trace:
  [79162.634974]  ? _cond_resched+0x15/0x30
  [79162.635686]  ? kmem_cache_alloc_trace+0x162/0x220
  [79162.636398]  ? selinux_bpf_prog_alloc+0x1f/0x60
  [79162.637111]  bpf_prog_load+0x3de/0x690
  [79162.637809]  __do_sys_bpf+0x105/0x1740
  [79162.638488]  do_syscall_64+0x5b/0x180
  [79162.639147]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  ......

Fixes: 5b92a28aae ("bpf: Support attaching tracing BPF program to other BPF programs")
Reported-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20191205010606.177774-1-yhs@fb.com
2019-12-04 21:20:07 -08:00
Andrey Konovalov
eec028c938 kcov: remote coverage support
Patch series " kcov: collect coverage from usb and vhost", v3.

This patchset extends kcov to allow collecting coverage from backgound
kernel threads.  This extension requires custom annotations for each of
the places where coverage collection is desired.  This patchset
implements this for hub events in the USB subsystem and for vhost
workers.  See the first patch description for details about the kcov
extension.  The other two patches apply this kcov extension to USB and
vhost.

Examples of other subsystems that might potentially benefit from this
when custom annotations are added (the list is based on
process_one_work() callers for bugs recently reported by syzbot):

1. fs: writeback wb_workfn() worker,
2. net: addrconf_dad_work()/addrconf_verify_work() workers,
3. net: neigh_periodic_work() worker,
4. net/p9: p9_write_work()/p9_read_work() workers,
5. block: blk_mq_run_work_fn() worker.

These patches have been used to enable coverage-guided USB fuzzing with
syzkaller for the last few years, see the details here:

  https://github.com/google/syzkaller/blob/master/docs/linux/external_fuzzing_usb.md

This patchset has been pushed to the public Linux kernel Gerrit
instance:

  https://linux-review.googlesource.com/c/linux/kernel/git/torvalds/linux/+/1524

This patch (of 3):

Add background thread coverage collection ability to kcov.

With KCOV_ENABLE coverage is collected only for syscalls that are issued
from the current process.  With KCOV_REMOTE_ENABLE it's possible to
collect coverage for arbitrary parts of the kernel code, provided that
those parts are annotated with kcov_remote_start()/kcov_remote_stop().

This allows to collect coverage from two types of kernel background
threads: the global ones, that are spawned during kernel boot in a
limited number of instances (e.g.  one USB hub_event() worker thread is
spawned per USB HCD); and the local ones, that are spawned when a user
interacts with some kernel interface (e.g.  vhost workers).

To enable collecting coverage from a global background thread, a unique
global handle must be assigned and passed to the corresponding
kcov_remote_start() call.  Then a userspace process can pass a list of
such handles to the KCOV_REMOTE_ENABLE ioctl in the handles array field
of the kcov_remote_arg struct.  This will attach the used kcov device to
the code sections, that are referenced by those handles.

Since there might be many local background threads spawned from
different userspace processes, we can't use a single global handle per
annotation.  Instead, the userspace process passes a non-zero handle
through the common_handle field of the kcov_remote_arg struct.  This
common handle gets saved to the kcov_handle field in the current
task_struct and needs to be passed to the newly spawned threads via
custom annotations.  Those threads should in turn be annotated with
kcov_remote_start()/kcov_remote_stop().

Internally kcov stores handles as u64 integers.  The top byte of a
handle is used to denote the id of a subsystem that this handle belongs
to, and the lower 4 bytes are used to denote the id of a thread instance
within that subsystem.  A reserved value 0 is used as a subsystem id for
common handles as they don't belong to a particular subsystem.  The
bytes 4-7 are currently reserved and must be zero.  In the future the
number of bytes used for the subsystem or handle ids might be increased.

When a particular userspace process collects coverage by via a common
handle, kcov will collect coverage for each code section that is
annotated to use the common handle obtained as kcov_handle from the
current task_struct.  However non common handles allow to collect
coverage selectively from different subsystems.

Link: http://lkml.kernel.org/r/e90e315426a384207edbec1d6aa89e43008e4caf.1572366574.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: David Windsor <dwindsor@gmail.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Anders Roxell <anders.roxell@linaro.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Marco Elver <elver@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-04 19:44:14 -08:00
Huang Shijie
964975ac66 lib/genalloc.c: rename addr_in_gen_pool to gen_pool_has_addr
Follow the kernel conventions, rename addr_in_gen_pool to
gen_pool_has_addr.

[sjhuang@iluvatar.ai: fix Documentation/ too]
 Link: http://lkml.kernel.org/r/20181229015914.5573-1-sjhuang@iluvatar.ai
Link: http://lkml.kernel.org/r/20181228083950.20398-1-sjhuang@iluvatar.ai
Signed-off-by: Huang Shijie <sjhuang@iluvatar.ai>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-04 19:44:13 -08:00
Joe Perches
5e1aada08c kernel/sys.c: avoid copying possible padding bytes in copy_to_user
Initialization is not guaranteed to zero padding bytes so use an
explicit memset instead to avoid leaking any kernel content in any
possible padding bytes.

Link: http://lkml.kernel.org/r/dfa331c00881d61c8ee51577a082d8bebd61805c.camel@perches.com
Signed-off-by: Joe Perches <joe@perches.com>
Cc: Dan Carpenter <error27@gmail.com>
Cc: Julia Lawall <julia.lawall@lip6.fr>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-04 19:44:12 -08:00
Nathan Chancellor
ef70eff9de kernel/profile.c: use cpumask_available to check for NULL cpumask
When building with clang + -Wtautological-pointer-compare, these
instances pop up:

  kernel/profile.c:339:6: warning: comparison of array 'prof_cpu_mask' not equal to a null pointer is always true [-Wtautological-pointer-compare]
          if (prof_cpu_mask != NULL)
              ^~~~~~~~~~~~~    ~~~~
  kernel/profile.c:376:6: warning: comparison of array 'prof_cpu_mask' not equal to a null pointer is always true [-Wtautological-pointer-compare]
          if (prof_cpu_mask != NULL)
              ^~~~~~~~~~~~~    ~~~~
  kernel/profile.c:406:26: warning: comparison of array 'prof_cpu_mask' not equal to a null pointer is always true [-Wtautological-pointer-compare]
          if (!user_mode(regs) && prof_cpu_mask != NULL &&
                                ^~~~~~~~~~~~~    ~~~~
  3 warnings generated.

This can be addressed with the cpumask_available helper, introduced in
commit f7e30f01a9 ("cpumask: Add helper cpumask_available()") to fix
warnings like this while keeping the code the same.

Link: https://github.com/ClangBuiltLinux/linux/issues/747
Link: http://lkml.kernel.org/r/20191022191957.9554-1-natechancellor@gmail.com
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-04 19:44:12 -08:00
Xiaoming Ni
260a2679e5 kernel/notifier.c: remove blocking_notifier_chain_cond_register()
blocking_notifier_chain_cond_register() does not consider system_booting
state, which is the only difference between this function and
blocking_notifier_cain_register().  This can be a bug and is a piece of
duplicate code.

Delete blocking_notifier_chain_cond_register()

Link: http://lkml.kernel.org/r/1568861888-34045-4-git-send-email-nixiaoming@huawei.com
Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anna Schumaker <anna.schumaker@netapp.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: J. Bruce Fields <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Nadia Derbey <Nadia.Derbey@bull.net>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Sam Protsenko <semen.protsenko@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Vasily Averin <vvs@virtuozzo.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-04 19:44:12 -08:00
Xiaoming Ni
5adaabb65a kernel/notifier.c: remove notifier_chain_cond_register()
The only difference between notifier_chain_cond_register() and
notifier_chain_register() is the lack of warning hints for duplicate
registrations.  Use notifier_chain_register() instead of
notifier_chain_cond_register() to avoid duplicate code

Link: http://lkml.kernel.org/r/1568861888-34045-3-git-send-email-nixiaoming@huawei.com
Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anna Schumaker <anna.schumaker@netapp.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: J. Bruce Fields <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Nadia Derbey <Nadia.Derbey@bull.net>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Sam Protsenko <semen.protsenko@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Vasily Averin <vvs@virtuozzo.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-04 19:44:12 -08:00
Xiaoming Ni
1a50cb80f2 kernel/notifier.c: intercept duplicate registrations to avoid infinite loops
Registering the same notifier to a hook repeatedly can cause the hook
list to form a ring or lose other members of the list.

  case1: An infinite loop in notifier_chain_register() can cause soft lockup
          atomic_notifier_chain_register(&test_notifier_list, &test1);
          atomic_notifier_chain_register(&test_notifier_list, &test1);
          atomic_notifier_chain_register(&test_notifier_list, &test2);

  case2: An infinite loop in notifier_chain_register() can cause soft lockup
          atomic_notifier_chain_register(&test_notifier_list, &test1);
          atomic_notifier_chain_register(&test_notifier_list, &test1);
          atomic_notifier_call_chain(&test_notifier_list, 0, NULL);

  case3: lose other hook test2
          atomic_notifier_chain_register(&test_notifier_list, &test1);
          atomic_notifier_chain_register(&test_notifier_list, &test2);
          atomic_notifier_chain_register(&test_notifier_list, &test1);

  case4: Unregister returns 0, but the hook is still in the linked list,
         and it is not really registered. If you call
         notifier_call_chain after ko is unloaded, it will trigger oops.

If the system is configured with softlockup_panic and the same hook is
repeatedly registered on the panic_notifier_list, it will cause a loop
panic.

Add a check in notifier_chain_register(), intercepting duplicate
registrations to avoid infinite loops

Link: http://lkml.kernel.org/r/1568861888-34045-2-git-send-email-nixiaoming@huawei.com
Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com>
Reviewed-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Anna Schumaker <anna.schumaker@netapp.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: J. Bruce Fields <bfields@fieldses.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Nadia Derbey <Nadia.Derbey@bull.net>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Sam Protsenko <semen.protsenko@linaro.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-04 19:44:12 -08:00
Linus Torvalds
2f13437b89 Two fixes and one patch that was missed:
Fixes:
 
   - Missing __print_hex_dump undef for processing new function in trace events
   - Stop WARN_ON messages when lockdown disables tracing on boot up
 
  Enhancement:
 
   - Debug option to inject trace events from userspace (for rasdaemon)
 
 The enhancement has its own config option and is non invasive. It's been
 discussed for sever months and should have been added to my original
 push, but I never pulled it into my queue.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXehlhRQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qlGgAQCszcDuNyVllj0VwWi4i+0FAQcI12Ad
 W0NGZm0wObGExwD8CDR/CdHq9ulizFQjJfopG6b5Uc3Z4NNJ+QGnMxzBuwo=
 =k31z
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull more tracing updates from Steven Rostedt:
 "Two fixes and one patch that was missed:

  Fixes:

   - Missing __print_hex_dump undef for processing new function in trace
     events

   - Stop WARN_ON messages when lockdown disables tracing on boot up

  Enhancement:

   - Debug option to inject trace events from userspace (for rasdaemon)"

The enhancement has its own config option and is non invasive. It's been
discussed for sever months and should have been added to my original
push, but I never pulled it into my queue.

* tag 'trace-v5.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Do not create directories if lockdown is in affect
  tracing: Introduce trace event injection
  tracing: Fix __print_hex_dump scope
2019-12-04 19:13:52 -08:00
Linus Torvalds
ef867c12f3 Additional power management updates for 5.5-rc1
- Avoid a race condition in the ACPI EC driver that may cause
    systems to be unable to leave suspend-to-idle (Rafael Wysocki).
 
  - Drop the "disabled" field, which is redundant, from struct
    cpuidle_state (Rafael Wysocki).
 
  - Reintroduce device PM QoS frequency constraints (temporarily
    introduced and than dropped during the 5.4 cycle) in preparation
    for adding QoS support to devfreq (Leonard Crestez).
 
  - Clean up indentation (in multiple places) and the cpuidle drivers
    help text in Kconfig (Krzysztof Kozlowski, Randy Dunlap).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl3nhpQSHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxQj4P/2HbVROWMON7q9iWhgO59qABEbqU8M7L
 DaJ2gu+bDe3FQ9Ek6Y2EObfGw3nl9riyGbZH/jVmcOkbuXE+aQXv/j7eEnM9G35+
 8+JSfhucVsohaHVxT2ROMv+7YD+pLyWK1ivuVK/dNcvmxQaC9CKrmn3GF2ujkqNR
 ahdRRzZobGeC6mc8tms3GYpWkd1R5zd74ALGVsw9i/eB3P/YgrlS8HaQynpbaflZ
 qhRKZgsTf8QD6+OG+6HQhWpOfAlG36dsJnvuk0Oa0Cpnw+Zfj6WoR1jpL9ufNWBM
 Re1faTfppy6Hnyxr62Ytkbq2pYozTVAnQM+TKNIGoqxA4OIXvhgQpBqApmuJXpRx
 ZFBfr943f7I2jmAAznHeiW9l3n+4h725rpoxKapnlO3OMRDwCTqxbMahiS+CDULd
 gSu4prnoBdd9WrwiR7M1PA4X2Eb2M0kYFQUr7BltlTgjLHjQy47Mnazh9WxYBAv8
 p1tip39QHeZcdO3rdW1O21ljNekEIOFAi5bVVECsR6RyA+KR+vHgFP9pMUWyCpgU
 +rde+MdGKIL3sw/szNhTTDfQ49vz/ObcipJg3/rakq6jXeFL4n5NwMy5jYrquPlx
 xxHx3Yp1PCBEZ1TXS6+JjznvQBU/G/7YvoWobpqwN/IL1wa55rWOX8Ah1+YnfLzF
 fGzh0EvPJKyM
 =KAyd
 -----END PGP SIGNATURE-----

Merge tag 'pm-5.5-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull additional power management updates from Rafael Wysocki:
 "These fix an ACPI EC driver bug exposed by the recent rework of the
  suspend-to-idle code flow, reintroduce frequency constraints into
  device PM QoS (in preparation for adding QoS support to devfreq), drop
  a redundant field from struct cpuidle_state and clean up Kconfig in
  some places.

  Specifics:

   - Avoid a race condition in the ACPI EC driver that may cause systems
     to be unable to leave suspend-to-idle (Rafael Wysocki)

   - Drop the "disabled" field, which is redundant, from struct
     cpuidle_state (Rafael Wysocki)

   - Reintroduce device PM QoS frequency constraints (temporarily
     introduced and than dropped during the 5.4 cycle) in preparation
     for adding QoS support to devfreq (Leonard Crestez)

   - Clean up indentation (in multiple places) and the cpuidle drivers
     help text in Kconfig (Krzysztof Kozlowski, Randy Dunlap)"

* tag 'pm-5.5-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI: PM: s2idle: Rework ACPI events synchronization
  ACPI: EC: Rework flushing of pending work
  PM / devfreq: Add missing locking while setting suspend_freq
  PM / QoS: Restore DEV_PM_QOS_MIN/MAX_FREQUENCY
  PM / QoS: Reorder pm_qos/freq_qos/dev_pm_qos structs
  PM / QoS: Initial kunit test
  PM / QoS: Redefine FREQ_QOS_MAX_DEFAULT_VALUE to S32_MAX
  power: avs: Fix Kconfig indentation
  cpufreq: Fix Kconfig indentation
  cpuidle: minor Kconfig help text fixes
  cpuidle: Drop disabled field from struct cpuidle_state
  cpuidle: Fix Kconfig indentation
2019-12-04 10:48:09 -08:00
Christian Brauner
0b8d616fb5
taskstats: fix data-race
When assiging and testing taskstats in taskstats_exit() there's a race
when setting up and reading sig->stats when a thread-group with more
than one thread exits:

write to 0xffff8881157bbe10 of 8 bytes by task 7951 on cpu 0:
 taskstats_tgid_alloc kernel/taskstats.c:567 [inline]
 taskstats_exit+0x6b7/0x717 kernel/taskstats.c:596
 do_exit+0x2c2/0x18e0 kernel/exit.c:864
 do_group_exit+0xb4/0x1c0 kernel/exit.c:983
 get_signal+0x2a2/0x1320 kernel/signal.c:2734
 do_signal+0x3b/0xc00 arch/x86/kernel/signal.c:815
 exit_to_usermode_loop+0x250/0x2c0 arch/x86/entry/common.c:159
 prepare_exit_to_usermode arch/x86/entry/common.c:194 [inline]
 syscall_return_slowpath arch/x86/entry/common.c:274 [inline]
 do_syscall_64+0x2d7/0x2f0 arch/x86/entry/common.c:299
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

read to 0xffff8881157bbe10 of 8 bytes by task 7949 on cpu 1:
 taskstats_tgid_alloc kernel/taskstats.c:559 [inline]
 taskstats_exit+0xb2/0x717 kernel/taskstats.c:596
 do_exit+0x2c2/0x18e0 kernel/exit.c:864
 do_group_exit+0xb4/0x1c0 kernel/exit.c:983
 __do_sys_exit_group kernel/exit.c:994 [inline]
 __se_sys_exit_group kernel/exit.c:992 [inline]
 __x64_sys_exit_group+0x2e/0x30 kernel/exit.c:992
 do_syscall_64+0xcf/0x2f0 arch/x86/entry/common.c:296
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fix this by using smp_load_acquire() and smp_store_release().

Reported-by: syzbot+c5d03165a1bd1dead0c1@syzkaller.appspotmail.com
Fixes: 34ec12349c ("taskstats: cleanup ->signal->stats allocation")
Cc: stable@vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Marco Elver <elver@google.com>
Reviewed-by: Will Deacon <will@kernel.org>
Reviewed-by: Andrea Parri <parri.andrea@gmail.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Link: https://lore.kernel.org/r/20191009114809.8643-1-christian.brauner@ubuntu.com
2019-12-04 15:18:39 +01:00
Steven Rostedt (VMware)
a356646a56 tracing: Do not create directories if lockdown is in affect
If lockdown is disabling tracing on boot up, it prevents the tracing files
from even bering created. But when that happens, there's several places that
will give a warning that the files were not created as that is usually a
sign of a bug.

Add in strategic locations where a check is made to see if tracing is
disabled by lockdown, and if it is, do not go further, and fail silently
(but print that tracing is disabled by lockdown, without doing a WARN_ON()).

Cc: Matthew Garrett <mjg59@google.com>
Fixes: 17911ff38a ("tracing: Add locked_down checks to the open calls of files created for tracefs")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-12-04 08:27:15 -05:00
Linus Torvalds
043cf46825 Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Ingo Molnar:
 "The main changes in the timer code in this cycle were:

   - Clockevent updates:

      - timer-of framework cleanups. (Geert Uytterhoeven)

      - Use timer-of for the renesas-ostm and the device name to prevent
        name collision in case of multiple timers. (Geert Uytterhoeven)

      - Check if there is an error after calling of_clk_get in asm9260
        (Chuhong Yuan)

   - ABI fix: Zero out high order bits of nanoseconds on compat
     syscalls. This got broken a year ago, with apparently no side
     effects so far.

     Since the kernel would use random data otherwise I don't think we'd
     have other options but to fix the bug, even if there was a side
     effect to applications (Dmitry Safonov)

   - Optimize ns_to_timespec64() on 32-bit systems: move away from
     div_s64_rem() which can be slow, to div_u64_rem() which is faster
     (Arnd Bergmann)

   - Annotate KCSAN-reported false positive data races in
     hrtimer_is_queued() users by moving timer->state handling over to
     the READ_ONCE()/WRITE_ONCE() APIs. This documents these accesses
     (Eric Dumazet)

   - Misc cleanups and small fixes"

[ I undid the "ABI fix" and updated the comments instead. The reason
  there were apparently no side effects is that the fix was a no-op.

  The updated comment is to say _why_ it was a no-op.    - Linus ]

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  time: Zero the upper 32-bits in __kernel_timespec on 32-bit
  time: Rename tsk->real_start_time to ->start_boottime
  hrtimer: Remove the comment about not used HRTIMER_SOFTIRQ
  time: Fix spelling mistake in comment
  time: Optimize ns_to_timespec64()
  hrtimer: Annotate lockless access to timer->state
  clocksource/drivers/asm9260: Add a check for of_clk_get
  clocksource/drivers/renesas-ostm: Use unique device name instead of ostm
  clocksource/drivers/renesas-ostm: Convert to timer_of
  clocksource/drivers/timer-of: Use unique device name instead of timer
  clocksource/drivers/timer-of: Convert last full_name to %pOF
2019-12-03 12:20:25 -08:00
Linus Torvalds
b22bfea7f1 Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Ingo Molnar:
 "Most of the IRQ subsystem changes in this cycle were irq-chip driver
  updates:

   - Qualcomm PDC wakeup interrupt support

   - Layerscape external IRQ support

   - Broadcom bcm7038 PM and wakeup support

   - Ingenic driver cleanup and modernization

   - GICv3 ITS preparation for GICv4.1 updates

   - GICv4 fixes

  There's also the series from Frederic Weisbecker that fixes memory
  ordering bugs for the irq-work logic, whose primary fix is to turn
  work->irq_work.flags into an atomic variable and then convert the
  complex (and buggy) atomic_cmpxchg() loop in irq_work_claim() into a
  much simpler atomic_fetch_or() call.

  There are also various smaller cleanups"

* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits)
  pinctrl/sdm845: Add PDC wakeup interrupt map for GPIOs
  pinctrl/msm: Setup GPIO chip in hierarchy
  irqchip/qcom-pdc: Add irqchip set/get state calls
  irqchip/qcom-pdc: Add irqdomain for wakeup capable GPIOs
  irqchip/qcom-pdc: Do not toggle IRQ_ENABLE during mask/unmask
  irqchip/qcom-pdc: Update max PDC interrupts
  of/irq: Document properties for wakeup interrupt parent
  genirq: Introduce irq_chip_get/set_parent_state calls
  irqdomain: Add bus token DOMAIN_BUS_WAKEUP
  genirq: Fix function documentation of __irq_alloc_descs()
  irq_work: Fix IRQ_WORK_BUSY bit clearing
  irqchip/ti-sci-inta: Use ERR_CAST inlined function instead of ERR_PTR(PTR_ERR(...))
  irq_work: Slightly simplify IRQ_WORK_PENDING clearing
  irq_work: Fix irq_work_claim() memory ordering
  irq_work: Convert flags to atomic_t
  irqchip: Ingenic: Add process for more than one irq at the same time.
  irqchip: ingenic: Alloc generic chips from IRQ domain
  irqchip: ingenic: Get virq number from IRQ domain
  irqchip: ingenic: Error out if IRQ domain creation failed
  irqchip: ingenic: Drop redundant irq_suspend / irq_resume functions
  ...
2019-12-03 09:29:50 -08:00
Linus Torvalds
76bb8b0596 Kbuild updates for v5.5
- remove unneeded asm headers from hexagon, ia64
 
  - add 'dir-pkg' target, which works like 'tar-pkg' but skips archiving
 
  - add 'helpnewconfig' target, which shows help for new CONFIG options
 
  - support 'make nsdeps' for external modules
 
  - make rebuilds faster by deleting $(wildcard $^) checks
 
  - remove compile tests for kernel-space headers
 
  - refactor modpost to simplify modversion handling
 
  - make single target builds faster
 
  - optimize and clean up scripts/kallsyms.c
 
  - refactor various Makefiles and scripts
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAl3lKCUVHG1hc2FoaXJv
 eUBrZXJuZWwub3JnAAoJED2LAQed4NsGu9sP/iTW/RjDxbAsu0aP8jFqzLK/xKB/
 NQn/+dD76TjEmjgew9AXszf2rJL+ixKVymGM08FV59Bbguvi8XmAB/QXK21Sjb5j
 rVl3N97TWNkvXM+QJyly23G2UtbubRSPo3g+e70BZrw3lcmrsK+sAmTOL5KtIrNX
 9BHM803JwqsMJyvBwTBBw3UFeeBqb38Qx6gmigfSihuDf6pvjoVDKskpsDno3wX7
 rdiXYxAsKQLQ/P2ym/bV/Oqe90RqRtV/2/WCpLshlwHkiM9huflv6GjgCkkbAx5H
 N3TSptlS7l/2B/XKHgA5ALjHjUlxTGBzLLoevarCd8loKcQXFlgx+vd3nM/WJlHJ
 x9UpTklDwGP9eUBsa9W980tEyUVsFGMAC8EcTdW6NN2IRtuCOSA5N2FYYt8/SDd0
 2b3PhElTJIp4pTWSYN6JZxB1R8n/YBgxLqOJ6N2U6B9CdKFUCHlwGH23QfN89km/
 WEMP85bsaab/dnyxbwelkoYYYyPgUHsC13AbpkHdrDxMbAGO+G1PwpHxC6ErF2en
 wRGrcUxWTfHRykO5aJIQtCB9b1fv73134mTzB5fTYd6GtjepGBSBCO9xb2Iy4sc9
 Y+nHVVDUrihvSOpJgqh677PcLDutOZR8fFCoc1ZMDAbBsDvrb0Qsee6oEidj98xc
 5kXp9YZh/tdh/tdo
 =zUaB
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-v5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild updates from Masahiro Yamada:

 - remove unneeded asm headers from hexagon, ia64

 - add 'dir-pkg' target, which works like 'tar-pkg' but skips archiving

 - add 'helpnewconfig' target, which shows help for new CONFIG options

 - support 'make nsdeps' for external modules

 - make rebuilds faster by deleting $(wildcard $^) checks

 - remove compile tests for kernel-space headers

 - refactor modpost to simplify modversion handling

 - make single target builds faster

 - optimize and clean up scripts/kallsyms.c

 - refactor various Makefiles and scripts

* tag 'kbuild-v5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (59 commits)
  MAINTAINERS: update Kbuild/Kconfig maintainer's email address
  scripts/kallsyms: remove redundant initializers
  scripts/kallsyms: put check_symbol_range() calls close together
  scripts/kallsyms: make check_symbol_range() void function
  scripts/kallsyms: move ignored symbol types to is_ignored_symbol()
  scripts/kallsyms: move more patterns to the ignored_prefixes array
  scripts/kallsyms: skip ignored symbols very early
  scripts/kallsyms: add const qualifiers where possible
  scripts/kallsyms: make find_token() return (unsigned char *)
  scripts/kallsyms: replace prefix_underscores_count() with strspn()
  scripts/kallsyms: add sym_name() to mitigate cast ugliness
  scripts/kallsyms: remove unneeded length check for prefix matching
  scripts/kallsyms: remove redundant is_arm_mapping_symbol()
  scripts/kallsyms: set relative_base more effectively
  scripts/kallsyms: shrink table before sorting it
  scripts/kallsyms: fix definitely-lost memory leak
  scripts/kallsyms: remove unneeded #ifndef ARRAY_SIZE
  kbuild: make single target builds even faster
  modpost: respect the previous export when 'exported twice' is warned
  modpost: do not set ->preloaded for symbols from Module.symvers
  ...
2019-12-02 17:35:04 -08:00
David S. Miller
734c7022ad Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2019-12-02

The following pull-request contains BPF updates for your *net* tree.

We've added 10 non-merge commits during the last 6 day(s) which contain
a total of 10 files changed, 60 insertions(+), 51 deletions(-).

The main changes are:

1) Fix vmlinux BTF generation for binutils pre v2.25, from Stanislav Fomichev.

2) Fix libbpf global variable relocation to take symbol's st_value offset
   into account, from Andrii Nakryiko.

3) Fix libbpf build on powerpc where check_abi target fails due to different
   readelf output format, from Aurelien Jarno.

4) Don't set BPF insns RO for the case when they are JITed in order to avoid
   fragmenting the direct map, from Daniel Borkmann.

5) Fix static checker warning in btf_distill_func_proto() as well as a build
   error due to empty enum when BPF is compiled out, from Alexei Starovoitov.

6) Fix up generation of bpf_helper_defs.h for perf, from Arnaldo Carvalho de Melo.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-02 10:50:29 -08:00
Cong Wang
6c3edaf9fd tracing: Introduce trace event injection
We have been trying to use rasdaemon to monitor hardware errors like
correctable memory errors. rasdaemon uses trace events to monitor
various hardware errors. In order to test it, we have to inject some
hardware errors, unfortunately not all of them provide error
injections. MCE does provide a way to inject MCE errors, but errors
like PCI error and devlink error don't, it is not easy to add error
injection to each of them. Instead, it is relatively easier to just
allow users to inject trace events in a generic way so that all trace
events can be injected.

This patch introduces trace event injection, where a new 'inject' is
added to each tracepoint directory. Users could write into this file
with key=value pairs to specify the value of each fields of the trace
event, all unspecified fields are set to zero values by default.

For example, for the net/net_dev_queue tracepoint, we can inject:

  INJECT=/sys/kernel/debug/tracing/events/net/net_dev_queue/inject
  echo "" > $INJECT
  echo "name='test'" > $INJECT
  echo "name='test' len=1024" > $INJECT
  cat /sys/kernel/debug/tracing/trace
  ...
   <...>-614   [000] ....    36.571483: net_dev_queue: dev= skbaddr=00000000fbf338c2 len=0
   <...>-614   [001] ....   136.588252: net_dev_queue: dev=test skbaddr=00000000fbf338c2 len=0
   <...>-614   [001] .N..   208.431878: net_dev_queue: dev=test skbaddr=00000000fbf338c2 len=1024

Triggers could be triggered as usual too:

  echo "stacktrace if len == 1025" > /sys/kernel/debug/tracing/events/net/net_dev_queue/trigger
  echo "len=1025" > $INJECT
  cat /sys/kernel/debug/tracing/trace
  ...
      bash-614   [000] ....    36.571483: net_dev_queue: dev= skbaddr=00000000fbf338c2 len=0
      bash-614   [001] ....   136.588252: net_dev_queue: dev=test skbaddr=00000000fbf338c2 len=0
      bash-614   [001] .N..   208.431878: net_dev_queue: dev=test skbaddr=00000000fbf338c2 len=1024
      bash-614   [001] .N.1   284.236349: <stack trace>
 => event_inject_write
 => vfs_write
 => ksys_write
 => do_syscall_64
 => entry_SYSCALL_64_after_hwframe

The only thing that can't be injected is string pointers as they
require constant string pointers, this can't be done at run time.

Link: http://lkml.kernel.org/r/20191130045218.18979-1-xiyou.wangcong@gmail.com

Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-12-02 11:07:00 -05:00
Linus Torvalds
596cf45cbf Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:
 "Incoming:

   - a small number of updates to scripts/, ocfs2 and fs/buffer.c

   - most of MM

  I still have quite a lot of material (mostly not MM) staged after
  linux-next due to -next dependencies. I'll send those across next week
  as the preprequisites get merged up"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (135 commits)
  mm/page_io.c: annotate refault stalls from swap_readpage
  mm/Kconfig: fix trivial help text punctuation
  mm/Kconfig: fix indentation
  mm/memory_hotplug.c: remove __online_page_set_limits()
  mm: fix typos in comments when calling __SetPageUptodate()
  mm: fix struct member name in function comments
  mm/shmem.c: cast the type of unmap_start to u64
  mm: shmem: use proper gfp flags for shmem_writepage()
  mm/shmem.c: make array 'values' static const, makes object smaller
  userfaultfd: require CAP_SYS_PTRACE for UFFD_FEATURE_EVENT_FORK
  fs/userfaultfd.c: wp: clear VM_UFFD_MISSING or VM_UFFD_WP during userfaultfd_register()
  userfaultfd: wrap the common dst_vma check into an inlined function
  userfaultfd: remove unnecessary WARN_ON() in __mcopy_atomic_hugetlb()
  userfaultfd: use vma_pagesize for all huge page size calculation
  mm/madvise.c: use PAGE_ALIGN[ED] for range checking
  mm/madvise.c: replace with page_size() in madvise_inject_error()
  mm/mmap.c: make vma_merge() comment more easy to understand
  mm/hwpoison-inject: use DEFINE_DEBUGFS_ATTRIBUTE to define debugfs fops
  autonuma: reduce cache footprint when scanning page tables
  autonuma: fix watermark checking in migrate_balanced_pgdat()
  ...
2019-12-01 20:36:41 -08:00
Linus Torvalds
ceb3074745 y2038: syscall implementation cleanups
This is a series of cleanups for the y2038 work, mostly intended
 for namespace cleaning: the kernel defines the traditional
 time_t, timeval and timespec types that often lead to y2038-unsafe
 code. Even though the unsafe usage is mostly gone from the kernel,
 having the types and associated functions around means that we
 can still grow new users, and that we may be missing conversions
 to safe types that actually matter.
 
 There are still a number of driver specific patches needed to
 get the last users of these types removed, those have been
 submitted to the respective maintainers.
 
 Link: https://lore.kernel.org/lkml/20191108210236.1296047-1-arnd@arndb.de/
 Signed-off-by: Arnd Bergmann <arnd@arndb.de>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJd3D+wAAoJEJpsee/mABjZfdcQAJvl6e+4ddKoDMIVJqVCE25N
 meFRgA7S8jy6BefEVeUgI8TxK+amGO36szMBUEnZxSSxq9u+gd13m5bEK6Xq/ov7
 4KTAiA3Irm/W5FBTktu1zc5ROIra1Xj7jLdubf8wEC3viSXIXB3+68Y28iBN7D2O
 k9kSpwINC5lWeC8guZy2I+2yc4ywUEXao9nVh8C/J+FQtU02TcdLtZop9OhpAa8u
 U19VVH3WHkQI7ZfLvBTUiYK6tlYTiYCnpr8l6sm850CnVv1fzBW+DzmVhPJ6FdFd
 4m5staC0sQ6gVqtjVMBOtT5CdzREse6hpwbKo2GRWFroO5W9tljMOJJXHvv/f6kz
 DxrpUmj37JuRbqAbr8KDmQqPo6M2CRkxFxjol1yh5ER63u1xMwLm/PQITZIMDvPO
 jrFc2C2SdM2E9bKP/RMCVoKSoRwxCJ5IwJ2AF237rrU0sx/zB2xsrOGssx5CWEgc
 3bbk6tDQujJJubnCfgRy1tTxpLZOHEEKw8YhFLLbR2LCtA9pA/0rfLLad16cjA5e
 5jIHxfsFc23zgpzrJeB7kAF/9xgu1tlA5BotOs3VBE89LtWOA9nK5dbPXng6qlUe
 er3xLCfS38ovhUw6DusQpaYLuaYuLM7DKO4iav9kuTMcY9GkbPk7vDD3KPGh2goy
 hY5cSM8+kT1q/THLnUBH
 =Bdbv
 -----END PGP SIGNATURE-----

Merge tag 'y2038-cleanups-5.5' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground

Pull y2038 cleanups from Arnd Bergmann:
 "y2038 syscall implementation cleanups

  This is a series of cleanups for the y2038 work, mostly intended for
  namespace cleaning: the kernel defines the traditional time_t, timeval
  and timespec types that often lead to y2038-unsafe code. Even though
  the unsafe usage is mostly gone from the kernel, having the types and
  associated functions around means that we can still grow new users,
  and that we may be missing conversions to safe types that actually
  matter.

  There are still a number of driver specific patches needed to get the
  last users of these types removed, those have been submitted to the
  respective maintainers"

Link: https://lore.kernel.org/lkml/20191108210236.1296047-1-arnd@arndb.de/

* tag 'y2038-cleanups-5.5' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground: (26 commits)
  y2038: alarm: fix half-second cut-off
  y2038: ipc: fix x32 ABI breakage
  y2038: fix typo in powerpc vdso "LOPART"
  y2038: allow disabling time32 system calls
  y2038: itimer: change implementation to timespec64
  y2038: move itimer reset into itimer.c
  y2038: use compat_{get,set}_itimer on alpha
  y2038: itimer: compat handling to itimer.c
  y2038: time: avoid timespec usage in settimeofday()
  y2038: timerfd: Use timespec64 internally
  y2038: elfcore: Use __kernel_old_timeval for process times
  y2038: make ns_to_compat_timeval use __kernel_old_timeval
  y2038: socket: use __kernel_old_timespec instead of timespec
  y2038: socket: remove timespec reference in timestamping
  y2038: syscalls: change remaining timeval to __kernel_old_timeval
  y2038: rusage: use __kernel_old_timeval
  y2038: uapi: change __kernel_time_t to __kernel_old_time_t
  y2038: stat: avoid 'time_t' in 'struct stat'
  y2038: ipc: remove __kernel_time_t reference from headers
  y2038: vdso: powerpc: avoid timespec references
  ...
2019-12-01 14:00:59 -08:00
Linus Torvalds
ad0b314e00 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull sysctl system call removal from Eric Biederman:
 "As far as I can tell we have reached the point where no one enables
  the sysctl system call anymore. It still is enabled in a few
  defconfigs but they are mostly the rarely used one and in asking
  people about that it was more cut & paste enabled than anything else.

  This is single commit that just deletes code. Leaving just enough code
  so that the deprecated sysctl warning continues to be printed. If my
  analysis turns out to be wrong and someone actually cares it will be
  easy to revert this commit and have the system call again.

  There was one new xtensa defconfig in linux-next that enabled the
  system call this cycle and when asked about it the maintainer of the
  code replied that it was not enabled on purpose. As of today's
  linux-next tree that defconfig no longer enables the system call.

  What we saw in the review discussion was that if we go a step farther
  than my patch and mess with uapi headers there are pieces of code that
  won't compile, but nothing minds the system call actually disappearing
  from the kernel"

Link: https://lore.kernel.org/lkml/201910011140.EA0181F13@keescook/

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  sysctl: Remove the sysctl system call
2019-12-01 13:26:18 -08:00
Johannes Weiner
204cb79ad4 kernel: sysctl: make drop_caches write-only
Currently, the drop_caches proc file and sysctl read back the last value
written, suggesting this is somehow a stateful setting instead of a
one-time command.  Make it write-only, like e.g.  compact_memory.

While mitigating a VM problem at scale in our fleet, there was confusion
about whether writing to this file will permanently switch the kernel into
a non-caching mode.  This influences the decision making in a tense
situation, where tens of people are trying to fix tens of thousands of
affected machines: Do we need a rollback strategy?  What are the
performance implications of operating in a non-caching state for several
days?  It also caused confusion when the kernel team said we may need to
write the file several times to make sure it's effective ("But it already
reads back 3?").

Link: http://lkml.kernel.org/r/20191031221602.9375-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Chris Down <chris@chrisdown.name>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-01 12:59:07 -08:00
Daniel Axtens
eafb149ed7 fork: support VMAP_STACK with KASAN_VMALLOC
Supporting VMAP_STACK with KASAN_VMALLOC is straightforward:

 - clear the shadow region of vmapped stacks when swapping them in
 - tweak Kconfig to allow VMAP_STACK to be turned on with KASAN

Link: http://lkml.kernel.org/r/20191031093909.9228-4-dja@axtens.net
Signed-off-by: Daniel Axtens <dja@axtens.net>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-01 12:59:05 -08:00
Gaowei Pu
ff68dac6d6 mm/mmap.c: use IS_ERR_VALUE to check return value of get_unmapped_area
get_unmapped_area() returns an address or -errno on failure.  Historically
we have checked for the failure by offset_in_page() which is correct but
quite hard to read.  Newer code started using IS_ERR_VALUE which is much
easier to read.  Convert remaining users of offset_in_page as well.

[mhocko@suse.com: rewrite changelog]
[mhocko@kernel.org: fix mremap.c and uprobes.c sites also]
Link: http://lkml.kernel.org/r/20191012102512.28051-1-pugaowei@gmail.com
Signed-off-by: Gaowei Pu <pugaowei@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-01 06:29:19 -08:00
Linus Torvalds
b94ae8ad9f seccomp updates for v5.5
- implement SECCOMP_USER_NOTIF_FLAG_CONTINUE (Christian Brauner)
 - fixes to selftests (Christian Brauner)
 - remove secure_computing() argument (Christian Brauner)
 -----BEGIN PGP SIGNATURE-----
 Comment: Kees Cook <kees@outflux.net>
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAl3dT/kWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJg7eD/9PFh0xAgk7swWIOnkv/Ckj6pqR
 lcnVaugsap2sp99P+QxVPoqKoBsHF/OZ96OqJcokljdWO77ElBMG4Xxgjho/mPPU
 Yzhsd9/Q0j4zYIe/Gy+4LxZ+wSudBxv7ls4l86fst1GWg880VkLk32/1N0BUjFAp
 uyBBaEuDoXcnkru8ojKH1xgp0Cd1KjyO1KEAQdkSt2GROo3nhROh9955Hrrxuanr
 0sjWLYe8E8P3hPugRI/3WRZu4VqdIn47pm+/UMPwGpC80kI+mSL1jtidszqC022w
 u0H5yoedEhZCan7uHWtEY1TXfwgktUKMZOzMP8LSoZ9cNPAFyKXsFqN7Jzf/1Edr
 9Zsc+9gc3lfBr6YYBSHUC4XYGzZ2fy0itK/yRTvZdUGO/XETrE61fR/wyVjQttRS
 OR1tAtmd9/3iZqe1jh1l3Rw4bJh1w/hS768sWpp8qAMunCGF5gQvFdqGFAxjIS5c
 Ddd0gjxK/NV72+iUzCSL0qUXcYjNYPT4cUapywBuQ4H1i4hl5EM3nGyCbLFbpqkp
 L2fzeAdRGSZIzZ35emTWhvSLZ36Ty64zEViNbAOP9o/+j6/SR5TjL1aNDkz69Eca
 GM1XiDeg4AoamtPR38+DzS+EnzBWfOD6ujsKNFgjAJbVIaa414Vql9utrq7fSvf2
 OIJjAD8PZKN93t1qaw==
 =igQG
 -----END PGP SIGNATURE-----

Merge tag 'seccomp-v5.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull seccomp updates from Kees Cook:
 "Mostly this is implementing the new flag SECCOMP_USER_NOTIF_FLAG_CONTINUE,
  but there are cleanups as well.

   - implement SECCOMP_USER_NOTIF_FLAG_CONTINUE (Christian Brauner)

   - fixes to selftests (Christian Brauner)

   - remove secure_computing() argument (Christian Brauner)"

* tag 'seccomp-v5.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  seccomp: rework define for SECCOMP_USER_NOTIF_FLAG_CONTINUE
  seccomp: fix SECCOMP_USER_NOTIF_FLAG_CONTINUE test
  seccomp: simplify secure_computing()
  seccomp: test SECCOMP_USER_NOTIF_FLAG_CONTINUE
  seccomp: add SECCOMP_USER_NOTIF_FLAG_CONTINUE
  seccomp: avoid overflow in implicit constant conversion
2019-11-30 17:23:16 -08:00
Linus Torvalds
3b805ca177 audit/stable-5.5 PR 20191126
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAl3dbM4UHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXMagw/+MiOlIHFykjK0NGOyUbXR7AwjdrHz
 1Fgkh7VZCAHMfCcdlljIIkYe7P+ybfdK51E1QLBiTPGh353JJRAvrjFbDcIyT4kf
 u7AVddUeT0QQefFs39ZFWTV0mvJjDBfjFkmiL3cdY9ulx4ZX8V426qjyl8KIwTHe
 YkQF3pYMmO28G1SfZu298zTmrFRA10FezxCUbBRZTTE9FcD7mnGQRB7w/wY9t0H1
 ebIDgDA0EBh3oznGD3qxD63b5ULdSrImTzvSmEfhzZZsoZB4XGO5MOnLRqgBwFbT
 qfRSRfHVl+1JtDHCU43zOUlzScAsff5mvfVNdDMLAFtGGSjDGxgxKNyLwp8+5wmH
 GzvB99QQECvCN+gbaedm6adWBGzi7vpoCcgfqY0UPLYvCqsNFbZw4U/iu5ONKWy/
 cEGVUGzHf0pWofugDIJfGQuRt6iS2XT9Ode6+QMx8OiC3auZcluhTuJSxMQIhFZ1
 5XmoHOQddBwlalmIx8fsIHSo6xsAjNFwTOikEFVzUB+ECR2Urs8eHvQj+d94xz6e
 q9LrNkt/eVIDiI+PeP+UBD5IJlfmRSoyUd/mIDFMfmqMBubBSQl70Eyt3dUdt1Bz
 0PBs6xjYztpgk3Q7s35TMn8EvDcEksG0WEoFM5fEohYPLWQf8tJ5BbyYJJqc0sod
 ExlXu1lPfg9ppKc=
 =fc7R
 -----END PGP SIGNATURE-----

Merge tag 'audit-pr-20191126' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit

Pull audit updates from Paul Moore:
 "Audit is back for v5.5, albeit with only two patches:

   - Allow for the auditing of suspicious O_CREAT usage via the new
     AUDIT_ANOM_CREAT record.

   - Remove a redundant if-conditional check found during code analysis.
     It's a minor change, but when the pull request is only two patches
     long, you need filler in the pull request email"

[ Heh on the pull request filler. I wish more people tried to write
  better pull request messages, even if maybe it's not worth it for the
  trivial cases ;^)   - Linus ]

* tag 'audit-pr-20191126' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
  audit: remove redundant condition check in kauditd_thread()
  audit: Report suspicious O_CREAT usage
2019-11-30 17:01:48 -08:00
Linus Torvalds
8a99117f6e kgdb patches for 5.5-rc1
The major change here is the work from Douglas Anderson that
 reworks the way kdb stack traces are handled on SMP systems.
 The effect is to allow all CPUs to issue their stack trace which
 reduced the need for architecture specific code to support stack
 tracing.
 
 Also included are general of clean ups from Doug and myself:
 
  * Remove some unused variables or arguments.
  * Tidy up the kdb escape handling code and fix a couple of odd
    corner cases.
  - Better ignore escape characters that do not form part of an
    escape sequence. This mostly benefits vi users since they are most
    likely to press escape as a nervous habit but it won't harm anyone
    else.
 
 Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEELzVBU1D3lWq6cKzwfOMlXTn3iKEFAl3dPzwACgkQfOMlXTn3
 iKG2sxAAkGTTmKKlu8cAEILD7ONXM3kB0lfsTxJ2aBdrFhkZxOmVIO5fAaTxLRh5
 bmznv1bzA6FulSxS2d0aGa8Oh3QE8z7fV2fngsW409ikUf3uu43K13R2yQGnOdZY
 n+dMR+C/H8LWvmUDK1rZtNf91uhmD+DNxpoI6U7H4mIVMC1RRP8XtMyf3m9qRbJE
 Bud0JAdOHB5eSH9a/97elRIhIUCWUSkeFG950RIMT08kdsyIAaobg+4NmlmTZsl2
 zVmXaIftBjiDAkEDtk/7p9N+3U42e0aWA2YSxq4lYgNfgsbJTGP8GskNTOG+egOJ
 N03xHqHR7NhzkCKjneocEba95uKct7t50+epC6nAT8GF4COV6aLNUcm+vNhcVmLI
 kbJO0ZcWp+iBr0O5GO53ZaGEoD3GAT7l3tDGqXkcJN1OGc6gjiEih8FRFoMa6cIJ
 GdqziWsooOlHgGgu9lsRL1a0pvrFJFkd9ha7XEKWIq8CEiHmKSbhPJF3SyaX2XJA
 NTrthitANWGWC4EIapV+jhSZ/8tOKfT5ehCvFEtnouKJ0pHFyynDJaveUJ3561Bl
 qr7noViXcIidDgceagGSZz7fQxBZeG3MNL1D5YIcpE3lfEHKSl5FivQ5kBmq79at
 1svw1OmocrvFuUhxhkj1Yo2R7Q6k97IYeX8v1q7DBkFaXS7Lv6E=
 =CZKd
 -----END PGP SIGNATURE-----

Merge tag 'kgdb-5.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux

Pull kgdb updates from Daniel Thompson:
 "The major change here is the work from Douglas Anderson that reworks
  the way kdb stack traces are handled on SMP systems. The effect is to
  allow all CPUs to issue their stack trace which reduced the need for
  architecture specific code to support stack tracing.

  Also included are general of clean ups from Doug and myself:

   - Remove some unused variables or arguments.

   - Tidy up the kdb escape handling code and fix a couple of odd corner
     cases.

   - Better ignore escape characters that do not form part of an escape
     sequence. This mostly benefits vi users since they are most likely
     to press escape as a nervous habit but it won't harm anyone else"

* tag 'kgdb-5.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux:
  kdb: Tweak escape handling for vi users
  kdb: Improve handling of characters from different input sources
  kdb: Remove special case logic from kdb_read()
  kdb: Simplify code to fetch characters from console
  kdb: Tidy up code to handle escape sequences
  kdb: Avoid array subscript warnings on non-SMP builds
  kdb: Fix stack crawling on 'running' CPUs that aren't the master
  kdb: Fix "btc <cpu>" crash if the CPU didn't round up
  kdb: Remove unused "argcount" param from kdb_bt1(); make btaprompt bool
  kgdb: Remove unused DCPU_SSTEP definition
2019-11-30 16:41:55 -08:00
Linus Torvalds
738d5fabff Merge branch 'parisc-5.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc updates from Helge Deller:
 "Just trivial small updates: An assembler register optimization in the
  inlined networking checksum functions, a compiler warning fix and
  don't unneccesary print a runtime warning on machines which wouldn't
  be affected anyway"

* 'parisc-5.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
  parisc: Avoid spurious inequivalent alias kernel error messages
  kexec: Fix pointer-to-int-cast warnings
  parisc: Do not hardcode registers in checksum functions
2019-11-30 14:45:32 -08:00
Linus Torvalds
6a965666b7 Pipework for general notification queue
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl3O0OoACgkQ+7dXa6fL
 C2tAwA//VH9Y81azemXFdflDF90sSH3TCASlKHVYHbBNAkH/QP5F00G4BEM4nNqH
 F3x7qcU9vzfGdumF1pc90Yt6XSYlsQEGF+xMyMw/VS2wKs40yv+b/doVbzOWbN9C
 NfrklgHeuuBk+JzU2llDisVqKRTLt4SmDpYu1ZdcchUQFZCCl3BpgdSEC+xXrHay
 +KlRPVNMSd2kXMCDuSWrr71lVNdCTdf3nNC5p1i780+VrgpIBIG/jmiNdCcd7PLH
 1aesPlr8UZY3+bmRtqe587fVRAhT2qA2xibKtyf9R0hrDtUKR4NSnpPmaeIjb26e
 LhVntcChhYxQqzy/T4ScTDNVjpSlwi6QMo5DwAwzNGf2nf/v5/CZ+vGYDVdXRFHj
 tgH1+8eDpHsi7jJp6E4cmZjiolsUx/ePDDTrQ4qbdDMO7fmIV6YQKFAMTLJepLBY
 qnJVqoBq3qn40zv6tVZmKgWiXQ65jEkBItZhEUmcQRBiSbBDPweIdEzx/mwzkX7U
 1gShGdut6YP4GX7BnOhkiQmzucS85mgkUfG43+mBfYXb+4zNTEjhhkqhEduz2SQP
 xnjHxEM+MTGCj3PozIpJxNKzMTEceYY7cAUdNEMDQcHog7OCnIdGBIc7BPnsN8yA
 CPzntwP4mmLfK3weq3PIGC6d9xfc9PpmiR9docxQOvE6sk2Ifeo=
 =FKC7
 -----END PGP SIGNATURE-----

Merge tag 'notifications-pipe-prep-20191115' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull pipe rework from David Howells:
 "This is my set of preparatory patches for building a general
  notification queue on top of pipes. It makes a number of significant
  changes:

   - It removes the nr_exclusive argument from __wake_up_sync_key() as
     this is always 1. This prepares for the next step:

   - Adds wake_up_interruptible_sync_poll_locked() so that poll can be
     woken up from a function that's holding the poll waitqueue
     spinlock.

   - Change the pipe buffer ring to be managed in terms of unbounded
     head and tail indices rather than bounded index and length. This
     means that reading the pipe only needs to modify one index, not
     two.

   - A selection of helper functions are provided to query the state of
     the pipe buffer, plus a couple to apply updates to the pipe
     indices.

   - The pipe ring is allowed to have kernel-reserved slots. This allows
     many notification messages to be spliced in by the kernel without
     allowing userspace to pin too many pages if it writes to the same
     pipe.

   - Advance the head and tail indices inside the pipe waitqueue lock
     and use wake_up_interruptible_sync_poll_locked() to poke poll
     without having to take the lock twice.

   - Rearrange pipe_write() to preallocate the buffer it is going to
     write into and then drop the spinlock. This allows kernel
     notifications to then be added the ring whilst it is filling the
     buffer it allocated. The read side is stalled because the pipe
     mutex is still held.

   - Don't wake up readers on a pipe if there was already data in it
     when we added more.

   - Don't wake up writers on a pipe if the ring wasn't full before we
     removed a buffer"

* tag 'notifications-pipe-prep-20191115' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  pipe: Remove sync on wake_ups
  pipe: Increase the writer-wakeup threshold to reduce context-switch count
  pipe: Check for ring full inside of the spinlock in pipe_write()
  pipe: Remove redundant wakeup from pipe_write()
  pipe: Rearrange sequence in pipe_write() to preallocate slot
  pipe: Conditionalise wakeup in pipe_read()
  pipe: Advance tail pointer inside of wait spinlock in pipe_read()
  pipe: Allow pipes to have kernel-reserved slots
  pipe: Use head and tail pointers for the ring, not cursor and length
  Add wake_up_interruptible_sync_poll_locked()
  Remove the nr_exclusive argument from __wake_up_sync_key()
  pipe: Reduce #inclusion of pipe_fs_i.h
2019-11-30 14:12:13 -08:00
Linus Torvalds
aa32f11691 hmm related patches for 5.5
This is another round of bug fixing and cleanup. This time the focus is on
 the driver pattern to use mmu notifiers to monitor a VA range. This code
 is lifted out of many drivers and hmm_mirror directly into the
 mmu_notifier core and written using the best ideas from all the driver
 implementations.
 
 This removes many bugs from the drivers and has a very pleasing
 diffstat. More drivers can still be converted, but that is for another
 cycle.
 
 - A shared branch with RDMA reworking the RDMA ODP implementation
 
 - New mmu_interval_notifier API. This is focused on the use case of
   monitoring a VA and simplifies the process for drivers
 
 - A common seq-count locking scheme built into the mmu_interval_notifier
   API usable by drivers that call get_user_pages() or hmm_range_fault()
   with the VA range
 
 - Conversion of mlx5 ODP, hfi1, radeon, nouveau, AMD GPU, and Xen GntDev
   drivers to the new API. This deletes a lot of wonky driver code.
 
 - Two improvements for hmm_range_fault(), from testing done by Ralph
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAl3cCjQACgkQOG33FX4g
 mxpp8xAAiR9iOdT28m/tx1GF31XludrMhRZVIiz0vmCIxIiAkWekWEfAEVm9PDnh
 wdrxTJohSs+B65AK3sfToOM3AIuNCuFVWmbbHI5qmOO76vaSvcZa905Z++pNsawO
 Bn8mgRCprYoFHcxWLvTvnA5U0g1S2BSSOwBSZI43CbEnVvHjYAR6MnvRqfGMk+NF
 bf8fTk/x+fl0DCemhynlBLuJkogzoE2Hgl0yPY5bFna4PktOxdpa1yPaQsiqZ7e6
 2s2NtM3pbMBJk0W42q5BU+aPhiqfxFFszasPSLBduXrD2xDsG76HJdHj5VydKmfL
 nelG4BvqJozXTEZWvTEePYhCqaZ41eJZ7Asw8BXtmacVqE5mDlTXo/Zdgbz7yEOR
 mI5MVyjD5rauZJldUOWXbwrPoWVFRvboauehiSgqvxvT9HvlFp9GKObSuu4gubBQ
 mzxs4t48tPhA7bswLmw0/pETSogFuVDfaB7hsyY0gi8EwxMFMpw2qFypm1PEEF+C
 BuUxCSShzvNKrraNe5PWaNNFd3AzIwAOWJHE+poH4bCoXQVr5nA+rq2gnHkdY5vq
 /xrBCyxkf0U05YoFGYembPVCInMehzp9Xjy8V+SueSvCg2/TYwGDCgGfsbe9dNOP
 Bc40JpS7BDn5w9nyLUJmOx7jfruNV6kx1QslA7NDDrB/rzOlsEc=
 =Hj8a
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull hmm updates from Jason Gunthorpe:
 "This is another round of bug fixing and cleanup. This time the focus
  is on the driver pattern to use mmu notifiers to monitor a VA range.
  This code is lifted out of many drivers and hmm_mirror directly into
  the mmu_notifier core and written using the best ideas from all the
  driver implementations.

  This removes many bugs from the drivers and has a very pleasing
  diffstat. More drivers can still be converted, but that is for another
  cycle.

   - A shared branch with RDMA reworking the RDMA ODP implementation

   - New mmu_interval_notifier API. This is focused on the use case of
     monitoring a VA and simplifies the process for drivers

   - A common seq-count locking scheme built into the
     mmu_interval_notifier API usable by drivers that call
     get_user_pages() or hmm_range_fault() with the VA range

   - Conversion of mlx5 ODP, hfi1, radeon, nouveau, AMD GPU, and Xen
     GntDev drivers to the new API. This deletes a lot of wonky driver
     code.

   - Two improvements for hmm_range_fault(), from testing done by Ralph"

* tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
  mm/hmm: remove hmm_range_dma_map and hmm_range_dma_unmap
  mm/hmm: make full use of walk_page_range()
  xen/gntdev: use mmu_interval_notifier_insert
  mm/hmm: remove hmm_mirror and related
  drm/amdgpu: Use mmu_interval_notifier instead of hmm_mirror
  drm/amdgpu: Use mmu_interval_insert instead of hmm_mirror
  drm/amdgpu: Call find_vma under mmap_sem
  nouveau: use mmu_interval_notifier instead of hmm_mirror
  nouveau: use mmu_notifier directly for invalidate_range_start
  drm/radeon: use mmu_interval_notifier_insert
  RDMA/hfi1: Use mmu_interval_notifier_insert for user_exp_rcv
  RDMA/odp: Use mmu_interval_notifier_insert()
  mm/hmm: define the pre-processor related parts of hmm.h even if disabled
  mm/hmm: allow hmm_range to be used with a mmu_interval_notifier or hmm_mirror
  mm/mmu_notifier: add an interval tree notifier
  mm/mmu_notifier: define the header pre-processor parts even if disabled
  mm/hmm: allow snapshot of the special zero page
2019-11-30 10:33:14 -08:00
Leonard Crestez
36a8015f89 PM / QoS: Restore DEV_PM_QOS_MIN/MAX_FREQUENCY
Support for adding per-device frequency limits was removed in
commit 2aac8bdf7a ("PM: QoS: Drop frequency QoS types from device PM QoS")
after cpufreq switched to use a new "freq_constraints" construct.

Restore support for per-device freq limits but base this upon
freq_constraints. This is primarily meant to be used by the devfreq
subsystem.

This removes the "static" marking on freq_qos_apply but does not export
it for modules.

Signed-off-by: Leonard Crestez <leonard.crestez@nxp.com>
Reviewed-by: Matthias Kaehlcke <mka@chromium.org>
Tested-by: Matthias Kaehlcke <mka@chromium.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-11-29 12:04:50 +01:00
Zhenzhong Duan
c5105d764e sched/clock: Use static_branch_likely() with sched_clock_running
sched_clock_running is enabled early at bootup stage and never
disabled. So hint that to the compiler by using static_branch_likely()
rather than static_branch_unlikely().

The branch probability mis-annotation was introduced in the original
commit that converted the plain sched_clock_running flag to a static key:

  46457ea464 ("sched/clock: Use static key for sched_clock_running")

Steve further notes:

  | Looks like the confusion was the moving of the "!":
  |
  | -       if (unlikely(!sched_clock_running))
  | +       if (!static_branch_unlikely(&sched_clock_running))
  |
  | Where, it was unlikely that !sched_clock_running would be true, but
  | because the "!" was moved outside the "unlikely()" it makes the test
  | "likely()". That is, if we added an intermediate step, it would have
  | been:
  |
  |         if (!likely(sched_clock_running))
  |
  | which would have prevented the mistake that this patch fixes.

  [ mingo: Edited the changelog. ]

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: juri.lelli@redhat.com
Cc: mgorman@suse.de
Cc: vincent.guittot@linaro.org
Link: https://lkml.kernel.org/r/1574843848-26825-1-git-send-email-zhenzhong.duan@oracle.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-29 08:10:54 +01:00
Marco Elver
1a365e8223 locking/spinlock/debug: Fix various data races
This fixes various data races in spinlock_debug. By testing with KCSAN,
it is observable that the console gets spammed with data races reports,
suggesting these are extremely frequent.

Example data race report:

  read to 0xffff8ab24f403c48 of 4 bytes by task 221 on cpu 2:
   debug_spin_lock_before kernel/locking/spinlock_debug.c:85 [inline]
   do_raw_spin_lock+0x9b/0x210 kernel/locking/spinlock_debug.c:112
   __raw_spin_lock include/linux/spinlock_api_smp.h:143 [inline]
   _raw_spin_lock+0x39/0x40 kernel/locking/spinlock.c:151
   spin_lock include/linux/spinlock.h:338 [inline]
   get_partial_node.isra.0.part.0+0x32/0x2f0 mm/slub.c:1873
   get_partial_node mm/slub.c:1870 [inline]
  <snip>

  write to 0xffff8ab24f403c48 of 4 bytes by task 167 on cpu 3:
   debug_spin_unlock kernel/locking/spinlock_debug.c:103 [inline]
   do_raw_spin_unlock+0xc9/0x1a0 kernel/locking/spinlock_debug.c:138
   __raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:159 [inline]
   _raw_spin_unlock_irqrestore+0x2d/0x50 kernel/locking/spinlock.c:191
   spin_unlock_irqrestore include/linux/spinlock.h:393 [inline]
   free_debug_processing+0x1b3/0x210 mm/slub.c:1214
   __slab_free+0x292/0x400 mm/slub.c:2864
  <snip>

As a side-effect, with KCSAN, this eventually locks up the console, most
likely due to deadlock, e.g. .. -> printk lock -> spinlock_debug ->
KCSAN detects data race -> kcsan_print_report() -> printk lock ->
deadlock.

This fix will 1) avoid the data races, and 2) allow using lock debugging
together with KCSAN.

Reported-by: Qian Cai <cai@lca.pw>
Signed-off-by: Marco Elver <elver@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: https://lkml.kernel.org/r/20191120155715.28089-1-elver@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-29 08:03:27 +01:00
Alexei Starovoitov
ce27709b81 bpf: Fix build in minimal configurations
Some kconfigs can have BPF enabled without a single valid program type.
In such configurations the build will fail with:
./kernel/bpf/btf.c:3466:1: error: empty enum is invalid

Fix it by adding unused value to the enum.

Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Link: https://lore.kernel.org/bpf/20191128043508.2346723-1-ast@kernel.org
2019-11-29 01:03:42 +01:00
Linus Torvalds
81b6b96475 dma-mapping updates for 5.5-rc1
- improve dma-debug scalability (Eric Dumazet)
  - tiny dma-debug cleanup (Dan Carpenter)
  - check for vmap memory in dma_map_single (Kees Cook)
  - check for dma_addr_t overflows in dma-direct when using
    DMA offsets (Nicolas Saenz Julienne)
  - switch the x86 sta2x11 SOC to use more generic DMA code
    (Nicolas Saenz Julienne)
  - fix arm-nommu dma-ranges handling (Vladimir Murzin)
  - use __initdata in CMA (Shyam Saini)
  - replace the bus dma mask with a limit (Nicolas Saenz Julienne)
  - merge the remapping helpers into the main dma-direct flow (me)
  - switch xtensa to the generic dma remap handling (me)
  - various cleanups around dma_capable (me)
  - remove unused dev arguments to various dma-noncoherent helpers (me)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAl3f+eULHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYPyPg/+PVHCrhmepudQQFHu6wfurE5U77iNnoUifvG+b5z5
 5mHmTMkQwyox6rKDe8NuFApAhz1VJDSUgSelPmvTSOIEIGXCvX1p+GqRSVS5YQON
 aLzGvbWKE8hCpaPdDHKYDauD1FZGMM8L2P5oOMF9X9fQ94xxRqfqJM6c8iD16Sgg
 +aOgPNzTnxQHJFF/Dbt/mjJrKXWI+XF+bgUbH+l9yKa7Dd7ibmJR8yl9hs1jmp0H
 1CZ+CizwnAs57rCd1a6Ybc6gj59tySc03NMnnbTko+KDxrcbD3Ee2tpqHVkkCjYz
 Yl0m4FIpbotrpokL/FIS727bVvkjbWgoeM+kiVPoYzmZea3pq/tFDr6tp/BxDhFj
 TZXSFfgQljlYMD3ppSoklFlfjGriVWV0tPO3arPXwuuMF5EX/IMQmvxei05jpc8n
 iELNXOP9iZZkY4tLHy2hn2uWrxBRrS1WQwlLg9hahlNRzyfFSyHeP0zWlVDt+RgF
 5CCbEI+HQcUqg1FApB30lQNWTn1+dJftrpKVBlgNBIyIa/z2rFbt8GdSnItxjfQX
 /XX8EZbFvF6AcXkgURkYFIoKM/EbYShOSLcYA3PTUtcuTnF6Kk5eimySiGWZTVCS
 prruSFDZJOvL3SnOIMIiYVmBdB7lEbDyLI/VYuhoECXEDCJpVmRktNkJNg4q6/E+
 fjQ=
 =e5wO
 -----END PGP SIGNATURE-----

Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux; tag 'dma-mapping-5.5' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping updates from Christoph Hellwig:

 - improve dma-debug scalability (Eric Dumazet)

 - tiny dma-debug cleanup (Dan Carpenter)

 - check for vmap memory in dma_map_single (Kees Cook)

 - check for dma_addr_t overflows in dma-direct when using DMA offsets
   (Nicolas Saenz Julienne)

 - switch the x86 sta2x11 SOC to use more generic DMA code (Nicolas
   Saenz Julienne)

 - fix arm-nommu dma-ranges handling (Vladimir Murzin)

 - use __initdata in CMA (Shyam Saini)

 - replace the bus dma mask with a limit (Nicolas Saenz Julienne)

 - merge the remapping helpers into the main dma-direct flow (me)

 - switch xtensa to the generic dma remap handling (me)

 - various cleanups around dma_capable (me)

 - remove unused dev arguments to various dma-noncoherent helpers (me)

* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux:

* tag 'dma-mapping-5.5' of git://git.infradead.org/users/hch/dma-mapping: (22 commits)
  dma-mapping: treat dev->bus_dma_mask as a DMA limit
  dma-direct: exclude dma_direct_map_resource from the min_low_pfn check
  dma-direct: don't check swiotlb=force in dma_direct_map_resource
  dma-debug: clean up put_hash_bucket()
  powerpc: remove support for NULL dev in __phys_to_dma / __dma_to_phys
  dma-direct: avoid a forward declaration for phys_to_dma
  dma-direct: unify the dma_capable definitions
  dma-mapping: drop the dev argument to arch_sync_dma_for_*
  x86/PCI: sta2x11: use default DMA address translation
  dma-direct: check for overflows on 32 bit DMA addresses
  dma-debug: increase HASH_SIZE
  dma-debug: reorder struct dma_debug_entry fields
  xtensa: use the generic uncached segment support
  dma-mapping: merge the generic remapping helpers into dma-direct
  dma-direct: provide mmap and get_sgtable method overrides
  dma-direct: remove the dma_handle argument to __dma_direct_alloc_pages
  dma-direct: remove __dma_direct_free_pages
  usb: core: Remove redundant vmap checks
  kernel: dma-contiguous: mark CMA parameters __initdata/__initconst
  dma-debug: add a schedule point in debug_dma_dump_mappings()
  ...
2019-11-28 11:16:43 -08:00
Linus Torvalds
95f1fa9e34 New tracing features:
- PERAMAENT flag to ftrace_ops when attaching a callback to a function
    As /proc/sys/kernel/ftrace_enabled when set to zero will disable all
    attached callbacks in ftrace, this has a detrimental impact on live
    kernel tracing, as it disables all that it patched. If a ftrace_ops
    is registered to ftrace with the PERMANENT flag set, it will prevent
    ftrace_enabled from being disabled, and if ftrace_enabled is already
    disabled, it will prevent a ftrace_ops with PREMANENT flag set from
    being registered.
 
  - New register_ftrace_direct(). As eBPF would like to register its own
    trampolines to be called by the ftrace nop locations directly,
    without going through the ftrace trampoline, this function has been
    added. This allows for eBPF trampolines to live along side of
    ftrace, perf, kprobe and live patching. It also utilizes the ftrace
    enabled_functions file that keeps track of functions that have been
    modified in the kernel, to allow for security auditing.
 
  - Allow for kernel internal use of ftrace instances. Subsystems in
    the kernel can now create and destroy their own tracing instances
    which allows them to have their own tracing buffer, and be able
    to record events without worrying about other users from writing over
    their data.
 
  - New seq_buf_hex_dump() that lets users use the hex_dump() in their
    seq_buf usage.
 
  - Notifications now added to tracing_max_latency to allow user space
    to know when a new max latency is hit by one of the latency tracers.
 
  - Wider spread use of generic compare operations for use of bsearch and
    friends.
 
  - More synthetic event fields may be defined (32 up from 16)
 
  - Use of xarray for architectures with sparse system calls, for the
    system call trace events.
 
 This along with small clean ups and fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXdwv4BQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qnB5AP91vsdHQjwE1+/UWG/cO+qFtKvn2QJK
 QmBRIJNH/s+1TAD/fAOhgw+ojSK3o/qc+NpvPTEW9AEwcJL1wacJUn+XbQc=
 =ztql
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing updates from Steven Rostedt:
 "New tracing features:

   - New PERMANENT flag to ftrace_ops when attaching a callback to a
     function.

     As /proc/sys/kernel/ftrace_enabled when set to zero will disable
     all attached callbacks in ftrace, this has a detrimental impact on
     live kernel tracing, as it disables all that it patched. If a
     ftrace_ops is registered to ftrace with the PERMANENT flag set, it
     will prevent ftrace_enabled from being disabled, and if
     ftrace_enabled is already disabled, it will prevent a ftrace_ops
     with PREMANENT flag set from being registered.

   - New register_ftrace_direct().

     As eBPF would like to register its own trampolines to be called by
     the ftrace nop locations directly, without going through the ftrace
     trampoline, this function has been added. This allows for eBPF
     trampolines to live along side of ftrace, perf, kprobe and live
     patching. It also utilizes the ftrace enabled_functions file that
     keeps track of functions that have been modified in the kernel, to
     allow for security auditing.

   - Allow for kernel internal use of ftrace instances.

     Subsystems in the kernel can now create and destroy their own
     tracing instances which allows them to have their own tracing
     buffer, and be able to record events without worrying about other
     users from writing over their data.

   - New seq_buf_hex_dump() that lets users use the hex_dump() in their
     seq_buf usage.

   - Notifications now added to tracing_max_latency to allow user space
     to know when a new max latency is hit by one of the latency
     tracers.

   - Wider spread use of generic compare operations for use of bsearch
     and friends.

   - More synthetic event fields may be defined (32 up from 16)

   - Use of xarray for architectures with sparse system calls, for the
     system call trace events.

  This along with small clean ups and fixes"

* tag 'trace-v5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (51 commits)
  tracing: Enable syscall optimization for MIPS
  tracing: Use xarray for syscall trace events
  tracing: Sample module to demonstrate kernel access to Ftrace instances.
  tracing: Adding new functions for kernel access to Ftrace instances
  tracing: Fix Kconfig indentation
  ring-buffer: Fix typos in function ring_buffer_producer
  ftrace: Use BIT() macro
  ftrace: Return ENOTSUPP when DYNAMIC_FTRACE_WITH_DIRECT_CALLS is not configured
  ftrace: Rename ftrace_graph_stub to ftrace_stub_graph
  ftrace: Add a helper function to modify_ftrace_direct() to allow arch optimization
  ftrace: Add helper find_direct_entry() to consolidate code
  ftrace: Add another check for match in register_ftrace_direct()
  ftrace: Fix accounting bug with direct->count in register_ftrace_direct()
  ftrace/selftests: Fix spelling mistake "wakeing" -> "waking"
  tracing: Increase SYNTH_FIELDS_MAX for synthetic_events
  ftrace/samples: Add a sample module that implements modify_ftrace_direct()
  ftrace: Add modify_ftrace_direct()
  tracing: Add missing "inline" in stub function of latency_fsnotify()
  tracing: Remove stray tab in TRACE_EVAL_MAP_FILE's help text
  tracing: Use seq_buf_hex_dump() to dump buffers
  ...
2019-11-27 11:42:01 -08:00
Linus Torvalds
9a3d7fd275 Driver core patches for 5.5-rc1
Here is the "big" set of driver core patches for 5.5-rc1
 
 There's a few minor cleanups and fixes in here, but the majority of the
 patches in here fall into two buckets:
   - debugfs api cleanups and fixes
   - driver core device link support for boot dependancy issues
 
 The debugfs api cleanups are working to slowly refactor the debugfs apis
 so that it is even harder to use incorrectly.  That work has been
 happening for the past few kernel releases and will continue over time,
 it's a long-term project/goal
 
 The driver core device link support missed 5.4 by just a bit, so it's
 been sitting and baking for many months now.  It's from Saravana Kannan
 to help resolve the problems that DT-based systems have at boot time
 with dependancy graphs and kernel modules.  Turns out that no one has
 actually tried to build a generic arm64 kernel with loads of modules and
 have it "just work" for a variety of platforms (like a distro kernel)
 The big problem turned out to be a lack of depandancy information
 between different areas of DT entries, and the work here resolves that
 problem and now allows devices to boot properly, and quicker than a
 monolith kernel.
 
 All of these patches have been in linux-next for a long time with no
 reported issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXd6m6Q8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+yntJQCcCqg6RQ7LTdHuZv1ETeefXlsfk00An1Jtean6
 42bWGx52bGFvAcpjWy8R
 =P7hq
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-5.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core updates from Greg KH:
 "Here is the "big" set of driver core patches for 5.5-rc1

  There's a few minor cleanups and fixes in here, but the majority of
  the patches in here fall into two buckets:

   - debugfs api cleanups and fixes

   - driver core device link support for boot dependancy issues

  The debugfs api cleanups are working to slowly refactor the debugfs
  apis so that it is even harder to use incorrectly. That work has been
  happening for the past few kernel releases and will continue over
  time, it's a long-term project/goal

  The driver core device link support missed 5.4 by just a bit, so it's
  been sitting and baking for many months now. It's from Saravana Kannan
  to help resolve the problems that DT-based systems have at boot time
  with dependancy graphs and kernel modules. Turns out that no one has
  actually tried to build a generic arm64 kernel with loads of modules
  and have it "just work" for a variety of platforms (like a distro
  kernel). The big problem turned out to be a lack of dependency
  information between different areas of DT entries, and the work here
  resolves that problem and now allows devices to boot properly, and
  quicker than a monolith kernel.

  All of these patches have been in linux-next for a long time with no
  reported issues"

* tag 'driver-core-5.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (68 commits)
  tracing: Remove unnecessary DEBUG_FS dependency
  of: property: Add device link support for interrupt-parent, dmas and -gpio(s)
  debugfs: Fix !DEBUG_FS debugfs_create_automount
  of: property: Add device link support for "iommu-map"
  of: property: Fix the semantics of of_is_ancestor_of()
  i2c: of: Populate fwnode in of_i2c_get_board_info()
  drivers: base: Fix Kconfig indentation
  firmware_loader: Fix labels with comma for builtin firmware
  driver core: Allow device link operations inside sync_state()
  driver core: platform: Declare ret variable only once
  cpu-topology: declare parse_acpi_topology in <linux/arch_topology.h>
  crypto: hisilicon: no need to check return value of debugfs_create functions
  driver core: platform: use the correct callback type for bus_find_device
  firmware_class: make firmware caching configurable
  driver core: Clarify documentation for fwnode_operations.add_links()
  mailbox: tegra: Fix superfluous IRQ error message
  net: caif: Fix debugfs on 64-bit platforms
  mac80211: Use debugfs_create_xul() helper
  media: c8sectpfe: no need to check return value of debugfs_create functions
  of: property: Add device link support for iommus, mboxes and io-channels
  ...
2019-11-27 11:06:20 -08:00
Masami Hiramatsu
f66c0447cc kprobes: Set unoptimized flag after unoptimizing code
Set the unoptimized flag after confirming the code is completely
unoptimized. Without this fix, when a kprobe hits the intermediate
modified instruction (the first byte is replaced by an INT3, but
later bytes can still be a jump address operand) while unoptimizing,
it can return to the middle byte of the modified code, which causes
an invalid instruction exception in the kernel.

Usually, this is a rare case, but if we put a probe on the function
call while text patching, it always causes a kernel panic as below:

 # echo p text_poke+5 > kprobe_events
 # echo 1 > events/kprobes/enable
 # echo 0 > events/kprobes/enable

invalid opcode: 0000 [#1] PREEMPT SMP PTI
 RIP: 0010:text_poke+0x9/0x50
 Call Trace:
  arch_unoptimize_kprobe+0x22/0x28
  arch_unoptimize_kprobes+0x39/0x87
  kprobe_optimizer+0x6e/0x290
  process_one_work+0x2a0/0x610
  worker_thread+0x28/0x3d0
  ? process_one_work+0x610/0x610
  kthread+0x10d/0x130
  ? kthread_park+0x80/0x80
  ret_from_fork+0x3a/0x50

text_poke() is used for patching the code in optprobes.

This can happen even if we blacklist text_poke() and other functions,
because there is a small time window during which we show the intermediate
code to other CPUs.

 [ mingo: Edited the changelog. ]

Tested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bristot@redhat.com
Fixes: 6274de4984 ("kprobes: Support delayed unoptimizing")
Link: https://lkml.kernel.org/r/157483422375.25881.13508326028469515760.stgit@devnote2
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-27 07:44:25 +01:00
Peter Zijlstra
04ae87a520 ftrace: Rework event_create_dir()
Rework event_create_dir() to use an array of static data instead of
function pointers where possible.

The problem is that it would call the function pointer on module load
before parse_args(), possibly even before jump_labels were initialized.
Luckily the generated functions don't use jump_labels but it still seems
fragile. It also gets in the way of changing when we make the module map
executable.

The generated function are basically calling trace_define_field() with a
bunch of static arguments. So instead of a function, capture these
arguments in a static array, avoiding the function call.

Now there are a number of cases where the fields are dynamic (syscall
arguments, kprobes and uprobes), in which case a static array does not
work, for these we preserve the function call. Luckily all these cases
are not related to modules and so we can retain the function call for
them.

Also fix up all broken tracepoint definitions that now generate a
compile error.

Tested-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191111132458.342979914@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-27 07:44:25 +01:00
Peter Zijlstra
958de66819 module: Remove set_all_modules_text_*()
Now that there are no users of set_all_modules_text_*() left, remove
it.

While it appears nds32 uses it, it does not have STRICT_MODULE_RWX and
therefore ends up with the NOP stubs.

Tested-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Chen <deanbo422@gmail.com>
Link: https://lkml.kernel.org/r/20191111132458.284298307@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-27 07:44:25 +01:00
Linus Torvalds
9e7a03233e Power management updates for 5.5-rc1
- Use nanoseconds (instead of microseconds) as the unit of time in
    the cpuidle core and simplify checks for disabled idle states in
    the idle loop (Rafael Wysocki).
 
  - Fix and clean up the teo cpuidle governor (Rafael Wysocki).
 
  - Fix the cpuidle registration error code path (Zhenzhong Duan).
 
  - Avoid excessive vmexits in the ACPI cpuidle driver (Yin Fengwei).
 
  - Extend the idle injection infrastructure to be able to measure the
    requested duration in nanoseconds and to allow an exit latency
    limit for idle states to be specified (Daniel Lezcano).
 
  - Fix cpufreq driver registration and clarify a comment in the
    cpufreq core (Viresh Kumar).
 
  - Add NULL checks to the show() and store() methods of sysfs
    attributes exposed by cpufreq (Kai Shen).
 
  - Update cpufreq drivers:
 
    * Fix for a plain int as pointer warning from sparse in
      intel_pstate (Jamal Shareef).
 
    * Fix for a hardcoded number of CPUs and stack bloat in the
      powernv driver (John Hubbard).
 
    * Updates to the ti-cpufreq driver and DT files to support new
      platforms and migrate bindings from opp-v1 to opp-v2 (Adam Ford,
      H. Nikolaus Schaller).
 
    * Merging of the arm_big_little and vexpress-spc drivers and
      related cleanup (Sudeep Holla).
 
    * Fix for imx's default speed grade value (Anson Huang).
 
    * Minor cleanup of the s3c64xx driver (Nathan Chancellor).
 
    * CPU speed bin detection fix for sun50i (Ondrej Jirman).
 
  - Appoint Chanwoo Choi as the new devfreq maintainer.
 
  - Update the devfreq core:
 
    * Check NULL governor in available_governors_show sysfs to prevent
      showing wrong governor information and fix a race condition
      between devfreq_update_status() and trans_stat_show() (Leonard
      Crestez).
 
    * Add new 'interrupt-driven' flag for devfreq governors to allow
      interrupt-driven governors to prevent the devfreq core from
      polling devices for status (Dmitry Osipenko).
 
    * Improve an error message in devfreq_add_device() (Matthias
      Kaehlcke).
 
  - Update devfreq drivers:
 
    * tegra30 driver fixes and cleanups (Dmitry Osipenko).
 
    * Removal of unused property from dt-binding documentation for
      the exynos-bus driver (Kamil Konieczny).
 
    * exynos-ppmu cleanup and DT bindings update (Lukasz Luba, Marek
      Szyprowski).
 
  - Add new CPU IDs for CometLake Mobile and Desktop to the Intel RAPL
    power capping driver (Zhang Rui).
 
  - Allow device initialization in the generic power domains (genpd)
    framework to be more straightforward and clean it up (Ulf Hansson).
 
  - Add support for adjusting OPP voltages at run time to the OPP
    framework (Stephen Boyd).
 
  - Avoid freeing memory that has never been allocated in the
    hibernation core (Andy Whitcroft).
 
  - Clean up function headers in a header file and coding style in the
    wakeup IRQs handling code (Ulf Hansson, Xiaofei Tan).
 
  - Clean up the SmartReflex adaptive voltage scaling (AVS) driver for
    ARM (Ben Dooks, Geert Uytterhoeven).
 
  - Wrap power management documentation to fit in 80 columns (Bjorn
    Helgaas).
 
  - Add pm-graph utility entry to MAINTAINERS (Todd Brandt).
 
  - Update the cpupower utility:
 
    * Fix the handling of set and info subcommands (Abhishek Goel).
 
    * Fix build warnings (Nathan Chancellor).
 
    * Improve mperf_monitor handling (Janakarajan Natarajan).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl3dHGYSHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxMcgP/1bMSkxlRHFOXYSRwS4YcvkUjlBHrCSi
 3qGRyYwhc+eRLqRc+2tcmQeQEeQRBqUt8etp7/9WxqS3nic/3Vdf6AFuhSpmJzo1
 6JTEutHMU5eP8lwQuKoUCJncCNdIfEOkd5T35E12W/ar5PwyJio0UByZJBnJBjD/
 p7/713ucq6ZH95OGncmCJ1S1UslFCZrSS2RRigDInu8gpEssnwN9zwaJbzUYrZHj
 BmnKpBpT8FdLmkpbOtmmiT7q2ZGpUEHhkaO916Knf/+BFdvydTXoR90FVvXKy8Zr
 QpOxaTdQB2ADifUa5zs8klVP6otmZhEO9vz8hVMUWGziqagObykQngzl8tqrKEBh
 hLI8eEG1IkEBCv5ThQbLcoaRXNpwriXXfvWPTPB8s84HJxNZ09F6pXsv1SLh96qC
 lj8Q5Yy2a3tlpsg4LB58XoJ54gOtlh8bWKkM0FytrFI/IP+HT4TUu/Rxgp1nDbGd
 tKzLvpn4Yo2h10seeDbYk3l79mogUYj50RmwjjPn+9RwS/Df4eIpNb6ibllGZUN/
 zcPZH5xlVfQRl2LKDufVN0nYSnoMZY/fU05p9XbUiJWd80LHYOb4Em1N6h/FNOyl
 alDhVwlxEvc2BQwL/gjYmN6Qxc7SsPTBrSGVwjWYY+FghOYQd/wBDQqQUeM21QKg
 ChOE3z/F/26r
 =GJvT
 -----END PGP SIGNATURE-----

Merge tag 'pm-5.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:
 "These include cpuidle changes to use nanoseconds (instead of
  microseconds) as the unit of time and to simplify checks for disabled
  idle states in the idle loop, some cpuidle fixes and governor updates,
  assorted cpufreq updates (driver updates mostly and a few core fixes
  and cleanups), devfreq updates (dominated by the tegra30 driver
  changes), new CPU IDs for the RAPL power capping driver, relatively
  minor updates of the generic power domains (genpd) and operation
  performance points (OPP) frameworks, and assorted fixes and cleanups.

  There are also two maintainer information updates: Chanwoo Choi will
  be maintaining the devfreq subsystem going forward and Todd Brandt is
  going to maintain the pm-graph utility (created by him).

  Specifics:

   - Use nanoseconds (instead of microseconds) as the unit of time in
     the cpuidle core and simplify checks for disabled idle states in
     the idle loop (Rafael Wysocki)

   - Fix and clean up the teo cpuidle governor (Rafael Wysocki)

   - Fix the cpuidle registration error code path (Zhenzhong Duan)

   - Avoid excessive vmexits in the ACPI cpuidle driver (Yin Fengwei)

   - Extend the idle injection infrastructure to be able to measure the
     requested duration in nanoseconds and to allow an exit latency
     limit for idle states to be specified (Daniel Lezcano)

   - Fix cpufreq driver registration and clarify a comment in the
     cpufreq core (Viresh Kumar)

   - Add NULL checks to the show() and store() methods of sysfs
     attributes exposed by cpufreq (Kai Shen)

   - Update cpufreq drivers:
      * Fix for a plain int as pointer warning from sparse in
        intel_pstate (Jamal Shareef)
      * Fix for a hardcoded number of CPUs and stack bloat in the
        powernv driver (John Hubbard)
      * Updates to the ti-cpufreq driver and DT files to support new
        platforms and migrate bindings from opp-v1 to opp-v2 (Adam Ford,
        H. Nikolaus Schaller)
      * Merging of the arm_big_little and vexpress-spc drivers and
        related cleanup (Sudeep Holla)
      * Fix for imx's default speed grade value (Anson Huang)
      * Minor cleanup of the s3c64xx driver (Nathan Chancellor)
      * CPU speed bin detection fix for sun50i (Ondrej Jirman)

   - Appoint Chanwoo Choi as the new devfreq maintainer.

   - Update the devfreq core:
      * Check NULL governor in available_governors_show sysfs to prevent
        showing wrong governor information and fix a race condition
        between devfreq_update_status() and trans_stat_show() (Leonard
        Crestez)
      * Add new 'interrupt-driven' flag for devfreq governors to allow
        interrupt-driven governors to prevent the devfreq core from
        polling devices for status (Dmitry Osipenko)
      * Improve an error message in devfreq_add_device() (Matthias
        Kaehlcke)

   - Update devfreq drivers:
      * tegra30 driver fixes and cleanups (Dmitry Osipenko)
      * Removal of unused property from dt-binding documentation for the
        exynos-bus driver (Kamil Konieczny)
      * exynos-ppmu cleanup and DT bindings update (Lukasz Luba, Marek
        Szyprowski)

   - Add new CPU IDs for CometLake Mobile and Desktop to the Intel RAPL
     power capping driver (Zhang Rui)

   - Allow device initialization in the generic power domains (genpd)
     framework to be more straightforward and clean it up (Ulf Hansson)

   - Add support for adjusting OPP voltages at run time to the OPP
     framework (Stephen Boyd)

   - Avoid freeing memory that has never been allocated in the
     hibernation core (Andy Whitcroft)

   - Clean up function headers in a header file and coding style in the
     wakeup IRQs handling code (Ulf Hansson, Xiaofei Tan)

   - Clean up the SmartReflex adaptive voltage scaling (AVS) driver for
     ARM (Ben Dooks, Geert Uytterhoeven)

   - Wrap power management documentation to fit in 80 columns (Bjorn
     Helgaas)

   - Add pm-graph utility entry to MAINTAINERS (Todd Brandt)

   - Update the cpupower utility:
      * Fix the handling of set and info subcommands (Abhishek Goel)
      * Fix build warnings (Nathan Chancellor)
      * Improve mperf_monitor handling (Janakarajan Natarajan)"

* tag 'pm-5.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (83 commits)
  PM: Wrap documentation to fit in 80 columns
  cpuidle: Pass exit latency limit to cpuidle_use_deepest_state()
  cpuidle: Allow idle injection to apply exit latency limit
  cpuidle: Introduce cpuidle_driver_state_disabled() for driver quirks
  cpuidle: teo: Avoid code duplication in conditionals
  cpufreq: Register drivers only after CPU devices have been registered
  cpuidle: teo: Avoid using "early hits" incorrectly
  cpuidle: teo: Exclude cpuidle overhead from computations
  PM / Domains: Convert to dev_to_genpd_safe() in genpd_syscore_switch()
  mmc: tmio: Avoid boilerplate code in ->runtime_suspend()
  PM / Domains: Implement the ->start() callback for genpd
  PM / Domains: Introduce dev_pm_domain_start()
  ARM: OMAP2+: SmartReflex: add omap_sr_pdata definition
  PM / wakeirq: remove unnecessary parentheses
  power: avs: smartreflex: Remove superfluous cast in debugfs_create_file() call
  cpuidle: Use nanoseconds as the unit of time
  PM / OPP: Support adjusting OPP voltages at runtime
  PM / core: Clean up some function headers in power.h
  cpufreq: Add NULL checks to show() and store() methods of cpufreq
  cpufreq: intel_pstate: Fix plain int as pointer warning from sparse
  ...
2019-11-26 19:06:44 -08:00
Alexei Starovoitov
d0f0104341 bpf: Fix static checker warning
kernel/bpf/btf.c:4023 btf_distill_func_proto()
        error: potentially dereferencing uninitialized 't'.

kernel/bpf/btf.c
  4012          nargs = btf_type_vlen(func);
  4013          if (nargs >= MAX_BPF_FUNC_ARGS) {
  4014                  bpf_log(log,
  4015                          "The function %s has %d arguments. Too many.\n",
  4016                          tname, nargs);
  4017                  return -EINVAL;
  4018          }
  4019          ret = __get_type_size(btf, func->type, &t);
                                                       ^^
t isn't initialized for the first -EINVAL return

This is unlikely path, since BTF should have been validated at this point.
Fix it by returning 'void' BTF.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20191126230106.237179-1-ast@kernel.org
2019-11-27 01:04:47 +01:00
Linus Torvalds
168829ad09 Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
 "The main changes in this cycle were:

   - A comprehensive rewrite of the robust/PI futex code's exit handling
     to fix various exit races. (Thomas Gleixner et al)

   - Rework the generic REFCOUNT_FULL implementation using
     atomic_fetch_* operations so that the performance impact of the
     cmpxchg() loops is mitigated for common refcount operations.

     With these performance improvements the generic implementation of
     refcount_t should be good enough for everybody - and this got
     confirmed by performance testing, so remove ARCH_HAS_REFCOUNT and
     REFCOUNT_FULL entirely, leaving the generic implementation enabled
     unconditionally. (Will Deacon)

   - Other misc changes, fixes, cleanups"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
  lkdtm: Remove references to CONFIG_REFCOUNT_FULL
  locking/refcount: Remove unused 'refcount_error_report()' function
  locking/refcount: Consolidate implementations of refcount_t
  locking/refcount: Consolidate REFCOUNT_{MAX,SATURATED} definitions
  locking/refcount: Move saturation warnings out of line
  locking/refcount: Improve performance of generic REFCOUNT_FULL code
  locking/refcount: Move the bulk of the REFCOUNT_FULL implementation into the <linux/refcount.h> header
  locking/refcount: Remove unused refcount_*_checked() variants
  locking/refcount: Ensure integer operands are treated as signed
  locking/refcount: Define constants for saturation and max refcount values
  futex: Prevent exit livelock
  futex: Provide distinct return value when owner is exiting
  futex: Add mutex around futex exit
  futex: Provide state handling for exec() as well
  futex: Sanitize exit state handling
  futex: Mark the begin of futex exit explicitly
  futex: Set task::futex_state to DEAD right after handling futex exit
  futex: Split futex_mm_release() for exit/exec
  exit/exec: Seperate mm_release()
  futex: Replace PF_EXITPIDONE with a state
  ...
2019-11-26 16:02:40 -08:00
Linus Torvalds
1ae78780ed Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 "The main changes in this cycle were:

   - Dynamic tick (nohz) updates, perhaps most notably changes to force
     the tick on when needed due to lengthy in-kernel execution on CPUs
     on which RCU is waiting.

   - Linux-kernel memory consistency model updates.

   - Replace rcu_swap_protected() with rcu_prepace_pointer().

   - Torture-test updates.

   - Documentation updates.

   - Miscellaneous fixes"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits)
  security/safesetid: Replace rcu_swap_protected() with rcu_replace_pointer()
  net/sched: Replace rcu_swap_protected() with rcu_replace_pointer()
  net/netfilter: Replace rcu_swap_protected() with rcu_replace_pointer()
  net/core: Replace rcu_swap_protected() with rcu_replace_pointer()
  bpf/cgroup: Replace rcu_swap_protected() with rcu_replace_pointer()
  fs/afs: Replace rcu_swap_protected() with rcu_replace_pointer()
  drivers/scsi: Replace rcu_swap_protected() with rcu_replace_pointer()
  drm/i915: Replace rcu_swap_protected() with rcu_replace_pointer()
  x86/kvm/pmu: Replace rcu_swap_protected() with rcu_replace_pointer()
  rcu: Upgrade rcu_swap_protected() to rcu_replace_pointer()
  rcu: Suppress levelspread uninitialized messages
  rcu: Fix uninitialized variable in nocb_gp_wait()
  rcu: Update descriptions for rcu_future_grace_period tracepoint
  rcu: Update descriptions for rcu_nocb_wake tracepoint
  rcu: Remove obsolete descriptions for rcu_barrier tracepoint
  rcu: Ensure that ->rcu_urgent_qs is set before resched IPI
  workqueue: Convert for_each_wq to use built-in list check
  rcu: Several rcu_segcblist functions can be static
  rcu: Remove unused function hlist_bl_del_init_rcu()
  Documentation: Rename rcu_node_context_switch() to rcu_note_context_switch()
  ...
2019-11-26 15:42:43 -08:00
Linus Torvalds
77a05940ee Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "The biggest changes in this cycle were:

   - Make kcpustat vtime aware (Frederic Weisbecker)

   - Rework the CFS load_balance() logic (Vincent Guittot)

   - Misc cleanups, smaller enhancements, fixes.

  The load-balancing rework is the most intrusive change: it replaces
  the old heuristics that have become less meaningful after the
  introduction of the PELT metrics, with a grounds-up load-balancing
  algorithm.

  As such it's not really an iterative series, but replaces the old
  load-balancing logic with the new one. We hope there are no
  performance regressions left - but statistically it's highly probable
  that there *is* going to be some workload that is hurting from these
  chnages. If so then we'd prefer to have a look at that workload and
  fix its scheduling, instead of reverting the changes"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits)
  rackmeter: Use vtime aware kcpustat accessor
  leds: Use all-in-one vtime aware kcpustat accessor
  cpufreq: Use vtime aware kcpustat accessors for user time
  procfs: Use all-in-one vtime aware kcpustat accessor
  sched/vtime: Bring up complete kcpustat accessor
  sched/cputime: Support other fields on kcpustat_field()
  sched/cpufreq: Move the cfs_rq_util_change() call to cpufreq_update_util()
  sched/fair: Add comments for group_type and balancing at SD_NUMA level
  sched/fair: Fix rework of find_idlest_group()
  sched/uclamp: Fix overzealous type replacement
  sched/Kconfig: Fix spelling mistake in user-visible help text
  sched/core: Further clarify sched_class::set_next_task()
  sched/fair: Use mul_u32_u32()
  sched/core: Simplify sched_class::pick_next_task()
  sched/core: Optimize pick_next_task()
  sched/core: Make pick_next_task_idle() more consistent
  sched/fair: Better document newidle_balance()
  leds: Use vtime aware kcpustat accessor to fetch CPUTIME_SYSTEM
  cpufreq: Use vtime aware kcpustat accessor to fetch CPUTIME_SYSTEM
  procfs: Use vtime aware kcpustat accessor to fetch CPUTIME_SYSTEM
  ...
2019-11-26 15:23:14 -08:00
Linus Torvalds
3f59dbcace Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Ingo Molnar:
 "The main kernel side changes in this cycle were:

   - Various Intel-PT updates and optimizations (Alexander Shishkin)

   - Prohibit kprobes on Xen/KVM emulate prefixes (Masami Hiramatsu)

   - Add support for LSM and SELinux checks to control access to the
     perf syscall (Joel Fernandes)

   - Misc other changes, optimizations, fixes and cleanups - see the
     shortlog for details.

  There were numerous tooling changes as well - 254 non-merge commits.
  Here are the main changes - too many to list in detail:

   - Enhancements to core tooling infrastructure, perf.data, libperf,
     libtraceevent, event parsing, vendor events, Intel PT, callchains,
     BPF support and instruction decoding.

   - There were updates to the following tools:

        perf annotate
        perf diff
        perf inject
        perf kvm
        perf list
        perf maps
        perf parse
        perf probe
        perf record
        perf report
        perf script
        perf stat
        perf test
        perf trace

   - And a lot of other changes: please see the shortlog and Git log for
     more details"

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (279 commits)
  perf parse: Fix potential memory leak when handling tracepoint errors
  perf probe: Fix spelling mistake "addrees" -> "address"
  libtraceevent: Fix memory leakage in copy_filter_type
  libtraceevent: Fix header installation
  perf intel-bts: Does not support AUX area sampling
  perf intel-pt: Add support for decoding AUX area samples
  perf intel-pt: Add support for recording AUX area samples
  perf pmu: When using default config, record which bits of config were changed by the user
  perf auxtrace: Add support for queuing AUX area samples
  perf session: Add facility to peek at all events
  perf auxtrace: Add support for dumping AUX area samples
  perf inject: Cut AUX area samples
  perf record: Add aux-sample-size config term
  perf record: Add support for AUX area sampling
  perf auxtrace: Add support for AUX area sample recording
  perf auxtrace: Move perf_evsel__find_pmu()
  perf record: Add a function to test for kernel support for AUX area sampling
  perf tools: Add kernel AUX area sampling definitions
  perf/core: Make the mlock accounting simple again
  perf report: Jump to symbol source view from total cycles view
  ...
2019-11-26 15:04:47 -08:00
Linus Torvalds
3f61281390 Merge branch 'core-stacktrace-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull stacktrace cleanup from Ingo Molnar:
 "A minor cleanup"

* 'core-stacktrace-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  stacktrace: Get rid of unneeded '!!' pattern
2019-11-26 14:47:19 -08:00
Eric W. Biederman
61a47c1ad3 sysctl: Remove the sysctl system call
This system call has been deprecated almost since it was introduced, and
in a survey of the linux distributions I can no longer find any of them
that enable CONFIG_SYSCTL_SYSCALL.  The only indication that I can find
that anyone might care is that a few of the defconfigs in the kernel
enable CONFIG_SYSCTL_SYSCALL.  However this appears in only 31 of 414
defconfigs in the kernel, so I suspect this symbols presence is simply
because it is harmless to include rather than because it is necessary.

As there appear to be no users of the sysctl system call, remove the
code.  As this removes one of the few uses of the internal kernel mount
of proc I hope this allows for even more simplifications of the proc
filesystem.

Cc: Alex Smith <alex.smith@imgtec.com>
Cc: Anders Berg <anders.berg@lsi.com>
Cc: Apelete Seketeli <apelete@seketeli.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Chee Nouk Phoon <cnphoon@altera.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Christian Ruppert <christian.ruppert@abilis.com>
Cc: Greg Ungerer <gerg@uclinux.org>
Cc: Harvey Hunt <harvey.hunt@imgtec.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Hongliang Tao <taohl@lemote.com>
Cc: Hua Yan <yanh@lemote.com>
Cc: Huacai Chen <chenhc@lemote.com>
Cc: John Crispin <blogic@openwrt.org>
Cc: Jonas Jensen <jonas.jensen@gmail.com>
Cc: Josh Boyer <jwboyer@gmail.com>
Cc: Jun Nie <jun.nie@linaro.org>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Kevin Wells <kevin.wells@nxp.com>
Cc: Kumar Gala <galak@codeaurora.org>
Cc: Lars-Peter Clausen <lars@metafoo.de>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Markos Chandras <markos.chandras@imgtec.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Noam Camus <noamc@ezchip.com>
Cc: Olof Johansson <olof@lixom.net>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Phil Edworthy <phil.edworthy@renesas.com>
Cc: Pierrick Hascoet <pierrick.hascoet@abilis.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Roland Stigge <stigge@antcom.de>
Cc: Santosh Shilimkar <santosh.shilimkar@ti.com>
Cc: Scott Telford <stelford@cadence.com>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Cc: Steven J. Hill <Steven.Hill@imgtec.com>
Cc: Tanmay Inamdar <tinamdar@apm.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Wolfram Sang <w.sang@pengutronix.de>
Acked-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2019-11-26 13:03:56 -06:00
Rafael J. Wysocki
5a97aa5bbc Merge branches 'pm-sleep', 'pm-domains', 'pm-opp' and 'powercap'
* pm-sleep:
  PM / wakeirq: remove unnecessary parentheses
  PM / core: Clean up some function headers in power.h
  PM / hibernate: memory_bm_find_bit(): Tighten node optimisation

* pm-domains:
  PM / Domains: Convert to dev_to_genpd_safe() in genpd_syscore_switch()
  mmc: tmio: Avoid boilerplate code in ->runtime_suspend()
  PM / Domains: Implement the ->start() callback for genpd
  PM / Domains: Introduce dev_pm_domain_start()

* pm-opp:
  PM / OPP: Support adjusting OPP voltages at runtime

* powercap:
  powercap/intel_rapl: add support for Cometlake desktop
  powercap/intel_rapl: add support for CometLake Mobile
2019-11-26 10:27:49 +01:00
Rafael J. Wysocki
6221403952 Merge branch 'pm-cpuidle'
* pm-cpuidle:
  cpuidle: Pass exit latency limit to cpuidle_use_deepest_state()
  cpuidle: Allow idle injection to apply exit latency limit
  cpuidle: Introduce cpuidle_driver_state_disabled() for driver quirks
  cpuidle: teo: Avoid code duplication in conditionals
  cpuidle: teo: Avoid using "early hits" incorrectly
  cpuidle: teo: Exclude cpuidle overhead from computations
  cpuidle: Use nanoseconds as the unit of time
  cpuidle: Consolidate disabled state checks
  ACPI: processor_idle: Skip dummy wait if kernel is in guest
  cpuidle: Do not unset the driver if it is there already
  cpuidle: teo: Fix "early hits" handling for disabled idle states
  cpuidle: teo: Consider hits and misses metrics of disabled states
  cpuidle: teo: Rename local variable in teo_select()
  cpuidle: teo: Ignore disabled idle states that are too deep
2019-11-26 10:26:26 +01:00
Linus Torvalds
386403a115 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from David Miller:
 "Another merge window, another pull full of stuff:

   1) Support alternative names for network devices, from Jiri Pirko.

   2) Introduce per-netns netdev notifiers, also from Jiri Pirko.

   3) Support MSG_PEEK in vsock/virtio, from Matias Ezequiel Vara
      Larsen.

   4) Allow compiling out the TLS TOE code, from Jakub Kicinski.

   5) Add several new tracepoints to the kTLS code, also from Jakub.

   6) Support set channels ethtool callback in ena driver, from Sameeh
      Jubran.

   7) New SCTP events SCTP_ADDR_ADDED, SCTP_ADDR_REMOVED,
      SCTP_ADDR_MADE_PRIM, and SCTP_SEND_FAILED_EVENT. From Xin Long.

   8) Add XDP support to mvneta driver, from Lorenzo Bianconi.

   9) Lots of netfilter hw offload fixes, cleanups and enhancements,
      from Pablo Neira Ayuso.

  10) PTP support for aquantia chips, from Egor Pomozov.

  11) Add UDP segmentation offload support to igb, ixgbe, and i40e. From
      Josh Hunt.

  12) Add smart nagle to tipc, from Jon Maloy.

  13) Support L2 field rewrite by TC offloads in bnxt_en, from Venkat
      Duvvuru.

  14) Add a flow mask cache to OVS, from Tonghao Zhang.

  15) Add XDP support to ice driver, from Maciej Fijalkowski.

  16) Add AF_XDP support to ice driver, from Krzysztof Kazimierczak.

  17) Support UDP GSO offload in atlantic driver, from Igor Russkikh.

  18) Support it in stmmac driver too, from Jose Abreu.

  19) Support TIPC encryption and auth, from Tuong Lien.

  20) Introduce BPF trampolines, from Alexei Starovoitov.

  21) Make page_pool API more numa friendly, from Saeed Mahameed.

  22) Introduce route hints to ipv4 and ipv6, from Paolo Abeni.

  23) Add UDP segmentation offload to cxgb4, Rahul Lakkireddy"

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1857 commits)
  libbpf: Fix usage of u32 in userspace code
  mm: Implement no-MMU variant of vmalloc_user_node_flags
  slip: Fix use-after-free Read in slip_open
  net: dsa: sja1105: fix sja1105_parse_rgmii_delays()
  macvlan: schedule bc_work even if error
  enetc: add support Credit Based Shaper(CBS) for hardware offload
  net: phy: add helpers phy_(un)lock_mdio_bus
  mdio_bus: don't use managed reset-controller
  ax88179_178a: add ethtool_op_get_ts_info()
  mlxsw: spectrum_router: Fix use of uninitialized adjacency index
  mlxsw: spectrum_router: After underlay moves, demote conflicting tunnels
  bpf: Simplify __bpf_arch_text_poke poke type handling
  bpf: Introduce BPF_TRACE_x helper for the tracing tests
  bpf: Add bpf_jit_blinding_enabled for !CONFIG_BPF_JIT
  bpf, testing: Add various tail call test cases
  bpf, x86: Emit patchable direct jump as tail call
  bpf: Constant map key tracking for prog array pokes
  bpf: Add poke dependency tracking for prog array maps
  bpf: Add initial poke descriptor table for jit images
  bpf: Move owner type, jited info into array auxiliary data
  ...
2019-11-25 20:02:57 -08:00
Linus Torvalds
f838767555 Livepatching changes for 5.5
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAl3bz1gACgkQUqAMR0iA
 lPI5fw//db5dOqAvBu/fz4k38Mc30okgCjtRh0+vhFXCCUXauQv2IhI19J2IiPpy
 4t/CjaUk2QSB06NDNUxt7XsuR0yAF4E0nJHUmkDKkN8UFsi7jAjxJ/92zH3x/LE0
 YWtVoWjGduO+QfLVlP22VVYh1pX5kOxXG2WTEiJtnkWYdkqtkEy7Cw2Rlzzrrym6
 6kIVi3nEPtn/hOnlF/Ii5SJWh+jJrSf+XwXiuIiBupT49Ujoa4KscmhkiHnAccXb
 ICJpsxBIdZLxHLe/c0YO3b8r4Hvms124vlIC19Z0l9VEJ++sphDOWRrL3Zq9tFw8
 FwIKq8Ex9UFfldOnpD5q/PRhM9Xgfsw8UYYZZpXQeW6z7qzv1iVpM6BQlt3dYQaL
 pb21UXIfzaTdZIsUgnetypbqWIFNZxovrhORpjqxo6WV4lSaVx4MPE2/Le/G8xPR
 DT+a6yyzTyM0XbZ0MCVDfuZ+ZRaA1zfKEcsYEu883b7yK4z+TZbT61vnEKqq8+40
 EgOZnNjBENZLRQY0RofQsO5zGwcaanVOtBOmYDXtP/fup8/1SRZ25zmmIVhvChJG
 iQwCDw6IMqnae/FsMb+kBTDCRJXpN028iYGY7UAoiCuvzc0qm0gJXsGdZLqMvjEh
 nEdKKN2ze+03s6I9AcISKdnbUVphhb/xeDKRBkMgcooWLrfWj5E=
 =i37E
 -----END PGP SIGNATURE-----

Merge tag 'livepatching-for-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching

Pull livepatching updates from Petr Mladek:

 - New API to track system state changes done be livepatch callbacks. It
   helps to maintain compatibility between livepatches.

 - Update Kconfig help text. ORC is another reliable unwinder.

 - Disable generic selftest timeout. Livepatch selftests have their own
   per-operation fine-grained timeouts.

* tag 'livepatching-for-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching:
  x86/stacktrace: update kconfig help text for reliable unwinders
  livepatch: Selftests of the API for tracking system state changes
  livepatch: Documentation of the new API for tracking system state changes
  livepatch: Allow to distinguish different version of system state changes
  livepatch: Basic API to track system state changes
  livepatch: Keep replaced patches until post_patch callback is called
  selftests/livepatch: Disable the timeout
2019-11-25 19:43:48 -08:00
Linus Torvalds
436b2a8039 Printk changes for 5.5
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAl3bpjoACgkQUqAMR0iA
 lPJJDA/+IJT4YCRp2TwV2jvIs0QzvXZrzEsxgCLibLE85mYTJgoQBD3W1bH2eyjp
 T/9U0Zh5PGr/84cHd4qiMxzo+5Olz930weG59NcO4RJBSr671aRYs5tJqwaQAZDR
 wlwaob5S28vUmjPxKulvxv6V3FdI79ZE9xrCOCSTQvz4iCLsGOu+Dn/qtF64pImX
 M/EXzPMBrByiQ8RTM4Ege8JoBqiCZPDG9GR3KPXIXQwEeQgIoeYxwRYakxSmSzz8
 W8NduFCbWavg/yHhghHikMiyOZeQzAt+V9k9WjOBTle3TGJegRhvjgI7508q3tXe
 jQTMGATBOPkIgFaZz7eEn/iBa3jZUIIOzDY93RYBmd26aBvwKLOma/Vkg5oGYl0u
 ZK+CMe+/xXl7brQxQ6JNsQhbSTjT+746LvLJlCvPbbPK9R0HeKNhsdKpGY3ugnmz
 VAnOFIAvWUHO7qx+J+EnOo5iiPpcwXZj4AjrwVrs/x5zVhzwQ+4DSU6rbNn0O1Ak
 ELrBqCQkQzh5kqK93jgMHeWQ9EOUp1Lj6PJhTeVnOx2x8tCOi6iTQFFrfdUPlZ6K
 2DajgrFhti4LvwVsohZlzZuKRm5EuwReLRSOn7PU5qoSm5rcouqMkdlYG/viwyhf
 mTVzEfrfemrIQOqWmzPrWEXlMj2mq8oJm4JkC+jJ/+HsfK4UU8I=
 =QCEy
 -----END PGP SIGNATURE-----

Merge tag 'printk-for-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk

Pull printk updates from Petr Mladek:

 - Allow to print symbolic error names via new %pe modifier.

 - Use pr_warn() instead of the remaining pr_warning() calls. Fix
   formatting of the related lines.

 - Add VSPRINTF entry to MAINTAINERS.

* tag 'printk-for-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk: (32 commits)
  checkpatch: don't warn about new vsprintf pointer extension '%pe'
  MAINTAINERS: Add VSPRINTF
  tools lib api: Renaming pr_warning to pr_warn
  ASoC: samsung: Use pr_warn instead of pr_warning
  lib: cpu_rmap: Use pr_warn instead of pr_warning
  trace: Use pr_warn instead of pr_warning
  dma-debug: Use pr_warn instead of pr_warning
  vgacon: Use pr_warn instead of pr_warning
  fs: afs: Use pr_warn instead of pr_warning
  sh/intc: Use pr_warn instead of pr_warning
  scsi: Use pr_warn instead of pr_warning
  platform/x86: intel_oaktrail: Use pr_warn instead of pr_warning
  platform/x86: asus-laptop: Use pr_warn instead of pr_warning
  platform/x86: eeepc-laptop: Use pr_warn instead of pr_warning
  oprofile: Use pr_warn instead of pr_warning
  of: Use pr_warn instead of pr_warning
  macintosh: Use pr_warn instead of pr_warning
  idsn: Use pr_warn instead of pr_warning
  ide: Use pr_warn instead of pr_warning
  crypto: n2: Use pr_warn instead of pr_warning
  ...
2019-11-25 19:40:40 -08:00
Linus Torvalds
1b96a41b42 Merge branch 'for-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "There are several notable changes here:

   - Single thread migrating itself has been optimized so that it
     doesn't need threadgroup rwsem anymore.

   - Freezer optimization to avoid unnecessary frozen state changes.

   - cgroup ID unification so that cgroup fs ino is the only unique ID
     used for the cgroup and can be used to directly look up live
     cgroups through filehandle interface on 64bit ino archs. On 32bit
     archs, cgroup fs ino is still the only ID in use but it is only
     unique when combined with gen.

   - selftest and other changes"

* 'for-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (24 commits)
  writeback: fix -Wformat compilation warnings
  docs: cgroup: mm: Fix spelling of "list"
  cgroup: fix incorrect WARN_ON_ONCE() in cgroup_setup_root()
  cgroup: use cgrp->kn->id as the cgroup ID
  kernfs: use 64bit inos if ino_t is 64bit
  kernfs: implement custom exportfs ops and fid type
  kernfs: combine ino/id lookup functions into kernfs_find_and_get_node_by_id()
  kernfs: convert kernfs_node->id from union kernfs_node_id to u64
  kernfs: kernfs_find_and_get_node_by_ino() should only look up activated nodes
  kernfs: use dumber locking for kernfs_find_and_get_node_by_ino()
  netprio: use css ID instead of cgroup ID
  writeback: use ino_t for inodes in tracepoints
  kernfs: fix ino wrap-around detection
  kselftests: cgroup: Avoid the reuse of fd after it is deallocated
  cgroup: freezer: don't change task and cgroups status unnecessarily
  cgroup: use cgroup->last_bstat instead of cgroup->bstat_pending for consistency
  cgroup: remove cgroup_enable_task_cg_lists() optimization
  cgroup: pids: use atomic64_t for pids->limit
  selftests: cgroup: Run test_core under interfering stress
  selftests: cgroup: Add task migration tests
  ...
2019-11-25 19:23:46 -08:00
Linus Torvalds
9391edee86 Merge branch 'for-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue updates from Tejun Heo:
 "There have been sporadic reports of sanity checks in
  destroy_workqueue() failing spuriously over the years. This contains
  the fix and its follow-up changes / fixes.

  There's also a RCU annotation improvement"

* 'for-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: Add RCU annotation for pwq list walk
  workqueue: Fix pwq ref leak in rescuer_thread()
  workqueue: more destroy_workqueue() fixes
  workqueue: Minor follow-ups to the rescuer destruction change
  workqueue: Fix missing kfree(rescuer) in destroy_workqueue()
  workqueue: Fix spurious sanity check failures in destroy_workqueue()
2019-11-25 18:57:12 -08:00
Linus Torvalds
0acefef584 threads-v5.5
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXdfjBwAKCRCRxhvAZXjc
 onCBAP47WZ/ie7yjoDWhOI1QB7II3NGSzToakxpgJaWoB+NjTwEA7PGrSYVEbPrf
 pUhiEaEJ29t+cWUxX3+yDO+k7SA6BAY=
 =Ra58
 -----END PGP SIGNATURE-----

Merge tag 'threads-v5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull thread management updates from Christian Brauner:

 - A pidfd's fdinfo file currently contains the field "Pid:\t<pid>"
   where <pid> is the pid of the process in the pid namespace of the
   procfs instance the fdinfo file for the pidfd was opened in.

   The fdinfo file has now gained a new "NSpid:\t<ns-pid1>[\t<ns-pid2>[...]]"
   field which lists the pids of the process in all child pid namespaces
   provided the pid namespace of the procfs instance it is looked up
   under has an ancestoral relationship with the pid namespace of the
   process. If it does not 0 will be shown and no further pid namespaces
   will be listed. Tests included. (Christian Kellner)

 - If the process the pidfd references has already exited, print -1 for
   the Pid and NSpid fields in the pidfd's fdinfo file. Tests included.
   (me)

 - Add CLONE_CLEAR_SIGHAND. This lets callers clear all signal handler
   that are not SIG_DFL or SIG_IGN at process creation time. This
   originated as a feature request from glibc to improve performance and
   elimate races in their posix_spawn() implementation. Tests included.
   (me)

 - Add support for choosing a specific pid for a process with clone3().
   This is the feature which was part of the thread update for v5.4 but
   after a discussion at LPC in Lisbon we decided to delay it for one
   more cycle in order to make the interface more generic. This has now
   done. It is now possible to choose a specific pid in a whole pid
   namespaces (sub)hierarchy instead of just one pid namespace. In order
   to choose a specific pid the caller must have CAP_SYS_ADMIN in all
   owning user namespaces of the target pid namespaces. Tests included.
   (Adrian Reber)

 - Test improvements and extensions. (Andrei Vagin, me)

* tag 'threads-v5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  selftests/clone3: skip if clone3() is ENOSYS
  selftests/clone3: check that all pids are released on error paths
  selftests/clone3: report a correct number of fails
  selftests/clone3: flush stdout and stderr before clone3() and _exit()
  selftests: add tests for clone3() with *set_tid
  fork: extend clone3() to support setting a PID
  selftests: add tests for clone3()
  tests: test CLONE_CLEAR_SIGHAND
  clone3: add CLONE_CLEAR_SIGHAND
  pid: use pid_has_task() in pidfd_open()
  exit: use pid_has_task() in do_wait()
  pid: use pid_has_task() in __change_pid()
  test: verify fdinfo for pidfd of reaped process
  pidfd: check pid has attached task in fdinfo
  pidfd: add tests for NSpid info in fdinfo
  pidfd: add NSpid entries to fdinfo
2019-11-25 18:36:49 -08:00
Linus Torvalds
752272f16d ARM:
- Data abort report and injection
 - Steal time support
 - GICv4 performance improvements
 - vgic ITS emulation fixes
 - Simplify FWB handling
 - Enable halt polling counters
 - Make the emulated timer PREEMPT_RT compliant
 
 s390:
 - Small fixes and cleanups
 - selftest improvements
 - yield improvements
 
 PPC:
 - Add capability to tell userspace whether we can single-step the guest.
 - Improve the allocation of XIVE virtual processor IDs
 - Rewrite interrupt synthesis code to deliver interrupts in virtual
   mode when appropriate.
 - Minor cleanups and improvements.
 
 x86:
 - XSAVES support for AMD
 - more accurate report of nested guest TSC to the nested hypervisor
 - retpoline optimizations
 - support for nested 5-level page tables
 - PMU virtualization optimizations, and improved support for nested
   PMU virtualization
 - correct latching of INITs for nested virtualization
 - IOAPIC optimization
 - TSX_CTRL virtualization for more TAA happiness
 - improved allocation and flushing of SEV ASIDs
 - many bugfixes and cleanups
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJd27PMAAoJEL/70l94x66DspsH+gPc6YWtKJFJH58Zj8NrNh6y
 t0FwDFcvUa51+m4jaY4L5Y8+zqu1dZFnPPhFGqNWpxrjCEvE/glQJv3BiUX06Seh
 aYUHNymGoYCTJOHaaGhV+NlgQaDuZOCOkIsOLAPehyFd1KojwB+FRC0xmO6aROPw
 9yQgYrKuK1UUn5HwxBNrMS4+Xv+2iKv/9sTnq1G4W2qX2NZQg84LVPg1zIdkCh3D
 3GOvoCBEk3ivQqjmdE7rP/InPr0XvW0b6TFhchIk8J6jEIQFHsmOUefiTvTxsIHV
 OKAZwvyeYPrYHA/aDZpaBmY2aR0ydfKDUQcviNIJoF1vOktGs0hvl3VbsmG8QCg=
 =OSI1
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "ARM:
   - data abort report and injection
   - steal time support
   - GICv4 performance improvements
   - vgic ITS emulation fixes
   - simplify FWB handling
   - enable halt polling counters
   - make the emulated timer PREEMPT_RT compliant

  s390:
   - small fixes and cleanups
   - selftest improvements
   - yield improvements

  PPC:
   - add capability to tell userspace whether we can single-step the
     guest
   - improve the allocation of XIVE virtual processor IDs
   - rewrite interrupt synthesis code to deliver interrupts in virtual
     mode when appropriate.
   - minor cleanups and improvements.

  x86:
   - XSAVES support for AMD
   - more accurate report of nested guest TSC to the nested hypervisor
   - retpoline optimizations
   - support for nested 5-level page tables
   - PMU virtualization optimizations, and improved support for nested
     PMU virtualization
   - correct latching of INITs for nested virtualization
   - IOAPIC optimization
   - TSX_CTRL virtualization for more TAA happiness
   - improved allocation and flushing of SEV ASIDs
   - many bugfixes and cleanups"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (127 commits)
  kvm: nVMX: Relax guest IA32_FEATURE_CONTROL constraints
  KVM: x86: Grab KVM's srcu lock when setting nested state
  KVM: x86: Open code shared_msr_update() in its only caller
  KVM: Fix jump label out_free_* in kvm_init()
  KVM: x86: Remove a spurious export of a static function
  KVM: x86: create mmu/ subdirectory
  KVM: nVMX: Remove unnecessary TLB flushes on L1<->L2 switches when L1 use apic-access-page
  KVM: x86: remove set but not used variable 'called'
  KVM: nVMX: Do not mark vmcs02->apic_access_page as dirty when unpinning
  KVM: vmx: use MSR_IA32_TSX_CTRL to hard-disable TSX on guest that lack it
  KVM: vmx: implement MSR_IA32_TSX_CTRL disable RTM functionality
  KVM: x86: implement MSR_IA32_TSX_CTRL effect on CPUID
  KVM: x86: do not modify masked bits of shared MSRs
  KVM: x86: fix presentation of TSX feature in ARCH_CAPABILITIES
  KVM: PPC: Book3S HV: XIVE: Fix potential page leak on error path
  KVM: PPC: Book3S HV: XIVE: Free previous EQ page when setting up a new one
  KVM: nVMX: Assume TLB entries of L1 and L2 are tagged differently if L0 use EPT
  KVM: x86: Unexport kvm_vcpu_reload_apic_access_page()
  KVM: nVMX: add CR4_LA57 bit to nested CR4_FIXED1
  KVM: nVMX: Use semi-colon instead of comma for exit-handlers initialization
  ...
2019-11-25 18:02:36 -08:00
Linus Torvalds
4ba380f616 arm64 updates for 5.5:
- On ARMv8 CPUs without hardware updates of the access flag, avoid
   failing cow_user_page() on PFN mappings if the pte is old. The patches
   introduce an arch_faults_on_old_pte() macro, defined as false on x86.
   When true, cow_user_page() makes the pte young before attempting
   __copy_from_user_inatomic().
 
 - Covert the synchronous exception handling paths in
   arch/arm64/kernel/entry.S to C.
 
 - FTRACE_WITH_REGS support for arm64.
 
 - ZONE_DMA re-introduced on arm64 to support Raspberry Pi 4
 
 - Several kselftest cases specific to arm64, together with a MAINTAINERS
   update for these files (moved to the ARM64 PORT entry).
 
 - Workaround for a Neoverse-N1 erratum where the CPU may fetch stale
   instructions under certain conditions.
 
 - Workaround for Cortex-A57 and A72 errata where the CPU may
   speculatively execute an AT instruction and associate a VMID with the
   wrong guest page tables (corrupting the TLB).
 
 - Perf updates for arm64: additional PMU topologies on HiSilicon
   platforms, support for CCN-512 interconnect, AXI ID filtering in the
   IMX8 DDR PMU, support for the CCPI2 uncore PMU in ThunderX2.
 
 - GICv3 optimisation to avoid a heavy barrier when accessing the
   ICC_PMR_EL1 register.
 
 - ELF HWCAP documentation updates and clean-up.
 
 - SMC calling convention conduit code clean-up.
 
 - KASLR diagnostics printed during boot
 
 - NVIDIA Carmel CPU added to the KPTI whitelist
 
 - Some arm64 mm clean-ups: use generic free_initrd_mem(), remove stale
   macro, simplify calculation in __create_pgd_mapping(), typos.
 
 - Kconfig clean-ups: CMDLINE_FORCE to depend on CMDLINE, choice for
   endinanness to help with allmodconfig.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAl3YJswACgkQa9axLQDI
 XvFwYg//aTGhNLew3ADgW2TYal7LyqetRROixPBrzqHLu2A8No1+QxHMaKxpZVyf
 pt25tABuLtPHql3qBzE0ltmfbLVsPj/3hULo404EJb9HLRfUnVGn7gcPkc+p4YAr
 IYkYPXJbk6OlJ84vI+4vXmDEF12bWCqamC9qZ+h99qTpMjFXFO17DSJ7xQ8Xic3A
 HHgCh4uA7gpTVOhLxaS6KIw+AZNYwvQxLXch2+wj6agbGX79uw9BeMhqVXdkPq8B
 RTDJpOdS970WOT4cHWOkmXwsqqGRqgsgyu+bRUJ0U72+0y6MX0qSHIUnVYGmNc5q
 Dtox4rryYLvkv/hbpkvjgVhv98q3J1mXt/CalChWB5dG4YwhJKN2jMiYuoAvB3WS
 6dR7Dfupgai9gq1uoKgBayS2O6iFLSa4g58vt3EqUBqmM7W7viGFPdLbuVio4ycn
 CNF2xZ8MZR6Wrh1JfggO7Hc11EJdSqESYfHO6V/pYB4pdpnqJLDoriYHXU7RsZrc
 HvnrIvQWKMwNbqBvpNbWvK5mpBMMX2pEienA3wOqKNH7MbepVsG+npOZTVTtl9tN
 FL0ePb/mKJu/2+gW8ntiqYn7EzjKprRmknOiT2FjWWo0PxgJ8lumefuhGZZbaOWt
 /aTAeD7qKd/UXLKGHF/9v3q4GEYUdCFOXP94szWVPyLv+D9h8L8=
 =TPL9
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Catalin Marinas:
 "Apart from the arm64-specific bits (core arch and perf, new arm64
  selftests), it touches the generic cow_user_page() (reviewed by
  Kirill) together with a macro for x86 to preserve the existing
  behaviour on this architecture.

  Summary:

   - On ARMv8 CPUs without hardware updates of the access flag, avoid
     failing cow_user_page() on PFN mappings if the pte is old. The
     patches introduce an arch_faults_on_old_pte() macro, defined as
     false on x86. When true, cow_user_page() makes the pte young before
     attempting __copy_from_user_inatomic().

   - Covert the synchronous exception handling paths in
     arch/arm64/kernel/entry.S to C.

   - FTRACE_WITH_REGS support for arm64.

   - ZONE_DMA re-introduced on arm64 to support Raspberry Pi 4

   - Several kselftest cases specific to arm64, together with a
     MAINTAINERS update for these files (moved to the ARM64 PORT entry).

   - Workaround for a Neoverse-N1 erratum where the CPU may fetch stale
     instructions under certain conditions.

   - Workaround for Cortex-A57 and A72 errata where the CPU may
     speculatively execute an AT instruction and associate a VMID with
     the wrong guest page tables (corrupting the TLB).

   - Perf updates for arm64: additional PMU topologies on HiSilicon
     platforms, support for CCN-512 interconnect, AXI ID filtering in
     the IMX8 DDR PMU, support for the CCPI2 uncore PMU in ThunderX2.

   - GICv3 optimisation to avoid a heavy barrier when accessing the
     ICC_PMR_EL1 register.

   - ELF HWCAP documentation updates and clean-up.

   - SMC calling convention conduit code clean-up.

   - KASLR diagnostics printed during boot

   - NVIDIA Carmel CPU added to the KPTI whitelist

   - Some arm64 mm clean-ups: use generic free_initrd_mem(), remove
     stale macro, simplify calculation in __create_pgd_mapping(), typos.

   - Kconfig clean-ups: CMDLINE_FORCE to depend on CMDLINE, choice for
     endinanness to help with allmodconfig"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (93 commits)
  arm64: Kconfig: add a choice for endianness
  kselftest: arm64: fix spelling mistake "contiguos" -> "contiguous"
  arm64: Kconfig: make CMDLINE_FORCE depend on CMDLINE
  MAINTAINERS: Add arm64 selftests to the ARM64 PORT entry
  arm64: kaslr: Check command line before looking for a seed
  arm64: kaslr: Announce KASLR status on boot
  kselftest: arm64: fake_sigreturn_misaligned_sp
  kselftest: arm64: fake_sigreturn_bad_size
  kselftest: arm64: fake_sigreturn_duplicated_fpsimd
  kselftest: arm64: fake_sigreturn_missing_fpsimd
  kselftest: arm64: fake_sigreturn_bad_size_for_magic0
  kselftest: arm64: fake_sigreturn_bad_magic
  kselftest: arm64: add helper get_current_context
  kselftest: arm64: extend test_init functionalities
  kselftest: arm64: mangle_pstate_invalid_mode_el[123][ht]
  kselftest: arm64: mangle_pstate_invalid_daif_bits
  kselftest: arm64: mangle_pstate_invalid_compat_toggle and common utils
  kselftest: arm64: extend toplevel skeleton Makefile
  drivers/perf: hisi: update the sccl_id/ccl_id for certain HiSilicon platform
  arm64: mm: reserve CMA and crashkernel in ZONE_DMA32
  ...
2019-11-25 15:39:19 -08:00
Linus Torvalds
e25645b181 linux-kselftest-5.5-rc1-kunit
This kselftest update for Linux 5.5-rc1 adds KUnit, a lightweight unit
 testing and mocking framework for the Linux kernel from Brendan Higgins.
 
 KUnit is not an end-to-end testing framework. It is currently supported
 on UML and sub-systems can write unit tests and run them in UML env.
 KUnit documentation is included in this update.
 
 In addition, this Kunit update adds 3 new kunit tests:
 
 - kunit test for proc sysctl from Iurii Zaikin
 - kunit test for the 'list' doubly linked list from David Gow
 - ext4 kunit test for decoding extended timestamps from Iurii Zaikin
 
 In the future KUnit will be linked to Kselftest framework to provide
 a way to trigger KUnit tests from user-space.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEPZKym/RZuOCGeA/kCwJExA0NQxwFAl3YfBQACgkQCwJExA0N
 QxzTjxAAiFhaDMhlpLhn1DpIUNvfKrIgDjJgajQAyMMs6TJK3OrD6J4WbpVD7wGo
 aqF9l6o64sY18JAo3s00j6AcAmVwNH7qzEEuzIQPjJvQ8C4sCWL3esEP4JHgFb2F
 snlSn5KjSsdC1D9N7uQIhgW76xPSyDrTwWQpglvmB9TwmJVBIl9zhu+unp73ufFJ
 N+ieDg8A6W/wDGYSq5JBSkJbuI0gL+daNwUYzxEEZIskndhpovOc82WAldECRm6x
 TfI0u39zTbrEO0DHgmYpyGbTN8TB2mXjH5HMjwg+KbHfKVTKKGvTK7XFs8mWGQpO
 n2meypZuwuIsRPOPcAVs+Gt2dc0jFODJVIV1EzA0WSv6TEdPqyhM/d13tHdCqjm9
 ic5wQ/hQQNEB1Dvg5ereXBaGGaoqP95y61ZpCS9vCXFXH+28E/B63Ebfs2IBIuqS
 Jv2KcoxabyZq3uGdjnn+mD7IM8rkvscRP4Ba31nXRgJIYDHAzqe7APN7y3on4NGx
 1q7lBlA3XZZ8qgo0zpLST20ck/qaL3tk4k8E1f8emh6CuyrCWtazgrWkMIlyEX0O
 8nre3uEAF9xUzB4+gZK4YmelN9Bld3Uv7Ippt1zTCiQ0FkEABQIMUrTZygy7Wfg6
 6qi4dk8frWW4Kt63gOXsMxr9FWTqDk+Ys4GPAVDVm1d0dzERn8k=
 =0zEe
 -----END PGP SIGNATURE-----

Merge tag 'linux-kselftest-5.5-rc1-kunit' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest

Pull kselftest KUnit support gtom Shuah Khan:
 "This adds KUnit, a lightweight unit testing and mocking framework for
  the Linux kernel from Brendan Higgins.

  KUnit is not an end-to-end testing framework. It is currently
  supported on UML and sub-systems can write unit tests and run them in
  UML env. KUnit documentation is included in this update.

  In addition, this Kunit update adds 3 new kunit tests:

   - proc sysctl test from Iurii Zaikin

   - the 'list' doubly linked list test from David Gow

   - ext4 tests for decoding extended timestamps from Iurii Zaikin

  In the future KUnit will be linked to Kselftest framework to provide a
  way to trigger KUnit tests from user-space"

* tag 'linux-kselftest-5.5-rc1-kunit' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: (23 commits)
  lib/list-test: add a test for the 'list' doubly linked list
  ext4: add kunit test for decoding extended timestamps
  Documentation: kunit: Fix verification command
  kunit: Fix '--build_dir' option
  kunit: fix failure to build without printk
  MAINTAINERS: add proc sysctl KUnit test to PROC SYSCTL section
  kernel/sysctl-test: Add null pointer test for sysctl.c:proc_dointvec()
  MAINTAINERS: add entry for KUnit the unit testing framework
  Documentation: kunit: add documentation for KUnit
  kunit: defconfig: add defconfigs for building KUnit tests
  kunit: tool: add Python wrappers for running KUnit tests
  kunit: test: add tests for KUnit managed resources
  kunit: test: add the concept of assertions
  kunit: test: add tests for kunit test abort
  kunit: test: add support for test abort
  objtool: add kunit_try_catch_throw to the noreturn list
  kunit: test: add initial tests
  lib: enable building KUnit in lib/
  kunit: test: add the concept of expectations
  kunit: test: add assertion printing library
  ...
2019-11-25 15:01:30 -08:00
Arnd Bergmann
b111df8447 y2038: alarm: fix half-second cut-off
Changing alarm_itimer accidentally broke the logic for arithmetic
rounding of half seconds in the return code.

Change it to a constant based on NSEC_PER_SEC, as suggested by
Ben Hutchings.

Fixes: bd40a17576 ("y2038: itimer: change implementation to timespec64")
Reported-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-11-25 21:52:35 +01:00
Linus Torvalds
fb4b3d3fd0 for-5.5/io_uring-20191121
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3WxNwQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgps4kD/9SIDXhYhhE8fNqeAF7Uouu8fxgwnkY3hSI
 43vJwCziiDxWWJH5mYW7/83VNOMZKHIbiYMnU6iEUsRQ/sG/wI0wEfAQZDHLzCKt
 cko2q7zAC1/4rtoslwJ3q04hE2Ap/nb93ELZBVr7fOAuODBNFUp/vifAojvsMPKz
 hNMNPq/vYg7c/iYMZKSBdtjE3tqceFNBjAVNMB9dHKQLeexEy4ve7AjBeawWsSi7
 GesnQ5w5u5LqkMYwLslpv/oVjHiiFWgGnDAvBNvykQvVy+DfB54KSqMV11W1aqdU
 l6L+ENfZasEvlk1yMAth2Foq4vlscm5MKEb6VdJhXWHHXtXkcBmz7RBqPmjSvXCY
 wS5GZRw8oYtTcid0aQf+t/wgRNTDJsGsnsT32qto41No3Z7vlIDHUDxHZGTA+gEL
 E8j9rDx6EXMTo3EFbC8XZcfsorhPJ1HKAyw1YFczHtYzJEQUR9jJe3f/Q9u6K2Vy
 s/EhkVeHa/lEd7kb6mI+6lQjGe1FXl7AHauDuaaEfIOZA/xJB3Bad5Wjq1va1cUO
 TX+37zjzFzJghhSIBGYq7G7iT4AMecPQgxHzCdCyYfW5S4Uur9tMmIElwVPI/Pjl
 kDZ9gdg9lm6JifZ9Ab8QcGhuQQTF3frwX9VfgrVgcqyvm38AiYzVgL9ZJnxRS/Cy
 ZfLNkACXqQ==
 =YZ9s
 -----END PGP SIGNATURE-----

Merge tag 'for-5.5/io_uring-20191121' of git://git.kernel.dk/linux-block

Pull io_uring updates from Jens Axboe:
 "A lot of stuff has been going on this cycle, with improving the
  support for networked IO (and hence unbounded request completion
  times) being one of the major themes. There's been a set of fixes done
  this week, I'll send those out as well once we're certain we're fully
  happy with them.

  This contains:

   - Unification of the "normal" submit path and the SQPOLL path (Pavel)

   - Support for sparse (and bigger) file sets, and updating of those
     file sets without needing to unregister/register again.

   - Independently sized CQ ring, instead of just making it always 2x
     the SQ ring size. This makes it more flexible for networked
     applications.

   - Support for overflowed CQ ring, never dropping events but providing
     backpressure on submits.

   - Add support for absolute timeouts, not just relative ones.

   - Support for generic cancellations. This divorces io_uring from
     workqueues as well, which additionally gets us one step closer to
     generic async system call support.

   - With cancellations, we can support grabbing the process file table
     as well, just like we do mm context. This allows support for system
     calls that create file descriptors, like accept4() support that's
     built on top of that.

   - Support for io_uring tracing (Dmitrii)

   - Support for linked timeouts. These abort an operation if it isn't
     completed by the time noted in the linke timeout.

   - Speedup tracking of poll requests

   - Various cleanups making the coder easier to follow (Jackie, Pavel,
     Bob, YueHaibing, me)

   - Update MAINTAINERS with new io_uring list"

* tag 'for-5.5/io_uring-20191121' of git://git.kernel.dk/linux-block: (64 commits)
  io_uring: make POLL_ADD/POLL_REMOVE scale better
  io-wq: remove now redundant struct io_wq_nulls_list
  io_uring: Fix getting file for non-fd opcodes
  io_uring: introduce req_need_defer()
  io_uring: clean up io_uring_cancel_files()
  io-wq: ensure free/busy list browsing see all items
  io-wq: ensure we have a stable view of ->cur_work for cancellations
  io_wq: add get/put_work handlers to io_wq_create()
  io_uring: check for validity of ->rings in teardown
  io_uring: fix potential deadlock in io_poll_wake()
  io_uring: use correct "is IO worker" helper
  io_uring: fix -ENOENT issue with linked timer with short timeout
  io_uring: don't do flush cancel under inflight_lock
  io_uring: flag SQPOLL busy condition to userspace
  io_uring: make ASYNC_CANCEL work with poll and timeout
  io_uring: provide fallback request for OOM situations
  io_uring: convert accept4() -ERESTARTSYS into -EINTR
  io_uring: fix error clear of ->file_table in io_sqe_files_register()
  io_uring: separate the io_free_req and io_free_req_find_next interface
  io_uring: keep io_put_req only responsible for release and put req
  ...
2019-11-25 10:40:27 -08:00
Ingo Molnar
83bae01182 Merge branch 'timers/urgent' into timers/core, to pick up fix
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-25 15:43:15 +01:00
Ingo Molnar
de881a341c Merge branch 'sched/rt' into sched/core, to pick up commit
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-25 13:48:11 +01:00
Will Deacon
2f30b36943 locking/refcount: Remove unused 'refcount_error_report()' function
'refcount_error_report()' has no callers. Remove it.

Signed-off-by: Will Deacon <will@kernel.org>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Kees Cook <keescook@chromium.org>
Tested-by: Hanjun Guo <guohanjun@huawei.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191121115902.2551-10-will@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-25 09:15:42 +01:00
Ingo Molnar
c494cd6469 Merge branch 'perf/urgent' into perf/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-25 09:08:29 +01:00
Daniel Borkmann
b553a6ec57 bpf: Simplify __bpf_arch_text_poke poke type handling
Given that we have BPF_MOD_NOP_TO_{CALL,JUMP}, BPF_MOD_{CALL,JUMP}_TO_NOP
and BPF_MOD_{CALL,JUMP}_TO_{CALL,JUMP} poke types and that we also pass in
old_addr as well as new_addr, it's a bit redundant and unnecessarily
complicates __bpf_arch_text_poke() itself since we can derive the same from
the *_addr that were passed in. Hence simplify and use BPF_MOD_{CALL,JUMP}
as types which also allows to clean up call-sites.

In addition to that, __bpf_arch_text_poke() currently verifies that text
matches expected old_insn before we invoke text_poke_bp(). Also add a check
on new_insn and skip rewrite if it already matches. Reason why this is rather
useful is that it avoids making any special casing in prog_array_map_poke_run()
when old and new prog were NULL and has the benefit that also for this case
we perform a check on text whether it really matches our expectations.

Suggested-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/fcb00a2b0b288d6c73de4ef58116a821c8fe8f2f.1574555798.git.daniel@iogearbox.net
2019-11-24 17:12:11 -08:00
Daniel Borkmann
d2e4c1e6c2 bpf: Constant map key tracking for prog array pokes
Add tracking of constant keys into tail call maps. The signature of
bpf_tail_call_proto is that arg1 is ctx, arg2 map pointer and arg3
is a index key. The direct call approach for tail calls can be enabled
if the verifier asserted that for all branches leading to the tail call
helper invocation, the map pointer and index key were both constant
and the same.

Tracking of map pointers we already do from prior work via c93552c443
("bpf: properly enforce index mask to prevent out-of-bounds speculation")
and 09772d92cd ("bpf: avoid retpoline for lookup/update/ delete calls
on maps").

Given the tail call map index key is not on stack but directly in the
register, we can add similar tracking approach and later in fixup_bpf_calls()
add a poke descriptor to the progs poke_tab with the relevant information
for the JITing phase.

We internally reuse insn->imm for the rewritten BPF_JMP | BPF_TAIL_CALL
instruction in order to point into the prog's poke_tab, and keep insn->imm
as 0 as indicator that current indirect tail call emission must be used.
Note that publishing to the tracker must happen at the end of fixup_bpf_calls()
since adding elements to the poke_tab reallocates its memory, so we need
to wait until its in final state.

Future work can generalize and add similar approach to optimize plain
array map lookups. Difference there is that we need to look into the key
value that sits on stack. For clarity in bpf_insn_aux_data, map_state
has been renamed into map_ptr_state, so we get map_{ptr,key}_state as
trackers.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/e8db37f6b2ae60402fa40216c96738ee9b316c32.1574452833.git.daniel@iogearbox.net
2019-11-24 17:04:11 -08:00
Daniel Borkmann
da765a2f59 bpf: Add poke dependency tracking for prog array maps
This work adds program tracking to prog array maps. This is needed such
that upon prog array updates/deletions we can fix up all programs which
make use of this tail call map. We add ops->map_poke_{un,}track()
helpers to maps to maintain the list of programs and ops->map_poke_run()
for triggering the actual update.

bpf_array_aux is extended to contain the list head and poke_mutex in
order to serialize program patching during updates/deletions.
bpf_free_used_maps() will untrack the program shortly before dropping
the reference to the map. For clearing out the prog array once all urefs
are dropped we need to use schedule_work() to have a sleepable context.

The prog_array_map_poke_run() is triggered during updates/deletions and
walks the maintained prog list. It checks in their poke_tabs whether the
map and key is matching and runs the actual bpf_arch_text_poke() for
patching in the nop or new jmp location. Depending on the type of update,
we use one of BPF_MOD_{NOP_TO_JUMP,JUMP_TO_NOP,JUMP_TO_JUMP}.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/1fb364bb3c565b3e415d5ea348f036ff379e779d.1574452833.git.daniel@iogearbox.net
2019-11-24 17:04:11 -08:00
Daniel Borkmann
a66886fe6c bpf: Add initial poke descriptor table for jit images
Add initial poke table data structures and management to the BPF
prog that can later be used by JITs. Also add an instance of poke
specific data for tail call maps; plan for later work is to extend
this also for BPF static keys.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/1db285ec2ea4207ee0455b3f8e191a4fc58b9ade.1574452833.git.daniel@iogearbox.net
2019-11-24 17:04:11 -08:00
Daniel Borkmann
2beee5f574 bpf: Move owner type, jited info into array auxiliary data
We're going to extend this with further information which is only
relevant for prog array at this point. Given this info is not used
in critical path, move it into its own structure such that the main
array map structure can be kept on diet.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/b9ddccdb0f6f7026489ee955f16c96381e1e7238.1574452833.git.daniel@iogearbox.net
2019-11-24 17:04:11 -08:00
Daniel Borkmann
6332be04c0 bpf: Move bpf_free_used_maps into sleepable section
We later on are going to need a sleepable context as opposed to plain
RCU callback in order to untrack programs we need to poke at runtime
and tracking as well as image update is performed under mutex.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/09823b1d5262876e9b83a8e75df04cf0467357a4.1574452833.git.daniel@iogearbox.net
2019-11-24 17:03:44 -08:00
Yonghong Song
581738a681 bpf: Provide better register bounds after jmp32 instructions
With latest llvm (trunk https://github.com/llvm/llvm-project),
test_progs, which has +alu32 enabled, failed for strobemeta.o.
The verifier output looks like below with edit to replace large
decimal numbers with hex ones.
 193: (85) call bpf_probe_read_user_str#114
   R0=inv(id=0)
 194: (26) if w0 > 0x1 goto pc+4
   R0_w=inv(id=0,umax_value=0xffffffff00000001)
 195: (6b) *(u16 *)(r7 +80) = r0
 196: (bc) w6 = w0
   R6_w=inv(id=0,umax_value=0xffffffff,var_off=(0x0; 0xffffffff))
 197: (67) r6 <<= 32
   R6_w=inv(id=0,smax_value=0x7fffffff00000000,umax_value=0xffffffff00000000,
            var_off=(0x0; 0xffffffff00000000))
 198: (77) r6 >>= 32
   R6=inv(id=0,umax_value=0xffffffff,var_off=(0x0; 0xffffffff))
 ...
 201: (79) r8 = *(u64 *)(r10 -416)
   R8_w=map_value(id=0,off=40,ks=4,vs=13872,imm=0)
 202: (0f) r8 += r6
   R8_w=map_value(id=0,off=40,ks=4,vs=13872,umax_value=0xffffffff,var_off=(0x0; 0xffffffff))
 203: (07) r8 += 9696
   R8_w=map_value(id=0,off=9736,ks=4,vs=13872,umax_value=0xffffffff,var_off=(0x0; 0xffffffff))
 ...
 255: (bf) r1 = r8
   R1_w=map_value(id=0,off=9736,ks=4,vs=13872,umax_value=0xffffffff,var_off=(0x0; 0xffffffff))
 ...
 257: (85) call bpf_probe_read_user_str#114
 R1 unbounded memory access, make sure to bounds check any array access into a map

The value range for register r6 at insn 198 should be really just 0/1.
The umax_value=0xffffffff caused later verification failure.

After jmp instructions, the current verifier already tried to use just
obtained information to get better register range. The current mechanism is
for 64bit register only. This patch implemented to tighten the range
for 32bit sub-registers after jmp32 instructions.
With the patch, we have the below range ranges for the
above code sequence:
 193: (85) call bpf_probe_read_user_str#114
   R0=inv(id=0)
 194: (26) if w0 > 0x1 goto pc+4
   R0_w=inv(id=0,smax_value=0x7fffffff00000001,umax_value=0xffffffff00000001,
            var_off=(0x0; 0xffffffff00000001))
 195: (6b) *(u16 *)(r7 +80) = r0
 196: (bc) w6 = w0
   R6_w=inv(id=0,umax_value=0xffffffff,var_off=(0x0; 0x1))
 197: (67) r6 <<= 32
   R6_w=inv(id=0,umax_value=0x100000000,var_off=(0x0; 0x100000000))
 198: (77) r6 >>= 32
   R6=inv(id=0,umax_value=1,var_off=(0x0; 0x1))
 ...
 201: (79) r8 = *(u64 *)(r10 -416)
   R8_w=map_value(id=0,off=40,ks=4,vs=13872,imm=0)
 202: (0f) r8 += r6
   R8_w=map_value(id=0,off=40,ks=4,vs=13872,umax_value=1,var_off=(0x0; 0x1))
 203: (07) r8 += 9696
   R8_w=map_value(id=0,off=9736,ks=4,vs=13872,umax_value=1,var_off=(0x0; 0x1))
 ...
 255: (bf) r1 = r8
   R1_w=map_value(id=0,off=9736,ks=4,vs=13872,umax_value=1,var_off=(0x0; 0x1))
 ...
 257: (85) call bpf_probe_read_user_str#114
 ...

At insn 194, the register R0 has better var_off.mask and smax_value.
Especially, the var_off.mask ensures later lshift and rshift
maintains proper value range.

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20191121170650.449030-1-yhs@fb.com
2019-11-24 16:58:46 -08:00
Toke Høiland-Jørgensen
071cdecec5 xdp: Fix cleanup on map free for devmap_hash map type
Tetsuo pointed out that it was not only the device unregister hook that was
broken for devmap_hash types, it was also cleanup on map free. So better
fix this as well.

While we're at it, there's no reason to allocate the netdev_map array for
DEVMAP_HASH, so skip that and adjust the cost accordingly.

Fixes: 6f9d451ab1 ("xdp: Add devmap_hash map type for looking up devices by hashed index")
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20191121133612.430414-1-toke@redhat.com
2019-11-24 16:58:46 -08:00
Jason Gunthorpe
107e899874 mm/hmm: define the pre-processor related parts of hmm.h even if disabled
Only the function calls are stubbed out with static inlines that always
fail. This is the standard way to write a header for an optional component
and makes it easier for drivers that only optionally need HMM_MIRROR.

Link: https://lore.kernel.org/r/20191112202231.3856-5-jgg@ziepe.ca
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Tested-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-11-23 19:56:44 -04:00
Jakub Kicinski
84bb46cd62 Revert "bpf: Emit audit messages upon successful prog load and unload"
This commit reverts commit 91e6015b08 ("bpf: Emit audit messages
upon successful prog load and unload") and its follow up commit
7599a896f2 ("audit: Move audit_log_task declaration under
CONFIG_AUDITSYSCALL") as requested by Paul Moore. The change needs
close review on linux-audit, tests etc.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-11-23 09:56:02 -08:00
Hassan Naveed
0e24220821 tracing: Use xarray for syscall trace events
Currently, a lot of memory is wasted for architectures like MIPS when
init_ftrace_syscalls() allocates the array for syscalls using kcalloc.
This is because syscalls numbers start from 4000, 5000 or 6000 and
array elements up to that point are unused.
Fix this by using a data structure more suited to storing sparsely
populated arrays. The XARRAY data structure, implemented using radix
trees, is much more memory efficient for storing the syscalls in
question.

Link: http://lkml.kernel.org/r/20191115234314.21599-1-hnaveed@wavecomp.com

Signed-off-by: Hassan Naveed <hnaveed@wavecomp.com>
Reviewed-by: Paul Burton <paulburton@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-22 19:47:41 -05:00
Divya Indi
2887978714 tracing: Adding new functions for kernel access to Ftrace instances
Adding 2 new functions -
1) struct trace_array *trace_array_get_by_name(const char *name);

Return pointer to a trace array with given name. If it does not exist,
create and return pointer to the new trace array.

2) int trace_array_set_clr_event(struct trace_array *tr,
const char *system ,const char *event, bool enable);

Enable/Disable events to this trace array.

Additionally,
- To handle reference counters, export trace_array_put()
- Due to introduction of the above 2 new functions, we no longer need to
  export - ftrace_set_clr_event & trace_array_create APIs.

Link: http://lkml.kernel.org/r/1574276919-11119-2-git-send-email-divya.indi@oracle.com

Signed-off-by: Divya Indi <divya.indi@oracle.com>
Reviewed-by: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-22 19:41:08 -05:00
Krzysztof Kozlowski
fc809bc5ce tracing: Fix Kconfig indentation
Adjust indentation from spaces to tab (+optional two spaces) as in
coding style with command like:
	$ sed -e 's/^        /\t/' -i */Kconfig

Link: http://lkml.kernel.org/r/20191120133807.12741-1-krzk@kernel.org

Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-22 19:41:08 -05:00
Xianting Tian
a82a4804b4 ring-buffer: Fix typos in function ring_buffer_producer
Fix spelling and other typos

Link: http://lkml.kernel.org/r/1573916755-32478-1-git-send-email-xianting_tian@126.com

Signed-off-by: Xianting Tian <xianting_tian@126.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-22 19:41:08 -05:00
Jakub Kicinski
a9f852e92e Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Minor conflict in drivers/s390/net/qeth_l2_main.c, kept the lock
from commit c8183f5489 ("s390/qeth: fix potential deadlock on
workqueue flush"), removed the code which was removed by commit
9897d583b0 ("s390/qeth: consolidate some duplicated HW cmd code").

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-11-22 16:27:24 -08:00
Linus Torvalds
34c36f4564 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Validate tunnel options length in act_tunnel_key, from Xin Long.

 2) Fix DMA sync bug in gve driver, from Adi Suresh.

 3) TSO kills performance on some r8169 chips due to HW issues, disable
    by default in that case, from Corinna Vinschen.

 4) Fix clock disable mismatch in fec driver, from Chubong Yuan.

 5) Fix interrupt status bits define in hns3 driver, from Huazhong Tan.

 6) Fix workqueue deadlocks in qeth driver, from Julian Wiedmann.

 7) Don't napi_disable() twice in r8152 driver, from Hayes Wang.

 8) Fix SKB extension memory leak, from Florian Westphal.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (54 commits)
  r8152: avoid to call napi_disable twice
  MAINTAINERS: Add myself as maintainer of virtio-vsock
  udp: drop skb extensions before marking skb stateless
  net: rtnetlink: prevent underflows in do_setvfinfo()
  can: m_can_platform: remove unnecessary m_can_class_resume() call
  can: m_can_platform: set net_device structure as driver data
  hv_netvsc: Fix send_table offset in case of a host bug
  hv_netvsc: Fix offset usage in netvsc_send_table()
  net-ipv6: IPV6_TRANSPARENT - check NET_RAW prior to NET_ADMIN
  sfc: Only cancel the PPS workqueue if it exists
  nfc: port100: handle command failure cleanly
  net-sysfs: fix netdev_queue_add_kobject() breakage
  r8152: Re-order napi_disable in rtl8152_close
  net: qca_spi: Move reset_count to struct qcaspi
  net: qca_spi: fix receive buffer size check
  net/ibmvnic: Ignore H_FUNCTION return from H_EOI to tolerate XIVE mode
  Revert "net/ibmvnic: Fix EOI when running in XIVE mode"
  net/mlxfw: Verify FSM error code translation doesn't exceed array size
  net/mlx5: Update the list of the PCI supported devices
  net/mlx5: Fix auto group size calculation
  ...
2019-11-22 14:28:14 -08:00
Linus Torvalds
a6b0373ffc Power management regression fix for final 5.4
Fix problems with switching cpufreq drivers on some x86 systems with
 ACPI (and with changing the operation modes of the intel_pstate driver
 on those systems) introduced by recent changes related to the
 management of frequency limits in cpufreq.
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl3X1LQSHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxAgMQAKFu4e3IZ7vVMo3gu61r9amqVG03prMh
 Ca8lvMd0pxp1wtL6gSkAB0Kces1hwooz4bpA/Khm7igBfdilcNu9gv/d9EvGRCU1
 /iOWHrsYS4gQnWwjJ001x+PokQQEFv+uGw9a6vwlEVU4a6TImlJKF+99Beogzr7U
 tuTzEpim2ruMHFaP5EqjC/XcU+uzyW2rMnc7fH0t55W0saIT2trI6Nj+DJgZWPth
 eRbgS4aOy3Bt9foBCVOFtkXz5VvHS3EUMPI7DxOx3OTt8D469NBDYx/ui8y7frIF
 t3vME/UuWCJmEUJ7o0CqhVcB4TW5gZaIE7P2TO9pSCxPxKtPbROdfrR3snGPLjNl
 syD7yqwuYpDbfi+0v3Mi3fnUqnnsiXPg2mFCl/w0tgIKz1iAv20pOWnCqja4QreG
 Z+Ax1owzrEYlDaV5lDXYZd48dqItj4dMuUBSwXu8RBk+LE/Bjro0jnnK4zUD/M1m
 sukrvgFrIXpTgqSnm8sX4fMyZnEvmNtG2kAYl6F84k2EICbOLkXD9ypBANb/9C3h
 KcSzJFr76de18s5bnMBgoQRvfElbOotr63TyzeRo/6mF0G4+BpyAEsP7FzRSq+6u
 cd+lrCvItzwIgD/5XU1WpkoOoYM//c0Y9JUHKwtnd+FSp1psGJwlhgbbckazQ699
 A5DC7maAkkHb
 =S23p
 -----END PGP SIGNATURE-----

Merge tag 'pm-5.4-final' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management regression fix from Rafael Wysocki:
 "Fix problems with switching cpufreq drivers on some x86 systems with
  ACPI (and with changing the operation modes of the intel_pstate driver
  on those systems) introduced by recent changes related to the
  management of frequency limits in cpufreq"

* tag 'pm-5.4-final' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PM: QoS: Invalidate frequency QoS requests after removal
2019-11-22 09:18:16 -08:00
Kusanagi Kouichi
0e4a459f56 tracing: Remove unnecessary DEBUG_FS dependency
Tracing replaced debugfs with tracefs.

Signed-off-by: Kusanagi Kouichi <slash@ac.auone-net.jp>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20191120104350753.EWCT.12796.ppp.dion.ne.jp@dmta0009.auone-net.jp
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-22 16:19:13 +01:00
Nicolas Saenz Julienne
a7ba70f178 dma-mapping: treat dev->bus_dma_mask as a DMA limit
Using a mask to represent bus DMA constraints has a set of limitations.
The biggest one being it can only hold a power of two (minus one). The
DMA mapping code is already aware of this and treats dev->bus_dma_mask
as a limit. This quirk is already used by some architectures although
still rare.

With the introduction of the Raspberry Pi 4 we've found a new contender
for the use of bus DMA limits, as its PCIe bus can only address the
lower 3GB of memory (of a total of 4GB). This is impossible to represent
with a mask. To make things worse the device-tree code rounds non power
of two bus DMA limits to the next power of two, which is unacceptable in
this case.

In the light of this, rename dev->bus_dma_mask to dev->bus_dma_limit all
over the tree and treat it as such. Note that dev->bus_dma_limit should
contain the higher accessible DMA address.

Signed-off-by: Nicolas Saenz Julienne <nsaenzjulienne@suse.de>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-11-21 18:14:35 +01:00
Christoph Hellwig
d7293f79ca Merge branch 'for-next/zone-dma' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux into dma-mapping-for-next
Pull in a stable branch from the arm64 tree that adds the zone_dma_bits
variable to avoid creating hard to resolve conflicts with that addition.
2019-11-21 18:13:03 +01:00
Paolo Bonzini
46f4f0aabc Merge branch 'kvm-tsx-ctrl' into HEAD
Conflicts:
	arch/x86/kvm/vmx/vmx.c
2019-11-21 12:03:40 +01:00
Alexander Shishkin
c4b7547974 perf/core: Make the mlock accounting simple again
Commit:

  d44248a413 ("perf/core: Rework memory accounting in perf_mmap()")

does a lot of things to the mlock accounting arithmetics, while the only
thing that actually needed to happen is subtracting the part that is
charged to the mm from the part that is charged to the user, so that the
former isn't charged twice.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Cc: songliubraving@fb.com
Link: https://lkml.kernel.org/r/20191120170640.54123-1-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-21 07:37:50 +01:00
Frederic Weisbecker
74722bb223 sched/vtime: Bring up complete kcpustat accessor
Many callsites want to fetch the values of system, user, user_nice, guest
or guest_nice kcpustat fields altogether or at least a pair of these.

In that case calling kcpustat_field() for each requested field brings
unecessary overhead when we could fetch all of them in a row.

So provide kcpustat_cpu_fetch() that fetches the whole kcpustat array
in a vtime safe way under the same RCU and seqcount block.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Link: https://lkml.kernel.org/r/20191121024430.19938-3-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-21 07:33:24 +01:00
Frederic Weisbecker
5a1c95580f sched/cputime: Support other fields on kcpustat_field()
Provide support for user, nice, guest and guest_nice fields through
kcpustat_field().

Whether we account the delta to a nice or not nice field is decided on
top of the nice value snapshot taken at the time we call kcpustat_field().
If the nice value of the task has been changed since the last vtime
update, we may have inacurrate distribution of the nice VS unnice
cputime.

However this is considered as a minor issue compared to the proper fix
that would involve interrupting the target on nice updates, which is
undesired on nohz_full CPUs.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Link: https://lkml.kernel.org/r/20191121024430.19938-2-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-21 07:33:23 +01:00
David S. Miller
ee5a489fd9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2019-11-20

The following pull-request contains BPF updates for your *net-next* tree.

We've added 81 non-merge commits during the last 17 day(s) which contain
a total of 120 files changed, 4958 insertions(+), 1081 deletions(-).

There are 3 trivial conflicts, resolve it by always taking the chunk from
196e8ca748:

<<<<<<< HEAD
=======
void *bpf_map_area_mmapable_alloc(u64 size, int numa_node);
>>>>>>> 196e8ca748

<<<<<<< HEAD
void *bpf_map_area_alloc(u64 size, int numa_node)
=======
static void *__bpf_map_area_alloc(u64 size, int numa_node, bool mmapable)
>>>>>>> 196e8ca748

<<<<<<< HEAD
        if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
=======
        /* kmalloc()'ed memory can't be mmap()'ed */
        if (!mmapable && size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
>>>>>>> 196e8ca748

The main changes are:

1) Addition of BPF trampoline which works as a bridge between kernel functions,
   BPF programs and other BPF programs along with two new use cases: i) fentry/fexit
   BPF programs for tracing with practically zero overhead to call into BPF (as
   opposed to k[ret]probes) and ii) attachment of the former to networking related
   programs to see input/output of networking programs (covering xdpdump use case),
   from Alexei Starovoitov.

2) BPF array map mmap support and use in libbpf for global data maps; also a big
   batch of libbpf improvements, among others, support for reading bitfields in a
   relocatable manner (via libbpf's CO-RE helper API), from Andrii Nakryiko.

3) Extend s390x JIT with usage of relative long jumps and loads in order to lift
   the current 64/512k size limits on JITed BPF programs there, from Ilya Leoshkevich.

4) Add BPF audit support and emit messages upon successful prog load and unload in
   order to have a timeline of events, from Daniel Borkmann and Jiri Olsa.

5) Extension to libbpf and xdpsock sample programs to demo the shared umem mode
   (XDP_SHARED_UMEM) as well as RX-only and TX-only sockets, from Magnus Karlsson.

6) Several follow-up bug fixes for libbpf's auto-pinning code and a new API
   call named bpf_get_link_xdp_info() for retrieving the full set of prog
   IDs attached to XDP, from Toke Høiland-Jørgensen.

7) Add BTF support for array of int, array of struct and multidimensional arrays
   and enable it for skb->cb[] access in kfree_skb test, from Martin KaFai Lau.

8) Fix AF_XDP by using the correct number of channels from ethtool, from Luigi Rizzo.

9) Two fixes for BPF selftest to get rid of a hang in test_tc_tunnel and to avoid
   xdping to be run as standalone, from Jiri Benc.

10) Various BPF selftest fixes when run with latest LLVM trunk, from Yonghong Song.

11) Fix a memory leak in BPF fentry test run data, from Colin Ian King.

12) Various smaller misc cleanups and improvements mostly all over BPF selftests and
    samples, from Daniel T. Lee, Andre Guedes, Anders Roxell, Mao Wenan, Yue Haibing.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-20 18:11:23 -08:00
Dmitry Safonov
7b8474466e time: Zero the upper 32-bits in __kernel_timespec on 32-bit
On compat interfaces, the high order bits of nanoseconds should be zeroed
out. This is because the application code or the libc do not guarantee
zeroing of these. If used without zeroing, kernel might be at risk of using
timespec values incorrectly.

Originally it was handled correctly, but lost during is_compat_syscall()
cleanup. Revert the condition back to check CONFIG_64BIT.

Fixes: 98f76206b3 ("compat: Cleanup in_compat_syscall() callers")
Reported-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20191121000303.126523-1-dima@arista.com
2019-11-21 01:17:58 +01:00
Daniel Borkmann
196e8ca748 bpf: Switch bpf_map_{area_alloc,area_mmapable_alloc}() to u64 size
Given we recently extended the original bpf_map_area_alloc() helper in
commit fc9702273e ("bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY"),
we need to apply the same logic as in ff1c08e1f7 ("bpf: Change size
to u64 for bpf_map_{area_alloc, charge_init}()"). To avoid conflicts,
extend it for bpf-next.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-11-20 23:18:58 +01:00
Daniel Borkmann
91e6015b08 bpf: Emit audit messages upon successful prog load and unload
Allow for audit messages to be emitted upon BPF program load and
unload for having a timeline of events. The load itself is in
syscall context, so additional info about the process initiating
the BPF prog creation can be logged and later directly correlated
to the unload event.

The only info really needed from BPF side is the globally unique
prog ID where then audit user space tooling can query / dump all
info needed about the specific BPF program right upon load event
and enrich the record, thus these changes needed here can be kept
small and non-intrusive to the core.

Raw example output:

  # auditctl -D
  # auditctl -a always,exit -F arch=x86_64 -S bpf
  # ausearch --start recent -m 1334
  [...]
  ----
  time->Wed Nov 20 12:45:51 2019
  type=PROCTITLE msg=audit(1574271951.590:8974): proctitle="./test_verifier"
  type=SYSCALL msg=audit(1574271951.590:8974): arch=c000003e syscall=321 success=yes exit=14 a0=5 a1=7ffe2d923e80 a2=78 a3=0 items=0 ppid=742 pid=949 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=2 comm="test_verifier" exe="/root/bpf-next/tools/testing/selftests/bpf/test_verifier" subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key=(null)
  type=UNKNOWN[1334] msg=audit(1574271951.590:8974): auid=0 uid=0 gid=0 ses=2 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 pid=949 comm="test_verifier" exe="/root/bpf-next/tools/testing/selftests/bpf/test_verifier" prog-id=3260 event=LOAD
  ----
  time->Wed Nov 20 12:45:51 2019
type=UNKNOWN[1334] msg=audit(1574271951.590:8975): prog-id=3260 event=UNLOAD
  ----
  [...]

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20191120213816.8186-1-jolsa@kernel.org
2019-11-20 13:44:51 -08:00
Christoph Hellwig
68a33b1794 dma-direct: exclude dma_direct_map_resource from the min_low_pfn check
The valid memory address check in dma_capable only makes sense when mapping
normal memory, not when using dma_map_resource to map a device resource.
Add a new boolean argument to dma_capable to exclude that check for the
dma_map_resource case.

Fixes: b12d66278d ("dma-direct: check for overflows on 32 bit DMA addresses")
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
2019-11-20 20:31:41 +01:00
Christoph Hellwig
4268ac6ae5 dma-direct: don't check swiotlb=force in dma_direct_map_resource
When mapping resources we can't just use swiotlb ram for bounce
buffering.  Switch to a direct dma_capable check instead.

Fixes: cfced78696 ("dma-mapping: remove the default map_resource implementation")
Reported-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
2019-11-20 20:31:41 +01:00
Dan Carpenter
50f579a239 dma-debug: clean up put_hash_bucket()
The put_hash_bucket() is a bit cleaner if it takes an unsigned long
directly instead of a pointer to unsigned long.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-11-20 20:31:41 +01:00
Christoph Hellwig
56e35f9c5b dma-mapping: drop the dev argument to arch_sync_dma_for_*
These are pure cache maintainance routines, so drop the unused
struct device argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Suggested-by: Daniel Vetter <daniel.vetter@ffwll.ch>
2019-11-20 20:31:38 +01:00
Thomas Gleixner
407e62f52a irqchip updates for Linux 5.5
- Qualcomm PDC wakeup interrupt support
 - Layerscape external IRQ support
 - Broadcom bcm7038 PM and wakeup support
 - Ingenic driver cleanup and modernization
 - GICv3 ITS preparation for GICv4.1 updates
 - GICv4 fixes
 - Various cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAl3VJasPHG1hekBrZXJu
 ZWwub3JnAAoJECPQ0LrRPXpDMLEP/2U55GLmPuNqW/na4YFmcOYAkoP0WpEKDzn9
 u8lBi8CukKl1Z2JLAyP0E1e5iOVS0exvQ5V4OwxGKmeR9oWmB3Lym8UWRw7vcKEF
 HMKgtPCd3U3J5jW0P4Hr8hn6Q+B55fdMrrAaJcgsfBVB7bRB0lC0LZYGtN4VC4d8
 rTQzup5CK8Mu9k4NztLCxxoBUKFoqM+ZKsrRB2eOXB9amcPQtFvwC+5ZL2tDr3wS
 7d+pd6G4A+hsloIDUxoH9BrO/jd1jlfHyBRDFJIgpo/IQWVT6ciQECZomRR1pW30
 bGFYBf/HPFqfyH+ZOrWprSAd0Yx33WtYaMokYaJ6vGu4wedyxh/1LTRLzL0tWuyZ
 tPFvEmiiP/Hoeq1JHFRFQUO/75ckqALLeAxCjACCN8+F2Z0armk1W/iwehZNQHHV
 JdDXegRNUlMipG2kk5D3L6AK28bi+3+axc1ERMN1RO40eLm8NLogWL2TJlxLbyUe
 lMZMe43ceC0McGnQpAY8qyuC7IycQtngKNBvzG+6ADucGpFez3gYxh39RR43XMVo
 37Hsj+Ur7CFBJj6WTCzV2teC/WaXXQkJYxn6fsHNmUgdwPgGD3LppxhlWG49Ao9w
 x8ZnfyrrYmcFOJrKbT45ExMihioaGf8dyksKZNA/Z4dI0g/kf0LyYi5ujZaDDilI
 eDkMI/xI
 =uENR
 -----END PGP SIGNATURE-----

Merge tag 'irqchip-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/core

Pull irqchip updates from Marc Zyngier:

 - Qualcomm PDC wakeup interrupt support
 - Layerscape external IRQ support
 - Broadcom bcm7038 PM and wakeup support
 - Ingenic driver cleanup and modernization
 - GICv3 ITS preparation for GICv4.1 updates
 - GICv4 fixes
 - Various cleanups
2019-11-20 14:16:34 +01:00
Luc Van Oostenryck
9e77716a75 fork: fix pidfd_poll()'s return type
pidfd_poll() is defined as returning 'unsigned int' but the
.poll method is declared as returning '__poll_t', a bitwise type.

Fix this by using the proper return type and using the EPOLL
constants instead of the POLL ones, as required for __poll_t.

Fixes: b53b0b9d9a ("pidfd: add polling support")
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: stable@vger.kernel.org # 5.3
Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20191120003320.31138-1-luc.vanoostenryck@gmail.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2019-11-20 11:48:50 +01:00
Daniel Lezcano
5aa9ba6312 cpuidle: Pass exit latency limit to cpuidle_use_deepest_state()
Modify cpuidle_use_deepest_state() to take an additional exit latency
limit argument to be passed to find_deepest_idle_state() and make
cpuidle_idle_call() pass dev->forced_idle_latency_limit_ns to it for
forced idle.

Suggested-by: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
[ rjw: Rebase and rearrange code, subject & changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-11-20 11:46:18 +01:00
Daniel Lezcano
c55b51a06b cpuidle: Allow idle injection to apply exit latency limit
In some cases it may be useful to specify an exit latency limit for
the idle state to be used during CPU idle time injection.

Instead of duplicating the information in struct cpuidle_device
or propagating the latency limit in the call stack, replace the
use_deepest_state field with forced_latency_limit_ns to represent
that limit, so that the deepest idle state with exit latency within
that limit is forced (i.e. no governors) when it is set.

A zero exit latency limit for forced idle means to use governors in
the usual way (analogous to use_deepest_state equal to "false" before
this change).

Additionally, add play_idle_precise() taking two arguments, the
duration of forced idle and the idle state exit latency limit, both
in nanoseconds, and redefine play_idle() as a wrapper around that
new function.

This change is preparatory, no functional impact is expected.

Suggested-by: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
[ rjw: Subject, changelog, cpuidle_use_deepest_state() kerneldoc, whitespace ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-11-20 11:32:55 +01:00
Ingo Molnar
5cbaefe974 kcsan: Improve various small stylistic details
Tidy up a few bits:

  - Fix typos and grammar, improve wording.

  - Remove spurious newlines that are col80 warning artifacts where the
    resulting line-break is worse than the disease it's curing.

  - Use core kernel coding style to improve readability and reduce
    spurious code pattern variations.

  - Use better vertical alignment for structure definitions and initialization
    sequences.

  - Misc other small details.

No change in functionality intended.

Cc: linux-kernel@vger.kernel.org
Cc: Marco Elver <elver@google.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-20 10:47:23 +01:00
Rafael J. Wysocki
05ff1ba412 PM: QoS: Invalidate frequency QoS requests after removal
Switching cpufreq drivers (or switching operation modes of the
intel_pstate driver from "active" to "passive" and vice versa)
does not work on some x86 systems with ACPI after commit
3000ce3c52 ("cpufreq: Use per-policy frequency QoS"), because
the ACPI _PPC and thermal code uses the same frequency QoS request
object for a given CPU every time a cpufreq driver is registered
and freq_qos_remove_request() does not invalidate the request after
removing it from its QoS list, so freq_qos_add_request() complains
and fails when that request is passed to it again.

Fix the issue by modifying freq_qos_remove_request() to clear the qos
and type fields of the frequency request pointed to by its argument
after removing it from its QoS list so as to invalidate it.

Fixes: 3000ce3c52 ("cpufreq: Use per-policy frequency QoS")
Reported-and-tested-by: Doug Smythies <dsmythies@telus.net>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2019-11-20 10:46:42 +01:00
Thomas Gleixner
3ef240eaff futex: Prevent exit livelock
Oleg provided the following test case:

int main(void)
{
	struct sched_param sp = {};

	sp.sched_priority = 2;
	assert(sched_setscheduler(0, SCHED_FIFO, &sp) == 0);

	int lock = vfork();
	if (!lock) {
		sp.sched_priority = 1;
		assert(sched_setscheduler(0, SCHED_FIFO, &sp) == 0);
		_exit(0);
	}

	syscall(__NR_futex, &lock, FUTEX_LOCK_PI, 0,0,0);
	return 0;
}

This creates an unkillable RT process spinning in futex_lock_pi() on a UP
machine or if the process is affine to a single CPU. The reason is:

 parent	    	    			child

  set FIFO prio 2

  vfork()			->	set FIFO prio 1
   implies wait_for_child()	 	sched_setscheduler(...)
 			   		exit()
					do_exit()
 					....
					mm_release()
					  tsk->futex_state = FUTEX_STATE_EXITING;
					  exit_futex(); (NOOP in this case)
					  complete() --> wakes parent
  sys_futex()
    loop infinite because
    tsk->futex_state == FUTEX_STATE_EXITING

The same problem can happen just by regular preemption as well:

  task holds futex
  ...
  do_exit()
    tsk->futex_state = FUTEX_STATE_EXITING;

  --> preemption (unrelated wakeup of some other higher prio task, e.g. timer)

  switch_to(other_task)

  return to user
  sys_futex()
	loop infinite as above

Just for the fun of it the futex exit cleanup could trigger the wakeup
itself before the task sets its futex state to DEAD.

To cure this, the handling of the exiting owner is changed so:

   - A refcount is held on the task

   - The task pointer is stored in a caller visible location

   - The caller drops all locks (hash bucket, mmap_sem) and blocks
     on task::futex_exit_mutex. When the mutex is acquired then
     the exiting task has completed the cleanup and the state
     is consistent and can be reevaluated.

This is not a pretty solution, but there is no choice other than returning
an error code to user space, which would break the state consistency
guarantee and open another can of problems including regressions.

For stable backports the preparatory commits ac31c7ff86 .. ba31c1a485
are required as well, but for anything older than 5.3.y the backports are
going to be provided when this hits mainline as the other dependencies for
those kernels are definitely not stable material.

Fixes: 778e9a9c3e ("pi-futex: fix exit races and locking problems")
Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Stable Team <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20191106224557.041676471@linutronix.de
2019-11-20 09:40:38 +01:00
Thomas Gleixner
ac31c7ff86 futex: Provide distinct return value when owner is exiting
attach_to_pi_owner() returns -EAGAIN for various cases:

 - Owner task is exiting
 - Futex value has changed

The caller drops the held locks (hash bucket, mmap_sem) and retries the
operation. In case of the owner task exiting this can result in a live
lock.

As a preparatory step for seperating those cases, provide a distinct return
value (EBUSY) for the owner exiting case.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20191106224556.935606117@linutronix.de
2019-11-20 09:40:10 +01:00
Thomas Gleixner
3f186d9748 futex: Add mutex around futex exit
The mutex will be used in subsequent changes to replace the busy looping of
a waiter when the futex owner is currently executing the exit cleanup to
prevent a potential live lock.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20191106224556.845798895@linutronix.de
2019-11-20 09:40:10 +01:00
Thomas Gleixner
af8cbda2cf futex: Provide state handling for exec() as well
exec() attempts to handle potentially held futexes gracefully by running
the futex exit handling code like exit() does.

The current implementation has no protection against concurrent incoming
waiters. The reason is that the futex state cannot be set to
FUTEX_STATE_DEAD after the cleanup because the task struct is still active
and just about to execute the new binary.

While its arguably buggy when a task holds a futex over exec(), for
consistency sake the state handling can at least cover the actual futex
exit cleanup section. This provides state consistency protection accross
the cleanup. As the futex state of the task becomes FUTEX_STATE_OK after the
cleanup has been finished, this cannot prevent subsequent attempts to
attach to the task in case that the cleanup was not successfull in mopping
up all leftovers.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20191106224556.753355618@linutronix.de
2019-11-20 09:40:09 +01:00
Thomas Gleixner
4a8e991b91 futex: Sanitize exit state handling
Instead of having a smp_mb() and an empty lock/unlock of task::pi_lock move
the state setting into to the lock section.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20191106224556.645603214@linutronix.de
2019-11-20 09:40:09 +01:00
Thomas Gleixner
18f694385c futex: Mark the begin of futex exit explicitly
Instead of relying on PF_EXITING use an explicit state for the futex exit
and set it in the futex exit function. This moves the smp barrier and the
lock/unlock serialization into the futex code.

As with the DEAD state this is restricted to the exit path as exec
continues to use the same task struct.

This allows to simplify that logic in a next step.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20191106224556.539409004@linutronix.de
2019-11-20 09:40:09 +01:00
Thomas Gleixner
f24f22435d futex: Set task::futex_state to DEAD right after handling futex exit
Setting task::futex_state in do_exit() is rather arbitrarily placed for no
reason. Move it into the futex code.

Note, this is only done for the exit cleanup as the exec cleanup cannot set
the state to FUTEX_STATE_DEAD because the task struct is still in active
use.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20191106224556.439511191@linutronix.de
2019-11-20 09:40:08 +01:00
Thomas Gleixner
150d71584b futex: Split futex_mm_release() for exit/exec
To allow separate handling of the futex exit state in the futex exit code
for exit and exec, split futex_mm_release() into two functions and invoke
them from the corresponding exit/exec_mm_release() callsites.

Preparatory only, no functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20191106224556.332094221@linutronix.de
2019-11-20 09:40:08 +01:00
Thomas Gleixner
4610ba7ad8 exit/exec: Seperate mm_release()
mm_release() contains the futex exit handling. mm_release() is called from
do_exit()->exit_mm() and from exec()->exec_mm().

In the exit_mm() case PF_EXITING and the futex state is updated. In the
exec_mm() case these states are not touched.

As the futex exit code needs further protections against exit races, this
needs to be split into two functions.

Preparatory only, no functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20191106224556.240518241@linutronix.de
2019-11-20 09:40:08 +01:00
Thomas Gleixner
3d4775df0a futex: Replace PF_EXITPIDONE with a state
The futex exit handling relies on PF_ flags. That's suboptimal as it
requires a smp_mb() and an ugly lock/unlock of the exiting tasks pi_lock in
the middle of do_exit() to enforce the observability of PF_EXITING in the
futex code.

Add a futex_state member to task_struct and convert the PF_EXITPIDONE logic
over to the new state. The PF_EXITING dependency will be cleaned up in a
later step.

This prepares for handling various futex exit issues later.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20191106224556.149449274@linutronix.de
2019-11-20 09:40:07 +01:00
Thomas Gleixner
ba31c1a485 futex: Move futex exit handling into futex code
The futex exit handling is #ifdeffed into mm_release() which is not pretty
to begin with. But upcoming changes to address futex exit races need to add
more functionality to this exit code.

Split it out into a function, move it into futex code and make the various
futex exit functions static.

Preparatory only and no functional change.

Folded build fix from Borislav.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20191106224556.049705556@linutronix.de
2019-11-20 09:40:07 +01:00
YueHaibing
b2e2f0e6a6 bpf: Make array_map_mmap static
Fix sparse warning:

kernel/bpf/arraymap.c:481:5: warning:
 symbol 'array_map_mmap' was not declared. Should it be static?

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20191119142113.15388-1-yuehaibing@huawei.com
2019-11-19 16:57:32 -08:00
Ingo Molnar
8e1d58ae0c Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into locking/kcsan
Pull the KCSAN subsystem from Paul E. McKenney:

   "This pull request contains base kernel concurrency sanitizer
    (KCSAN) enablement for x86, courtesy of Marco Elver.  KCSAN is a
    sampling watchpoint-based data-race detector, and is documented in
    Documentation/dev-tools/kcsan.rst.  KCSAN was announced in September,
    and much feedback has since been incorporated:

      http://lkml.kernel.org/r/CANpmjNPJ_bHjfLZCAPV23AXFfiPiyXXqqu72n6TgWzb2Gnu1eA@mail.gmail.com

    The data races located thus far have resulted in a number of fixes:

      https://github.com/google/ktsan/wiki/KCSAN#upstream-fixes-of-data-races-found-by-kcsan

    Additional information may be found here:

      https://lore.kernel.org/lkml/20191114180303.66955-1-elver@google.com/
   "

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-19 19:56:28 +01:00
Steven Rostedt (VMware)
46f9469247 ftrace: Rename ftrace_graph_stub to ftrace_stub_graph
The ftrace_graph_stub was created and points to ftrace_stub as a way to
assign the functon graph tracer function pointer to a stub function with a
different prototype than what ftrace_stub has and not trigger the C
verifier. The ftrace_graph_stub was created via the linker script
vmlinux.lds.h. Unfortunately, powerpc already uses the name
ftrace_graph_stub for its internal implementation of the function graph
tracer, and even though powerpc would still build, the change via the linker
script broke function tracer on powerpc from working.

By using the name ftrace_stub_graph, which does not exist anywhere else in
the kernel, this should not be a problem.

Link: https://lore.kernel.org/r/1573849732.5937.136.camel@lca.pw

Fixes: b83b43ffc6 ("fgraph: Fix function type mismatches of ftrace_graph_return using ftrace_stub")
Reorted-by: Qian Cai <cai@lca.pw>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-18 11:42:16 -05:00
Steven Rostedt (VMware)
ea806eb3ea ftrace: Add a helper function to modify_ftrace_direct() to allow arch optimization
If a direct ftrace callback is at a location that does not have any other
ftrace helpers attached to it, it is possible to simply just change the
text to call the new caller (if the architecture supports it). But this
requires special architecture code. Currently, modify_ftrace_direct() uses a
trick to add a stub ftrace callback to the location forcing it to call the
ftrace iterator. Then it can change the direct helper to call the new
function in C, and then remove the stub. Removing the stub will have the
location now call the new location that the direct helper is using.

The new helper function does the registering the stub trick, but is a weak
function, allowing an architecture to override it to do something a bit more
direct.

Link: https://lore.kernel.org/r/20191115215125.mbqv7taqnx376yed@ast-mbp.dhcp.thefacebook.com

Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-18 11:42:09 -05:00
Alexander Shishkin
36b3db03b4 perf/core: Fix the mlock accounting, again
Commit:

  5e6c3c7b1e ("perf/aux: Fix tracking of auxiliary trace buffer allocation")

tried to guess the correct combination of arithmetic operations that would
undo the AUX buffer's mlock accounting, and failed, leaking the bottom part
when an allocation needs to be charged partially to both user->locked_vm
and mm->pinned_vm, eventually leaving the user with no locked bonus:

  $ perf record -e intel_pt//u -m1,128 uname
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.061 MB perf.data ]

  $ perf record -e intel_pt//u -m1,128 uname
  Permission error mapping pages.
  Consider increasing /proc/sys/kernel/perf_event_mlock_kb,
  or try again with a smaller value of -m/--mmap_pages.
  (current value: 1,128)

Fix this by subtracting both locked and pinned counts when AUX buffer is
unmapped.

Reported-by: Thomas Richter <tmricht@linux.ibm.com>
Tested-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-18 16:27:37 +01:00
Vincent Guittot
bef69dd878 sched/cpufreq: Move the cfs_rq_util_change() call to cpufreq_update_util()
update_cfs_rq_load_avg() calls cfs_rq_util_change() every time PELT decays,
which might be inefficient when the cpufreq driver has rate limitation.

When a task is attached on a CPU, we have this call path:

update_load_avg()
  update_cfs_rq_load_avg()
    cfs_rq_util_change -- > trig frequency update
  attach_entity_load_avg()
    cfs_rq_util_change -- > trig frequency update

The 1st frequency update will not take into account the utilization of the
newly attached task and the 2nd one might be discarded because of rate
limitation of the cpufreq driver.

update_cfs_rq_load_avg() is only called by update_blocked_averages()
and update_load_avg() so we can move the call to
cfs_rq_util_change/cpufreq_update_util() into these two functions.

It's also interesting to note that update_load_avg() already calls
cfs_rq_util_change() directly for the !SMP case.

This change will also ensure that cpufreq_update_util() is called even
when there is no more CFS rq in the leaf_cfs_rq_list to update, but only
IRQ, RT or DL PELT signals.

[ mingo: Minor updates. ]

Reported-by: Doug Smythies <dsmythies@telus.net>
Tested-by: Doug Smythies <dsmythies@telus.net>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: juri.lelli@redhat.com
Cc: linux-pm@vger.kernel.org
Cc: mgorman@suse.de
Cc: rostedt@goodmis.org
Cc: sargun@sargun.me
Cc: srinivas.pandruvada@linux.intel.com
Cc: tj@kernel.org
Cc: xiexiuqi@huawei.com
Cc: xiezhipeng1@huawei.com
Fixes: 039ae8bcf7 ("sched/fair: Fix O(nr_cgroups) in the load balancing path")
Link: https://lkml.kernel.org/r/1574083279-799-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-18 14:42:26 +01:00
Ingo Molnar
b21feab0b8 Linux 5.4-rc8
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl3RzgkeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGN18H/0JZbfIpy8/4Irol
 0va7Aj2fBi1a5oxfqYsMKN0u3GKbN3OV9tQ+7w1eBNGvL72TGadgVTzTY+Im7A9U
 UjboAc7jDPCG+YhIwXFufMiIAq5jDIj6h0LDas7ALsMfsnI/RhTwgNtLTAkyI3dH
 YV/6ljFULwueJHCxzmrYbd1x39PScj3kCNL2pOe6On7rXMKOemY/nbbYYISxY30E
 GMgKApSS+li7VuSqgrKoq5Qaox26LyR2wrXB1ij4pqEJ9xgbnKRLdHuvXZnE+/5p
 46EMirt+yeSkltW3d2/9MoCHaA76ESzWMMDijLx7tPgoTc3RB3/3ZLsm3rYVH+cR
 cRlNNSk=
 =0+Cg
 -----END PGP SIGNATURE-----

Merge tag 'v5.4-rc8' into sched/core, to pick up fixes and dependencies

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-18 14:41:02 +01:00
Vincent Guittot
a9723389cc sched/fair: Add comments for group_type and balancing at SD_NUMA level
Add comments to describe each state of goup_type and to add some details
about the load balance at NUMA level.

[ Valentin Schneider: Updates to the comments. ]
[ mingo: Other updates to the comments. ]

Reported-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Valentin Schneider <valentin.schneider@arm.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/1573570243-1903-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-18 14:33:12 +01:00
Vincent Guittot
3318544b72 sched/fair: Fix rework of find_idlest_group()
The task, for which the scheduler looks for the idlest group of CPUs, must
be discounted from all statistics in order to get a fair comparison
between groups. This includes utilization, load, nr_running and idle_cpus.

Such unfairness can be easily highlighted with the unixbench execl 1 task.
This test continuously call execve() and the scheduler looks for the idlest
group/CPU on which it should place the task. Because the task runs on the
local group/CPU, the latter seems already busy even if there is nothing
else running on it. As a result, the scheduler will always select another
group/CPU than the local one.

This recovers most of the performance regression on my system from the
recent load-balancer rewrite.

[ mingo: Minor cleanups. ]

Reported-by: kernel test robot <rong.a.chen@intel.com>
Tested-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: hdanton@sina.com
Cc: parth@linux.ibm.com
Cc: pauld@redhat.com
Cc: quentin.perret@arm.com
Cc: riel@surriel.com
Cc: srikar@linux.vnet.ibm.com
Cc: valentin.schneider@arm.com
Fixes: 57abff067a ("sched/fair: Rework find_idlest_group()")
Link: https://lkml.kernel.org/r/1571762798-25900-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-18 14:11:56 +01:00
Andrii Nakryiko
fc9702273e bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY
Add ability to memory-map contents of BPF array map. This is extremely useful
for working with BPF global data from userspace programs. It allows to avoid
typical bpf_map_{lookup,update}_elem operations, improving both performance
and usability.

There had to be special considerations for map freezing, to avoid having
writable memory view into a frozen map. To solve this issue, map freezing and
mmap-ing is happening under mutex now:
  - if map is already frozen, no writable mapping is allowed;
  - if map has writable memory mappings active (accounted in map->writecnt),
    map freezing will keep failing with -EBUSY;
  - once number of writable memory mappings drops to zero, map freezing can be
    performed again.

Only non-per-CPU plain arrays are supported right now. Maps with spinlocks
can't be memory mapped either.

For BPF_F_MMAPABLE array, memory allocation has to be done through vmalloc()
to be mmap()'able. We also need to make sure that array data memory is
page-sized and page-aligned, so we over-allocate memory in such a way that
struct bpf_array is at the end of a single page of memory with array->value
being aligned with the start of the second page. On deallocation we need to
accomodate this memory arrangement to free vmalloc()'ed memory correctly.

One important consideration regarding how memory-mapping subsystem functions.
Memory-mapping subsystem provides few optional callbacks, among them open()
and close().  close() is called for each memory region that is unmapped, so
that users can decrease their reference counters and free up resources, if
necessary. open() is *almost* symmetrical: it's called for each memory region
that is being mapped, **except** the very first one. So bpf_map_mmap does
initial refcnt bump, while open() will do any extra ones after that. Thus
number of close() calls is equal to number of open() calls plus one more.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/bpf/20191117172806.2195367-4-andriin@fb.com
2019-11-18 11:41:59 +01:00
Andrii Nakryiko
85192dbf4d bpf: Convert bpf_prog refcnt to atomic64_t
Similarly to bpf_map's refcnt/usercnt, convert bpf_prog's refcnt to atomic64
and remove artificial 32k limit. This allows to make bpf_prog's refcounting
non-failing, simplifying logic of users of bpf_prog_add/bpf_prog_inc.

Validated compilation by running allyesconfig kernel build.

Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20191117172806.2195367-3-andriin@fb.com
2019-11-18 11:41:59 +01:00
Andrii Nakryiko
1e0bd5a091 bpf: Switch bpf_map ref counter to atomic64_t so bpf_map_inc() never fails
92117d8443 ("bpf: fix refcnt overflow") turned refcounting of bpf_map into
potentially failing operation, when refcount reaches BPF_MAX_REFCNT limit
(32k). Due to using 32-bit counter, it's possible in practice to overflow
refcounter and make it wrap around to 0, causing erroneous map free, while
there are still references to it, causing use-after-free problems.

But having a failing refcounting operations are problematic in some cases. One
example is mmap() interface. After establishing initial memory-mapping, user
is allowed to arbitrarily map/remap/unmap parts of mapped memory, arbitrarily
splitting it into multiple non-contiguous regions. All this happening without
any control from the users of mmap subsystem. Rather mmap subsystem sends
notifications to original creator of memory mapping through open/close
callbacks, which are optionally specified during initial memory mapping
creation. These callbacks are used to maintain accurate refcount for bpf_map
(see next patch in this series). The problem is that open() callback is not
supposed to fail, because memory-mapped resource is set up and properly
referenced. This is posing a problem for using memory-mapping with BPF maps.

One solution to this is to maintain separate refcount for just memory-mappings
and do single bpf_map_inc/bpf_map_put when it goes from/to zero, respectively.
There are similar use cases in current work on tcp-bpf, necessitating extra
counter as well. This seems like a rather unfortunate and ugly solution that
doesn't scale well to various new use cases.

Another approach to solve this is to use non-failing refcount_t type, which
uses 32-bit counter internally, but, once reaching overflow state at UINT_MAX,
stays there. This utlimately causes memory leak, but prevents use after free.

But given refcounting is not the most performance-critical operation with BPF
maps (it's not used from running BPF program code), we can also just switch to
64-bit counter that can't overflow in practice, potentially disadvantaging
32-bit platforms a tiny bit. This simplifies semantics and allows above
described scenarios to not worry about failing refcount increment operation.

In terms of struct bpf_map size, we are still good and use the same amount of
space:

BEFORE (3 cache lines, 8 bytes of padding at the end):
struct bpf_map {
	const struct bpf_map_ops  * ops __attribute__((__aligned__(64))); /*     0     8 */
	struct bpf_map *           inner_map_meta;       /*     8     8 */
	void *                     security;             /*    16     8 */
	enum bpf_map_type  map_type;                     /*    24     4 */
	u32                        key_size;             /*    28     4 */
	u32                        value_size;           /*    32     4 */
	u32                        max_entries;          /*    36     4 */
	u32                        map_flags;            /*    40     4 */
	int                        spin_lock_off;        /*    44     4 */
	u32                        id;                   /*    48     4 */
	int                        numa_node;            /*    52     4 */
	u32                        btf_key_type_id;      /*    56     4 */
	u32                        btf_value_type_id;    /*    60     4 */
	/* --- cacheline 1 boundary (64 bytes) --- */
	struct btf *               btf;                  /*    64     8 */
	struct bpf_map_memory memory;                    /*    72    16 */
	bool                       unpriv_array;         /*    88     1 */
	bool                       frozen;               /*    89     1 */

	/* XXX 38 bytes hole, try to pack */

	/* --- cacheline 2 boundary (128 bytes) --- */
	atomic_t                   refcnt __attribute__((__aligned__(64))); /*   128     4 */
	atomic_t                   usercnt;              /*   132     4 */
	struct work_struct work;                         /*   136    32 */
	char                       name[16];             /*   168    16 */

	/* size: 192, cachelines: 3, members: 21 */
	/* sum members: 146, holes: 1, sum holes: 38 */
	/* padding: 8 */
	/* forced alignments: 2, forced holes: 1, sum forced holes: 38 */
} __attribute__((__aligned__(64)));

AFTER (same 3 cache lines, no extra padding now):
struct bpf_map {
	const struct bpf_map_ops  * ops __attribute__((__aligned__(64))); /*     0     8 */
	struct bpf_map *           inner_map_meta;       /*     8     8 */
	void *                     security;             /*    16     8 */
	enum bpf_map_type  map_type;                     /*    24     4 */
	u32                        key_size;             /*    28     4 */
	u32                        value_size;           /*    32     4 */
	u32                        max_entries;          /*    36     4 */
	u32                        map_flags;            /*    40     4 */
	int                        spin_lock_off;        /*    44     4 */
	u32                        id;                   /*    48     4 */
	int                        numa_node;            /*    52     4 */
	u32                        btf_key_type_id;      /*    56     4 */
	u32                        btf_value_type_id;    /*    60     4 */
	/* --- cacheline 1 boundary (64 bytes) --- */
	struct btf *               btf;                  /*    64     8 */
	struct bpf_map_memory memory;                    /*    72    16 */
	bool                       unpriv_array;         /*    88     1 */
	bool                       frozen;               /*    89     1 */

	/* XXX 38 bytes hole, try to pack */

	/* --- cacheline 2 boundary (128 bytes) --- */
	atomic64_t                 refcnt __attribute__((__aligned__(64))); /*   128     8 */
	atomic64_t                 usercnt;              /*   136     8 */
	struct work_struct work;                         /*   144    32 */
	char                       name[16];             /*   176    16 */

	/* size: 192, cachelines: 3, members: 21 */
	/* sum members: 154, holes: 1, sum holes: 38 */
	/* forced alignments: 2, forced holes: 1, sum forced holes: 38 */
} __attribute__((__aligned__(64)));

This patch, while modifying all users of bpf_map_inc, also cleans up its
interface to match bpf_map_put with separate operations for bpf_map_inc and
bpf_map_inc_with_uref (to match bpf_map_put and bpf_map_put_with_uref,
respectively). Also, given there are no users of bpf_map_inc_not_zero
specifying uref=true, remove uref flag and default to uref=false internally.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20191117172806.2195367-2-andriin@fb.com
2019-11-18 11:41:59 +01:00
David S. Miller
949610ddd0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2019-11-15

The following pull-request contains BPF updates for your *net* tree.

We've added 1 non-merge commits during the last 9 day(s) which contain
a total of 1 file changed, 3 insertions(+), 1 deletion(-).

The main changes are:

1) Fix a missing unlock of bpf_devs_lock in bpf_offload_dev_create()'s
    error path, from Dan.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-17 10:23:49 -08:00
Linus Torvalds
cbb104f91d Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc scheduler fixes from Ingo Molnar:

 - Fix potential deadlock under CONFIG_DEBUG_OBJECTS=y

 - PELT metrics update ordering fix

 - uclamp logic fix

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/uclamp: Fix incorrect condition
  sched/pelt: Fix update of blocked PELT ordering
  sched/core: Avoid spurious lock dependencies
2019-11-17 08:30:38 -08:00
Valentin Schneider
7763baace1 sched/uclamp: Fix overzealous type replacement
Some uclamp helpers had their return type changed from 'unsigned int' to
'enum uclamp_id' by commit

  0413d7f33e ("sched/uclamp: Always use 'enum uclamp_id' for clamp_id values")

but it happens that some do return a value in the [0, SCHED_CAPACITY_SCALE]
range, which should really be unsigned int. The affected helpers are
uclamp_none(), uclamp_rq_max_value() and uclamp_eff_value(). Fix those up.

Note that this doesn't lead to any obj diff using a relatively recent
aarch64 compiler (8.3-2019.03). The current code of e.g. uclamp_eff_value()
properly returns an 11 bit value (bits_per(1024)) and doesn't seem to do
anything funny. I'm still marking this as fixing the above commit to be on
the safe side.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Qais Yousef <qais.yousef@arm.com>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar.Eggemann@arm.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: patrick.bellasi@matbug.net
Cc: qperret@google.com
Cc: surenb@google.com
Cc: tj@kernel.org
Fixes: 0413d7f33e ("sched/uclamp: Always use 'enum uclamp_id' for clamp_id values")
Link: https://lkml.kernel.org/r/20191115103908.27610-1-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-17 10:46:05 +01:00
David S. Miller
19b7e21c55 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Lots of overlapping changes and parallel additions, stuff
like that.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-16 21:51:42 -08:00
Linus Torvalds
3278b3b678 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Ingo Molnar:
 "Fix integer truncation bug in __do_adjtimex()"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  ntp/y2038: Remove incorrect time_t truncation
2019-11-16 16:08:46 -08:00
Linus Torvalds
5ffaf037e7 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Misc fixes: a handful of AUX event handling related fixes, a Sparse
  fix and two ABI fixes"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/core: Fix missing static inline on perf_cgroup_switch()
  perf/core: Consistently fail fork on allocation failures
  perf/aux: Disallow aux_output for kernel events
  perf/core: Reattach a misplaced comment
  perf/aux: Fix the aux_output group inheritance fix
  perf/core: Disallow uncore-cgroup events
2019-11-16 15:56:01 -08:00
Marco Elver
0ebba7141e build, kcsan: Add KCSAN build exceptions
This blacklists several compilation units from KCSAN. See the respective
inline comments for the reasoning.

Signed-off-by: Marco Elver <elver@google.com>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-11-16 07:23:14 -08:00
Marco Elver
dfd402a4c4 kcsan: Add Kernel Concurrency Sanitizer infrastructure
Kernel Concurrency Sanitizer (KCSAN) is a dynamic data-race detector for
kernel space. KCSAN is a sampling watchpoint-based data-race detector.
See the included Documentation/dev-tools/kcsan.rst for more details.

This patch adds basic infrastructure, but does not yet enable KCSAN for
any architecture.

Signed-off-by: Marco Elver <elver@google.com>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-11-16 07:23:13 -08:00
Maulik Shah
4a169a95d8 genirq: Introduce irq_chip_get/set_parent_state calls
On certain QTI chipsets some GPIOs are direct-connect interrupts to the
GIC to be used as regular interrupt lines. When the GPIOs are not used
for interrupt generation the interrupt line is disabled. But disabling
the interrupt at GIC does not prevent the interrupt to be reported as
pending at GIC_ISPEND. Later, when drivers call enable_irq() on the
interrupt, an unwanted interrupt occurs.

Introduce get and set methods for irqchip's parent to clear it's pending
irq state. This then can be invoked by the GPIO interrupt controller on
the parents in it hierarchy to clear the interrupt before enabling the
interrupt.

Signed-off-by: Maulik Shah <mkshah@codeaurora.org>
Signed-off-by: Lina Iyer <ilina@codeaurora.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Stephen Boyd <swboyd@chromium.org>
Link: https://lore.kernel.org/r/1573855915-9841-7-git-send-email-ilina@codeaurora.org

[updated commit text and minor code fixes]
2019-11-16 10:20:02 +00:00
Adrian Reber
49cb2fc42c fork: extend clone3() to support setting a PID
The main motivation to add set_tid to clone3() is CRIU.

To restore a process with the same PID/TID CRIU currently uses
/proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to
ns_last_pid and then (quickly) does a clone(). This works most of the
time, but it is racy. It is also slow as it requires multiple syscalls.

Extending clone3() to support *set_tid makes it possible restore a
process using CRIU without accessing /proc/sys/kernel/ns_last_pid and
race free (as long as the desired PID/TID is available).

This clone3() extension places the same restrictions (CAP_SYS_ADMIN)
on clone3() with *set_tid as they are currently in place for ns_last_pid.

The original version of this change was using a single value for
set_tid. At the 2019 LPC, after presenting set_tid, it was, however,
decided to change set_tid to an array to enable setting the PID of a
process in multiple PID namespaces at the same time. If a process is
created in a PID namespace it is possible to influence the PID inside
and outside of the PID namespace. Details also in the corresponding
selftest.

To create a process with the following PIDs:

      PID NS level         Requested PID
        0 (host)              31496
        1                        42
        2                         1

For that example the two newly introduced parameters to struct
clone_args (set_tid and set_tid_size) would need to be:

  set_tid[0] = 1;
  set_tid[1] = 42;
  set_tid[2] = 31496;
  set_tid_size = 3;

If only the PIDs of the two innermost nested PID namespaces should be
defined it would look like this:

  set_tid[0] = 1;
  set_tid[1] = 42;
  set_tid_size = 2;

The PID of the newly created process would then be the next available
free PID in the PID namespace level 0 (host) and 42 in the PID namespace
at level 1 and the PID of the process in the innermost PID namespace
would be 1.

The set_tid array is used to specify the PID of a process starting
from the innermost nested PID namespaces up to set_tid_size PID namespaces.

set_tid_size cannot be larger then the current PID namespace level.

Signed-off-by: Adrian Reber <areber@redhat.com>
Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com>
Acked-by: Andrei Vagin <avagin@gmail.com>
Link: https://lore.kernel.org/r/20191115123621.142252-1-areber@redhat.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2019-11-15 23:49:22 +01:00
Alexei Starovoitov
5b92a28aae bpf: Support attaching tracing BPF program to other BPF programs
Allow FENTRY/FEXIT BPF programs to attach to other BPF programs of any type
including their subprograms. This feature allows snooping on input and output
packets in XDP, TC programs including their return values. In order to do that
the verifier needs to track types not only of vmlinux, but types of other BPF
programs as well. The verifier also needs to translate uapi/linux/bpf.h types
used by networking programs into kernel internal BTF types used by FENTRY/FEXIT
BPF programs. In some cases LLVM optimizations can remove arguments from BPF
subprograms without adjusting BTF info that LLVM backend knows. When BTF info
disagrees with actual types that the verifiers sees the BPF trampoline has to
fallback to conservative and treat all arguments as u64. The FENTRY/FEXIT
program can still attach to such subprograms, but it won't be able to recognize
pointer types like 'struct sk_buff *' and it won't be able to pass them to
bpf_skb_output() for dumping packets to user space. The FENTRY/FEXIT program
would need to use bpf_probe_read_kernel() instead.

The BPF_PROG_LOAD command is extended with attach_prog_fd field. When it's set
to zero the attach_btf_id is one vmlinux BTF type ids. When attach_prog_fd
points to previously loaded BPF program the attach_btf_id is BTF type id of
main function or one of its subprograms.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20191114185720.1641606-18-ast@kernel.org
2019-11-15 23:45:24 +01:00
Alexei Starovoitov
8c1b6e69dc bpf: Compare BTF types of functions arguments with actual types
Make the verifier check that BTF types of function arguments match actual types
passed into top-level BPF program and into BPF-to-BPF calls. If types match
such BPF programs and sub-programs will have full support of BPF trampoline. If
types mismatch the trampoline has to be conservative. It has to save/restore
five program arguments and assume 64-bit scalars.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20191114185720.1641606-17-ast@kernel.org
2019-11-15 23:45:02 +01:00
Alexei Starovoitov
91cc1a9974 bpf: Annotate context types
Annotate BPF program context types with program-side type and kernel-side type.
This type information is used by the verifier. btf_get_prog_ctx_type() is
used in the later patches to verify that BTF type of ctx in BPF program matches to
kernel expected ctx type. For example, the XDP program type is:
BPF_PROG_TYPE(BPF_PROG_TYPE_XDP, xdp, struct xdp_md, struct xdp_buff)
That means that XDP program should be written as:
int xdp_prog(struct xdp_md *ctx) { ... }

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20191114185720.1641606-16-ast@kernel.org
2019-11-15 23:44:48 +01:00
Alexei Starovoitov
9cc31b3a09 bpf: Fix race in btf_resolve_helper_id()
btf_resolve_helper_id() caching logic is a bit racy, since under root the
verifier can verify several programs in parallel. Fix it with READ/WRITE_ONCE.
Fix the type as well, since error is also recorded.

Fixes: a7658e1a41 ("bpf: Check types of arguments passed into helpers")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20191114185720.1641606-15-ast@kernel.org
2019-11-15 23:44:20 +01:00
Alexei Starovoitov
fec56f5890 bpf: Introduce BPF trampoline
Introduce BPF trampoline concept to allow kernel code to call into BPF programs
with practically zero overhead.  The trampoline generation logic is
architecture dependent.  It's converting native calling convention into BPF
calling convention.  BPF ISA is 64-bit (even on 32-bit architectures). The
registers R1 to R5 are used to pass arguments into BPF functions. The main BPF
program accepts only single argument "ctx" in R1. Whereas CPU native calling
convention is different. x86-64 is passing first 6 arguments in registers
and the rest on the stack. x86-32 is passing first 3 arguments in registers.
sparc64 is passing first 6 in registers. And so on.

The trampolines between BPF and kernel already exist.  BPF_CALL_x macros in
include/linux/filter.h statically compile trampolines from BPF into kernel
helpers. They convert up to five u64 arguments into kernel C pointers and
integers. On 64-bit architectures this BPF_to_kernel trampolines are nops. On
32-bit architecture they're meaningful.

The opposite job kernel_to_BPF trampolines is done by CAST_TO_U64 macros and
__bpf_trace_##call() shim functions in include/trace/bpf_probe.h. They convert
kernel function arguments into array of u64s that BPF program consumes via
R1=ctx pointer.

This patch set is doing the same job as __bpf_trace_##call() static
trampolines, but dynamically for any kernel function. There are ~22k global
kernel functions that are attachable via nop at function entry. The function
arguments and types are described in BTF.  The job of btf_distill_func_proto()
function is to extract useful information from BTF into "function model" that
architecture dependent trampoline generators will use to generate assembly code
to cast kernel function arguments into array of u64s.  For example the kernel
function eth_type_trans has two pointers. They will be casted to u64 and stored
into stack of generated trampoline. The pointer to that stack space will be
passed into BPF program in R1. On x86-64 such generated trampoline will consume
16 bytes of stack and two stores of %rdi and %rsi into stack. The verifier will
make sure that only two u64 are accessed read-only by BPF program. The verifier
will also recognize the precise type of the pointers being accessed and will
not allow typecasting of the pointer to a different type within BPF program.

The tracing use case in the datacenter demonstrated that certain key kernel
functions have (like tcp_retransmit_skb) have 2 or more kprobes that are always
active.  Other functions have both kprobe and kretprobe.  So it is essential to
keep both kernel code and BPF programs executing at maximum speed. Hence
generated BPF trampoline is re-generated every time new program is attached or
detached to maintain maximum performance.

To avoid the high cost of retpoline the attached BPF programs are called
directly. __bpf_prog_enter/exit() are used to support per-program execution
stats.  In the future this logic will be optimized further by adding support
for bpf_stats_enabled_key inside generated assembly code. Introduction of
preemptible and sleepable BPF programs will completely remove the need to call
to __bpf_prog_enter/exit().

Detach of a BPF program from the trampoline should not fail. To avoid memory
allocation in detach path the half of the page is used as a reserve and flipped
after each attach/detach. 2k bytes is enough to call 40+ BPF programs directly
which is enough for BPF tracing use cases. This limit can be increased in the
future.

BPF_TRACE_FENTRY programs have access to raw kernel function arguments while
BPF_TRACE_FEXIT programs have access to kernel return value as well. Often
kprobe BPF program remembers function arguments in a map while kretprobe
fetches arguments from a map and analyzes them together with return value.
BPF_TRACE_FEXIT accelerates this typical use case.

Recursion prevention for kprobe BPF programs is done via per-cpu
bpf_prog_active counter. In practice that turned out to be a mistake. It
caused programs to randomly skip execution. The tracing tools missed results
they were looking for. Hence BPF trampoline doesn't provide builtin recursion
prevention. It's a job of BPF program itself and will be addressed in the
follow up patches.

BPF trampoline is intended to be used beyond tracing and fentry/fexit use cases
in the future. For example to remove retpoline cost from XDP programs.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20191114185720.1641606-5-ast@kernel.org
2019-11-15 23:41:51 +01:00
Alexei Starovoitov
5964b2000f bpf: Add bpf_arch_text_poke() helper
Add bpf_arch_text_poke() helper that is used by BPF trampoline logic to patch
nops/calls in kernel text into calls into BPF trampoline and to patch
calls/nops inside BPF programs too.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20191114185720.1641606-4-ast@kernel.org
2019-11-15 23:41:28 +01:00
Ilya Leoshkevich
b7b3fc8dd9 bpf: Support doubleword alignment in bpf_jit_binary_alloc
Currently passing alignment greater than 4 to bpf_jit_binary_alloc does
not work: in such cases it silently aligns only to 4 bytes.

On s390, in order to load a constant from memory in a large (>512k) BPF
program, one must use lgrl instruction, whose memory operand must be
aligned on an 8-byte boundary.

This patch makes it possible to request 8-byte alignment from
bpf_jit_binary_alloc, and also makes it issue a warning when an
unsupported alignment is requested.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20191115123722.58462-1-iii@linux.ibm.com
2019-11-15 22:25:00 +01:00
Sebastian Andrzej Siewior
49e9d1a9fa workqueue: Add RCU annotation for pwq list walk
An additional check has been recently added to ensure that a RCU related lock
is held while the RCU list is iterated.
The `pwqs' are sometimes iterated without a RCU lock but with the &wq->mutex
acquired leading to a warning.

Teach list_for_each_entry_rcu() that the RCU usage is okay if &wq->mutex
is acquired during the list traversal.

Fixes: 28875945ba ("rcu: Add support for consolidated-RCU reader checking")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-11-15 11:53:35 -08:00
Steven Rostedt (VMware)
128161f47b ftrace: Add helper find_direct_entry() to consolidate code
Both unregister_ftrace_direct() and modify_ftrace_direct() needs to
normalize the ip passed in to match the rec->ip, as it is acceptable to have
the ip on the ftrace call site but not the start. There are also common
validity checks with the record found by the ip, these should be done for
both unregister_ftrace_direct() and modify_ftrace_direct().

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-15 14:25:08 -05:00
Steven Rostedt (VMware)
406acdd32d ftrace: Add another check for match in register_ftrace_direct()
As an instruction pointer passed into register_ftrace_direct() may just
exist on the ftrace call site, but may not be the start of the call site
itself, register_ftrace_direct() still needs to update test if a direct call
exists on the normalized site, as only one direct call is allowed at any one
time.

Fixes: 763e34e74b ("ftrace: Add register_ftrace_direct()")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-15 14:24:46 -05:00
Steven Rostedt (VMware)
1c7f9b673d ftrace: Fix accounting bug with direct->count in register_ftrace_direct()
The direct->count wasn't being updated properly, where it only was updated
when the first entry was added, but should be updated every time.

Fixes: 013bf0da04 ("ftrace: Add ftrace_find_direct_func()")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-15 14:23:55 -05:00
Yang Tao
ca16d5bee5 futex: Prevent robust futex exit race
Robust futexes utilize the robust_list mechanism to allow the kernel to
release futexes which are held when a task exits. The exit can be voluntary
or caused by a signal or fault. This prevents that waiters block forever.

The futex operations in user space store a pointer to the futex they are
either locking or unlocking in the op_pending member of the per task robust
list.

After a lock operation has succeeded the futex is queued in the robust list
linked list and the op_pending pointer is cleared.

After an unlock operation has succeeded the futex is removed from the
robust list linked list and the op_pending pointer is cleared.

The robust list exit code checks for the pending operation and any futex
which is queued in the linked list. It carefully checks whether the futex
value is the TID of the exiting task. If so, it sets the OWNER_DIED bit and
tries to wake up a potential waiter.

This is race free for the lock operation but unlock has two race scenarios
where waiters might not be woken up. These issues can be observed with
regular robust pthread mutexes. PI aware pthread mutexes are not affected.

(1) Unlocking task is killed after unlocking the futex value in user space
    before being able to wake a waiter.

        pthread_mutex_unlock()
                |
                V
        atomic_exchange_rel (&mutex->__data.__lock, 0)
                        <------------------------killed
            lll_futex_wake ()                   |
                                                |
                                                |(__lock = 0)
                                                |(enter kernel)
                                                |
                                                V
                                            do_exit()
                                            exit_mm()
                                          mm_release()
                                        exit_robust_list()
                                        handle_futex_death()
                                                |
                                                |(__lock = 0)
                                                |(uval = 0)
                                                |
                                                V
        if ((uval & FUTEX_TID_MASK) != task_pid_vnr(curr))
                return 0;

    The sanity check which ensures that the user space futex is owned by
    the exiting task prevents the wakeup of waiters which in consequence
    block infinitely.

(2) Waiting task is killed after a wakeup and before it can acquire the
    futex in user space.

        OWNER                         WAITER
				futex_wait()      		
   pthread_mutex_unlock()               |
                |                       |
                |(__lock = 0)           |
                |                       |
                V                       |
         futex_wake() ------------>  wakeup()
                                        |
                                        |(return to userspace)
                                        |(__lock = 0)
                                        |
                                        V
                        oldval = mutex->__data.__lock
                                          <-----------------killed
    atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,  |
                        id | assume_other_futex_waiters, 0)      |
                                                                 |
                                                                 |
                                                   (enter kernel)|
                                                                 |
                                                                 V
                                                         do_exit()
                                                        |
                                                        |
                                                        V
                                        handle_futex_death()
                                        |
                                        |(__lock = 0)
                                        |(uval = 0)
                                        |
                                        V
        if ((uval & FUTEX_TID_MASK) != task_pid_vnr(curr))
                return 0;

    The sanity check which ensures that the user space futex is owned
    by the exiting task prevents the wakeup of waiters, which seems to
    be correct as the exiting task does not own the futex value, but
    the consequence is that other waiters wont be woken up and block
    infinitely.

In both scenarios the following conditions are true:

   - task->robust_list->list_op_pending != NULL
   - user space futex value == 0
   - Regular futex (not PI)

If these conditions are met then it is reasonably safe to wake up a
potential waiter in order to prevent the above problems.

As this might be a false positive it can cause spurious wakeups, but the
waiter side has to handle other types of unrelated wakeups, e.g. signals
gracefully anyway. So such a spurious wakeup will not affect the
correctness of these operations.

This workaround must not touch the user space futex value and cannot set
the OWNER_DIED bit because the lock value is 0, i.e. uncontended. Setting
OWNER_DIED in this case would result in inconsistent state and subsequently
in malfunction of the owner died handling in user space.

The rest of the user space state is still consistent as no other task can
observe the list_op_pending entry in the exiting tasks robust list.

The eventually woken up waiter will observe the uncontended lock value and
take it over.

[ tglx: Massaged changelog and comment. Made the return explicit and not
  	depend on the subsequent check and added constants to hand into
  	handle_futex_death() instead of plain numbers. Fixed a few coding
	style issues. ]

Fixes: 0771dfefc9 ("[PATCH] lightweight robust futexes: core")
Signed-off-by: Yang Tao <yang.tao172@zte.com.cn>
Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1573010582-35297-1-git-send-email-wang.yi59@zte.com.cn
Link: https://lkml.kernel.org/r/20191106224555.943191378@linutronix.de
2019-11-15 19:10:49 +01:00
Linus Torvalds
b4c0800e42 Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs fixes from Al Viro:
 "Assorted fixes all over the place; some of that is -stable fodder,
  some regressions from the last window"

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  ecryptfs_lookup_interpose(): lower_dentry->d_parent is not stable either
  ecryptfs_lookup_interpose(): lower_dentry->d_inode is not stable
  ecryptfs: fix unlink and rmdir in face of underlying fs modifications
  audit_get_nd(): don't unlock parent too early
  exportfs_decode_fh(): negative pinned may become positive without the parent locked
  cgroup: don't put ERR_PTR() into fc->root
  autofs: fix a leak in autofs_expire_indirect()
  aio: Fix io_pgetevents() struct __compat_aio_sigset layout
  fs/namespace.c: fix use-after-free of mount in mnt_warn_timestamp_expiry()
2019-11-15 08:44:08 -08:00
Artem Bityutskiy
58a74a2925 tracing: Increase SYNTH_FIELDS_MAX for synthetic_events
Increase the maximum allowed count of synthetic event fields from 16 to 32
in order to allow for larger-than-usual events.

Link: http://lkml.kernel.org/r/20191115091730.9192-1-dedekind1@gmail.com

Reviewed-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-15 11:30:29 -05:00
Arnd Bergmann
942437c97f y2038: allow disabling time32 system calls
At the moment, the compilation of the old time32 system calls depends
purely on the architecture. As systems with new libc based on 64-bit
time_t are getting deployed, even architectures that previously supported
these (notably x86-32 and arm32 but also many others) no longer depend on
them, and removing them from a kernel image results in a smaller kernel
binary, the same way we can leave out many other optional system calls.

More importantly, on an embedded system that needs to keep working
beyond year 2038, any user space program calling these system calls
is likely a bug, so removing them from the kernel image does provide
an extra debugging help for finding broken applications.

I've gone back and forth on hiding this option unless CONFIG_EXPERT
is set. This version leaves it visible based on the logic that
eventually it will be turned off indefinitely.

Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-11-15 14:38:30 +01:00
Arnd Bergmann
bd40a17576 y2038: itimer: change implementation to timespec64
There is no 64-bit version of getitimer/setitimer since that is not
actually needed. However, the implementation is built around the
deprecated 'struct timeval' type.

Change the code to use timespec64 internally to reduce the dependencies
on timeval and associated helper functions.

Minor adjustments in the code are needed to make the native and compat
version work the same way, and to keep the range check working after
the conversion.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-11-15 14:38:30 +01:00
Arnd Bergmann
ddbc7d0657 y2038: move itimer reset into itimer.c
Preparing for a change to the itimer internals, stop using the
do_setitimer() symbol and instead use a new higher-level interface.

The do_getitimer()/do_setitimer functions can now be made static,
allowing the compiler to potentially produce better object code.

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-11-15 14:38:30 +01:00
Arnd Bergmann
4c22ea2b91 y2038: use compat_{get,set}_itimer on alpha
The itimer handling for the old alpha osf_setitimer/osf_getitimer
system calls is identical to the compat version of getitimer/setitimer,
so just use those directly.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-11-15 14:38:30 +01:00
Arnd Bergmann
c1745f84be y2038: itimer: compat handling to itimer.c
The structure is only used in one place, moving it there simplifies the
interface and helps with later changes to this code.

Rename it to match the other time32 structures in the process.

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-11-15 14:38:30 +01:00
Arnd Bergmann
5e0fb1b57b y2038: time: avoid timespec usage in settimeofday()
The compat_get_timeval() and timeval_valid() interfaces are deprecated
and getting removed along with the definition of struct timeval itself.

Change the two implementations of the settimeofday() system call to
open-code these helpers and completely avoid references to timeval.

The timeval_valid() call is not needed any more here, only a check to
avoid overflowing tv_nsec during the multiplication, as there is another
range check in do_sys_settimeofday64().

Tested-by: syzbot+dccce9b26ba09ca49966@syzkaller.appspotmail.com
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-11-15 14:38:30 +01:00
Arnd Bergmann
75d319c06e y2038: syscalls: change remaining timeval to __kernel_old_timeval
All of the remaining syscalls that pass a timeval (gettimeofday, utime,
futimesat) can trivially be changed to pass a __kernel_old_timeval
instead, which has a compatible layout, but avoids ambiguity with
the timeval type in user space.

Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-11-15 14:38:29 +01:00
Arnd Bergmann
bdd565f817 y2038: rusage: use __kernel_old_timeval
There are two 'struct timeval' fields in 'struct rusage'.

Unfortunately the definition of timeval is now ambiguous when used in
user space with a libc that has a 64-bit time_t, and this also changes
the 'rusage' definition in user space in a way that is incompatible with
the system call interface.

While there is no good solution to avoid all ambiguity here, change
the definition in the kernel headers to be compatible with the kernel
ABI, using __kernel_old_timeval as an unambiguous base type.

In previous discussions, there was also a plan to add a replacement
for rusage based on 64-bit timestamps and nanosecond resolution,
i.e. 'struct __kernel_timespec'. I have patches for that as well,
if anyone thinks we should do that.

Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-11-15 14:38:29 +01:00
Arnd Bergmann
2a785996cc y2038: uapi: change __kernel_time_t to __kernel_old_time_t
This is mainly a patch for clarification, and to let us remove
the time_t definition from the kernel to prevent new users from
creeping in that might not be y2038-safe.

All remaining uses of 'time_t' or '__kernel_time_t' are part of
the user API that cannot be changed by that either have a
replacement or that do not suffer from the y2038 overflow.

Acked-by: Deepa Dinamani <deepa.kernel@gmail.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-11-15 14:38:29 +01:00
Arnd Bergmann
3ca47e958a y2038: remove CONFIG_64BIT_TIME
The CONFIG_64BIT_TIME option is defined on all architectures, and can
be removed for simplicity now.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-11-15 14:38:27 +01:00
Like Xu
52ba4b0b99 perf/core: Provide a kernel-internal interface to pause perf_event
Exporting perf_event_pause() as an external accessor for kernel users (such
as KVM) who may do both disable perf_event and read count with just one
time to hold perf_event_ctx_lock. Also the value could be reset optionally.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Like Xu <like.xu@linux.intel.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-11-15 11:44:07 +01:00
Like Xu
3ca270fc9e perf/core: Provide a kernel-internal interface to recalibrate event period
Currently, perf_event_period() is used by user tools via ioctl. Based on
naming convention, exporting perf_event_period() for kernel users (such
as KVM) who may recalibrate the event period for their assigned counter
according to their requirements.

The perf_event_period() is an external accessor, just like the
perf_event_{en,dis}able() and should thus use perf_event_ctx_lock().

Suggested-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Like Xu <like.xu@linux.intel.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-11-15 11:44:06 +01:00
Konstantin Khorenko
5d60331161 kernel/module.c: wakeup processes in module_wq on module unload
Fix the race between load and unload a kernel module.

sys_delete_module()
 try_stop_module()
  mod->state = _GOING
					add_unformed_module()
					 old = find_module_all()
					 (old->state == _GOING =>
					  wait_event_interruptible())

					 During pre-condition
					 finished_loading() rets 0
					 schedule()
					 (never gets waken up later)
 free_module()
  mod->state = _UNFORMED
   list_del_rcu(&mod->list)
   (dels mod from "modules" list)

return

The race above leads to modprobe hanging forever on loading
a module.

Error paths on loading module call wake_up_all(&module_wq) after
freeing module, so let's do the same on straight module unload.

Fixes: 6e6de3dee5 ("kernel/module.c: Only return -EEXIST for modules that have finished loading")
Reviewed-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2019-11-15 11:23:12 +01:00
Qais Yousef
6e1ff0773f sched/uclamp: Fix incorrect condition
uclamp_update_active() should perform the update when
p->uclamp[clamp_id].active is true. But when the logic was inverted in
[1], the if condition wasn't inverted correctly too.

[1] https://lore.kernel.org/lkml/20190902073836.GO2369@hirez.programming.kicks-ass.net/

Reported-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Patrick Bellasi <patrick.bellasi@matbug.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: babbe170e0 ("sched/uclamp: Update CPU's refcount on TG's clamp changes")
Link: https://lkml.kernel.org/r/20191114211052.15116-1-qais.yousef@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-15 11:02:18 +01:00
luanshi
20a15ee040 genirq: Fix function documentation of __irq_alloc_descs()
The function got renamed at some point, but the kernel-doc was not updated.

Signed-off-by: Liguang Zhang <zhangliguang@linux.alibaba.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/1573656093-8643-1-git-send-email-zhangliguang@linux.alibaba.com
2019-11-15 10:48:38 +01:00
Frederic Weisbecker
e9838bd511 irq_work: Fix IRQ_WORK_BUSY bit clearing
While attempting to clear the busy bit at the end of a work execution,
atomic_cmpxchg() expects the value of the flags with the pending bit
cleared as the old value. However by mistake the value of the flags is
passed without clearing the pending bit first.

As a result, clearing the busy bit fails and irq_work_sync() may stall:

 watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [blktrace:4948]
 CPU: 0 PID: 4948 Comm: blktrace Not tainted 5.4.0-rc7-00003-gfeb4a51323bab #1
 RIP: 0010:irq_work_sync+0x4/0x10
 Call Trace:
  relay_close_buf+0x19/0x50
  relay_close+0x64/0x100
  blk_trace_free+0x1f/0x50
  __blk_trace_remove+0x1e/0x30
  blk_trace_ioctl+0x11b/0x140
  blkdev_ioctl+0x6c1/0xa40
  block_ioctl+0x39/0x40
  do_vfs_ioctl+0xa5/0x700
  ksys_ioctl+0x70/0x80
  __x64_sys_ioctl+0x16/0x20
  do_syscall_64+0x5b/0x1d0
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

So clear the appropriate bit before passing the old flags to cmpxchg().

Fixes: feb4a51323 ("irq_work: Slightly simplify IRQ_WORK_PENDING clearing")
Reported-by: kernel test robot <rong.a.chen@intel.com>
Reported-by: Leonard Crestez <leonard.crestez@nxp.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Leonard Crestez <leonard.crestez@nxp.com>
Link: https://lkml.kernel.org/r/20191113171201.14032-1-frederic@kernel.org
2019-11-15 10:48:37 +01:00
Steven Rostedt (VMware)
0567d68091 ftrace: Add modify_ftrace_direct()
Add a new function modify_ftrace_direct() that will allow a user to update
an existing direct caller to a new trampoline, without missing hits due to
unregistering one and then adding another.

Link: https://lore.kernel.org/r/20191109022907.6zzo6orhxpt5n2sv@ast-mbp.dhcp.thefacebook.com

Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-14 22:45:47 -05:00
Steven Rostedt (VMware)
36b3615dc3 tracing: Add missing "inline" in stub function of latency_fsnotify()
The latency_fsnotify() stub when the function is not defined, was missing
the "inline".

Link: https://lore.kernel.org/r/20191115140213.74c5efe7@canb.auug.org.au

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-14 22:45:47 -05:00
Tejun Heo
d749534322 cgroup: fix incorrect WARN_ON_ONCE() in cgroup_setup_root()
743210386c ("cgroup: use cgrp->kn->id as the cgroup ID") added WARN
which triggers if cgroup_id(root_cgrp) is not 1.  This is fine on
64bit ino archs but on 32bit archs cgroup ID is ((gen << 32) | ino)
and gen starts at 1, so the root id is 0x1_0000_0001 instead of 1
always triggering the WARN.

What we wanna make sure is that the ino part is 1.  Fix it.

Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Fixes: 743210386c ("cgroup: use cgrp->kn->id as the cgroup ID")
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-11-14 14:46:51 -08:00
Borislav Petkov
9b4712044d tracing: Remove stray tab in TRACE_EVAL_MAP_FILE's help text
There was a stray tab in the help text of the aforementioned config
option which showed like this:

  The "print fmt" of the trace events will show the enum/sizeof names
  instead        of their values. This can cause problems for user space tools
  ...

in menuconfig. Remove it and end a sentence with a fullstop.

No functional changes.

Link: http://lkml.kernel.org/r/20191112174219.10933-1-bp@alien8.de

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-14 13:15:12 -05:00
Piotr Maziarz
ef56e047b2 tracing: Use seq_buf_hex_dump() to dump buffers
Without this, buffers can be printed with __print_array macro that has
no formatting options and can be hard to read. The other way is to
mimic formatting capability with multiple calls of trace event with one
call per row which gives performance impact and different timestamp in
each row.

Link: http://lkml.kernel.org/r/1573130738-29390-2-git-send-email-piotrx.maziarz@linux.intel.com

Signed-off-by: Piotr Maziarz <piotrx.maziarz@linux.intel.com>
Signed-off-by: Cezary Rojewski <cezary.rojewski@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-14 13:15:12 -05:00
Masami Hiramatsu
c7411a1a12 tracing/kprobe: Check whether the non-suffixed symbol is notrace
Check whether the non-suffixed symbol is notrace, since suffixed
symbols are generated by the compilers for optimization. Based on
these suffixed symbols, notrace check might not work because
some of them are just a partial code of the original function.
(e.g. cold-cache (unlikely) code is separated from original
 function as FUNCTION.cold.XX)

For example, without this fix,
  # echo p device_add.cold.67 > /sys/kernel/debug/tracing/kprobe_events
  sh: write error: Invalid argument

  # cat /sys/kernel/debug/tracing/error_log
  [  135.491035] trace_kprobe: error: Failed to register probe event
    Command: p device_add.cold.67
               ^
  # dmesg | tail -n 1
  [  135.488599] trace_kprobe: Could not probe notrace function device_add.cold.67

With this,
  # echo p device_add.cold.66 > /sys/kernel/debug/tracing/kprobe_events
  # cat /sys/kernel/debug/kprobes/list
  ffffffff81599de9  k  device_add.cold.66+0x0    [DISABLED]

Actually, kprobe blacklist already did similar thing,
see within_kprobe_blacklist().

Link: http://lkml.kernel.org/r/157233790394.6706.18243942030937189679.stgit@devnote2

Fixes: 45408c4f92 ("tracing: kprobes: Prohibit probing on notrace function")
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-14 13:15:12 -05:00
Yuming Han
6ee40511cb tracing: use kvcalloc for tgid_map array allocation
Fail to allocate memory for tgid_map, because it requires order-6 page.
detail as:

c3 sh: page allocation failure: order:6,
   mode:0x140c0c0(GFP_KERNEL), nodemask=(null)
c3 sh cpuset=/ mems_allowed=0
c3 CPU: 3 PID: 5632 Comm: sh Tainted: G        W  O    4.14.133+ #10
c3 Hardware name: Generic DT based system
c3 Backtrace:
c3 [<c010bdbc>] (dump_backtrace) from [<c010c08c>](show_stack+0x18/0x1c)
c3 [<c010c074>] (show_stack) from [<c0993c54>](dump_stack+0x84/0xa4)
c3 [<c0993bd0>] (dump_stack) from [<c0229858>](warn_alloc+0xc4/0x19c)
c3 [<c0229798>] (warn_alloc) from [<c022a6e4>](__alloc_pages_nodemask+0xd18/0xf28)
c3 [<c02299cc>] (__alloc_pages_nodemask) from [<c0248344>](kmalloc_order+0x20/0x38)
c3 [<c0248324>] (kmalloc_order) from [<c0248380>](kmalloc_order_trace+0x24/0x108)
c3 [<c024835c>] (kmalloc_order_trace) from [<c01e6078>](set_tracer_flag+0xb0/0x158)
c3 [<c01e5fc8>] (set_tracer_flag) from [<c01e6404>](trace_options_core_write+0x7c/0xcc)
c3 [<c01e6388>] (trace_options_core_write) from [<c0278b1c>](__vfs_write+0x40/0x14c)
c3 [<c0278adc>] (__vfs_write) from [<c0278e10>](vfs_write+0xc4/0x198)
c3 [<c0278d4c>] (vfs_write) from [<c027906c>](SyS_write+0x6c/0xd0)
c3 [<c0279000>] (SyS_write) from [<c01079a0>](ret_fast_syscall+0x0/0x54)

Switch to use kvcalloc to avoid unexpected allocation failures.

Link: http://lkml.kernel.org/r/1571888070-24425-1-git-send-email-chunyan.zhang@unisoc.com

Signed-off-by: Yuming Han <yuming.han@unisoc.com>
Signed-off-by: Chunyan Zhang <chunyan.zhang@unisoc.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-14 13:15:11 -05:00
Srivatsa S. Bhat (VMware)
0c3c86bdc6 tracing/hwlat: Fix a few trivial nits
Update the source file name in the comments, and fix a grammatical
error.

Link: http://lkml.kernel.org/r/157073346821.17189.8946944856026592247.stgit@srivatsa-ubuntu

Signed-off-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-14 13:15:11 -05:00
Andy Shevchenko
80042c8f06 tracing: Use generic type for comparator function
Comparator function type, cmp_func_t, is defined in the types.h,
use it in the code.

Link: http://lkml.kernel.org/r/20191007135656.37734-3-andriy.shevchenko@linux.intel.com

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-14 13:15:11 -05:00
Steven Rostedt (VMware)
b83b43ffc6 fgraph: Fix function type mismatches of ftrace_graph_return using ftrace_stub
The C compiler is allowing more checks to make sure that function pointers
are assigned to the correct prototype function. Unfortunately, the function
graph tracer uses a special name with its assigned ftrace_graph_return
function pointer that maps to a stub function used by the function tracer
(ftrace_stub). The ftrace_graph_return variable is compared to the
ftrace_stub in some archs to know if the function graph tracer is enabled or
not. This means we can not just simply create a new function stub that
compares it without modifying all the archs.

Instead, have the linker script create a function_graph_stub that maps to
ftrace_stub, and this way we can define the prototype for it to match the
prototype of ftrace_graph_return, and make the compiler checks all happy!

Link: http://lkml.kernel.org/r/20191015090055.789a0aed@gandalf.local.home

Cc: linux-sh@vger.kernel.org
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc:  Rich Felker <dalias@libc.org>
Reported-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-14 13:13:43 -05:00
Divya Indi
953ae45a0c tracing: Adding NULL checks for trace_array descriptor pointer
As part of commit f45d1225ad ("tracing: Kernel access to Ftrace
instances") we exported certain functions. Here, we are adding some additional
NULL checks to ensure safe usage by users of these APIs.

Link: http://lkml.kernel.org/r/1565805327-579-4-git-send-email-divya.indi@oracle.com

Signed-off-by: Divya Indi <divya.indi@oracle.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-13 09:37:29 -05:00
Divya Indi
e585e6469d tracing: Verify if trace array exists before destroying it.
A trace array can be destroyed from userspace or kernel. Verify if the
trace array exists before proceeding to destroy/remove it.

Link: http://lkml.kernel.org/r/1565805327-579-3-git-send-email-divya.indi@oracle.com

Reviewed-by: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
Signed-off-by: Divya Indi <divya.indi@oracle.com>
[ Removed unneeded braces ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-13 09:37:29 -05:00
Divya Indi
2d6425af61 tracing: Declare newly exported APIs in include/linux/trace.h
Declare the newly introduced and exported APIs in the header file -
include/linux/trace.h. Moving previous declarations from
kernel/trace/trace.h to include/linux/trace.h.

Link: http://lkml.kernel.org/r/1565805327-579-2-git-send-email-divya.indi@oracle.com

Signed-off-by: Divya Indi <divya.indi@oracle.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-13 09:37:29 -05:00
Ben Dooks
6dff4d7dd3 tracing: Make internal ftrace events static
The event_class_ftrace_##call and event_##call do not seem
to be used outside of trace_export.c so make them both static
to avoid a number of sparse warnings:

kernel/trace/trace_entries.h:59:1: warning: symbol 'event_class_ftrace_function' was not declared. Should it be static?
kernel/trace/trace_entries.h:59:1: warning: symbol '__event_function' was not declared. Should it be static?
kernel/trace/trace_entries.h:77:1: warning: symbol 'event_class_ftrace_funcgraph_entry' was not declared. Should it be static?
kernel/trace/trace_entries.h:77:1: warning: symbol '__event_funcgraph_entry' was not declared. Should it be static?
kernel/trace/trace_entries.h:93:1: warning: symbol 'event_class_ftrace_funcgraph_exit' was not declared. Should it be static?
kernel/trace/trace_entries.h:93:1: warning: symbol '__event_funcgraph_exit' was not declared. Should it be static?
kernel/trace/trace_entries.h:129:1: warning: symbol 'event_class_ftrace_context_switch' was not declared. Should it be static?
kernel/trace/trace_entries.h:129:1: warning: symbol '__event_context_switch' was not declared. Should it be static?
kernel/trace/trace_entries.h:149:1: warning: symbol 'event_class_ftrace_wakeup' was not declared. Should it be static?
kernel/trace/trace_entries.h:149:1: warning: symbol '__event_wakeup' was not declared. Should it be static?
kernel/trace/trace_entries.h:171:1: warning: symbol 'event_class_ftrace_kernel_stack' was not declared. Should it be static?
kernel/trace/trace_entries.h:171:1: warning: symbol '__event_kernel_stack' was not declared. Should it be static?
kernel/trace/trace_entries.h:191:1: warning: symbol 'event_class_ftrace_user_stack' was not declared. Should it be static?
kernel/trace/trace_entries.h:191:1: warning: symbol '__event_user_stack' was not declared. Should it be static?
kernel/trace/trace_entries.h:214:1: warning: symbol 'event_class_ftrace_bprint' was not declared. Should it be static?
kernel/trace/trace_entries.h:214:1: warning: symbol '__event_bprint' was not declared. Should it be static?
kernel/trace/trace_entries.h:230:1: warning: symbol 'event_class_ftrace_print' was not declared. Should it be static?
kernel/trace/trace_entries.h:230:1: warning: symbol '__event_print' was not declared. Should it be static?
kernel/trace/trace_entries.h:247:1: warning: symbol 'event_class_ftrace_raw_data' was not declared. Should it be static?
kernel/trace/trace_entries.h:247:1: warning: symbol '__event_raw_data' was not declared. Should it be static?
kernel/trace/trace_entries.h:262:1: warning: symbol 'event_class_ftrace_bputs' was not declared. Should it be static?
kernel/trace/trace_entries.h:262:1: warning: symbol '__event_bputs' was not declared. Should it be static?
kernel/trace/trace_entries.h:277:1: warning: symbol 'event_class_ftrace_mmiotrace_rw' was not declared. Should it be static?
kernel/trace/trace_entries.h:277:1: warning: symbol '__event_mmiotrace_rw' was not declared. Should it be static?
kernel/trace/trace_entries.h:298:1: warning: symbol 'event_class_ftrace_mmiotrace_map' was not declared. Should it be static?
kernel/trace/trace_entries.h:298:1: warning: symbol '__event_mmiotrace_map' was not declared. Should it be static?
kernel/trace/trace_entries.h:322:1: warning: symbol 'event_class_ftrace_branch' was not declared. Should it be static?
kernel/trace/trace_entries.h:322:1: warning: symbol '__event_branch' was not declared. Should it be static?
kernel/trace/trace_entries.h:343:1: warning: symbol 'event_class_ftrace_hwlat' was not declared. Should it be static?
kernel/trace/trace_entries.h:343:1: warning: symbol '__event_hwlat' was not declared. Should it be static?

Link: http://lkml.kernel.org/r/20191015121012.18824-1-ben.dooks@codethink.co.uk

Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-13 09:37:29 -05:00
Sebastian Andrzej Siewior
9c34fc4b7e tracing: Use CONFIG_PREEMPTION
CONFIG_PREEMPTION is selected by CONFIG_PREEMPT and by CONFIG_PREEMPT_RT.
Both PREEMPT and PREEMPT_RT require the same functionality which today
depends on CONFIG_PREEMPT.

Add additional header output for PREEMPT_RT.
Link: http://lkml.kernel.org/r/20191015191821.11479-34-bigeasy@linutronix.de

Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-13 09:37:28 -05:00
Viktor Rosendahl (BMW)
793937236d preemptirq_delay_test: Add the burst feature and a sysfs trigger
This burst feature enables the user to generate a burst of
preempt/irqsoff latencies. This makes it possible to test whether we
are able to detect latencies that systematically occur very close to
each other.

The maximum burst size is 10. We also create 10 identical test
functions, so that we get 10 different backtraces; this is useful
when we want to test whether we can detect all the latencies in a
burst. Otherwise, there would be no easy way of differentiating
between which latency in a burst was captured by the tracer.

In addition, there is a sysfs trigger, so that it's not necessary to
reload the module to repeat the test. The trigger will appear as
/sys/kernel/preemptirq_delay_test/trigger in sysfs.

Link: http://lkml.kernel.org/r/20191008220824.7911-3-viktor.rosendahl@gmail.com

Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Viktor Rosendahl (BMW) <viktor.rosendahl@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-13 09:37:28 -05:00
Viktor Rosendahl (BMW)
91edde2e6a ftrace: Implement fs notification for tracing_max_latency
This patch implements the feature that the tracing_max_latency file,
e.g. /sys/kernel/debug/tracing/tracing_max_latency will receive
notifications through the fsnotify framework when a new latency is
available.

One particularly interesting use of this facility is when enabling
threshold tracing, through /sys/kernel/debug/tracing/tracing_thresh,
together with the preempt/irqsoff tracers. This makes it possible to
implement a user space program that can, with equal probability,
obtain traces of latencies that occur immediately after each other in
spite of the fact that the preempt/irqsoff tracers operate in overwrite
mode.

This facility works with the hwlat, preempt/irqsoff, and wakeup
tracers.

The tracers may call the latency_fsnotify() from places such as
__schedule() or do_idle(); this makes it impossible to call
queue_work() directly without risking a deadlock. The same would
happen with a softirq,  kernel thread or tasklet. For this reason we
use the irq_work mechanism to call queue_work().

This patch creates a new workqueue. The reason for doing this is that
I wanted to use the WQ_UNBOUND and WQ_HIGHPRI flags.  My thinking was
that WQ_UNBOUND might help with the latency in some important cases.

If we use:

queue_work(system_highpri_wq, &tr->fsnotify_work);

then the work will (almost) always execute on the same CPU but if we are
unlucky that CPU could be too busy while there could be another CPU in
the system that would be able to process the work soon enough.

queue_work_on() could be used to queue the work on another CPU but it
seems difficult to select the right CPU.

Link: http://lkml.kernel.org/r/20191008220824.7911-2-viktor.rosendahl@gmail.com

Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Viktor Rosendahl (BMW) <viktor.rosendahl@gmail.com>
[ Added max() to have one compare for max latency ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-13 09:37:28 -05:00
Steven Rostedt (VMware)
da537f0aef ftrace: Add information on number of page groups allocated
Looking for ways to shrink the size of the dyn_ftrace structure, knowing the
information about how many pages and the number of groups of those pages, is
useful in working out the best ways to save on memory.

This adds one info print on how many groups of pages were used to allocate
the ftrace dyn_ftrace structures, and also shows the number of pages and
groups in the dyn_ftrace_total_info (which is used for debugging).

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-13 09:37:28 -05:00
Steven Rostedt (VMware)
a3ad1a7e39 ftrace/x86: Add a counter to test function_graph with direct
As testing for direct calls from the function graph tracer adds a little
overhead (which is a lot when tracing every function), add a counter that
can be used to test if function_graph tracer needs to test for a direct
caller or not.

It would have been nicer if we could use a static branch, but the static
branch logic fails when used within the function graph tracer trampoline.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-13 09:36:49 -05:00
Steven Rostedt (VMware)
013bf0da04 ftrace: Add ftrace_find_direct_func()
As function_graph tracer modifies the return address to insert a trampoline
to trace the return of a function, it must be aware of a direct caller, as
when it gets called, the function's return address may not be at on the
stack where it expects. It may have to see if that return address points to
the a direct caller and adjust if it is.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-13 09:36:48 -05:00
Steven Rostedt (VMware)
763e34e74b ftrace: Add register_ftrace_direct()
Add the start of the functionality to allow other trampolines to use the
ftrace mcount/fentry/nop location. This adds two new functions:

 register_ftrace_direct() and unregister_ftrace_direct()

Both take two parameters: the first is the instruction address of where the
mcount/fentry/nop exists, and the second is the trampoline to have that
location called.

This will handle cases where ftrace is already used on that same location,
and will make it still work, where the registered direct called trampoline
will get called after all the registered ftrace callers are handled.

Currently, it will not allow for IP_MODIFY functions to be called at the
same locations, which include some kprobes and live kernel patching.

At this point, no architecture supports this. This is only the start of
implementing the framework.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-13 09:36:41 -05:00
Peter Zijlstra
cf25e24db6 time: Rename tsk->real_start_time to ->start_boottime
Since it stores CLOCK_BOOTTIME, not, as the name suggests,
CLOCK_REALTIME, let's rename ->real_start_time to ->start_bootime.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-13 11:09:49 +01:00
Dan Carpenter
c759bc47db locking/lockdep: Update the comment for __lock_release()
This changes "to the list" to "from the list" and also deletes the
obsolete comment about the "@nested" argument.

The "nested" argument was removed in this commit, earlier this year:

  5facae4f35 ("locking/lockdep: Remove unused @nested argument from lock_release()").

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Will Deacon <will@kernel.org>
Link: https://lkml.kernel.org/r/20191104091252.GA31509@mwanda
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-13 11:07:48 +01:00
Alexander Shishkin
a4faf00d99 perf/aux: Allow using AUX data in perf samples
AUX data can be used to annotate perf events such as performance counters
or tracepoints/breakpoints by including it in sample records when
PERF_SAMPLE_AUX flag is set. Such samples would be instrumental in debugging
and profiling by providing, for example, a history of instruction flow
leading up to the event's overflow.

The implementation makes use of grouping an AUX event with all the events
that wish to take samples of the AUX data, such that the former is the
group leader. The samplees should also specify the desired size of the AUX
sample via attr.aux_sample_size.

AUX capable PMUs need to explicitly add support for sampling, because it
relies on a new callback to take a snapshot of the buffer without touching
the event states.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: adrian.hunter@intel.com
Cc: mathieu.poirier@linaro.org
Link: https://lkml.kernel.org/r/20191025140835.53665-2-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-13 11:06:14 +01:00
Qian Cai
deb0c3c29d perf/core: Fix unlock balance in perf_init_event()
Commit:

  66d258c5b0 ("perf/core: Optimize perf_init_event()")

introduced an unlock imbalance in perf_init_event() where it calls
"goto again" and then only repeat rcu_read_unlock().

Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Fixes: 66d258c5b0 ("perf/core: Optimize perf_init_event()")
Link: https://lkml.kernel.org/r/20191106052935.8352-1-cai@lca.pw
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-13 11:06:13 +01:00
Ingo Molnar
fed4c9c681 Merge branch 'perf/urgent' into perf/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-13 11:04:43 +01:00
Ben Dooks (Codethink)
d00dbd2981 perf/core: Fix missing static inline on perf_cgroup_switch()
It looks like a "static inline" has been missed in front
of the empty definition of perf_cgroup_switch() under
certain configurations.

Fixes the following sparse warning:

  kernel/events/core.c:1035:1: warning: symbol 'perf_cgroup_switch' was not declared. Should it be static?

Signed-off-by: Ben Dooks (Codethink) <ben.dooks@codethink.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: https://lkml.kernel.org/r/20191106132527.19977-1-ben.dooks@codethink.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-13 08:16:44 +01:00
Alexander Shishkin
697d877849 perf/core: Consistently fail fork on allocation failures
Commit:

  313ccb9615 ("perf: Allocate context task_ctx_data for child event")

makes the inherit path skip over the current event in case of task_ctx_data
allocation failure. This, however, is inconsistent with allocation failures
in perf_event_alloc(), which would abort the fork.

Correct this by returning an error code on task_ctx_data allocation
failure and failing the fork in that case.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: https://lkml.kernel.org/r/20191105075702.60319-1-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-13 08:16:43 +01:00
Alexander Shishkin
dce5affb94 perf/aux: Disallow aux_output for kernel events
Commit

  ab43762ef0 ("perf: Allow normal events to output AUX data")

added 'aux_output' bit to the attribute structure, which relies on AUX
events and grouping, neither of which is supported for the kernel events.
This notwithstanding, attempts have been made to use it in the kernel
code, suggesting the necessity of an explicit hard -EINVAL.

Fix this by rejecting attributes with aux_output set for kernel events.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: https://lkml.kernel.org/r/20191030134731.5437-3-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-13 08:16:42 +01:00
Alexander Shishkin
f25d8ba9e1 perf/core: Reattach a misplaced comment
A comment is in a wrong place in perf_event_create_kernel_counter().
Fix that.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: https://lkml.kernel.org/r/20191030134731.5437-2-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-13 08:16:41 +01:00
Alexander Shishkin
00496fe5e0 perf/aux: Fix the aux_output group inheritance fix
Commit

  f733c6b508 ("perf/core: Fix inheritance of aux_output groups")

adds a NULL pointer dereference in case inherit_group() races with
perf_release(), which causes the below crash:

 > BUG: kernel NULL pointer dereference, address: 000000000000010b
 > #PF: supervisor read access in kernel mode
 > #PF: error_code(0x0000) - not-present page
 > PGD 3b203b067 P4D 3b203b067 PUD 3b2040067 PMD 0
 > Oops: 0000 [#1] SMP KASAN
 > CPU: 0 PID: 315 Comm: exclusive-group Tainted: G B 5.4.0-rc3-00181-g72e1839403cb-dirty #878
 > RIP: 0010:perf_get_aux_event+0x86/0x270
 > Call Trace:
 >  ? __perf_read_group_add+0x3b0/0x3b0
 >  ? __kasan_check_write+0x14/0x20
 >  ? __perf_event_init_context+0x154/0x170
 >  inherit_task_group.isra.0.part.0+0x14b/0x170
 >  perf_event_init_task+0x296/0x4b0

Fix this by skipping over events that are getting closed, in the
inheritance path.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Fixes: f733c6b508 ("perf/core: Fix inheritance of aux_output groups")
Link: https://lkml.kernel.org/r/20191101151248.47327-1-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-13 08:16:40 +01:00
Peter Zijlstra
09f4e8f05d perf/core: Disallow uncore-cgroup events
While discussing uncore event scheduling, I noticed we do not in fact
seem to dis-allow making uncore-cgroup events. Such events make no
sense what so ever because the cgroup is a CPU local state where
uncore counts across a number of CPUs.

Disallow them.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-13 08:16:39 +01:00
Vincent Guittot
b90f7c9d21 sched/pelt: Fix update of blocked PELT ordering
update_cfs_rq_load_avg() can call cpufreq_update_util() to trigger an
update of the frequency. Make sure that RT, DL and IRQ PELT signals have
been updated before calling cpufreq.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: dsmythies@telus.net
Cc: juri.lelli@redhat.com
Cc: mgorman@suse.de
Cc: rostedt@goodmis.org
Fixes: 371bf42732 ("sched/rt: Add rt_rq utilization tracking")
Fixes: 3727e0e163 ("sched/dl: Add dl_rq utilization tracking")
Fixes: 91c27493e7 ("sched/irq: Add IRQ utilization tracking")
Link: https://lkml.kernel.org/r/1572434309-32512-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-13 08:01:31 +01:00
Peter Zijlstra
ff51ff84d8 sched/core: Avoid spurious lock dependencies
While seemingly harmless, __sched_fork() does hrtimer_init(), which,
when DEBUG_OBJETS, can end up doing allocations.

This then results in the following lock order:

  rq->lock
    zone->lock.rlock
      batched_entropy_u64.lock

Which in turn causes deadlocks when we do wakeups while holding that
batched_entropy lock -- as the random code does.

Solve this by moving __sched_fork() out from under rq->lock. This is
safe because nothing there relies on rq->lock, as also evident from the
other __sched_fork() callsite.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akpm@linux-foundation.org
Cc: bigeasy@linutronix.de
Cc: cl@linux.com
Cc: keescook@chromium.org
Cc: penberg@kernel.org
Cc: rientjes@google.com
Cc: thgarnie@google.com
Cc: tytso@mit.edu
Cc: will@kernel.org
Fixes: b7d5dc2107 ("random: add a spinlock_t to struct batched_entropy")
Link: https://lkml.kernel.org/r/20191001091837.GK4536@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-13 08:01:30 +01:00
Linus Torvalds
eb094f0696 Merge branch 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 TSX Async Abort and iTLB Multihit mitigations from Thomas Gleixner:
 "The performance deterioration departement is not proud at all of
  presenting the seventh installment of speculation mitigations and
  hardware misfeature workarounds:

   1) TSX Async Abort (TAA) - 'The Annoying Affair'

      TAA is a hardware vulnerability that allows unprivileged
      speculative access to data which is available in various CPU
      internal buffers by using asynchronous aborts within an Intel TSX
      transactional region.

      The mitigation depends on a microcode update providing a new MSR
      which allows to disable TSX in the CPU. CPUs which have no
      microcode update can be mitigated by disabling TSX in the BIOS if
      the BIOS provides a tunable.

      Newer CPUs will have a bit set which indicates that the CPU is not
      vulnerable, but the MSR to disable TSX will be available
      nevertheless as it is an architected MSR. That means the kernel
      provides the ability to disable TSX on the kernel command line,
      which is useful as TSX is a truly useful mechanism to accelerate
      side channel attacks of all sorts.

   2) iITLB Multihit (NX) - 'No eXcuses'

      iTLB Multihit is an erratum where some Intel processors may incur
      a machine check error, possibly resulting in an unrecoverable CPU
      lockup, when an instruction fetch hits multiple entries in the
      instruction TLB. This can occur when the page size is changed
      along with either the physical address or cache type. A malicious
      guest running on a virtualized system can exploit this erratum to
      perform a denial of service attack.

      The workaround is that KVM marks huge pages in the extended page
      tables as not executable (NX). If the guest attempts to execute in
      such a page, the page is broken down into 4k pages which are
      marked executable. The workaround comes with a mechanism to
      recover these shattered huge pages over time.

  Both issues come with full documentation in the hardware
  vulnerabilities section of the Linux kernel user's and administrator's
  guide.

  Thanks to all patch authors and reviewers who had the extraordinary
  priviledge to be exposed to this nuisance.

  Special thanks to Borislav Petkov for polishing the final TAA patch
  set and to Paolo Bonzini for shepherding the KVM iTLB workarounds and
  providing also the backports to stable kernels for those!"

* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/speculation/taa: Fix printing of TAA_MSG_SMT on IBRS_ALL CPUs
  Documentation: Add ITLB_MULTIHIT documentation
  kvm: x86: mmu: Recovery of shattered NX large pages
  kvm: Add helper function for creating VM worker threads
  kvm: mmu: ITLB_MULTIHIT mitigation
  cpu/speculation: Uninline and export CPU mitigations helpers
  x86/cpu: Add Tremont to the cpu vulnerability whitelist
  x86/bugs: Add ITLB_MULTIHIT bug infrastructure
  x86/tsx: Add config options to set tsx=on|off|auto
  x86/speculation/taa: Add documentation for TSX Async Abort
  x86/tsx: Add "auto" option to the tsx= cmdline parameter
  kvm/x86: Export MDS_NO=0 to guests when TSX is enabled
  x86/speculation/taa: Add sysfs reporting for TSX Async Abort
  x86/speculation/taa: Add mitigation for TSX Async Abort
  x86/cpu: Add a "tsx=" cmdline option with TSX disabled by default
  x86/cpu: Add a helper function x86_read_arch_cap_msr()
  x86/msr: Add the IA32_TSX_CTRL MSR
2019-11-12 10:53:24 -08:00
Tejun Heo
743210386c cgroup: use cgrp->kn->id as the cgroup ID
cgroup ID is currently allocated using a dedicated per-hierarchy idr
and used internally and exposed through tracepoints and bpf.  This is
confusing because there are tracepoints and other interfaces which use
the cgroupfs ino as IDs.

The preceding changes made kn->id exposed as ino as 64bit ino on
supported archs or ino+gen (low 32bits as ino, high gen).  There's no
reason for cgroup to use different IDs.  The kernfs IDs are unique and
userland can easily discover them and map them back to paths using
standard file operations.

This patch replaces cgroup IDs with kernfs IDs.

* cgroup_id() is added and all cgroup ID users are converted to use it.

* kernfs_node creation is moved to earlier during cgroup init so that
  cgroup_id() is available during init.

* While at it, s/cgroup/cgrp/ in psi helpers for consistency.

* Fallback ID value is changed to 1 to be consistent with root cgroup
  ID.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
2019-11-12 08:18:04 -08:00
Tejun Heo
40430452fd kernfs: use 64bit inos if ino_t is 64bit
Each kernfs_node is identified with a 64bit ID.  The low 32bit is
exposed as ino and the high gen.  While this already allows using inos
as keys by looking up with wildcard generation number of 0, it's
adding unnecessary complications for 64bit ino archs which can
directly use kernfs_node IDs as inos to uniquely identify each cgroup
instance.

This patch exposes IDs directly as inos on 64bit ino archs.  The
conversion is mostly straight-forward.

* 32bit ino archs behave the same as before.  64bit ino archs now use
  the whole 64bit ID as ino and the generation number is fixed at 1.

* 64bit inos still use the same idr allocator which gurantees that the
  lower 32bits identify the current live instance uniquely and the
  high 32bits are incremented whenever the low bits wrap.  As the
  upper 32bits are no longer used as gen and we don't wanna start ino
  allocation with 33rd bit set, the initial value for highbits
  allocation is changed to 0 on 64bit ino archs.

* blktrace exposes two 32bit numbers - (INO,GEN) pair - to identify
  the issuing cgroup.  Userland builds FILEID_INO32_GEN fids from
  these numbers to look up the cgroups.  To remain compatible with the
  behavior, always output (LOW32,HIGH32) which will be constructed
  back to the original 64bit ID by __kernfs_fh_to_dentry().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
2019-11-12 08:18:04 -08:00
Tejun Heo
fe0f726c9f kernfs: combine ino/id lookup functions into kernfs_find_and_get_node_by_id()
kernfs_find_and_get_node_by_ino() looks the kernfs_node matching the
specified ino.  On top of that, kernfs_get_node_by_id() and
kernfs_fh_get_inode() implement full ID matching by testing the rest
of ID.

On surface, confusingly, the two are slightly different in that the
latter uses 0 gen as wildcard while the former doesn't - does it mean
that the latter can't uniquely identify inodes w/ 0 gen?  In practice,
this is a distinction without a difference because generation number
starts at 1.  There are no actual IDs with 0 gen, so it can always
safely used as wildcard.

Let's simplify the code by renaming kernfs_find_and_get_node_by_ino()
to kernfs_find_and_get_node_by_id(), moving all lookup logics into it,
and removing now unnecessary kernfs_get_node_by_id().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-12 08:18:04 -08:00
Tejun Heo
67c0496e87 kernfs: convert kernfs_node->id from union kernfs_node_id to u64
kernfs_node->id is currently a union kernfs_node_id which represents
either a 32bit (ino, gen) pair or u64 value.  I can't see much value
in the usage of the union - all that's needed is a 64bit ID which the
current code is already limited to.  Using a union makes the code
unnecessarily complicated and prevents using 64bit ino without adding
practical benefits.

This patch drops union kernfs_node_id and makes kernfs_node->id a u64.
ino is stored in the lower 32bits and gen upper.  Accessors -
kernfs[_id]_ino() and kernfs[_id]_gen() - are added to retrieve the
ino and gen.  This simplifies ID handling less cumbersome and will
allow using 64bit inos on supported archs.

This patch doesn't make any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexei Starovoitov <ast@kernel.org>
2019-11-12 08:18:03 -08:00
Srivatsa S. Bhat (VMware)
d61ca3c25e sched/Kconfig: Fix spelling mistake in user-visible help text
Fix a spelling mistake in the help text for PREEMPT_RT.

Signed-off-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/157204450499.10518.4542293884417101528.stgit@srivatsa-ubuntu
2019-11-12 11:35:32 +01:00
Mukesh Ojha
1d6acc18fe time: Fix spelling mistake in comment
witin => within

Signed-off-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/1571124819-9639-1-git-send-email-mojha@codeaurora.org
2019-11-12 11:30:46 +01:00
Arnd Bergmann
20d087368d time: Optimize ns_to_timespec64()
ns_to_timespec64() calls div_s64_rem(), which is a rather slow function on
32-bit architectures, as it cannot take advantage of the do_div()
optimizations for constant arguments.

Open-code the div_s64_rem() function in ns_to_timespec64(), so a constant
divider can be passed into the optimized div_u64_rem() function.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191108203435.112759-3-arnd@arndb.de
2019-11-12 08:15:15 +01:00
Arnd Bergmann
2f5841349d ntp/y2038: Remove incorrect time_t truncation
A cast to 'time_t' was accidentally left in place during the
conversion of __do_adjtimex() to 64-bit timestamps, so the
resulting value is incorrectly truncated.

Remove the cast so the 64-bit time gets propagated correctly.

Fixes: ead25417f8 ("timex: use __kernel_timex internally")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20191108203435.112759-2-arnd@arndb.de
2019-11-12 08:13:44 +01:00
Rafael J. Wysocki
c1d51f684c cpuidle: Use nanoseconds as the unit of time
Currently, the cpuidle subsystem uses microseconds as the unit of
time which (among other things) causes the idle loop to incur some
integer division overhead for no clear benefit.

In order to allow cpuidle to measure time in nanoseconds, add two
new fields, exit_latency_ns and target_residency_ns, to represent the
exit latency and target residency of an idle state in nanoseconds,
respectively, to struct cpuidle_state and initialize them with the
help of the corresponding values in microseconds provided by drivers.
Additionally, change cpuidle_governor_latency_req() to return the
idle state exit latency constraint in nanoseconds.

Also meeasure idle state residency (last_residency_ns in struct
cpuidle_device and time_ns in struct cpuidle_driver) in nanoseconds
and update the cpuidle core and governors accordingly.

However, the menu governor still computes typical intervals in
microseconds to avoid integer overflows.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Doug Smythies <dsmythies@telus.net>
Tested-by: Doug Smythies <dsmythies@telus.net>
2019-11-11 21:56:07 +01:00
Linus Torvalds
de620fb99e Merge branch 'for-5.4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fix from Tejun Heo:
 "There's an inadvertent preemption point in ptrace_stop() which was
  reliably triggering for a test scenario significantly slowing it down.

  This contains Oleg's fix to remove the unwanted preemption point"

* 'for-5.4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: freezer: call cgroup_enter_frozen() with preemption disabled in ptrace_stop()
2019-11-11 12:41:14 -08:00
Masahiro Yamada
f276031b4e kheaders: explain why include/config/autoconf.h is excluded from md5sum
This comment block explains why include/generated/compile.h is omitted,
but nothing about include/generated/autoconf.h, which might be more
difficult to understand. Add more comments.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2019-11-11 20:10:01 +09:00
Masahiro Yamada
1463f74f49 kheaders: remove the last bashism to allow sh to run it
'pushd' ... 'popd' is the last bash-specific code in this script.
One way to avoid it is to run the code in a sub-shell.

With that addressed, you can run this script with sh.

I replaced $(BASH) with $(CONFIG_SHELL), and I changed the hashbang
to #!/bin/sh.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2019-11-11 20:10:01 +09:00
Masahiro Yamada
ea79e5168b kheaders: optimize header copy for in-tree builds
This script copies headers by the cpio command twice; first from
srctree, and then from objtree. However, when we building in-tree,
we know the srctree and the objtree are the same. That is, all the
headers copied by the first cpio are overwritten by the second one.

Skip the first cpio when we are building in-tree.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2019-11-11 20:10:01 +09:00
Masahiro Yamada
0e11773e76 kheaders: optimize md5sum calculation for in-tree builds
This script computes md5sum of headers in srctree and in objtree.
However, when we are building in-tree, we know the srctree and the
objtree are the same. That is, we end up with the same computation
twice. In fact, the first two lines of kernel/kheaders.md5 are always
the same for in-tree builds.

Unify the two md5sum calculations.

For in-tree builds ($building_out_of_srctree is empty), we check
only two directories, "include", and "arch/$SRCARCH/include".

For out-of-tree builds ($building_out_of_srctree is 1), we check
4 directories, "$srctree/include", "$srctree/arch/$SRCARCH/include",
"include", and "arch/$SRCARCH/include" since we know they are all
different.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2019-11-11 20:10:01 +09:00
Masahiro Yamada
9a06635718 kheaders: remove unneeded 'cat' command piped to 'head' / 'tail'
The 'head' and 'tail' commands can take a file path directly.
So, you do not need to run 'cat'.

  cat kernel/kheaders.md5 | head -1

... is equivalent to:

  head -1 kernel/kheaders.md5

and the latter saves forking one process.

While I was here, I replaced 'head -1' with 'head -n 1'.

I also replaced '==' with '=' since we do not have a good reason to
use the bashism.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2019-11-11 20:10:01 +09:00
Eric Dumazet
5e76f56457 dma-debug: increase HASH_SIZE
With modern NIC, it is not unusual having about ~256,000 active dma
mappings and a hash size of 1024 buckets is too small.

Forcing full cache line per bucket does not seem useful, especially now
that we have contention on free_entries_lock for allocations and freeing
of entries.  Better use the space to fit more buckets.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-11-11 10:52:18 +01:00
Eric Dumazet
d3694f3073 dma-debug: reorder struct dma_debug_entry fields
Move all fields used during exact match lookups to the first cache line.
This makes debug_dma_mapping_error() and friends about 50% faster.

Since it removes two 32bit holes, force a cacheline alignment on struct
dma_debug_entry.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-11-11 10:52:18 +01:00
Christoph Hellwig
3acac06550 dma-mapping: merge the generic remapping helpers into dma-direct
Integrate the generic dma remapping implementation into the main flow.
This prepares for architectures like xtensa that use an uncached
segment for pages in the kernel mapping, but can also remap highmem
from CMA.  To simplify that implementation we now always deduct the
page from the physical address via the DMA address instead of the
virtual address.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Filippov <jcmvbkbc@gmail.com>
2019-11-11 10:52:18 +01:00
Christoph Hellwig
34dc0ea6bc dma-direct: provide mmap and get_sgtable method overrides
For dma-direct we know that the DMA address is an encoding of the
physical address that we can trivially decode.  Use that fact to
provide implementations that do not need the arch_dma_coherent_to_pfn
architecture hook.  Note that we still can only support mmap of
non-coherent memory only if the architecture provides a way to set an
uncached bit in the page tables.  This must be true for architectures
that use the generic remap helpers, but other architectures can also
manually select it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Filippov <jcmvbkbc@gmail.com>
2019-11-11 10:52:15 +01:00
Jiri Slaby
4b48512c2e stacktrace: Get rid of unneeded '!!' pattern
My commit b0c51f1584 ("stacktrace: Don't skip first entry on
noncurrent tasks") adds one or zero to skipnr by "!!(current == tsk)".

But the C99 standard says:

  The == (equal to) and != (not equal to) operators are
  ...
  Each of the operators yields 1 if the specified relation is true and 0
  if it is false.

So there is no need to prepend the above expression by "!!" -- remove it.

Reported-by: Joe Perches <joe@perches.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191111092647.27419-1-jslaby@suse.cz
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-11 10:30:59 +01:00
Frederic Weisbecker
feb4a51323 irq_work: Slightly simplify IRQ_WORK_PENDING clearing
Instead of fetching the value of flags and perform an xchg() to clear
a bit, just use atomic_fetch_andnot() that is more suitable to do that
job in one operation while keeping the full ordering.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191108160858.31665-4-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-11 09:03:31 +01:00
Frederic Weisbecker
25269871db irq_work: Fix irq_work_claim() memory ordering
When irq_work_claim() finds IRQ_WORK_PENDING flag already set, we just
return and don't raise a new IPI. We expect the destination to see
and handle our latest updades thanks to the pairing atomic_xchg()
in irq_work_run_list().

But cmpxchg() doesn't guarantee a full memory barrier upon failure. So
it's possible that the destination misses our latest updates.

So use atomic_fetch_or() instead that is unconditionally fully ordered
and also performs exactly what we want here and simplify the code.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191108160858.31665-3-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-11 09:03:31 +01:00
Frederic Weisbecker
153bedbac2 irq_work: Convert flags to atomic_t
We need to convert flags to atomic_t in order to later fix an ordering
issue on atomic_cmpxchg() failure. This will allow us to use atomic_fetch_or().

Also clarify the nature of those flags.

[ mingo: Converted two more usage site the original patch missed. ]

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191108160858.31665-2-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-11 09:02:56 +01:00
Peter Zijlstra
a0e813f26e sched/core: Further clarify sched_class::set_next_task()
It turns out there really is something special to the first
set_next_task() invocation. In specific the 'change' pattern really
should not cause balance callbacks.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: juri.lelli@redhat.com
Cc: ktkhai@virtuozzo.com
Cc: mgorman@suse.de
Cc: qais.yousef@arm.com
Cc: qperret@google.com
Cc: rostedt@goodmis.org
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Fixes: f95d4eaee6 ("sched/{rt,deadline}: Fix set_next_task vs pick_next_task")
Link: https://lkml.kernel.org/r/20191108131909.775434698@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-11 08:35:21 +01:00
Peter Zijlstra
2eeb01a28c sched/fair: Use mul_u32_u32()
While reading the code I encountered another site where we should be
using mul_u32_u32() because GCC just won't take a hint.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: juri.lelli@redhat.com
Cc: ktkhai@virtuozzo.com
Cc: mgorman@suse.de
Cc: qais.yousef@arm.com
Cc: qperret@google.com
Cc: rostedt@goodmis.org
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Link: https://lkml.kernel.org/r/20191108131909.717931380@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-11 08:35:20 +01:00
Peter Zijlstra
98c2f700ed sched/core: Simplify sched_class::pick_next_task()
Now that the indirect class call never uses the last two arguments of
pick_next_task(), remove them.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: juri.lelli@redhat.com
Cc: ktkhai@virtuozzo.com
Cc: mgorman@suse.de
Cc: qais.yousef@arm.com
Cc: qperret@google.com
Cc: rostedt@goodmis.org
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Link: https://lkml.kernel.org/r/20191108131909.660595546@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-11 08:35:20 +01:00
Peter Zijlstra
5d7d605642 sched/core: Optimize pick_next_task()
Ever since we moved the sched_class definitions into their own files,
the constant expression {fair,idle}_sched_class.pick_next_task() is
not in fact a compile time constant anymore and results in an indirect
call (barring LTO).

Fix that by exposing pick_next_task_{fair,idle}() directly, this gets
rid of the indirect call (and RETPOLINE) on the fast path.

Also remove the unlikely() from the idle case, it is in fact /the/ way
we select idle -- and that is a very common thing to do.

Performance for will-it-scale/sched_yield improves by 2% (as reported
by 0-day).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: juri.lelli@redhat.com
Cc: ktkhai@virtuozzo.com
Cc: mgorman@suse.de
Cc: qais.yousef@arm.com
Cc: qperret@google.com
Cc: rostedt@goodmis.org
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Link: https://lkml.kernel.org/r/20191108131909.603037345@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-11 08:35:19 +01:00
Peter Zijlstra
f488e1057b sched/core: Make pick_next_task_idle() more consistent
Only pick_next_task_fair() needs the @prev and @rf argument; these are
required to implement the cpu-cgroup optimization. None of the other
pick_next_task() methods need this. Make pick_next_task_idle() more
consistent.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: juri.lelli@redhat.com
Cc: ktkhai@virtuozzo.com
Cc: mgorman@suse.de
Cc: qais.yousef@arm.com
Cc: qperret@google.com
Cc: rostedt@goodmis.org
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Link: https://lkml.kernel.org/r/20191108131909.545730862@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-11 08:35:19 +01:00
Peter Zijlstra
7277a34c6b sched/fair: Better document newidle_balance()
Whilst chasing the pick_next_task() race, there was some confusion
about the newidle_balance() return values. Document them.

[ mingo: Minor edits. ]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: juri.lelli@redhat.com
Cc: ktkhai@virtuozzo.com
Cc: mgorman@suse.de
Cc: qais.yousef@arm.com
Cc: qperret@google.com
Cc: rostedt@goodmis.org
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Link: https://lkml.kernel.org/r/20191108131909.488364308@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-11 08:35:18 +01:00
Ingo Molnar
6d5a763c30 Linux 5.4-rc7
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl3IqJQeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGOiUH+gOEDwid5OODaFAd
 CggXugdFIlBZefKqGVNW5sjgX8pxFWHXuEMC8iNb6QXtQZdFrI6LFf9hhUDmzQtm
 6y1LPxxEiTZjObMEsBNylb7tyzgujFHcAlp0Zro3w/HLCqmYTSP3FF46i2u6KZfL
 XhkpM4X7R7qxlfpdhlfESv/ElRGocZe6SwXfC7pcPo5flFcmkdu9ijqhNd/6CZ/h
 Nf9rTsD/wEDVUelFbgVN+LJzlaB0tsyc4Zbof07n8OsFZjhdEOop8gfM/kTBLcyY
 6bh66SfDScdsNnC/l8csbPjSZRx+i+nQs67DyhGNnsSAFgHBZdC4Tb/2mDCwhCLR
 dUvuYZc=
 =1N6F
 -----END PGP SIGNATURE-----

Merge tag 'v5.4-rc7' into sched/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-11 08:34:59 +01:00
Ingo Molnar
1ca7feb590 Linux 5.4-rc7
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl3IqJQeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGOiUH+gOEDwid5OODaFAd
 CggXugdFIlBZefKqGVNW5sjgX8pxFWHXuEMC8iNb6QXtQZdFrI6LFf9hhUDmzQtm
 6y1LPxxEiTZjObMEsBNylb7tyzgujFHcAlp0Zro3w/HLCqmYTSP3FF46i2u6KZfL
 XhkpM4X7R7qxlfpdhlfESv/ElRGocZe6SwXfC7pcPo5flFcmkdu9ijqhNd/6CZ/h
 Nf9rTsD/wEDVUelFbgVN+LJzlaB0tsyc4Zbof07n8OsFZjhdEOop8gfM/kTBLcyY
 6bh66SfDScdsNnC/l8csbPjSZRx+i+nQs67DyhGNnsSAFgHBZdC4Tb/2mDCwhCLR
 dUvuYZc=
 =1N6F
 -----END PGP SIGNATURE-----

Merge tag 'v5.4-rc7' into perf/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-11 07:59:06 +01:00
Linus Torvalds
621084cd3d Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
 "A small set of fixes for timekeepoing and clocksource drivers:

   - VDSO data was updated conditional on the availability of a VDSO
     capable clocksource. This causes the VDSO functions which do not
     depend on a VDSO capable clocksource to operate on stale data.
     Always update unconditionally.

   - Prevent a double free in the mediatek driver

   - Use the proper helper in the sh_mtu2 driver so it won't attempt to
     initialize non-existing interrupts"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timekeeping/vsyscall: Update VDSO data unconditionally
  clocksource/drivers/sh_mtu2: Do not loop using platform_get_irq_by_name()
  clocksource/drivers/mediatek: Fix error handling
2019-11-10 12:03:58 -08:00
Linus Torvalds
81388c2b3f Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Thomas Gleixner:
 "Two fixes for scheduler regressions:

   - Plug a subtle race condition which was introduced with the rework
     of the next task selection functionality. The change of task
     properties became unprotected which can be observed inconsistently
     causing state corruption.

   - A trivial compile fix for CONFIG_CGROUPS=n"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Fix pick_next_task() vs 'change' pattern race
  sched/core: Fix compilation error when cgroup not selected
2019-11-10 12:00:47 -08:00
Linus Torvalds
ffba65ea24 Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixlet from Thomas Gleixner:
 "A trivial fix for a kernel doc regression where an argument change was
  not reflected in the documentation"

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irq/irqdomain: Update __irq_domain_alloc_fwnode() function documentation
2019-11-10 11:51:11 -08:00
Linus Torvalds
20c7e29684 Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull stacktrace fix from Thomas Gleixner:
 "A small fix for a stacktrace regression.

  Saving a stacktrace for a foreign task skipped an extra entry which
  makes e.g. the output of /proc/$PID/stack incomplete"

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  stacktrace: Don't skip first entry on noncurrent tasks
2019-11-10 11:47:39 -08:00
Al Viro
69924b8968 audit_get_nd(): don't unlock parent too early
if the child has been negative and just went positive
under us, we want coherent d_is_positive() and ->d_inode.
Don't unlock the parent until we'd done that work...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-11-10 11:56:55 -05:00
Al Viro
630faf81b3 cgroup: don't put ERR_PTR() into fc->root
the caller of ->get_tree() expects NULL left there on error...

Reported-by: Thibaut Sautereau <thibaut@sautereau.fr>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-11-10 11:53:27 -05:00
David S. Miller
14684b9301 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
One conflict in the BPF samples Makefile, some fixes in 'net' whilst
we were converting over to Makefile.target rules in 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-09 11:04:37 -08:00
Linus Torvalds
0058b0a506 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) BPF sample build fixes from Björn Töpel

 2) Fix powerpc bpf tail call implementation, from Eric Dumazet.

 3) DCCP leaks jiffies on the wire, fix also from Eric Dumazet.

 4) Fix crash in ebtables when using dnat target, from Florian Westphal.

 5) Fix port disable handling whne removing bcm_sf2 driver, from Florian
    Fainelli.

 6) Fix kTLS sk_msg trim on fallback to copy mode, from Jakub Kicinski.

 7) Various KCSAN fixes all over the networking, from Eric Dumazet.

 8) Memory leaks in mlx5 driver, from Alex Vesker.

 9) SMC interface refcounting fix, from Ursula Braun.

10) TSO descriptor handling fixes in stmmac driver, from Jose Abreu.

11) Add a TX lock to synchonize the kTLS TX path properly with crypto
    operations. From Jakub Kicinski.

12) Sock refcount during shutdown fix in vsock/virtio code, from Stefano
    Garzarella.

13) Infinite loop in Intel ice driver, from Colin Ian King.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (108 commits)
  ixgbe: need_wakeup flag might not be set for Tx
  i40e: need_wakeup flag might not be set for Tx
  igb/igc: use ktime accessors for skb->tstamp
  i40e: Fix for ethtool -m issue on X722 NIC
  iavf: initialize ITRN registers with correct values
  ice: fix potential infinite loop because loop counter being too small
  qede: fix NULL pointer deref in __qede_remove()
  net: fix data-race in neigh_event_send()
  vsock/virtio: fix sock refcnt holding during the shutdown
  net: ethernet: octeon_mgmt: Account for second possible VLAN header
  mac80211: fix station inactive_time shortly after boot
  net/fq_impl: Switch to kvmalloc() for memory allocation
  mac80211: fix ieee80211_txq_setup_flows() failure path
  ipv4: Fix table id reference in fib_sync_down_addr
  ipv6: fixes rt6_probe() and fib6_nh->last_probe init
  net: hns: Fix the stray netpoll locks causing deadlock in NAPI path
  net: usb: qmi_wwan: add support for DW5821e with eSIM support
  CDC-NCM: handle incomplete transfer of MTU
  nfc: netlink: fix double device reference drop
  NFC: st21nfca: fix double free
  ...
2019-11-08 18:21:05 -08:00
Peter Zijlstra
6e2df0581f sched: Fix pick_next_task() vs 'change' pattern race
Commit 67692435c4 ("sched: Rework pick_next_task() slow-path")
inadvertly introduced a race because it changed a previously
unexplored dependency between dropping the rq->lock and
sched_class::put_prev_task().

The comments about dropping rq->lock, in for example
newidle_balance(), only mentions the task being current and ->on_cpu
being set. But when we look at the 'change' pattern (in for example
sched_setnuma()):

	queued = task_on_rq_queued(p); /* p->on_rq == TASK_ON_RQ_QUEUED */
	running = task_current(rq, p); /* rq->curr == p */

	if (queued)
		dequeue_task(...);
	if (running)
		put_prev_task(...);

	/* change task properties */

	if (queued)
		enqueue_task(...);
	if (running)
		set_next_task(...);

It becomes obvious that if we do this after put_prev_task() has
already been called on @p, things go sideways. This is exactly what
the commit in question allows to happen when it does:

	prev->sched_class->put_prev_task(rq, prev, rf);
	if (!rq->nr_running)
		newidle_balance(rq, rf);

The newidle_balance() call will drop rq->lock after we've called
put_prev_task() and that allows the above 'change' pattern to
interleave and mess up the state.

Furthermore, it turns out we lost the RT-pull when we put the last DL
task.

Fix both problems by extracting the balancing from put_prev_task() and
doing a multi-class balance() pass before put_prev_task().

Fixes: 67692435c4 ("sched: Rework pick_next_task() slow-path")
Reported-by: Quentin Perret <qperret@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Quentin Perret <qperret@google.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
2019-11-08 22:34:14 +01:00
Qais Yousef
e3b8b6a0d1 sched/core: Fix compilation error when cgroup not selected
When cgroup is disabled the following compilation error was hit

	kernel/sched/core.c: In function ‘uclamp_update_active_tasks’:
	kernel/sched/core.c:1081:23: error: storage size of ‘it’ isn’t known
	  struct css_task_iter it;
			       ^~
	kernel/sched/core.c:1084:2: error: implicit declaration of function ‘css_task_iter_start’; did you mean ‘__sg_page_iter_start’? [-Werror=implicit-function-declaration]
	  css_task_iter_start(css, 0, &it);
	  ^~~~~~~~~~~~~~~~~~~
	  __sg_page_iter_start
	kernel/sched/core.c:1085:14: error: implicit declaration of function ‘css_task_iter_next’; did you mean ‘__sg_page_iter_next’? [-Werror=implicit-function-declaration]
	  while ((p = css_task_iter_next(&it))) {
		      ^~~~~~~~~~~~~~~~~~
		      __sg_page_iter_next
	kernel/sched/core.c:1091:2: error: implicit declaration of function ‘css_task_iter_end’; did you mean ‘get_task_cred’? [-Werror=implicit-function-declaration]
	  css_task_iter_end(&it);
	  ^~~~~~~~~~~~~~~~~
	  get_task_cred
	kernel/sched/core.c:1081:23: warning: unused variable ‘it’ [-Wunused-variable]
	  struct css_task_iter it;
			       ^~
	cc1: some warnings being treated as errors
	make[2]: *** [kernel/sched/core.o] Error 1

Fix by protetion uclamp_update_active_tasks() with
CONFIG_UCLAMP_TASK_GROUP

Fixes: babbe170e0 ("sched/uclamp: Update CPU's refcount on TG's clamp changes")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Patrick Bellasi <patrick.bellasi@matbug.net>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Ben Segall <bsegall@google.com>
Link: https://lkml.kernel.org/r/20191105112212.596-1-qais.yousef@arm.com
2019-11-08 22:34:14 +01:00
Catalin Marinas
6be22809e5 Merge branches 'for-next/elf-hwcap-docs', 'for-next/smccc-conduit-cleanup', 'for-next/zone-dma', 'for-next/relax-icc_pmr_el1-sync', 'for-next/double-page-fault', 'for-next/misc', 'for-next/kselftest-arm64-signal' and 'for-next/kaslr-diagnostics' into for-next/core
* for-next/elf-hwcap-docs:
  : Update the arm64 ELF HWCAP documentation
  docs/arm64: cpu-feature-registers: Rewrite bitfields that don't follow [e, s]
  docs/arm64: cpu-feature-registers: Documents missing visible fields
  docs/arm64: elf_hwcaps: Document HWCAP_SB
  docs/arm64: elf_hwcaps: sort the HWCAP{, 2} documentation by ascending value

* for-next/smccc-conduit-cleanup:
  : SMC calling convention conduit clean-up
  firmware: arm_sdei: use common SMCCC_CONDUIT_*
  firmware/psci: use common SMCCC_CONDUIT_*
  arm: spectre-v2: use arm_smccc_1_1_get_conduit()
  arm64: errata: use arm_smccc_1_1_get_conduit()
  arm/arm64: smccc/psci: add arm_smccc_1_1_get_conduit()

* for-next/zone-dma:
  : Reintroduction of ZONE_DMA for Raspberry Pi 4 support
  arm64: mm: reserve CMA and crashkernel in ZONE_DMA32
  dma/direct: turn ARCH_ZONE_DMA_BITS into a variable
  arm64: Make arm64_dma32_phys_limit static
  arm64: mm: Fix unused variable warning in zone_sizes_init
  mm: refresh ZONE_DMA and ZONE_DMA32 comments in 'enum zone_type'
  arm64: use both ZONE_DMA and ZONE_DMA32
  arm64: rename variables used to calculate ZONE_DMA32's size
  arm64: mm: use arm64_dma_phys_limit instead of calling max_zone_dma_phys()

* for-next/relax-icc_pmr_el1-sync:
  : Relax ICC_PMR_EL1 (GICv3) accesses when ICC_CTLR_EL1.PMHE is clear
  arm64: Document ICC_CTLR_EL3.PMHE setting requirements
  arm64: Relax ICC_PMR_EL1 accesses when ICC_CTLR_EL1.PMHE is clear

* for-next/double-page-fault:
  : Avoid a double page fault in __copy_from_user_inatomic() if hw does not support auto Access Flag
  mm: fix double page fault on arm64 if PTE_AF is cleared
  x86/mm: implement arch_faults_on_old_pte() stub on x86
  arm64: mm: implement arch_faults_on_old_pte() on arm64
  arm64: cpufeature: introduce helper cpu_has_hw_af()

* for-next/misc:
  : Various fixes and clean-ups
  arm64: kpti: Add NVIDIA's Carmel core to the KPTI whitelist
  arm64: mm: Remove MAX_USER_VA_BITS definition
  arm64: mm: simplify the page end calculation in __create_pgd_mapping()
  arm64: print additional fault message when executing non-exec memory
  arm64: psci: Reduce the waiting time for cpu_psci_cpu_kill()
  arm64: pgtable: Correct typo in comment
  arm64: docs: cpu-feature-registers: Document ID_AA64PFR1_EL1
  arm64: cpufeature: Fix typos in comment
  arm64/mm: Poison initmem while freeing with free_reserved_area()
  arm64: use generic free_initrd_mem()
  arm64: simplify syscall wrapper ifdeffery

* for-next/kselftest-arm64-signal:
  : arm64-specific kselftest support with signal-related test-cases
  kselftest: arm64: fake_sigreturn_misaligned_sp
  kselftest: arm64: fake_sigreturn_bad_size
  kselftest: arm64: fake_sigreturn_duplicated_fpsimd
  kselftest: arm64: fake_sigreturn_missing_fpsimd
  kselftest: arm64: fake_sigreturn_bad_size_for_magic0
  kselftest: arm64: fake_sigreturn_bad_magic
  kselftest: arm64: add helper get_current_context
  kselftest: arm64: extend test_init functionalities
  kselftest: arm64: mangle_pstate_invalid_mode_el[123][ht]
  kselftest: arm64: mangle_pstate_invalid_daif_bits
  kselftest: arm64: mangle_pstate_invalid_compat_toggle and common utils
  kselftest: arm64: extend toplevel skeleton Makefile

* for-next/kaslr-diagnostics:
  : Provide diagnostics on boot for KASLR
  arm64: kaslr: Check command line before looking for a seed
  arm64: kaslr: Announce KASLR status on boot
2019-11-08 17:46:11 +00:00
Steven Rostedt (VMware)
7e16f581a8 ftrace: Separate out functionality from ftrace_location_range()
Create a new function called lookup_rec() from the functionality of
ftrace_location_range(). The difference between lookup_rec() is that it
returns the record that it finds, where as ftrace_location_range() returns
only if it found a match or not.

The lookup_rec() is static, and can be used for new functionality where
ftrace needs to find a record of a specific address.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-08 12:26:46 -05:00
Steven Rostedt (VMware)
714641c367 ftrace: Separate out the copying of a ftrace_hash from __ftrace_hash_move()
Most of the functionality of __ftrace_hash_move() can be reused, but not all
of it. That is, __ftrace_hash_move() is used to simply make a new hash from
an existing one, using the same size as the original. Creating a dup_hash(),
where we can specify a new size will be useful when we want to create a hash
with a default size, or simply copy the old one.

Signed-off-by: Steven Rostedt (VMWare) <rostedt@goodmis.org>
2019-11-08 12:25:46 -05:00
Martin KaFai Lau
7e3617a72d bpf: Add array support to btf_struct_access
This patch adds array support to btf_struct_access().
It supports array of int, array of struct and multidimensional
array.

It also allows using u8[] as a scratch space.  For example,
it allows access the "char cb[48]" with size larger than
the array's element "char".  Another potential use case is
"u64 icsk_ca_priv[]" in the tcp congestion control.

btf_resolve_size() is added to resolve the size of any type.
It will follow the modifier if there is any.  Please
see the function comment for details.

This patch also adds the "off < moff" check at the beginning
of the for loop.  It is to reject cases when "off" is pointing
to a "hole" in a struct.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20191107180903.4097702-1-kafai@fb.com
2019-11-07 10:59:08 -08:00
Christoph Hellwig
4e1003aa56 dma-direct: remove the dma_handle argument to __dma_direct_alloc_pages
The argument isn't used anywhere, so stop passing it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Filippov <jcmvbkbc@gmail.com>
2019-11-07 17:25:57 +01:00
Christoph Hellwig
acaade1af3 dma-direct: remove __dma_direct_free_pages
We can just call dma_free_contiguous directly instead of wrapping it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Filippov <jcmvbkbc@gmail.com>
2019-11-07 17:25:40 +01:00
Honglei Wang
742e8cd3e1 cgroup: freezer: don't change task and cgroups status unnecessarily
It's not necessary to adjust the task state and revisit the state
of source and destination cgroups if the cgroups are not in freeze
state and the task itself is not frozen.

And in this scenario, it wakes up the task who's not supposed to be
ready to run.

Don't do the unnecessary task state adjustment can help stop waking
up the task without a reason.

Signed-off-by: Honglei Wang <honglei.wang@oracle.com>
Acked-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-11-07 07:38:41 -08:00
Amit Kucheria
3f6ec871e1 cpufreq: Initialize the governors in core_initcall
Initialize the cpufreq governors earlier to allow for earlier
performance control during the boot process.

Signed-off-by: Amit Kucheria <amit.kucheria@linaro.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Link: https://lore.kernel.org/r/b98eae9b44eb2f034d7f5d12a161f5f831be1eb7.1571656015.git.amit.kucheria@linaro.org
2019-11-07 07:00:26 +01:00
Martin KaFai Lau
85d31dd070 bpf: Account for insn->off when doing bpf_probe_read_kernel
In the bpf interpreter mode, bpf_probe_read_kernel is used to read
from PTR_TO_BTF_ID's kernel object.  It currently missed considering
the insn->off.  This patch fixes it.

Fixes: 2a02759ef5 ("bpf: Add support for BTF pointers to interpreter")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20191107014640.384083-1-kafai@fb.com
2019-11-06 21:52:52 -08:00
Dan Carpenter
d0fbb51dfa bpf, offload: Unlock on error in bpf_offload_dev_create()
We need to drop the bpf_devs_lock on error before returning.

Fixes: 9fd7c55591 ("bpf: offload: aggregate offloads per-device")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Link: https://lore.kernel.org/bpf/20191104091536.GB31509@mwanda
2019-11-07 00:20:27 +01:00
Eric Dumazet
56144737e6 hrtimer: Annotate lockless access to timer->state
syzbot reported various data-race caused by hrtimer_is_queued() reading
timer->state. A READ_ONCE() is required there to silence the warning.

Also add the corresponding WRITE_ONCE() when timer->state is set.

In remove_hrtimer() the hrtimer_is_queued() helper is open coded to avoid
loading timer->state twice.

KCSAN reported these cases:

BUG: KCSAN: data-race in __remove_hrtimer / tcp_pacing_check

write to 0xffff8880b2a7d388 of 1 bytes by interrupt on cpu 0:
 __remove_hrtimer+0x52/0x130 kernel/time/hrtimer.c:991
 __run_hrtimer kernel/time/hrtimer.c:1496 [inline]
 __hrtimer_run_queues+0x250/0x600 kernel/time/hrtimer.c:1576
 hrtimer_run_softirq+0x10e/0x150 kernel/time/hrtimer.c:1593
 __do_softirq+0x115/0x33f kernel/softirq.c:292
 run_ksoftirqd+0x46/0x60 kernel/softirq.c:603
 smpboot_thread_fn+0x37d/0x4a0 kernel/smpboot.c:165
 kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352

read to 0xffff8880b2a7d388 of 1 bytes by task 24652 on cpu 1:
 tcp_pacing_check net/ipv4/tcp_output.c:2235 [inline]
 tcp_pacing_check+0xba/0x130 net/ipv4/tcp_output.c:2225
 tcp_xmit_retransmit_queue+0x32c/0x5a0 net/ipv4/tcp_output.c:3044
 tcp_xmit_recovery+0x7c/0x120 net/ipv4/tcp_input.c:3558
 tcp_ack+0x17b6/0x3170 net/ipv4/tcp_input.c:3717
 tcp_rcv_established+0x37e/0xf50 net/ipv4/tcp_input.c:5696
 tcp_v4_do_rcv+0x381/0x4e0 net/ipv4/tcp_ipv4.c:1561
 sk_backlog_rcv include/net/sock.h:945 [inline]
 __release_sock+0x135/0x1e0 net/core/sock.c:2435
 release_sock+0x61/0x160 net/core/sock.c:2951
 sk_stream_wait_memory+0x3d7/0x7c0 net/core/stream.c:145
 tcp_sendmsg_locked+0xb47/0x1f30 net/ipv4/tcp.c:1393
 tcp_sendmsg+0x39/0x60 net/ipv4/tcp.c:1434
 inet_sendmsg+0x6d/0x90 net/ipv4/af_inet.c:807
 sock_sendmsg_nosec net/socket.c:637 [inline]
 sock_sendmsg+0x9f/0xc0 net/socket.c:657

BUG: KCSAN: data-race in __remove_hrtimer / __tcp_ack_snd_check

write to 0xffff8880a3a65588 of 1 bytes by interrupt on cpu 0:
 __remove_hrtimer+0x52/0x130 kernel/time/hrtimer.c:991
 __run_hrtimer kernel/time/hrtimer.c:1496 [inline]
 __hrtimer_run_queues+0x250/0x600 kernel/time/hrtimer.c:1576
 hrtimer_run_softirq+0x10e/0x150 kernel/time/hrtimer.c:1593
 __do_softirq+0x115/0x33f kernel/softirq.c:292
 invoke_softirq kernel/softirq.c:373 [inline]
 irq_exit+0xbb/0xe0 kernel/softirq.c:413
 exiting_irq arch/x86/include/asm/apic.h:536 [inline]
 smp_apic_timer_interrupt+0xe6/0x280 arch/x86/kernel/apic/apic.c:1137
 apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:830

read to 0xffff8880a3a65588 of 1 bytes by task 22891 on cpu 1:
 __tcp_ack_snd_check+0x415/0x4f0 net/ipv4/tcp_input.c:5265
 tcp_ack_snd_check net/ipv4/tcp_input.c:5287 [inline]
 tcp_rcv_established+0x750/0xf50 net/ipv4/tcp_input.c:5708
 tcp_v4_do_rcv+0x381/0x4e0 net/ipv4/tcp_ipv4.c:1561
 sk_backlog_rcv include/net/sock.h:945 [inline]
 __release_sock+0x135/0x1e0 net/core/sock.c:2435
 release_sock+0x61/0x160 net/core/sock.c:2951
 sk_stream_wait_memory+0x3d7/0x7c0 net/core/stream.c:145
 tcp_sendmsg_locked+0xb47/0x1f30 net/ipv4/tcp.c:1393
 tcp_sendmsg+0x39/0x60 net/ipv4/tcp.c:1434
 inet_sendmsg+0x6d/0x90 net/ipv4/af_inet.c:807
 sock_sendmsg_nosec net/socket.c:637 [inline]
 sock_sendmsg+0x9f/0xc0 net/socket.c:657
 __sys_sendto+0x21f/0x320 net/socket.c:1952
 __do_sys_sendto net/socket.c:1964 [inline]
 __se_sys_sendto net/socket.c:1960 [inline]
 __x64_sys_sendto+0x89/0xb0 net/socket.c:1960
 do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 24652 Comm: syz-executor.3 Not tainted 5.4.0-rc3+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

[ tglx: Added comments ]

Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191106174804.74723-1-edumazet@google.com
2019-11-06 23:18:31 +01:00
Tejun Heo
1bb5ec2eec cgroup: use cgroup->last_bstat instead of cgroup->bstat_pending for consistency
cgroup->bstat_pending is used to determine the base stat delta to
propagate to the parent.  While correct, this is different from how
percpu delta is determined for no good reason and the inconsistency
makes the code more difficult to understand.

This patch makes parent propagation delta calculation use the same
method as percpu to global propagation.

* cgroup_base_stat_accumulate() is renamed to cgroup_base_stat_add()
  and cgroup_base_stat_sub() is added.

* percpu propagation calculation is updated to use the above helpers.

* cgroup->bstat_pending is replaced with cgroup->last_bstat and
  updated to use the same calculation as percpu propagation.

Signed-off-by: Tejun Heo <tj@kernel.org>
2019-11-06 12:50:15 -08:00
Mark Rutland
a1326b17ac module/ftrace: handle patchable-function-entry
When using patchable-function-entry, the compiler will record the
callsites into a section named "__patchable_function_entries" rather
than "__mcount_loc". Let's abstract this difference behind a new
FTRACE_CALLSITE_SECTION, so that architectures don't have to handle this
explicitly (e.g. with custom module linker scripts).

As parisc currently handles this explicitly, it is fixed up accordingly,
with its custom linker script removed. Since FTRACE_CALLSITE_SECTION is
only defined when DYNAMIC_FTRACE is selected, the parisc module loading
code is updated to only use the definition in that case. When
DYNAMIC_FTRACE is not selected, modules shouldn't have this section, so
this removes some redundant work in that case.

To make sure that this is keep up-to-date for modules and the main
kernel, a comment is added to vmlinux.lds.h, with the existing ifdeffery
simplified for legibility.

I built parisc generic-{32,64}bit_defconfig with DYNAMIC_FTRACE enabled,
and verified that the section made it into the .ko files for modules.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Helge Deller <deller@gmx.de>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Reviewed-by: Torsten Duwe <duwe@suse.de>
Tested-by: Amit Daniel Kachhap <amit.kachhap@arm.com>
Tested-by: Sven Schnelle <svens@stackframe.org>
Tested-by: Torsten Duwe <duwe@suse.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: linux-parisc@vger.kernel.org
2019-11-06 14:17:30 +00:00
Mark Rutland
fbf6c73c5b ftrace: add ftrace_init_nop()
Architectures may need to perform special initialization of ftrace
callsites, and today they do so by special-casing ftrace_make_nop() when
the expected branch address is MCOUNT_ADDR. In some cases (e.g. for
patchable-function-entry), we don't have an mcount-like symbol and don't
want a synthetic MCOUNT_ADDR, but we may need to perform some
initialization of callsites.

To make it possible to separate initialization from runtime
modification, and to handle cases without an mcount-like symbol, this
patch adds an optional ftrace_init_nop() function that architectures can
implement, which does not pass a branch address.

Where an architecture does not provide ftrace_init_nop(), we will fall
back to the existing behaviour of calling ftrace_make_nop() with
MCOUNT_ADDR.

At the same time, ftrace_code_disable() is renamed to
ftrace_nop_initialize() to make it clearer that it is intended to
intialize a callsite into a disabled state, and is not for disabling a
callsite that has been runtime enabled. The kerneldoc description of rec
arguments is updated to cover non-mcount callsites.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Amit Daniel Kachhap <amit.kachhap@arm.com>
Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Torsten Duwe <duwe@suse.de>
Tested-by: Amit Daniel Kachhap <amit.kachhap@arm.com>
Tested-by: Sven Schnelle <svens@stackframe.org>
Tested-by: Torsten Duwe <duwe@suse.de>
Cc: Ingo Molnar <mingo@redhat.com>
2019-11-06 14:17:13 +00:00
David S. Miller
41de23e223 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2019-11-02

The following pull-request contains BPF updates for your *net* tree.

We've added 6 non-merge commits during the last 6 day(s) which contain
a total of 8 files changed, 35 insertions(+), 9 deletions(-).

The main changes are:

1) Fix ppc BPF JIT's tail call implementation by performing a second pass
   to gather a stable JIT context before opcode emission, from Eric Dumazet.

2) Fix build of BPF samples sys_perf_event_open() usage to compiled out
   unavailable test_attr__{enabled,open} checks. Also fix potential overflows
   in bpf_map_{area_alloc,charge_init} on 32 bit archs, from Björn Töpel.

3) Fix narrow loads of bpf_sysctl context fields with offset > 0 on big endian
   archs like s390x and also improve the test coverage, from Ilya Leoshkevich.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-05 17:38:21 -08:00
Linus Torvalds
26bc672134 for-linus-2019-11-05
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXcGMxwAKCRCRxhvAZXjc
 otEzAP9lHvP97TtRG9gP6dj6YovZ5Djdo3IscmkTqy5Nt8sVNQD+NWg1LZnSMFdJ
 ExETgRVlsjF8q2sblswtn/8Ab53O6AM=
 =RVGl
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-2019-11-05' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull clone3 stack argument update from Christian Brauner:
 "This changes clone3() to do basic stack validation and to set up the
  stack depending on whether or not it is growing up or down.

  With clone3() the expectation is now very simply that the .stack
  argument points to the lowest address of the stack and that
  .stack_size specifies the initial stack size. This is diferent from
  legacy clone() where the "stack" argument had to point to the lowest
  or highest address of the stack depending on the architecture.

  clone3() was released with 5.3. Currently, it is not documented and
  very unclear to userspace how the stack and stack_size argument have
  to be passed. After talking to glibc folks we concluded that changing
  clone3() to determine stack direction and doing basic validation is
  the right course of action.

  Note, this is a potentially user visible change. In the very unlikely
  case, that it breaks someone's use-case we will revert. (And then e.g.
  place the new behavior under an appropriate flag.)

  Note that passing an empty stack will continue working just as before.
  Breaking someone's use-case is very unlikely. Neither glibc nor musl
  currently expose a wrapper for clone3(). There is currently also no
  real motivation for anyone to use clone3() directly. First, because
  using clone{3}() with stacks requires some assembly (see glibc and
  musl). Second, because it does not provide features that legacy
  clone() doesn't. New features for clone3() will first happen in v5.5
  which is why v5.4 is still a good time to try and make that change now
  and backport it to v5.3.

  I did a codesearch on https://codesearch.debian.net, github, and
  gitlab and could not find any software currently relying directly on
  clone3(). I expect this to change once we land CLONE_CLEAR_SIGHAND
  which was a request coming from glibc at which point they'll likely
  start using it"

* tag 'for-linus-2019-11-05' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  clone3: validate stack arguments
2019-11-05 09:44:02 -08:00
Christian Brauner
fa729c4df5
clone3: validate stack arguments
Validate the stack arguments and setup the stack depening on whether or not
it is growing down or up.

Legacy clone() required userspace to know in which direction the stack is
growing and pass down the stack pointer appropriately. To make things more
confusing microblaze uses a variant of the clone() syscall selected by
CONFIG_CLONE_BACKWARDS3 that takes an additional stack_size argument.
IA64 has a separate clone2() syscall which also takes an additional
stack_size argument. Finally, parisc has a stack that is growing upwards.
Userspace therefore has a lot nasty code like the following:

 #define __STACK_SIZE (8 * 1024 * 1024)
 pid_t sys_clone(int (*fn)(void *), void *arg, int flags, int *pidfd)
 {
         pid_t ret;
         void *stack;

         stack = malloc(__STACK_SIZE);
         if (!stack)
                 return -ENOMEM;

 #ifdef __ia64__
         ret = __clone2(fn, stack, __STACK_SIZE, flags | SIGCHLD, arg, pidfd);
 #elif defined(__parisc__) /* stack grows up */
         ret = clone(fn, stack, flags | SIGCHLD, arg, pidfd);
 #else
         ret = clone(fn, stack + __STACK_SIZE, flags | SIGCHLD, arg, pidfd);
 #endif
         return ret;
 }

or even crazier variants such as [3].

With clone3() we have the ability to validate the stack. We can check that
when stack_size is passed, the stack pointer is valid and the other way
around. We can also check that the memory area userspace gave us is fine to
use via access_ok(). Furthermore, we probably should not require
userspace to know in which direction the stack is growing. It is easy
for us to do this in the kernel and I couldn't find the original
reasoning behind exposing this detail to userspace.

/* Intentional user visible API change */
clone3() was released with 5.3. Currently, it is not documented and very
unclear to userspace how the stack and stack_size argument have to be
passed. After talking to glibc folks we concluded that trying to change
clone3() to setup the stack instead of requiring userspace to do this is
the right course of action.
Note, that this is an explicit change in user visible behavior we introduce
with this patch. If it breaks someone's use-case we will revert! (And then
e.g. place the new behavior under an appropriate flag.)
Breaking someone's use-case is very unlikely though. First, neither glibc
nor musl currently expose a wrapper for clone3(). Second, there is no real
motivation for anyone to use clone3() directly since it does not provide
features that legacy clone doesn't. New features for clone3() will first
happen in v5.5 which is why v5.4 is still a good time to try and make that
change now and backport it to v5.3. Searches on [4] did not reveal any
packages calling clone3().

[1]: https://lore.kernel.org/r/CAG48ez3q=BeNcuVTKBN79kJui4vC6nw0Bfq6xc-i0neheT17TA@mail.gmail.com
[2]: https://lore.kernel.org/r/20191028172143.4vnnjpdljfnexaq5@wittgenstein
[3]: 5238e95759/src/basic/raw-clone.h (L31)
[4]: https://codesearch.debian.net
Fixes: 7f192e3cd3 ("fork: add clone3")
Cc: Kees Cook <keescook@chromium.org>
Cc: Jann Horn <jannh@google.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-api@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: <stable@vger.kernel.org> # 5.3
Cc: GNU C Library <libc-alpha@sourceware.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Aleksa Sarai <cyphar@cyphar.com>
Link: https://lore.kernel.org/r/20191031113608.20713-1-christian.brauner@ubuntu.com
2019-11-05 15:50:14 +01:00
Yi Wang
0ed9ca2589 irq/irqdomain: Update __irq_domain_alloc_fwnode() function documentation
A recent commit changed a parameter of __irq_domain_alloc_fwnode(), but
did not update the documentation comment. Fix it up.

Fixes: b977fcf477 ("irqdomain/debugfs: Use PAs to generate fwnode names")
Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/1571476047-29463-1-git-send-email-wang.yi59@zte.com.cn
2019-11-05 00:48:26 +01:00
Huacai Chen
52338415cf timekeeping/vsyscall: Update VDSO data unconditionally
The update of the VDSO data is depending on __arch_use_vsyscall() returning
True. This is a leftover from the attempt to map the features of various
architectures 1:1 into generic code.

The usage of __arch_use_vsyscall() in the actual vsyscall implementations
got dropped and replaced by the requirement for the architecture code to
return U64_MAX if the global clocksource is not usable in the VDSO.

But the __arch_use_vsyscall() check in the update code stayed which causes
the VDSO data to be stale or invalid when an architecture actually
implements that function and returns False when the current clocksource is
not usable in the VDSO.

As a consequence the VDSO implementations of clock_getres(), time(),
clock_gettime(CLOCK_.*_COARSE) operate on invalid data and return bogus
information.

Remove the __arch_use_vsyscall() check from the VDSO update function and
update the VDSO data unconditionally.

[ tglx: Massaged changelog and removed the now useless implementations in
  	asm-generic/ARM64/MIPS ]

Fixes: 44f57d788e ("timekeeping: Provide a generic update_vsyscall() implementation")
Signed-off-by: Huacai Chen <chenhc@lemote.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Paul Burton <paul.burton@mips.com>
Cc: linux-mips@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1571887709-11447-1-git-send-email-chenhc@lemote.com
2019-11-04 23:02:53 +01:00
Jiri Slaby
b0c51f1584 stacktrace: Don't skip first entry on noncurrent tasks
When doing cat /proc/<PID>/stack, the output is missing the first entry.
When the current code walks the stack starting in stack_trace_save_tsk,
it skips all scheduler functions (that's OK) plus one more function. But
this one function should be skipped only for the 'current' task as it is
stack_trace_save_tsk proper.

The original code (before the common infrastructure) skipped one
function only for the 'current' task -- see save_stack_trace_tsk before
3599fe12a1. So do so also in the new infrastructure now.

Fixes: 214d8ca6ee ("stacktrace: Provide common infrastructure")
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Michal Suchanek <msuchanek@suse.de>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20191030072545.19462-1-jslaby@suse.cz
2019-11-04 21:19:25 +01:00
Miroslav Benes
7162431dcf ftrace: Introduce PERMANENT ftrace_ops flag
Livepatch uses ftrace for redirection to new patched functions. It means
that if ftrace is disabled, all live patched functions are disabled as
well. Toggling global 'ftrace_enabled' sysctl thus affect it directly.
It is not a problem per se, because only administrator can set sysctl
values, but it still may be surprising.

Introduce PERMANENT ftrace_ops flag to amend this. If the
FTRACE_OPS_FL_PERMANENT is set on any ftrace ops, the tracing cannot be
disabled by disabling ftrace_enabled. Equally, a callback with the flag
set cannot be registered if ftrace_enabled is disabled.

Link: http://lkml.kernel.org/r/20191016113316.13415-2-mbenes@suse.cz

Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-04 09:33:15 -05:00
Tyler Hicks
731dc9df97 cpu/speculation: Uninline and export CPU mitigations helpers
A kernel module may need to check the value of the "mitigations=" kernel
command line parameter as part of its setup when the module needs
to perform software mitigations for a CPU flaw.

Uninline and export the helper functions surrounding the cpu_mitigations
enum to allow for their usage from a module.

Lastly, privatize the enum and cpu_mitigations variable since the value of
cpu_mitigations can be checked with the exported helper functions.

Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2019-11-04 12:22:02 +01:00
David S. Miller
ae8a76fb8b Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2019-11-02

The following pull-request contains BPF updates for your *net-next* tree.

We've added 30 non-merge commits during the last 7 day(s) which contain
a total of 41 files changed, 1864 insertions(+), 474 deletions(-).

The main changes are:

1) Fix long standing user vs kernel access issue by introducing
   bpf_probe_read_user() and bpf_probe_read_kernel() helpers, from Daniel.

2) Accelerated xskmap lookup, from Björn and Maciej.

3) Support for automatic map pinning in libbpf, from Toke.

4) Cleanup of BTF-enabled raw tracepoints, from Alexei.

5) Various fixes to libbpf and selftests.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-02 15:29:58 -07:00
David S. Miller
d31e95585c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
The only slightly tricky merge conflict was the netdevsim because the
mutex locking fix overlapped a lot of driver reload reorganization.

The rest were (relatively) trivial in nature.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-02 13:54:56 -07:00
Daniel Borkmann
6e07a63412 bpf: Switch BPF probe insns to bpf_probe_read_kernel
Commit 2a02759ef5 ("bpf: Add support for BTF pointers to interpreter")
explicitly states that the pointer to BTF object is a pointer to a kernel
object or NULL. Therefore we should also switch to using the strict kernel
probe helper which is restricted to kernel addresses only when architectures
have non-overlapping address spaces.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/d2b90827837685424a4b8008dfe0460558abfada.1572649915.git.daniel@iogearbox.net
2019-11-02 12:39:12 -07:00
Daniel Borkmann
6ae08ae3de bpf: Add probe_read_{user, kernel} and probe_read_{user, kernel}_str helpers
The current bpf_probe_read() and bpf_probe_read_str() helpers are broken
in that they assume they can be used for probing memory access for kernel
space addresses /as well as/ user space addresses.

However, plain use of probe_kernel_read() for both cases will attempt to
always access kernel space address space given access is performed under
KERNEL_DS and some archs in-fact have overlapping address spaces where a
kernel pointer and user pointer would have the /same/ address value and
therefore accessing application memory via bpf_probe_read{,_str}() would
read garbage values.

Lets fix BPF side by making use of recently added 3d7081822f ("uaccess:
Add non-pagefault user-space read functions"). Unfortunately, the only way
to fix this status quo is to add dedicated bpf_probe_read_{user,kernel}()
and bpf_probe_read_{user,kernel}_str() helpers. The bpf_probe_read{,_str}()
helpers are kept as-is to retain their current behavior.

The two *_user() variants attempt the access always under USER_DS set, the
two *_kernel() variants will -EFAULT when accessing user memory if the
underlying architecture has non-overlapping address ranges, also avoiding
throwing the kernel warning via 00c42373d3 ("x86-64: add warning for
non-canonical user access address dereferences").

Fixes: a5e8c07059 ("bpf: add bpf_probe_read_str helper")
Fixes: 2541517c32 ("tracing, perf: Implement BPF programs attached to kprobes")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/796ee46e948bc808d54891a1108435f8652c6ca4.1572649915.git.daniel@iogearbox.net
2019-11-02 12:39:12 -07:00
Daniel Borkmann
eb1b668874 bpf: Make use of probe_user_write in probe write helper
Convert the bpf_probe_write_user() helper to probe_user_write() such that
writes are not attempted under KERNEL_DS anymore which is buggy as kernel
and user space pointers can have overlapping addresses. Also, given we have
the access_ok() check inside probe_user_write(), the helper doesn't need
to do it twice.

Fixes: 96ae522795 ("bpf: Add bpf_probe_write_user BPF helper to be called in tracers")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/841c461781874c07a0ee404a454c3bc0459eed30.1572649915.git.daniel@iogearbox.net
2019-11-02 12:39:12 -07:00
Linus Torvalds
1204c70d9d Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Fix free/alloc races in batmanadv, from Sven Eckelmann.

 2) Several leaks and other fixes in kTLS support of mlx5 driver, from
    Tariq Toukan.

 3) BPF devmap_hash cost calculation can overflow on 32-bit, from Toke
    Høiland-Jørgensen.

 4) Add an r8152 device ID, from Kazutoshi Noguchi.

 5) Missing include in ipv6's addrconf.c, from Ben Dooks.

 6) Use siphash in flow dissector, from Eric Dumazet. Attackers can
    easily infer the 32-bit secret otherwise etc.

 7) Several netdevice nesting depth fixes from Taehee Yoo.

 8) Fix several KCSAN reported errors, from Eric Dumazet. For example,
    when doing lockless skb_queue_empty() checks, and accessing
    sk_napi_id/sk_incoming_cpu lockless as well.

 9) Fix jumbo packet handling in RXRPC, from David Howells.

10) Bump SOMAXCONN and tcp_max_syn_backlog values, from Eric Dumazet.

11) Fix DMA synchronization in gve driver, from Yangchun Fu.

12) Several bpf offload fixes, from Jakub Kicinski.

13) Fix sk_page_frag() recursion during memory reclaim, from Tejun Heo.

14) Fix ping latency during high traffic rates in hisilicon driver, from
    Jiangfent Xiao.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (146 commits)
  net: fix installing orphaned programs
  net: cls_bpf: fix NULL deref on offload filter removal
  selftests: bpf: Skip write only files in debugfs
  selftests: net: reuseport_dualstack: fix uninitalized parameter
  r8169: fix wrong PHY ID issue with RTL8168dp
  net: dsa: bcm_sf2: Fix IMP setup for port different than 8
  net: phylink: Fix phylink_dbg() macro
  gve: Fixes DMA synchronization.
  inet: stop leaking jiffies on the wire
  ixgbe: Remove duplicate clear_bit() call
  Documentation: networking: device drivers: Remove stray asterisks
  e1000: fix memory leaks
  i40e: Fix receive buffer starvation for AF_XDP
  igb: Fix constant media auto sense switching when no cable is connected
  net: ethernet: arc: add the missed clk_disable_unprepare
  igb: Enable media autosense for the i350.
  igb/igc: Don't warn on fatal read failures when the device is removed
  tcp: increase tcp_max_syn_backlog max value
  net: increase SOMAXCONN to 4096
  netdevsim: Fix use-after-free during device dismantle
  ...
2019-11-01 17:48:11 -07:00
Björn Töpel
d817991cc7 xsk: Restructure/inline XSKMAP lookup/redirect/flush
In this commit the XSKMAP entry lookup function used by the XDP
redirect code is moved from the xskmap.c file to the xdp_sock.h
header, so the lookup can be inlined from, e.g., the
bpf_xdp_redirect_map() function.

Further the __xsk_map_redirect() and __xsk_map_flush() is moved to the
xsk.c, which lets the compiler inline the xsk_rcv() and xsk_flush()
functions.

Finally, all the XDP socket functions were moved from linux/bpf.h to
net/xdp_sock.h, where most of the XDP sockets functions are anyway.

This yields a ~2% performance boost for the xdpsock "rx_drop"
scenario.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20191101110346.15004-4-bjorn.topel@gmail.com
2019-11-02 00:38:49 +01:00
Maciej Fijalkowski
e65650f291 bpf: Implement map_gen_lookup() callback for XSKMAP
Inline the xsk_map_lookup_elem() via implementing the map_gen_lookup()
callback. This results in emitting the bpf instructions in place of
bpf_map_lookup_elem() helper call and better performance of bpf
programs.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Link: https://lore.kernel.org/bpf/20191101110346.15004-3-bjorn.topel@gmail.com
2019-11-02 00:38:49 +01:00
Björn Töpel
64fe8c061d xsk: Store struct xdp_sock as a flexible array member of the XSKMAP
Prior this commit, the array storing XDP socket instances were stored
in a separate allocated array of the XSKMAP. Now, we store the sockets
as a flexible array member in a similar fashion as the arraymap. Doing
so, we do less pointer chasing in the lookup.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Link: https://lore.kernel.org/bpf/20191101110346.15004-2-bjorn.topel@gmail.com
2019-11-02 00:38:49 +01:00
Helge Deller
f973cce0e4 kexec: Fix pointer-to-int-cast warnings
Fix two pointer-to-int-cast warnings when compiling for the 32-bit parisc
platform:

kernel/kexec_file.c: In function ‘crash_prepare_elf64_headers’:
kernel/kexec_file.c:1307:19: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
  phdr->p_vaddr = (Elf64_Addr)_text;
                  ^
kernel/kexec_file.c:1324:19: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
  phdr->p_vaddr = (unsigned long long) __va(mstart);
                  ^

Signed-off-by: Helge Deller <deller@gmx.de>
2019-11-01 21:42:58 +01:00
Linus Torvalds
0dbe6cb8f7 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "Fix two scheduler topology bugs/oversights on Juno r0 2+4 big.LITTLE
  systems"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/topology: Allow sched_asym_cpucapacity to be disabled
  sched/topology: Don't try to build empty sched domains
2019-11-01 11:49:54 -07:00
Petr Mladek
92c9abf5e5 livepatch: Allow to distinguish different version of system state changes
The atomic replace runs pre/post (un)install callbacks only from the new
livepatch. There are several reasons for this:

  + Simplicity: clear ordering of operations, no interactions between
	old and new callbacks.

  + Reliability: only new livepatch knows what changes can already be made
	by older livepatches and how to take over the state.

  + Testing: the atomic replace can be properly tested only when a newer
	livepatch is available. It might be too late to fix unwanted effect
	of callbacks from older	livepatches.

It might happen that an older change is not enough and the same system
state has to be modified another way. Different changes need to get
distinguished by a version number added to struct klp_state.

The version can also be used to prevent loading incompatible livepatches.
The check is done when the livepatch is enabled. The rules are:

  + Any completely new system state modification is allowed.

  + System state modifications with the same or higher version are allowed
    for already modified system states.

  + Cumulative livepatches must handle all system state modifications from
    already installed livepatches.

  + Non-cumulative livepatches are allowed to touch already modified
    system states.

Link: http://lkml.kernel.org/r/20191030154313.13263-4-pmladek@suse.com
To: Jiri Kosina <jikos@kernel.org>
Cc: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Cc: Nicolai Stange <nstange@suse.de>
Cc: live-patching@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Acked-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Joe Lawrence <joe.lawrence@redhat.com>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2019-11-01 13:08:19 +01:00
Petr Mladek
73727f4daf livepatch: Basic API to track system state changes
This is another step how to help maintaining more livepatches.

One big help was the atomic replace and cumulative livepatches. These
livepatches replace the already installed ones. Therefore it should
be enough when each cumulative livepatch is consistent.

The problems might come with shadow variables and callbacks. They might
change the system behavior or state so that it is no longer safe to
go back and use an older livepatch or the original kernel code. Also,
a new livepatch must be able to detect changes which were made by
the already installed livepatches.

This is where the livepatch system state tracking gets useful. It
allows to:

  - find whether a system state has already been modified by
    previous livepatches

  - store data needed to manipulate and restore the system state

The information about the manipulated system states is stored in an
array of struct klp_state. It can be searched by two new functions
klp_get_state() and klp_get_prev_state().

The dependencies are going to be solved by a version field added later.
The only important information is that it will be allowed to modify
the same state by more non-cumulative livepatches. It is similar
to allowing to modify the same function several times. The livepatch
author is responsible for preventing incompatible changes.

Link: http://lkml.kernel.org/r/20191030154313.13263-3-pmladek@suse.com
To: Jiri Kosina <jikos@kernel.org>
Cc: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Cc: Nicolai Stange <nstange@suse.de>
Cc: live-patching@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Acked-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Joe Lawrence <joe.lawrence@redhat.com>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2019-11-01 13:08:14 +01:00
Petr Mladek
7e35e4eb7e livepatch: Keep replaced patches until post_patch callback is called
Pre/post (un)patch callbacks might manipulate the system state. Cumulative
livepatches might need to take over the changes made by the replaced
ones. For this they might need to access some data stored or referenced
by the old livepatches.

Therefore the replaced livepatches have to stay around until post_patch()
callback is called. It is achieved by calling the free functions later.
It is the same location where disabled livepatches have already been
freed.

Link: http://lkml.kernel.org/r/20191030154313.13263-2-pmladek@suse.com
To: Jiri Kosina <jikos@kernel.org>
Cc: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Cc: Nicolai Stange <nstange@suse.de>
Cc: live-patching@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Acked-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Joe Lawrence <joe.lawrence@redhat.com>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2019-11-01 13:08:08 +01:00
Nicolas Saenz Julienne
8b5369ea58 dma/direct: turn ARCH_ZONE_DMA_BITS into a variable
Some architectures, notably ARM, are interested in tweaking this
depending on their runtime DMA addressing limitations.

Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicolas Saenz Julienne <nsaenzjulienne@suse.de>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2019-11-01 09:41:18 +00:00
Björn Töpel
ff1c08e1f7 bpf: Change size to u64 for bpf_map_{area_alloc, charge_init}()
The functions bpf_map_area_alloc() and bpf_map_charge_init() prior
this commit passed the size parameter as size_t. In this commit this
is changed to u64.

All users of these functions avoid size_t overflows on 32-bit systems,
by explicitly using u64 when calculating the allocation size and
memory charge cost. However, since the result was narrowed by the
size_t when passing size and cost to the functions, the overflow
handling was in vain.

Instead of changing all call sites to size_t and handle overflow at
the call site, the parameter is changed to u64 and checked in the
functions above.

Fixes: d407bd25a2 ("bpf: don't trigger OOM killer under pressure with map alloc")
Fixes: c85d69135a ("bpf: move memory size checks to bpf_map_charge_init()")
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Link: https://lore.kernel.org/bpf/20191029154307.23053-1-bjorn.topel@gmail.com
2019-10-31 21:41:33 +01:00
David Howells
f94df9890e Add wake_up_interruptible_sync_poll_locked()
Add a wakeup call for a case whereby the caller already has the waitqueue
spinlock held.  This can be used by pipes to alter the ring buffer indices
and issue a wakeup under the same spinlock.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2019-10-31 15:12:23 +00:00
Alexei Starovoitov
f1b9509c2f bpf: Replace prog_raw_tp+btf_id with prog_tracing
The bpf program type raw_tp together with 'expected_attach_type'
was the most appropriate api to indicate BTF-enabled raw_tp programs.
But during development it became apparent that 'expected_attach_type'
cannot be used and new 'attach_btf_id' field had to be introduced.
Which means that the information is duplicated in two fields where
one of them is ignored.
Clean it up by introducing new program type where both
'expected_attach_type' and 'attach_btf_id' fields have
specific meaning.
In the future 'expected_attach_type' will be extended
with other attach points that have similar semantics to raw_tp.
This patch is replacing BTF-enabled BPF_PROG_TYPE_RAW_TRACEPOINT with
prog_type = BPF_RPOG_TYPE_TRACING
expected_attach_type = BPF_TRACE_RAW_TP
attach_btf_id = btf_id of raw tracepoint inside the kernel
Future patches will add
expected_attach_type = BPF_TRACE_FENTRY or BPF_TRACE_FEXIT
where programs have the same input context and the same helpers,
but different attach points.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20191030223212.953010-2-ast@kernel.org
2019-10-31 15:16:59 +01:00
Ingo Molnar
43e0ae7ae0 Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU and LKMM changes from Paul E. McKenney:

  - Documentation updates.

  - Miscellaneous fixes.

  - Dynamic tick (nohz) updates, perhaps most notably changes to
    force the tick on when needed due to lengthy in-kernel execution
    on CPUs on which RCU is waiting.

  - Replace rcu_swap_protected() with rcu_prepace_pointer().

  - Torture-test updates.

  - Linux-kernel memory consistency model updates.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-31 09:33:19 +01:00
Alexei Starovoitov
af91acbc62 bpf: Fix bpf jit kallsym access
Jiri reported crash when JIT is on, but net.core.bpf_jit_kallsyms is off.
bpf_prog_kallsyms_find() was skipping addr->bpf_prog resolution
logic in oops and stack traces. That's incorrect.
It should only skip addr->name resolution for 'cat /proc/kallsyms'.
That's what bpf_jit_kallsyms and bpf_jit_harden protect.

Fixes: 3dec541b2e ("bpf: Add support for BTF pointers to x86 JIT")
Reported-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20191030233019.1187404-1-ast@kernel.org
2019-10-31 02:02:29 +01:00
Ilya Leoshkevich
7541c87c9b bpf: Allow narrow loads of bpf_sysctl fields with offset > 0
"ctx:file_pos sysctl:read read ok narrow" works on s390 by accident: it
reads the wrong byte, which happens to have the expected value of 0.
Improve the test by seeking to the 4th byte and expecting 4 instead of
0.

This makes the latent problem apparent: the test attempts to read the
first byte of bpf_sysctl.file_pos, assuming this is the least-significant
byte, which is not the case on big-endian machines: a non-zero offset is
needed.

The point of the test is to verify narrow loads, so we cannot cheat our
way out by simply using BPF_W. The existence of the test means that such
loads have to be supported, most likely because llvm can generate them.
Fix the test by adding a big-endian variant, which uses an offset to
access the least-significant byte of bpf_sysctl.file_pos.

This reveals the final problem: verifier rejects accesses to bpf_sysctl
fields with offset > 0. Such accesses are already allowed for a wide
range of structs: __sk_buff, bpf_sock_addr and sk_msg_md to name a few.
Extend this support to bpf_sysctl by using bpf_ctx_range instead of
offsetof when matching field offsets.

Fixes: 7b146cebe3 ("bpf: Sysctl hook")
Fixes: e1550bfe0d ("bpf: Add file_pos field to bpf_sysctl ctx")
Fixes: 9a1027e525 ("selftests/bpf: Test file_pos field in bpf_sysctl ctx")
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20191028122902.9763-1-iii@linux.ibm.com
2019-10-30 12:49:13 -07:00
Shyam Saini
ca66536845 kernel: dma-contiguous: mark CMA parameters __initdata/__initconst
These parameters are only referenced by __init routine calls during
early boot so they should be marked as __initdata and __initconst
accordingly.

Signed-off-by: Shyam Saini <mayhs11saini@gmail.com>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-10-30 11:10:45 -07:00
Eric Dumazet
9ff6aa027d dma-debug: add a schedule point in debug_dma_dump_mappings()
debug_dma_dump_mappings() can take a lot of cpu cycles :

lpk43:/# time wc -l /sys/kernel/debug/dma-api/dump
163435 /sys/kernel/debug/dma-api/dump

real	0m0.463s
user	0m0.003s
sys	0m0.459s

Let's add a cond_resched() to avoid holding cpu for too long.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Corentin Labbe <clabbe@baylibre.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-10-30 11:09:35 -07:00
Vladimir Murzin
a445e940ea dma-mapping: fix handling of dma-ranges for reserved memory (again)
Daniele reported that issue previously fixed in c41f9ea998
("drivers: dma-coherent: Account dma_pfn_offset when used with device
tree") reappear shortly after 43fc509c3e ("dma-coherent: introduce
interface for default DMA pool") where fix was accidentally dropped.

Lets put fix back in place and respect dma-ranges for reserved memory.

Fixes: 43fc509c3e ("dma-coherent: introduce interface for default DMA pool")

Reported-by: Daniele Alessandrelli <daniele.alessandrelli@gmail.com>
Tested-by: Daniele Alessandrelli <daniele.alessandrelli@gmail.com>
Tested-by: Alexandre Torgue <alexandre.torgue@st.com>
Signed-off-by: Vladimir Murzin <vladimir.murzin@arm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-10-30 11:07:35 -07:00
Paul E. McKenney
8dcdfb7096 Merge branches 'doc.2019.10.29a', 'fixes.2019.10.30a', 'nohz.2019.10.28a', 'replace.2019.10.30a', 'torture.2019.10.05a' and 'lkmm.2019.10.05a' into HEAD
doc.2019.10.29a: RCU documentation updates.
fixes.2019.10.30a: RCU miscellaneous fixes.
nohz.2019.10.28a: RCU NO_HZ and NO_HZ_FULL updates.
replace.2019.10.30a: Replace rcu_swap_protected() with rcu_replace().
torture.2019.10.05a: RCU torture-test updates.

lkmm.2019.10.05a: Linux kernel memory model updates.
2019-10-30 08:47:13 -07:00
Paul E. McKenney
6092f7263f bpf/cgroup: Replace rcu_swap_protected() with rcu_replace_pointer()
This commit replaces the use of rcu_swap_protected() with the more
intuitively appealing rcu_replace_pointer() as a step towards removing
rcu_swap_protected().

Link: https://lore.kernel.org/lkml/CAHk-=wiAsJLw1egFEE=Z7-GGtM6wcvtyytXZA1+BHqta4gg6Hw@mail.gmail.com/
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
[ paulmck: From rcu_replace() to rcu_replace_pointer() per Ingo Molnar. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Yonghong Song <yhs@fb.com>
Cc: <netdev@vger.kernel.org>
Cc: <bpf@vger.kernel.org>
2019-10-30 08:45:14 -07:00
Paul E. McKenney
36b5dae645 rcu: Suppress levelspread uninitialized messages
New tools bring new warnings, and with v5.3 comes:

kernel/rcu/srcutree.c: warning: 'levelspread[<U aa0>]' may be used uninitialized in this function [-Wuninitialized]:  => 121:34

This commit suppresses this warning by initializing the full array
to INT_MIN, which will result in failures should any out-of-bounds
references appear.

Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
2019-10-30 08:34:53 -07:00
Dan Carpenter
b8889c9c89 rcu: Fix uninitialized variable in nocb_gp_wait()
We never set this to false.  This probably doesn't affect most people's
runtime because GCC will automatically initialize it to false at certain
common optimization levels.  But that behavior is related to a bug in
GCC and obviously should not be relied on.

Fixes: 5d6742b377 ("rcu/nocb: Use rcu_segcblist for no-CBs CPUs")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-30 08:34:53 -07:00
Joel Fernandes (Google)
05ef9e9eb3 rcu: Ensure that ->rcu_urgent_qs is set before resched IPI
The RCU-specific resched_cpu() function sends a resched IPI to the
specified CPU, which can be used to force the tick on for a given
nohz_full CPU.  This is needed when this nohz_full CPU is looping in the
kernel while blocking the current grace period.  However, for the tick
to actually be forced on in all cases, that CPU's rcu_data structure's
->rcu_urgent_qs flag must be set beforehand.  This commit therefore
causes rcu_implicit_dynticks_qs() to set this flag prior to invoking
resched_cpu() on a holdout nohz_full CPU.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-30 08:34:35 -07:00
Joel Fernandes (Google)
5a6446626d workqueue: Convert for_each_wq to use built-in list check
Because list_for_each_entry_rcu() can now check for holding a
lock as well as for being in an RCU read-side critical section,
this commit replaces the workqueue_sysfs_unregister() function's
use of assert_rcu_or_wq_mutex() and list_for_each_entry_rcu() with
list_for_each_entry_rcu() augmented with a lockdep_is_held() optional
argument.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-30 08:34:10 -07:00
kbuild test robot
1d24dd4e01 rcu: Several rcu_segcblist functions can be static
None of rcu_segcblist_set_len(), rcu_segcblist_add_len(), or
rcu_segcblist_xchg_len() are used outside of kernel/rcu/rcu_segcblist.c.
This commit therefore makes them static.

Fixes: eda669a6a2 ("rcu/nocb: Atomic ->len field in rcu_segcblist structure")
Signed-off-by: kbuild test robot <lkp@intel.com>
[ paulmck: "Fixes:" updated per Stephen Rothwell feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-30 08:33:22 -07:00
Alexei Starovoitov
15ab09bdca bpf: Enforce 'return 0' in BTF-enabled raw_tp programs
The return value of raw_tp programs is ignored by __bpf_trace_run()
that calls them. The verifier also allows any value to be returned.
For BTF-enabled raw_tp lets enforce 'return 0', so that return value
can be used for something in the future.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20191029032426.1206762-1-ast@kernel.org
2019-10-30 16:22:55 +01:00
Jens Axboe
771b53d033 io-wq: small threadpool implementation for io_uring
This adds support for io-wq, a smaller and specialized thread pool
implementation. This is meant to replace workqueues for io_uring. Among
the reasons for this addition are:

- We can assign memory context smarter and more persistently if we
  manage the life time of threads.

- We can drop various work-arounds we have in io_uring, like the
  async_list.

- We can implement hashed work insertion, to manage concurrency of
  buffered writes without needing a) an extra workqueue, or b)
  needlessly making the concurrency of said workqueue very low
  which hurts performance of multiple buffered file writers.

- We can implement cancel through signals, for cancelling
  interruptible work like read/write (or send/recv) to/from sockets.

- We need the above cancel for being able to assign and use file tables
  from a process.

- We can implement a more thorough cancel operation in general.

- We need it to move towards a syslet/threadlet model for even faster
  async execution. For that we need to take ownership of the used
  threads.

This list is just off the top of my head. Performance should be the
same, or better, at least that's what I've seen in my testing. io-wq
supports basic NUMA functionality, setting up a pool per node.

io-wq hooks up to the scheduler schedule in/out just like workqueue
and uses that to drive the need for more/less workers.

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-29 12:43:00 -06:00
Davidlohr Bueso
a0855d24fc locking/mutex: Complain upon mutex API misuse in IRQ contexts
Add warning checks if mutex_trylock() or mutex_unlock() are used in
IRQ contexts, under CONFIG_DEBUG_MUTEXES=y.

While the mutex rules and semantics are explicitly documented, this allows
to expose any abusers and robustifies the whole thing.

While trylock and unlock are non-blocking, calling from IRQ context
is still forbidden (lock must be within the same context as unlock).

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave@stgolabs.net
Link: https://lkml.kernel.org/r/20191025033634.3330-1-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-29 12:22:52 +01:00
Davidlohr Bueso
751459043c futex: Drop leftover wake_q_add() comment
Since the original comment, we have moved to do the task
reference counting explicitly along with wake_q_add_safe().
Drop the now incorrect comment.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave@stgolabs.net
Link: https://lkml.kernel.org/r/20191023033450.6445-1-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-29 12:22:52 +01:00
Frederic Weisbecker
64eea63c19 sched/kcpustat: Introduce vtime-aware kcpustat accessor for CPUTIME_SYSTEM
Kcpustat is not correctly supported on nohz_full CPUs. The tick doesn't
fire and the cputime therefore doesn't move forward. The issue has shown
up after the vanishing of the remaining 1Hz which has made the stall
visible.

We are solving that with checking the task running on a CPU through RCU
and reading its vtime delta that we add to the raw kcpustat values.

We make sure that we fetch a coherent raw-kcpustat/vtime-delta couple
sequence while checking that the CPU referred by the target vtime is the
correct one, under the locked vtime seqcount.

Only CPUTIME_SYSTEM is handled here as a start because it's the trivial
case. User and guest time will require more preparation work to
correctly handle niceness.

Reported-by: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpengli@tencent.com>
Link: https://lkml.kernel.org/r/20191025020303.19342-1-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-29 10:01:17 +01:00
Frederic Weisbecker
e44fcb4b7a sched/vtime: Rename vtime_accounting_cpu_enabled() to vtime_accounting_enabled_this_cpu()
Standardize the naming on top of the vtime_accounting_enabled_*() base.
Also make it clear we are checking the vtime state of the
*current* CPU with this function. We'll need to add an API to check that
state on remote CPUs as well, so we must disambiguate the naming.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Jacek Anaszewski <jacek.anaszewski@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rjw@rjwysocki.net>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Link: https://lkml.kernel.org/r/20191016025700.31277-9-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-29 10:01:14 +01:00
Frederic Weisbecker
74c578759f context_tracking: Rename context_tracking_is_enabled() => context_tracking_enabled()
Remove the superfluous "is" in the middle of the name. We want to
standardize the naming so that it can be expanded through suffixes:

	context_tracking_enabled()
	context_tracking_enabled_cpu()
	context_tracking_enabled_this_cpu()

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Jacek Anaszewski <jacek.anaszewski@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rjw@rjwysocki.net>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Link: https://lkml.kernel.org/r/20191016025700.31277-6-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-29 10:01:12 +01:00
Frederic Weisbecker
e6d5bf3e32 sched/cputime: Add vtime guest task state
Record guest as a VTIME state instead of guessing it from VTIME_SYS and
PF_VCPU. This is going to simplify the cputime read side especially as
its state machine is going to further expand in order to fully support
kcpustat on nohz_full.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Jacek Anaszewski <jacek.anaszewski@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rjw@rjwysocki.net>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Link: https://lkml.kernel.org/r/20191016025700.31277-4-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-29 10:01:11 +01:00
Frederic Weisbecker
14faf6fcac sched/cputime: Add vtime idle task state
Record idle as a VTIME state instead of guessing it from VTIME_SYS and
is_idle_task(). This is going to simplify the cputime read side
especially as its state machine is going to further expand in order to
fully support kcpustat on nohz_full.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Jacek Anaszewski <jacek.anaszewski@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rjw@rjwysocki.net>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Link: https://lkml.kernel.org/r/20191016025700.31277-3-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-29 10:01:10 +01:00
Frederic Weisbecker
802f4a827f sched/vtime: Record CPU under seqcount for kcpustat needs
In order to compute the kcpustat delta on a nohz CPU, we'll need to
fetch the task running on that target. Checking that its vtime
state snapshot actually refers to the relevant target involves recording
that CPU under the seqcount locked on task switch.

This is a step toward making kcpustat moving forward on full nohz CPUs.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Jacek Anaszewski <jacek.anaszewski@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rjw@rjwysocki.net>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Link: https://lkml.kernel.org/r/20191016025700.31277-2-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-29 10:01:08 +01:00
Patrick Bellasi
b8c9636140 sched/fair/util_est: Implement faster ramp-up EWMA on utilization increases
The estimated utilization for a task:

   util_est = max(util_avg, est.enqueue, est.ewma)

is defined based on:

 - util_avg: the PELT defined utilization
 - est.enqueued: the util_avg at the end of the last activation
 - est.ewma:     a exponential moving average on the est.enqueued samples

According to this definition, when a task suddenly changes its bandwidth
requirements from small to big, the EWMA will need to collect multiple
samples before converging up to track the new big utilization.

This slow convergence towards bigger utilization values is not
aligned to the default scheduler behavior, which is to optimize for
performance. Moreover, the est.ewma component fails to compensate for
temporarely utilization drops which spans just few est.enqueued samples.

To let util_est do a better job in the scenario depicted above, change
its definition by making util_est directly follow upward motion and
only decay the est.ewma on downward.

Signed-off-by: Patrick Bellasi <patrick.bellasi@matbug.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Douglas Raillard <douglas.raillard@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <qperret@google.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191023205630.14469-1-patrick.bellasi@matbug.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-29 10:01:07 +01:00
Valentin Schneider
e284df705c sched/topology: Allow sched_asym_cpucapacity to be disabled
While the static key is correctly initialized as being disabled, it will
remain forever enabled once turned on. This means that if we start with an
asymmetric system and hotplug out enough CPUs to end up with an SMP system,
the static key will remain set - which is obviously wrong. We should detect
this and turn off things like misfit migration and capacity aware wakeups.

As Quentin pointed out, having separate root domains makes this slightly
trickier. We could have exclusive cpusets that create an SMP island - IOW,
the domains within this root domain will not see any asymmetry. This means
we can't just disable the key on domain destruction, we need to count how
many asymmetric root domains we have.

Consider the following example using Juno r0 which is 2+4 big.LITTLE, where
two identical cpusets are created: they both span both big and LITTLE CPUs:

    asym0    asym1
  [       ][       ]
   L  L  B  L  L  B

  $ cgcreate -g cpuset:asym0
  $ cgset -r cpuset.cpus=0,1,3 asym0
  $ cgset -r cpuset.mems=0 asym0
  $ cgset -r cpuset.cpu_exclusive=1 asym0

  $ cgcreate -g cpuset:asym1
  $ cgset -r cpuset.cpus=2,4,5 asym1
  $ cgset -r cpuset.mems=0 asym1
  $ cgset -r cpuset.cpu_exclusive=1 asym1

  $ cgset -r cpuset.sched_load_balance=0 .

(the CPU numbering may look odd because on the Juno LITTLEs are CPUs 0,3-5
and bigs are CPUs 1-2)

If we make one of those SMP (IOW remove asymmetry) by e.g. hotplugging its
big core, we would end up with an SMP cpuset and an asymmetric cpuset - the
static key must remain set, because we still have one asymmetric root domain.

With the above example, this could be done with:

  $ echo 0 > /sys/devices/system/cpu/cpu2/online

Which would result in:

    asym0   asym1
  [       ][    ]
   L  L  B  L  L

When both SMP and asymmetric cpusets are present, all CPUs will observe
sched_asym_cpucapacity being set (it is system-wide), but not all CPUs
observe asymmetry in their sched domain hierarchy:

  per_cpu(sd_asym_cpucapacity, <any CPU in asym0>) == <some SD at DIE level>
  per_cpu(sd_asym_cpucapacity, <any CPU in asym1>) == NULL

Change the simple key enablement to an increment, and decrement the key
counter when destroying domains that cover asymmetric CPUs.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Dietmar.Eggemann@arm.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hannes@cmpxchg.org
Cc: lizefan@huawei.com
Cc: morten.rasmussen@arm.com
Cc: qperret@google.com
Cc: tj@kernel.org
Cc: vincent.guittot@linaro.org
Fixes: df054e8445 ("sched/topology: Add static_key for asymmetric CPU capacity optimizations")
Link: https://lkml.kernel.org/r/20191023153745.19515-3-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-29 09:58:46 +01:00
Valentin Schneider
cd1cb33505 sched/topology: Don't try to build empty sched domains
Turns out hotplugging CPUs that are in exclusive cpusets can lead to the
cpuset code feeding empty cpumasks to the sched domain rebuild machinery.

This leads to the following splat:

    Internal error: Oops: 96000004 [#1] PREEMPT SMP
    Modules linked in:
    CPU: 0 PID: 235 Comm: kworker/5:2 Not tainted 5.4.0-rc1-00005-g8d495477d62e #23
    Hardware name: ARM Juno development board (r0) (DT)
    Workqueue: events cpuset_hotplug_workfn
    pstate: 60000005 (nZCv daif -PAN -UAO)
    pc : build_sched_domains (./include/linux/arch_topology.h:23 kernel/sched/topology.c:1898 kernel/sched/topology.c:1969)
    lr : build_sched_domains (kernel/sched/topology.c:1966)
    Call trace:
    build_sched_domains (./include/linux/arch_topology.h:23 kernel/sched/topology.c:1898 kernel/sched/topology.c:1969)
    partition_sched_domains_locked (kernel/sched/topology.c:2250)
    rebuild_sched_domains_locked (./include/linux/bitmap.h:370 ./include/linux/cpumask.h:538 kernel/cgroup/cpuset.c:955 kernel/cgroup/cpuset.c:978 kernel/cgroup/cpuset.c:1019)
    rebuild_sched_domains (kernel/cgroup/cpuset.c:1032)
    cpuset_hotplug_workfn (kernel/cgroup/cpuset.c:3205 (discriminator 2))
    process_one_work (./arch/arm64/include/asm/jump_label.h:21 ./include/linux/jump_label.h:200 ./include/trace/events/workqueue.h:114 kernel/workqueue.c:2274)
    worker_thread (./include/linux/compiler.h:199 ./include/linux/list.h:268 kernel/workqueue.c:2416)
    kthread (kernel/kthread.c:255)
    ret_from_fork (arch/arm64/kernel/entry.S:1167)
    Code: f860dae2 912802d6 aa1603e1 12800000 (f8616853)

The faulty line in question is:

  cap = arch_scale_cpu_capacity(cpumask_first(cpu_map));

and we're not checking the return value against nr_cpu_ids (we shouldn't
have to!), which leads to the above.

Prevent generate_sched_domains() from returning empty cpumasks, and add
some assertion in build_sched_domains() to scream bloody murder if it
happens again.

The above splat was obtained on my Juno r0 with the following reproducer:

  $ cgcreate -g cpuset:asym
  $ cgset -r cpuset.cpus=0-3 asym
  $ cgset -r cpuset.mems=0 asym
  $ cgset -r cpuset.cpu_exclusive=1 asym

  $ cgcreate -g cpuset:smp
  $ cgset -r cpuset.cpus=4-5 smp
  $ cgset -r cpuset.mems=0 smp
  $ cgset -r cpuset.cpu_exclusive=1 smp

  $ cgset -r cpuset.sched_load_balance=0 .

  $ echo 0 > /sys/devices/system/cpu/cpu4/online
  $ echo 0 > /sys/devices/system/cpu/cpu5/online

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dietmar.Eggemann@arm.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hannes@cmpxchg.org
Cc: lizefan@huawei.com
Cc: morten.rasmussen@arm.com
Cc: qperret@google.com
Cc: tj@kernel.org
Cc: vincent.guittot@linaro.org
Fixes: 05484e0984 ("sched/topology: Add SD_ASYM_CPUCAPACITY flag detection")
Link: https://lkml.kernel.org/r/20191023153745.19515-2-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-29 09:58:45 +01:00
Masahiro Yamada
c3a6cf19e6 export: avoid code duplication in include/linux/export.h
include/linux/export.h has lots of code duplication between
EXPORT_SYMBOL and EXPORT_SYMBOL_NS.

To improve the maintainability and readability, unify the
implementation.

When the symbol has no namespace, pass the empty string "" to
the 'ns' parameter.

The drawback of this change is, it grows the code size.
When the symbol has no namespace, sym->namespace was previously
NULL, but it is now an empty string "". So, it increases 1 byte
for every no namespace EXPORT_SYMBOL.

A typical kernel configuration has 10K exported symbols, so it
increases 10KB in rough estimation.

I did not come up with a good idea to refactor it without increasing
the code size.

I am not sure how big a deal it is, but at least include/linux/export.h
looks nicer.

Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
[maennich: rebase on top of 3 fixes for the namespace feature]
Signed-off-by: Matthias Maennich <maennich@google.com>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2019-10-28 16:38:26 +01:00
Paul E. McKenney
dd7dafd1ad rcu: Make kernel-mode nohz_full CPUs invoke the RCU core processing
If a nohz_full CPU is idle or executing in userspace, it makes good sense
to keep it out of RCU core processing.  After all, the RCU grace-period
kthread can see its quiescent states and all of its callbacks are
offloaded, so there is nothing for RCU core processing to do.

However, if a nohz_full CPU is executing in kernel space, the RCU
grace-period kthread cannot do anything for it, so such a CPU must report
its own quiescent states.  This commit therefore makes nohz_full CPUs
skip RCU core processing only if the scheduler-clock interrupt caught
them in idle or in userspace.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-28 07:02:21 -07:00
Paul E. McKenney
ed93dfc6bc rcu: Confine ->core_needs_qs accesses to the corresponding CPU
Commit 671a63517c ("rcu: Avoid unnecessary softirq when system
is idle") fixed a bug that could result in an indefinite number of
unnecessary invocations of the RCU_SOFTIRQ handler at the trailing edge
of a scheduler-clock interrupt.  However, the fix introduced off-CPU
stores to ->core_needs_qs.  These writes did not conflict with the
on-CPU stores because the CPU's leaf rcu_node structure's ->lock was
held across all such stores.  However, the loads from ->core_needs_qs
were not promoted to READ_ONCE() and, worse yet, the code loading from
->core_needs_qs was written assuming that it was only ever updated by
the corresponding CPU.  So operation has been robust, but only by luck.
This situation is therefore an accident waiting to happen.

This commit therefore takes a different approach.  Instead of clearing
->core_needs_qs from the grace-period kthread's force-quiescent-state
processing, it modifies the rcu_pending() function to suppress the
rcu_sched_clock_irq() function's call to invoke_rcu_core() if there is no
grace period in progress.  This avoids the infinite needless RCU_SOFTIRQ
handlers while still keeping all accesses to ->core_needs_qs local to
the corresponding CPU.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-28 07:02:21 -07:00
Joel Fernandes (Google)
516e5ae0c9 rcu: Reset CPU hints when reporting a quiescent state
In some cases, tracing shows that need_heavy_qs is still set even though
urgent_qs was cleared upon reporting of a quiescent state.  One such
case is when the softirq reports that a CPU has passed quiescent state.

Commit 671a63517c ("rcu: Avoid unnecessary softirq when system is
idle") fixed a bug where core_needs_qs was not being cleared.  In order
to avoid running into similar situations with the urgent-grace-period
flags, this commit causes rcu_disable_urgency_upon_qs(), previously
rcu_disable_tick_upon_qs(), to clear the urgency hints, ->rcu_urgent_qs
and ->rcu_need_heavy_qs.  Note that it is possible for CPUs to go
offline with these urgency hints still set.  This is handled because
rcu_disable_urgency_upon_qs() is also invoked during the online process.

Because these hints can be cleared both by the corresponding CPU and by
the grace-period kthread, this commit also adds a number of READ_ONCE()
and WRITE_ONCE() calls.

Tested overnight with rcutorture running for 60 minutes on all
configurations of RCU.

Signed-off-by: "Joel Fernandes (Google)" <joel@joelfernandes.org>
[ paulmck: Clear urgency flags in rcu_disable_urgency_upon_qs(). ]
[ paulmck: Remove ->core_needs_qs from the set cleared at quiescent state. ]
[ paulmck: Make rcu_disable_urgency_upon_qs static per kbuild test robot. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-28 07:02:21 -07:00
Paul E. McKenney
b200a04895 rcu: Force nohz_full tick on upon irq enter instead of exit
There is interrupt-exit code that forces on the tick for nohz_full CPUs
failing to respond to the current grace period in a timely fashion.
However, this code must compare ->dynticks_nmi_nesting to the value 2
in the interrupt-exit fastpath.  This commit therefore moves this code
to the interrupt-entry fastpath, where a lighter-weight comparison to
zero may be used.

Reported-by: Joel Fernandes <joel@joelfernandes.org>
[ paulmck: Apply Joel Fernandes TICK_DEP_MASK_RCU->TICK_DEP_BIT_RCU fix. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-28 07:02:21 -07:00
Paul E. McKenney
66e4c33b51 rcu: Force tick on for nohz_full CPUs not reaching quiescent states
CPUs running for long time periods in the kernel in nohz_full mode
might leave the scheduling-clock interrupt disabled for then full
duration of their in-kernel execution.  This can (among other things)
delay grace periods.  This commit therefore forces the tick back on
for any nohz_full CPU that is failing to pass through a quiescent state
upon return from interrupt, which the resched_cpu() will induce.

Reported-by: Joel Fernandes <joel@joelfernandes.org>
[ paulmck: Clear ->rcu_forced_tick as reported by Joel Fernandes testing. ]
[ paulmck: Apply Joel Fernandes TICK_DEP_MASK_RCU->TICK_DEP_BIT_RCU fix. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-28 07:02:21 -07:00
Daniel Thompson
c58ff64376 kdb: Tweak escape handling for vi users
Currently if sequences such as "\ehelp\r" are delivered to the console then
the h gets eaten by the escape handling code. Since pressing escape
becomes something of a nervous twitch for vi users (and that escape doesn't
have much effect at a shell prompt) it is more helpful to emit the 'h' than
the '\e'.

We don't simply choose to emit the final character for all escape sequences
since that will do odd things for unsupported escape sequences (in
other words we retain the existing behaviour once we see '\e[').

Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20191025073328.643-6-daniel.thompson@linaro.org
2019-10-28 12:08:29 +00:00
Daniel Thompson
cdca8d8900 kdb: Improve handling of characters from different input sources
Currently if an escape timer is interrupted by a character from a
different input source then the new character is discarded and the
function returns '\e' (which will be discarded by the level above).
It is hard to see why this would ever be the desired behaviour.
Fix this to return the new character rather than the '\e'.

This is a bigger refactor than might be expected because the new
character needs to go through escape sequence detection.

Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20191025073328.643-5-daniel.thompson@linaro.org
2019-10-28 12:08:10 +00:00
Daniel Thompson
4f27e824bf kdb: Remove special case logic from kdb_read()
kdb_read() contains special case logic to force it exit after reading
a single character. We can remove all the special case logic by directly
calling the function to read a single character instead. This also
allows us to tidy up the function prototype which, because it now matches
getchar(), we can also rename in order to make its role clearer.

This does involve some extra code to handle btaprompt properly but we
don't mind the new lines of code here because the old code had some
interesting problems (bad newline handling, treating unexpected
characters like <cr>).

Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20191025073328.643-4-daniel.thompson@linaro.org
2019-10-28 12:07:57 +00:00
Daniel Thompson
d04213af90 kdb: Simplify code to fetch characters from console
Currently kdb_read_get_key() contains complex control flow that, on
close inspection, turns out to be unnecessary. In particular:

1. It is impossible to enter the branch conditioned on (escape_delay == 1)
   except when the loop enters with (escape_delay == 2) allowing us to
   combine the branches.

2. Most of the code conditioned on (escape_delay == 2) simply modifies
   local data and then breaks out of the loop causing the function to
   return escape_data[0].

3. Based on #2 there is not actually any need to ever explicitly set
   escape_delay to 2 because we it is much simpler to directly return
   escape_data[0] instead.

4. escape_data[0] is, for all but one exit path, known to be '\e'.

Simplify the code based on these observations.

There is a subtle (and harmless) change of behaviour resulting from this
simplification: instead of letting the escape timeout after ~1998
milliseconds we now timeout after ~2000 milliseconds

Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20191025073328.643-3-daniel.thompson@linaro.org
2019-10-28 12:02:21 +00:00
Daniel Thompson
53b63136e8 kdb: Tidy up code to handle escape sequences
kdb_read_get_key() has extremely complex break/continue control flow
managed by state variables and is very hard to review or modify. In
particular the way the escape sequence handling interacts with the
general control flow is hard to follow. Separate out the escape key
handling, without changing the control flow. This makes the main body of
the code easier to review.

Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20191025073328.643-2-daniel.thompson@linaro.org
2019-10-28 12:02:11 +00:00
Liang, Kan
d44f821b0e perf/core: Optimize perf_init_event() for TYPE_SOFTWARE
Andi reported that he was hitting the linear search in
perf_init_event() a lot. Now that all !TYPE_SOFTWARE events should hit
the IDR, make sure the TYPE_SOFTWARE events are at the head of the
list such that we'll quickly find the right PMU (provided a valid
event was given).

Signed-off-by: Liang, Kan <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-28 12:53:28 +01:00
Peter Zijlstra
66d258c5b0 perf/core: Optimize perf_init_event()
Andi reported that he was hitting the linear search in
perf_init_event() a lot. Make more agressive use of the IDR lookup to
avoid hitting the linear search.

With exception of PERF_TYPE_SOFTWARE (which relies on a hideous hack),
we can put everything in the IDR. On top of that, we can alias
TYPE_HARDWARE and TYPE_HW_CACHE to TYPE_RAW on the lookup side.

This greatly reduces the chances of hitting the linear search.

Reported-by: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan <kan.liang@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-28 12:51:02 +01:00
Peter Zijlstra
db0503e4f6 perf/core: Optimize perf_install_in_event()
Andi reported that when creating a lot of events, a lot of time is
spent in IPIs and asked if it would be possible to elide some of that.

Now when, as for example the perf-tool always does, events are created
disabled, then these events will not need to be scheduled when added
to the context (they're still disable) and therefore the IPI is not
required -- except for the very first event, that will need to set
ctx->is_active.

( It might be possible to set ctx->is_active remotely for cpu_ctx, but
  we really need the IPI for task_ctx, so lets not make that
  distinction. )

Also use __perf_effective_state() since group events depend on the
state of the leader, if the leader is OFF, the whole group is OFF.

So when sibling events are created enabled (XXX check tool) then we
only need a single IPI to create and enable the whole group (+ that
initial IPI to initialize the context).

Suggested-by: Andi Kleen <andi@firstfloor.org>
Reported-by: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: acme@kernel.org
Cc: kan.liang@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-28 12:51:02 +01:00
Alexey Budankov
c2b98a8661 perf/x86: Synchronize PMU task contexts on optimized context switches
Install Intel specific PMU task context synchronization adapter and
extend optimized context switch path with PMU specific task context
synchronization to fix LBR callstack virtualization on context switches.

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Song Liu <songliubraving@fb.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: https://lkml.kernel.org/r/9c6445a9-bdba-ef03-3859-f1f91198f27a@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-28 12:51:01 +01:00
Ingo Molnar
65133033ee Merge branch 'perf/urgent' into perf/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-28 12:38:26 +01:00
Alexander Shishkin
8c7e975667 perf/core: Start rejecting the syscall with attr.__reserved_2 set
Commit:

  1a59413124 ("perf: Add wakeup watermark control to the AUX area")

added attr.__reserved_2 padding, but forgot to add an ABI check to reject
attributes with this field set. Fix that.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: adrian.hunter@intel.com
Cc: mathieu.poirier@linaro.org
Link: https://lkml.kernel.org/r/20191025121636.75182-1-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-28 11:01:59 +01:00
Linus Torvalds
2b776b54bc Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
 "A small set of fixes for time(keeping):

   - Add a missing include to prevent compiler warnings.

   - Make the VDSO implementation of clock_getres() POSIX compliant
     again. A recent change dropped the NULL pointer guard which is
     required as NULL is a valid pointer value for this function.

   - Fix two function documentation typos"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  posix-cpu-timers: Fix two trivial comments
  timers/sched_clock: Include local timekeeping.h for missing declarations
  lib/vdso: Make clock_getres() POSIX compliant again
2019-10-27 07:04:22 -04:00
Linus Torvalds
a8a31fdcca Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner:
 "A set of perf fixes:

  kernel:

   - Unbreak the tracking of auxiliary buffer allocations which got
     imbalanced causing recource limit failures.

   - Fix the fallout of splitting of ToPA entries which missed to shift
     the base entry PA correctly.

   - Use the correct context to lookup the AUX event when unmapping the
     associated AUX buffer so the event can be stopped and the buffer
     reference dropped.

  tools:

   - Fix buildiid-cache mode setting in copyfile_mode_ns() when copying
     /proc/kcore

   - Fix freeing id arrays in the event list so the correct event is
     closed.

   - Sync sched.h anc kvm.h headers with the kernel sources.

   - Link jvmti against tools/lib/ctype.o to have weak strlcpy().

   - Fix multiple memory and file descriptor leaks, found by coverity in
     perf annotate.

   - Fix leaks in error handling paths in 'perf c2c', 'perf kmem', found
     by a static analysis tool"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/aux: Fix AUX output stopping
  perf/aux: Fix tracking of auxiliary trace buffer allocation
  perf/x86/intel/pt: Fix base for single entry topa
  perf kmem: Fix memory leak in compact_gfp_flags()
  tools headers UAPI: Sync sched.h with the kernel
  tools headers kvm: Sync kvm.h headers with the kernel sources
  tools headers kvm: Sync kvm headers with the kernel sources
  tools headers kvm: Sync kvm headers with the kernel sources
  perf c2c: Fix memory leak in build_cl_output()
  perf tools: Fix mode setting in copyfile_mode_ns()
  perf annotate: Fix multiple memory and file descriptor leaks
  perf tools: Fix resource leak of closedir() on the error paths
  perf evlist: Fix fix for freed id arrays
  perf jvmti: Link against tools/lib/ctype.h to have weak strlcpy()
2019-10-27 06:59:34 -04:00
David S. Miller
5b7fe93db0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2019-10-27

The following pull-request contains BPF updates for your *net-next* tree.

We've added 52 non-merge commits during the last 11 day(s) which contain
a total of 65 files changed, 2604 insertions(+), 1100 deletions(-).

The main changes are:

 1) Revolutionize BPF tracing by using in-kernel BTF to type check BPF
    assembly code. The work here teaches BPF verifier to recognize
    kfree_skb()'s first argument as 'struct sk_buff *' in tracepoints
    such that verifier allows direct use of bpf_skb_event_output() helper
    used in tc BPF et al (w/o probing memory access) that dumps skb data
    into perf ring buffer. Also add direct loads to probe memory in order
    to speed up/replace bpf_probe_read() calls, from Alexei Starovoitov.

 2) Big batch of changes to improve libbpf and BPF kselftests. Besides
    others: generalization of libbpf's CO-RE relocation support to now
    also include field existence relocations, revamp the BPF kselftest
    Makefile to add test runner concept allowing to exercise various
    ways to build BPF programs, and teach bpf_object__open() and friends
    to automatically derive BPF program type/expected attach type from
    section names to ease their use, from Andrii Nakryiko.

 3) Fix deadlock in stackmap's build-id lookup on rq_lock(), from Song Liu.

 4) Allow to read BTF as raw data from bpftool. Most notable use case
    is to dump /sys/kernel/btf/vmlinux through this, from Jiri Olsa.

 5) Use bpf_redirect_map() helper in libbpf's AF_XDP helper prog which
    manages to improve "rx_drop" performance by ~4%., from Björn Töpel.

 6) Fix to restore the flow dissector after reattach BPF test and also
    fix error handling in bpf_helper_defs.h generation, from Jakub Sitnicki.

 7) Improve verifier's BTF ctx access for use outside of raw_tp, from
    Martin KaFai Lau.

 8) Improve documentation for AF_XDP with new sections and to reflect
    latest features, from Magnus Karlsson.

 9) Add back 'version' section parsing to libbpf for old kernels, from
    John Fastabend.

10) Fix strncat bounds error in libbpf's libbpf_prog_type_by_name(),
    from KP Singh.

11) Turn on -mattr=+alu32 in LLVM by default for BPF kselftests in order
    to improve insn coverage for built BPF progs, from Yonghong Song.

12) Misc minor cleanups and fixes, from various others.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-26 22:57:27 -07:00
David S. Miller
1a51a47491 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2019-10-27

The following pull-request contains BPF updates for your *net* tree.

We've added 7 non-merge commits during the last 11 day(s) which contain
a total of 7 files changed, 66 insertions(+), 16 deletions(-).

The main changes are:

1) Fix two use-after-free bugs in relation to RCU in jited symbol exposure to
   kallsyms, from Daniel Borkmann.

2) Fix NULL pointer dereference in AF_XDP rx-only sockets, from Magnus Karlsson.

3) Fix hang in netdev unregister for hash based devmap as well as another overflow
   bug on 32 bit archs in memlock cost calculation, from Toke Høiland-Jørgensen.

4) Fix wrong memory access in LWT BPF programs on reroute due to invalid dst.
   Also fix BPF selftests to use more compatible nc options, from Jiri Benc.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-26 18:30:55 -07:00
Yunfeng Ye
c34c78dfc1 audit: remove redundant condition check in kauditd_thread()
Warning is found by the code analysis tool:
  "the condition 'if(ac && rc < 0)' is redundant: ac"

The @ac variable has been checked before. It can't be a null pointer
here, so remove the redundant condition check.

Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2019-10-25 11:48:14 -04:00
Tejun Heo
5153faac18 cgroup: remove cgroup_enable_task_cg_lists() optimization
cgroup_enable_task_cg_lists() is used to lazyily initialize task
cgroup associations on the first use to reduce fork / exit overheads
on systems which don't use cgroup.  Unfortunately, locking around it
has never been actually correct and its value is dubious given how the
vast majority of systems use cgroup right away from boot.

This patch removes the optimization.  For now, replace the cg_list
based branches with WARN_ON_ONCE()'s to be on the safe side.  We can
simplify the logic further in the future.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-10-25 05:56:28 -07:00
Martin KaFai Lau
3820729160 bpf: Prepare btf_ctx_access for non raw_tp use case
This patch makes a few changes to btf_ctx_access() to prepare
it for non raw_tp use case where the attach_btf_id is not
necessary a BTF_KIND_TYPEDEF.

It moves the "btf_trace_" prefix check and typedef-follow logic to a new
function "check_attach_btf_id()" which is called only once during
bpf_check().  btf_ctx_access() only operates on a BTF_KIND_FUNC_PROTO
type now. That should also be more efficient since it is done only
one instead of every-time check_ctx_access() is called.

"check_attach_btf_id()" needs to find the func_proto type from
the attach_btf_id.  It needs to store the result into the
newly added prog->aux->attach_func_proto.  func_proto
btf type has no name, so a proper name should be stored into
"attach_func_name" also.

v2:
- Move the "btf_trace_" check to an earlier verifier phase (Alexei)

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20191025001811.1718491-1-kafai@fb.com
2019-10-24 18:41:08 -07:00
Linus Torvalds
5fa2845fd7 Power management fixes for 5.4-rc5
- Using device PM QoS of CPU devices for managing frequency limits
    in cpufreq does not work, so introduce frequency QoS (based on the
    original low-level PM QoS) for this purpose, switch cpufreq and
    related code over to using it and fix a race involving deferred
    updates of frequency limits on top of that (Rafael Wysocki, Sudeep
    Holla).
 
  - Avoid calling regulator_enable()/disable() from the OPP framework
    to avoid side-effects on boot-enabled regulators that may change their
    initial voltage due to performing initial voltage balancing without
    all restrictions from the consumers (Marek Szyprowski).
 
  - Avoid a kref management issue in the OPP library code and drop an
    incorrectly added lockdep_assert_held() from it (Viresh Kumar).
 
  - Make the recently added haltpoll cpuidle driver take the 'idle='
    override into account as appropriate (Zhenzhong Duan).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl2x2nUSHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxB+wP/iwugVCQKNGaOiPRLz3SlL/hchZ7Nhye
 qqetgC5dP/ZUkLrvJnt31wTgpF3VA9AQzuydtSEc76sf/hwaVhmyKhUC8TqqYYx1
 Vhu6EE2tEtFyBNNl0ZPJS8+q7AjhSrDSV8JyFBafOfxRJV2za2Q+Fl0P4Y6gzcez
 mYC1Ol7Of9c8L9J2p33ViNh31AQR8LnqV8YhIZRG5F4+jklJIlEt8QL0C25VQYAX
 9TLhLrXktPRPJytX2eOU8vHOmVAz5qKvU3S6mAPhVPpnEcZL9E56H1sc3xCpX0ww
 V8pnteMOxa3RUySTlGwnzKRwZQ7eSQjsJO9s/YWT+EGzt58Gr0q2WnEiZ8Vm38Hg
 /eOWsyxZiS0JYGc9+Gg/Hh2XFmMxX5QEaKlyIyuJbXvwGGm8gg02MNziXa1FmUd+
 nuklYhN8MJjH1iV84V3c0M8sdrut7LOlOotzyDniZWj7L57Ht+MWNSbpDoebPdGb
 mhOYv/wjXcek0m8mHtlT6w/5o3qDYEkAsosacH6Eko/11tHCfnCGT/no/3ILf3f+
 bBCVxXhkNFdSGC6RcozO1TrMbG3PN9nCzq+od+BDUOmqfsjR8vJbSwbHUQaWn0DU
 jIBASI2FXlN6Hi3SMFA4m7Scnz1fwL9vUVT4EIJ6U5dsVx+T0qDIbgZ3nzyvp5Tn
 vt6X4VzAZ5d/
 =yitK
 -----END PGP SIGNATURE-----

Merge tag 'pm-5.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
 "These fix problems related to frequency limits management in cpufreq
  that were introduced during the 5.3 cycle (when PM QoS had started to
  be used for that), fix a few issues in the OPP (operating performance
  points) library code and fix up the recently added haltpoll cpuidle
  driver.

  The cpufreq changes are somewhat bigger that I would like them to be
  at this stage of the cycle, but the problems fixed by them include
  crashes on boot and shutdown in some cases (among other things) and in
  my view it is better to address the root of the issue right away.

  Specifics:

   - Using device PM QoS of CPU devices for managing frequency limits in
     cpufreq does not work, so introduce frequency QoS (based on the
     original low-level PM QoS) for this purpose, switch cpufreq and
     related code over to using it and fix a race involving deferred
     updates of frequency limits on top of that (Rafael Wysocki, Sudeep
     Holla).

   - Avoid calling regulator_enable()/disable() from the OPP framework
     to avoid side-effects on boot-enabled regulators that may change
     their initial voltage due to performing initial voltage balancing
     without all restrictions from the consumers (Marek Szyprowski).

   - Avoid a kref management issue in the OPP library code and drop an
     incorrectly added lockdep_assert_held() from it (Viresh Kumar).

   - Make the recently added haltpoll cpuidle driver take the 'idle='
     override into account as appropriate (Zhenzhong Duan)"

* tag 'pm-5.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  opp: Reinitialize the list_kref before adding the static OPPs again
  cpufreq: Cancel policy update work scheduled before freeing
  cpuidle: haltpoll: Take 'idle=' override into account
  opp: core: Revert "add regulators enable and disable"
  PM: QoS: Drop frequency QoS types from device PM QoS
  cpufreq: Use per-policy frequency QoS
  PM: QoS: Introduce frequency QoS
  opp: of: drop incorrect lockdep_assert_held()
2019-10-24 15:36:11 -04:00
Aleksa Sarai
a713af394c cgroup: pids: use atomic64_t for pids->limit
Because pids->limit can be changed concurrently (but we don't want to
take a lock because it would be needlessly expensive), use atomic64_ts
instead.

Fixes: commit 49b786ea14 ("cgroup: implement the PIDs subsystem")
Cc: stable@vger.kernel.org # v4.3+
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-10-24 12:07:10 -07:00
Daniel Thompson
d07ce4e32a kdb: Avoid array subscript warnings on non-SMP builds
Recent versions of gcc (reported on gcc-7.4) issue array subscript
warnings for builds where SMP is not enabled.

kernel/debug/debug_core.c: In function 'kdb_dump_stack_on_cpu':
kernel/debug/debug_core.c:452:17: warning: array subscript is outside array
+bounds [-Warray-bounds]
     if (!(kgdb_info[cpu].exception_state & DCPU_IS_SLAVE)) {
           ~~~~~~~~~^~~~~
   kernel/debug/debug_core.c:469:33: warning: array subscript is outside array
+bounds [-Warray-bounds]
     kgdb_info[cpu].exception_state |= DCPU_WANT_BT;
   kernel/debug/debug_core.c:470:18: warning: array subscript is outside array
+bounds [-Warray-bounds]
     while (kgdb_info[cpu].exception_state & DCPU_WANT_BT)

There is no bug here but there is scope to improve the code
generation for non-SMP systems (whilst also silencing the warning).

Reported-by: kbuild test robot <lkp@intel.com>
Fixes: 2277b49258 ("kdb: Fix stack crawling on 'running' CPUs that aren't the master")
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Link: https://lore.kernel.org/r/20191021101057.23861-1-daniel.thompson@linaro.org
Reviewed-by: Douglas Anderson <dianders@chromium.org>
2019-10-24 15:34:58 +01:00
Linus Torvalds
fa8a74de06 Two minor fixes:
- A race in perf trace initialization (missing mutexes)
 
  - Minor fix to represent gfp_t in synthetic events as properly signed
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXa5C9hQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qrFqAP9FskyS8xr1iqquWGarC38jYaLg8C2c
 7ou+tdIm+Rez5QEA+3E456Esusmx3dACk2QLeb788Pf1wQOjKXyZ9H+2eAE=
 =ys+Y
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.4-rc3-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
 "Two minor fixes:

   - A race in perf trace initialization (missing mutexes)

   - Minor fix to represent gfp_t in synthetic events as properly
     signed"

* tag 'trace-v5.4-rc3-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Fix race in perf_trace_buf initialization
  tracing: Fix "gfp_t" format for synthetic events
2019-10-23 15:43:51 -04:00
David Howells
ce4dd4429b Remove the nr_exclusive argument from __wake_up_sync_key()
Remove the nr_exclusive argument from __wake_up_sync_key() and derived
functions as everything seems to set it to 1.  Note also that if it wasn't
set to 1, it would clear WF_SYNC anyway.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2019-10-23 17:02:34 +01:00
Yi Wang
7f2cbcbcaf posix-cpu-timers: Fix two trivial comments
Recent changes modified the function arguments of
thread_group_sample_cputime() and task_cputimers_expired(), but forgot to
update the comments. Fix it up.

[ tglx: Changed the argument name of task_cputimers_expired() as the pointer
  	points to an array of samples. ]

Fixes: b7be4ef136 ("posix-cpu-timers: Switch thread group sampling to array")
Fixes: 001f797143 ("posix-cpu-timers: Make expiry checks array based")
Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/1571643852-21848-1-git-send-email-wang.yi59@zte.com.cn
2019-10-23 14:48:24 +02:00
Ben Dooks (Codethink)
086ee46b08 timers/sched_clock: Include local timekeeping.h for missing declarations
Include the timekeeping.h header to get the declaration of the
sched_clock_{suspend,resume} functions. Fixes the following sparse
warnings:

kernel/time/sched_clock.c:275:5: warning: symbol 'sched_clock_suspend' was not declared. Should it be static?
kernel/time/sched_clock.c:286:6: warning: symbol 'sched_clock_resume' was not declared. Should it be static?

Signed-off-by: Ben Dooks (Codethink) <ben.dooks@codethink.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191022131226.11465-1-ben.dooks@codethink.co.uk
2019-10-23 14:48:23 +02:00
Daniel Borkmann
3b4d9eb2ee bpf: Fix use after free in bpf_get_prog_name
There is one more problematic case I noticed while recently fixing BPF kallsyms
handling in cd7455f101 ("bpf: Fix use after free in subprog's jited symbol
removal") and that is bpf_get_prog_name().

If BTF has been attached to the prog, then we may be able to fetch the function
signature type id in kallsyms through prog->aux->func_info[prog->aux->func_idx].type_id.
However, while the BTF object itself is torn down via RCU callback, the prog's
aux->func_info is immediately freed via kvfree(prog->aux->func_info) once the
prog's refcount either hit zero or when subprograms were already exposed via
kallsyms and we hit the error path added in 5482e9a93c ("bpf: Fix memleak in
aux->func_info and aux->btf").

This violates RCU as well since kallsyms could be walked in parallel where we
could access aux->func_info. Hence, defer kvfree() to after RCU grace period.
Looking at ba64e7d852 ("bpf: btf: support proper non-jit func info") there
is no reason/dependency where we couldn't defer the kvfree(aux->func_info) into
the RCU callback.

Fixes: 5482e9a93c ("bpf: Fix memleak in aux->func_info and aux->btf")
Fixes: ba64e7d852 ("bpf: btf: support proper non-jit func info")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/875f2906a7c1a0691f2d567b4d8e4ea2739b1e88.1571779205.git.daniel@iogearbox.net
2019-10-22 21:59:49 -07:00
Daniel Borkmann
cd7455f101 bpf: Fix use after free in subprog's jited symbol removal
syzkaller managed to trigger the following crash:

  [...]
  BUG: unable to handle page fault for address: ffffc90001923030
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD aa551067 P4D aa551067 PUD aa552067 PMD a572b067 PTE 80000000a1173163
  Oops: 0000 [#1] PREEMPT SMP KASAN
  CPU: 0 PID: 7982 Comm: syz-executor912 Not tainted 5.4.0-rc3+ #0
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
  RIP: 0010:bpf_jit_binary_hdr include/linux/filter.h:787 [inline]
  RIP: 0010:bpf_get_prog_addr_region kernel/bpf/core.c:531 [inline]
  RIP: 0010:bpf_tree_comp kernel/bpf/core.c:600 [inline]
  RIP: 0010:__lt_find include/linux/rbtree_latch.h:115 [inline]
  RIP: 0010:latch_tree_find include/linux/rbtree_latch.h:208 [inline]
  RIP: 0010:bpf_prog_kallsyms_find kernel/bpf/core.c:674 [inline]
  RIP: 0010:is_bpf_text_address+0x184/0x3b0 kernel/bpf/core.c:709
  [...]
  Call Trace:
   kernel_text_address kernel/extable.c:147 [inline]
   __kernel_text_address+0x9a/0x110 kernel/extable.c:102
   unwind_get_return_address+0x4c/0x90 arch/x86/kernel/unwind_frame.c:19
   arch_stack_walk+0x98/0xe0 arch/x86/kernel/stacktrace.c:26
   stack_trace_save+0xb6/0x150 kernel/stacktrace.c:123
   save_stack mm/kasan/common.c:69 [inline]
   set_track mm/kasan/common.c:77 [inline]
   __kasan_kmalloc+0x11c/0x1b0 mm/kasan/common.c:510
   kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:518
   slab_post_alloc_hook mm/slab.h:584 [inline]
   slab_alloc mm/slab.c:3319 [inline]
   kmem_cache_alloc+0x1f5/0x2e0 mm/slab.c:3483
   getname_flags+0xba/0x640 fs/namei.c:138
   getname+0x19/0x20 fs/namei.c:209
   do_sys_open+0x261/0x560 fs/open.c:1091
   __do_sys_open fs/open.c:1115 [inline]
   __se_sys_open fs/open.c:1110 [inline]
   __x64_sys_open+0x87/0x90 fs/open.c:1110
   do_syscall_64+0xf7/0x1c0 arch/x86/entry/common.c:290
   entry_SYSCALL_64_after_hwframe+0x49/0xbe
  [...]

After further debugging it turns out that we walk kallsyms while in parallel
we tear down a BPF program which contains subprograms that have been JITed
though the program itself has not been fully exposed and is eventually bailing
out with error.

The bpf_prog_kallsyms_del_subprogs() in bpf_prog_load()'s error path removes
the symbols, however, bpf_prog_free() tears down the JIT memory too early via
scheduled work. Instead, it needs to properly respect RCU grace period as the
kallsyms walk for BPF is under RCU.

Fix it by refactoring __bpf_prog_put()'s tear down and reuse it in our error
path where we defer final destruction when we have subprogs in the program.

Fixes: 7d1982b4e3 ("bpf: fix panic in prog load calls cleanup")
Fixes: 1c2a088a66 ("bpf: x64: add JIT support for multi-function programs")
Reported-by: syzbot+710043c5d1d5b5013bc7@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: syzbot+710043c5d1d5b5013bc7@syzkaller.appspotmail.com
Link: https://lore.kernel.org/bpf/55f6367324c2d7e9583fa9ccf5385dcbba0d7a6e.1571752452.git.daniel@iogearbox.net
2019-10-22 11:26:09 -07:00
Alexander Shishkin
f3a519e4ad perf/aux: Fix AUX output stopping
Commit:

  8a58ddae23 ("perf/core: Fix exclusive events' grouping")

allows CAP_EXCLUSIVE events to be grouped with other events. Since all
of those also happen to be AUX events (which is not the case the other
way around, because arch/s390), this changes the rules for stopping the
output: the AUX event may not be on its PMU's context any more, if it's
grouped with a HW event, in which case it will be on that HW event's
context instead. If that's the case, munmap() of the AUX buffer can't
find and stop the AUX event, potentially leaving the last reference with
the atomic context, which will then end up freeing the AUX buffer. This
will then trip warnings:

Fix this by using the context's PMU context when looking for events
to stop, instead of the event's PMU context.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20191022073940.61814-1-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-22 14:39:37 +02:00
Prateek Sood
6b1340cc00 tracing: Fix race in perf_trace_buf initialization
A race condition exists while initialiazing perf_trace_buf from
perf_trace_init() and perf_kprobe_init().

      CPU0                                        CPU1
perf_trace_init()
  mutex_lock(&event_mutex)
    perf_trace_event_init()
      perf_trace_event_reg()
        total_ref_count == 0
	buf = alloc_percpu()
        perf_trace_buf[i] = buf
        tp_event->class->reg() //fails       perf_kprobe_init()
	goto fail                              perf_trace_event_init()
                                                 perf_trace_event_reg()
        fail:
	  total_ref_count == 0

                                                   total_ref_count == 0
                                                   buf = alloc_percpu()
                                                   perf_trace_buf[i] = buf
                                                   tp_event->class->reg()
                                                   total_ref_count++

          free_percpu(perf_trace_buf[i])
          perf_trace_buf[i] = NULL

Any subsequent call to perf_trace_event_reg() will observe total_ref_count > 0,
causing the perf_trace_buf to be always NULL. This can result in perf_trace_buf
getting accessed from perf_trace_buf_alloc() without being initialized. Acquiring
event_mutex in perf_kprobe_init() before calling perf_trace_event_init() should
fix this race.

The race caused the following bug:

 Unable to handle kernel paging request at virtual address 0000003106f2003c
 Mem abort info:
   ESR = 0x96000045
   Exception class = DABT (current EL), IL = 32 bits
   SET = 0, FnV = 0
   EA = 0, S1PTW = 0
 Data abort info:
   ISV = 0, ISS = 0x00000045
   CM = 0, WnR = 1
 user pgtable: 4k pages, 39-bit VAs, pgdp = ffffffc034b9b000
 [0000003106f2003c] pgd=0000000000000000, pud=0000000000000000
 Internal error: Oops: 96000045 [#1] PREEMPT SMP
 Process syz-executor (pid: 18393, stack limit = 0xffffffc093190000)
 pstate: 80400005 (Nzcv daif +PAN -UAO)
 pc : __memset+0x20/0x1ac
 lr : memset+0x3c/0x50
 sp : ffffffc09319fc50

  __memset+0x20/0x1ac
  perf_trace_buf_alloc+0x140/0x1a0
  perf_trace_sys_enter+0x158/0x310
  syscall_trace_enter+0x348/0x7c0
  el0_svc_common+0x11c/0x368
  el0_svc_handler+0x12c/0x198
  el0_svc+0x8/0xc

Ramdumps showed the following:
  total_ref_count = 3
  perf_trace_buf = (
      0x0 -> NULL,
      0x0 -> NULL,
      0x0 -> NULL,
      0x0 -> NULL)

Link: http://lkml.kernel.org/r/1571120245-4186-1-git-send-email-prsood@codeaurora.org

Cc: stable@vger.kernel.org
Fixes: e12f03d703 ("perf/core: Implement the 'perf_kprobe' PMU")
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Prateek Sood <prsood@codeaurora.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-10-21 19:38:28 -04:00
Ingo Molnar
aa7a7b7297 Merge branch 'perf/urgent' into perf/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-22 01:15:32 +02:00
Toke Høiland-Jørgensen
ce197d83a9 xdp: Handle device unregister for devmap_hash map type
It seems I forgot to add handling of devmap_hash type maps to the device
unregister hook for devmaps. This omission causes devices to not be
properly released, which causes hangs.

Fix this by adding the missing handler.

Fixes: 6f9d451ab1 ("xdp: Add devmap_hash map type for looking up devices by hashed index")
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20191019111931.2981954-1-toke@redhat.com
2019-10-21 15:51:41 -07:00
Christian Brauner
b612e5df45
clone3: add CLONE_CLEAR_SIGHAND
Reset all signal handlers of the child not set to SIG_IGN to SIG_DFL.
Mutually exclusive with CLONE_SIGHAND to not disturb other thread's
signal handler.

In the spirit of closer cooperation between glibc developers and kernel
developers (cf. [2]) this patchset came out of a discussion on the glibc
mailing list for improving posix_spawn() (cf. [1], [3], [4]). Kernel
support for this feature has been explicitly requested by glibc and I
see no reason not to help them with this.

The child helper process on Linux posix_spawn must ensure that no signal
handlers are enabled, so the signal disposition must be either SIG_DFL
or SIG_IGN. However, it requires a sigprocmask to obtain the current
signal mask and at least _NSIG sigaction calls to reset the signal
handlers for each posix_spawn call or complex state tracking that might
lead to data corruption in glibc. Adding this flags lets glibc avoid
these problems.

[1]: https://www.sourceware.org/ml/libc-alpha/2019-10/msg00149.html
[3]: https://www.sourceware.org/ml/libc-alpha/2019-10/msg00158.html
[4]: https://www.sourceware.org/ml/libc-alpha/2019-10/msg00160.html
[2]: https://lwn.net/Articles/799331/
     '[...] by asking for better cooperation with the C-library projects
     in general. They should be copied on patches containing ABI
     changes, for example. I noted that there are often times where
     C-library developers wish the kernel community had done things
     differently; how could those be avoided in the future? Members of
     the audience suggested that more glibc developers should perhaps
     join the linux-api list. The other suggestion was to "copy Florian
     on everything".'
Cc: Florian Weimer <fweimer@redhat.com>
Cc: libc-alpha@sourceware.org
Cc: linux-api@vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20191014104538.3096-1-christian.brauner@ubuntu.com
2019-10-21 21:46:47 +02:00
Thomas Richter
5e6c3c7b1e perf/aux: Fix tracking of auxiliary trace buffer allocation
The following commit from the v5.4 merge window:

  d44248a413 ("perf/core: Rework memory accounting in perf_mmap()")

... breaks auxiliary trace buffer tracking.

If I run command 'perf record -e rbd000' to record samples and saving
them in the **auxiliary** trace buffer then the value of 'locked_vm' becomes
negative after all trace buffers have been allocated and released:

During allocation the values increase:

  [52.250027] perf_mmap user->locked_vm:0x87 pinned_vm:0x0 ret:0
  [52.250115] perf_mmap user->locked_vm:0x107 pinned_vm:0x0 ret:0
  [52.250251] perf_mmap user->locked_vm:0x188 pinned_vm:0x0 ret:0
  [52.250326] perf_mmap user->locked_vm:0x208 pinned_vm:0x0 ret:0
  [52.250441] perf_mmap user->locked_vm:0x289 pinned_vm:0x0 ret:0
  [52.250498] perf_mmap user->locked_vm:0x309 pinned_vm:0x0 ret:0
  [52.250613] perf_mmap user->locked_vm:0x38a pinned_vm:0x0 ret:0
  [52.250715] perf_mmap user->locked_vm:0x408 pinned_vm:0x2 ret:0
  [52.250834] perf_mmap user->locked_vm:0x408 pinned_vm:0x83 ret:0
  [52.250915] perf_mmap user->locked_vm:0x408 pinned_vm:0x103 ret:0
  [52.251061] perf_mmap user->locked_vm:0x408 pinned_vm:0x184 ret:0
  [52.251146] perf_mmap user->locked_vm:0x408 pinned_vm:0x204 ret:0
  [52.251299] perf_mmap user->locked_vm:0x408 pinned_vm:0x285 ret:0
  [52.251383] perf_mmap user->locked_vm:0x408 pinned_vm:0x305 ret:0
  [52.251544] perf_mmap user->locked_vm:0x408 pinned_vm:0x386 ret:0
  [52.251634] perf_mmap user->locked_vm:0x408 pinned_vm:0x406 ret:0
  [52.253018] perf_mmap user->locked_vm:0x408 pinned_vm:0x487 ret:0
  [52.253197] perf_mmap user->locked_vm:0x408 pinned_vm:0x508 ret:0
  [52.253374] perf_mmap user->locked_vm:0x408 pinned_vm:0x589 ret:0
  [52.253550] perf_mmap user->locked_vm:0x408 pinned_vm:0x60a ret:0
  [52.253726] perf_mmap user->locked_vm:0x408 pinned_vm:0x68b ret:0
  [52.253903] perf_mmap user->locked_vm:0x408 pinned_vm:0x70c ret:0
  [52.254084] perf_mmap user->locked_vm:0x408 pinned_vm:0x78d ret:0
  [52.254263] perf_mmap user->locked_vm:0x408 pinned_vm:0x80e ret:0

The value of user->locked_vm increases to a limit then the memory
is tracked by pinned_vm.

During deallocation the size is subtracted from pinned_vm until
it hits a limit. Then a larger value is subtracted from locked_vm
leading to a large number (because of type unsigned):

  [64.267797] perf_mmap_close mmap_user->locked_vm:0x408 pinned_vm:0x78d
  [64.267826] perf_mmap_close mmap_user->locked_vm:0x408 pinned_vm:0x70c
  [64.267848] perf_mmap_close mmap_user->locked_vm:0x408 pinned_vm:0x68b
  [64.267869] perf_mmap_close mmap_user->locked_vm:0x408 pinned_vm:0x60a
  [64.267891] perf_mmap_close mmap_user->locked_vm:0x408 pinned_vm:0x589
  [64.267911] perf_mmap_close mmap_user->locked_vm:0x408 pinned_vm:0x508
  [64.267933] perf_mmap_close mmap_user->locked_vm:0x408 pinned_vm:0x487
  [64.267952] perf_mmap_close mmap_user->locked_vm:0x408 pinned_vm:0x406
  [64.268883] perf_mmap_close mmap_user->locked_vm:0x307 pinned_vm:0x406
  [64.269117] perf_mmap_close mmap_user->locked_vm:0x206 pinned_vm:0x406
  [64.269433] perf_mmap_close mmap_user->locked_vm:0x105 pinned_vm:0x406
  [64.269536] perf_mmap_close mmap_user->locked_vm:0x4 pinned_vm:0x404
  [64.269797] perf_mmap_close mmap_user->locked_vm:0xffffffffffffff84 pinned_vm:0x303
  [64.270105] perf_mmap_close mmap_user->locked_vm:0xffffffffffffff04 pinned_vm:0x202
  [64.270374] perf_mmap_close mmap_user->locked_vm:0xfffffffffffffe84 pinned_vm:0x101
  [64.270628] perf_mmap_close mmap_user->locked_vm:0xfffffffffffffe04 pinned_vm:0x0

This value sticks for the user until system is rebooted, causing
follow-on system calls using locked_vm resource limit to fail.

Note: There is no issue using the normal trace buffer.

In fact the issue is in perf_mmap_close(). During allocation auxiliary
trace buffer memory is either traced as 'extra' and added to 'pinned_vm'
or trace as 'user_extra' and added to 'locked_vm'. This applies for
normal trace buffers and auxiliary trace buffer.

However in function perf_mmap_close() all auxiliary trace buffer is
subtraced from 'locked_vm' and never from 'pinned_vm'. This breaks the
ballance.

Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@kernel.org
Cc: gor@linux.ibm.com
Cc: hechaol@fb.com
Cc: heiko.carstens@de.ibm.com
Cc: linux-perf-users@vger.kernel.org
Cc: songliubraving@fb.com
Fixes: d44248a413 ("perf/core: Rework memory accounting in perf_mmap()")
Link: https://lkml.kernel.org/r/20191021083354.67868-1-tmricht@linux.ibm.com
[ Minor readability edits. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-21 11:31:24 +02:00
Vincent Guittot
57abff067a sched/fair: Rework find_idlest_group()
The slow wake up path computes per sched_group statisics to select the
idlest group, which is quite similar to what load_balance() is doing
for selecting busiest group. Rework find_idlest_group() to classify the
sched_group and select the idlest one following the same steps as
load_balance().

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hdanton@sina.com
Cc: parth@linux.ibm.com
Cc: pauld@redhat.com
Cc: quentin.perret@arm.com
Cc: riel@surriel.com
Cc: srikar@linux.vnet.ibm.com
Cc: valentin.schneider@arm.com
Link: https://lkml.kernel.org/r/1571405198-27570-12-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-21 09:40:55 +02:00
Vincent Guittot
fc1273f4ce sched/fair: Optimize find_idlest_group()
find_idlest_group() now reads CPU's load_avg in two different ways.

Consolidate the function to read and use load_avg only once and simplify
the algorithm to only look for the group with lowest load_avg.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hdanton@sina.com
Cc: parth@linux.ibm.com
Cc: pauld@redhat.com
Cc: quentin.perret@arm.com
Cc: riel@surriel.com
Cc: srikar@linux.vnet.ibm.com
Cc: valentin.schneider@arm.com
Link: https://lkml.kernel.org/r/1571405198-27570-11-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-21 09:40:55 +02:00
Vincent Guittot
11f10e5420 sched/fair: Use load instead of runnable load in wakeup path
Runnable load was originally introduced to take into account the case where
blocked load biases the wake up path which may end to select an overloaded
CPU with a large number of runnable tasks instead of an underutilized
CPU with a huge blocked load.

Tha wake up path now starts looking for idle CPUs before comparing
runnable load and it's worth aligning the wake up path with the
load_balance() logic.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hdanton@sina.com
Cc: parth@linux.ibm.com
Cc: pauld@redhat.com
Cc: quentin.perret@arm.com
Cc: riel@surriel.com
Cc: srikar@linux.vnet.ibm.com
Cc: valentin.schneider@arm.com
Link: https://lkml.kernel.org/r/1571405198-27570-10-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-21 09:40:55 +02:00
Vincent Guittot
c63be7be59 sched/fair: Use utilization to select misfit task
Utilization is used to detect a misfit task but the load is then used to
select the task on the CPU which can lead to select a small task with
high weight instead of the task that triggered the misfit migration.

Check that task can't fit the CPU's capacity when selecting the misfit
task instead of using the load.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Valentin Schneider <valentin.schneider@arm.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hdanton@sina.com
Cc: parth@linux.ibm.com
Cc: pauld@redhat.com
Cc: quentin.perret@arm.com
Cc: riel@surriel.com
Cc: srikar@linux.vnet.ibm.com
Link: https://lkml.kernel.org/r/1571405198-27570-9-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-21 09:40:54 +02:00
Vincent Guittot
2ab4092fc8 sched/fair: Spread out tasks evenly when not overloaded
When there is only one CPU per group, using the idle CPUs to evenly spread
tasks doesn't make sense and nr_running is a better metrics.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hdanton@sina.com
Cc: parth@linux.ibm.com
Cc: pauld@redhat.com
Cc: quentin.perret@arm.com
Cc: riel@surriel.com
Cc: srikar@linux.vnet.ibm.com
Cc: valentin.schneider@arm.com
Link: https://lkml.kernel.org/r/1571405198-27570-8-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-21 09:40:54 +02:00
Vincent Guittot
b0fb1eb4f0 sched/fair: Use load instead of runnable load in load_balance()
'runnable load' was originally introduced to take into account the case
where blocked load biases the load balance decision which was selecting
underutilized groups with huge blocked load whereas other groups were
overloaded.

The load is now only used when groups are overloaded. In this case,
it's worth being conservative and taking into account the sleeping
tasks that might wake up on the CPU.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hdanton@sina.com
Cc: parth@linux.ibm.com
Cc: pauld@redhat.com
Cc: quentin.perret@arm.com
Cc: riel@surriel.com
Cc: srikar@linux.vnet.ibm.com
Cc: valentin.schneider@arm.com
Link: https://lkml.kernel.org/r/1571405198-27570-7-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-21 09:40:54 +02:00
Vincent Guittot
5e23e47443 sched/fair: Use rq->nr_running when balancing load
CFS load_balance() only takes care of CFS tasks whereas CPUs can be used by
other scheduling classes. Typically, a CFS task preempted by an RT or deadline
task will not get a chance to be pulled by another CPU because
load_balance() doesn't take into account tasks from other classes.
Add sum of nr_running in the statistics and use it to detect such
situations.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hdanton@sina.com
Cc: parth@linux.ibm.com
Cc: pauld@redhat.com
Cc: quentin.perret@arm.com
Cc: riel@surriel.com
Cc: srikar@linux.vnet.ibm.com
Cc: valentin.schneider@arm.com
Link: https://lkml.kernel.org/r/1571405198-27570-6-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-21 09:40:54 +02:00
Vincent Guittot
0b0695f2b3 sched/fair: Rework load_balance()
The load_balance() algorithm contains some heuristics which have become
meaningless since the rework of the scheduler's metrics like the
introduction of PELT.

Furthermore, load is an ill-suited metric for solving certain task
placement imbalance scenarios.

For instance, in the presence of idle CPUs, we should simply try to get at
least one task per CPU, whereas the current load-based algorithm can actually
leave idle CPUs alone simply because the load is somewhat balanced.

The current algorithm ends up creating virtual and meaningless values like
the avg_load_per_task or tweaks the state of a group to make it overloaded
whereas it's not, in order to try to migrate tasks.

load_balance() should better qualify the imbalance of the group and clearly
define what has to be moved to fix this imbalance.

The type of sched_group has been extended to better reflect the type of
imbalance. We now have:

	group_has_spare
	group_fully_busy
	group_misfit_task
	group_asym_packing
	group_imbalanced
	group_overloaded

Based on the type of sched_group, load_balance now sets what it wants to
move in order to fix the imbalance. It can be some load as before but also
some utilization, a number of task or a type of task:

	migrate_task
	migrate_util
	migrate_load
	migrate_misfit

This new load_balance() algorithm fixes several pending wrong tasks
placement:

 - the 1 task per CPU case with asymmetric system
 - the case of cfs task preempted by other class
 - the case of tasks not evenly spread on groups with spare capacity

Also the load balance decisions have been consolidated in the 3 functions
below after removing the few bypasses and hacks of the current code:

 - update_sd_pick_busiest() select the busiest sched_group.
 - find_busiest_group() checks if there is an imbalance between local and
   busiest group.
 - calculate_imbalance() decides what have to be moved.

Finally, the now unused field total_running of struct sd_lb_stats has been
removed.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hdanton@sina.com
Cc: parth@linux.ibm.com
Cc: pauld@redhat.com
Cc: quentin.perret@arm.com
Cc: riel@surriel.com
Cc: srikar@linux.vnet.ibm.com
Cc: valentin.schneider@arm.com
Link: https://lkml.kernel.org/r/1571405198-27570-5-git-send-email-vincent.guittot@linaro.org
[ Small readability and spelling updates. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-21 09:40:53 +02:00
Vincent Guittot
fcf0553db6 sched/fair: Remove meaningless imbalance calculation
Clean up load_balance() and remove meaningless calculation and fields before
adding a new algorithm.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Rik van Riel <riel@surriel.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hdanton@sina.com
Cc: parth@linux.ibm.com
Cc: pauld@redhat.com
Cc: quentin.perret@arm.com
Cc: srikar@linux.vnet.ibm.com
Cc: valentin.schneider@arm.com
Link: https://lkml.kernel.org/r/1571405198-27570-4-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-21 09:40:53 +02:00
Vincent Guittot
a349834703 sched/fair: Rename sg_lb_stats::sum_nr_running to sum_h_nr_running
Rename sum_nr_running to sum_h_nr_running because it effectively tracks
cfs->h_nr_running so we can use sum_nr_running to track rq->nr_running
when needed.

There are no functional changes.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Acked-by: Rik van Riel <riel@surriel.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hdanton@sina.com
Cc: parth@linux.ibm.com
Cc: pauld@redhat.com
Cc: quentin.perret@arm.com
Cc: srikar@linux.vnet.ibm.com
Link: https://lkml.kernel.org/r/1571405198-27570-3-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-21 09:40:53 +02:00
Vincent Guittot
490ba971d8 sched/fair: Clean up asym packing
Clean up asym packing to follow the default load balance behavior:

- classify the group by creating a group_asym_packing field.
- calculate the imbalance in calculate_imbalance() instead of bypassing it.

We don't need to test twice same conditions anymore to detect asym packing
and we consolidate the calculation of imbalance in calculate_imbalance().

There is no functional changes.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Rik van Riel <riel@surriel.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hdanton@sina.com
Cc: parth@linux.ibm.com
Cc: pauld@redhat.com
Cc: quentin.perret@arm.com
Cc: srikar@linux.vnet.ibm.com
Cc: valentin.schneider@arm.com
Link: https://lkml.kernel.org/r/1571405198-27570-2-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-21 09:40:53 +02:00
Rafael J. Wysocki
77751a466e PM: QoS: Introduce frequency QoS
Introduce frequency QoS, based on the "raw" low-level PM QoS, to
represent min and max frequency requests and aggregate constraints.

The min and max frequency requests are to be represented by
struct freq_qos_request objects and the aggregate constraints are to
be represented by struct freq_constraints objects.  The latter are
expected to be initialized with the help of freq_constraints_init().

The freq_qos_read_value() helper is defined to retrieve the aggregate
constraints values from a given struct freq_constraints object and
there are the freq_qos_add_request(), freq_qos_update_request() and
freq_qos_remove_request() helpers to manipulate the min and max
frequency requests.  It is assumed that the the helpers will not
run concurrently with each other for the same struct freq_qos_request
object, so if that may be the case, their uses must ensure proper
synchronization between them (e.g. through locking).

In addition, freq_qos_add_notifier() and freq_qos_remove_notifier()
are provided to add and remove notifiers that will trigger on aggregate
constraint changes to and from a given struct freq_constraints object,
respectively.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2019-10-21 02:05:21 +02:00
Andy Whitcroft
da6043fe85 PM / hibernate: memory_bm_find_bit(): Tighten node optimisation
When looking for a bit by number we make use of the cached result from the
preceding lookup to speed up operation.  Firstly we check if the requested
pfn is within the cached zone and if not lookup the new zone.  We then
check if the offset for that pfn falls within the existing cached node.
This happens regardless of whether the node is within the zone we are
now scanning.  With certain memory layouts it is possible for this to
false trigger creating a temporary alias for the pfn to a different bit.
This leads the hibernation code to free memory which it was never allocated
with the expected fallout.

Ensure the zone we are scanning matches the cached zone before considering
the cached node.

Deep thanks go to Andrea for many, many, many hours of hacking and testing
that went into cornering this bug.

Reported-by: Andrea Righi <andrea.righi@canonical.com>
Tested-by: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-10-21 01:24:27 +02:00
David S. Miller
2f184393e0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Several cases of overlapping changes which were for the most
part trivially resolvable.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-20 10:43:00 -07:00
Linus Torvalds
e2ab4ef83f Kbuild fixes for v5.4 (2nd)
- fix a bashism of setlocalversion
 
  - do not use the too new --sort option of tar
 -----BEGIN PGP SIGNATURE-----
 
 iQJSBAABCgA8FiEEbmPs18K1szRHjPqEPYsBB53g2wYFAl2sh9MeHHlhbWFkYS5t
 YXNhaGlyb0Bzb2Npb25leHQuY29tAAoJED2LAQed4NsGO90P/13dNgaibb3JcCvp
 +ugn+o26V+aqZ8uDLPPifUNFLLnDfGmgORAOe0Z2mGI2CqMDaWMYjvyZNF1KzOdD
 +WFZubiTUevYyM9AOG6O6EOGEDRumgvdNhnJIP+aCUNKGM50zzhuAa23y6Xip1NZ
 qJY5AYu9KiH9jTXtZAdGReuMdvNRsMZuvmI4Qnc21utItVxDFEQKGdTlRoJaGEdN
 Zc/f1HzWwlr7VWZiy0iMRSCAAtAA3TN3S1bw5QCYsuJb3LRl2EIOKMeSinRTYWRc
 zU5CHrkRSX4/M/o8VantTOZueuGMfbaOAl1KEDh8DUPDcrl+gmL4s8aFuDW+5d2+
 7pPlEe/FNZOAGVBDgiK59N3bl2Rn/Uyg9VGwFc8PU3VeY5VbS97ayBaLFcua2zwX
 88mtsZoq+fN3qsq8gfbb21qauTbEibyrk8lzjHoQl07TW82GFWG/GXSt9dQAm1sb
 w/M15WdaONKOzkYXGGCo0l6vdXx1+qqV/LZQcaWhfcoeU5YVVEmSRwCakCxxSWUE
 rvWtn0vljgH8kyb0gnUP6qLJC9clwMv58sl/mG65/D7zs9B74g16BkX0tU8vLSpr
 Ww1BpGtPwuFPp7+kEzWemtlulhDvow2xIX7bNIkAk1IIVpyCmU98AKpivR7G9Cvt
 4cZPSGRU+s2ou/hrtjryzvmwhGUG
 =KNWF
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-fixes-v5.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull more Kbuild fixes from Masahiro Yamada:

 - fix a bashism of setlocalversion

 - do not use the too new --sort option of tar

* tag 'kbuild-fixes-v5.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  kheaders: substituting --sort in archive creation
  scripts: setlocalversion: fix a bashism
  kbuild: update comment about KBUILD_ALLDIRS
2019-10-20 12:36:57 -04:00
Linus Torvalds
188768f3c0 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull hrtimer fixlet from Thomas Gleixner:
 "A single commit annotating the lockcless access to timer->base with
  READ_ONCE() and adding the WRITE_ONCE() counterparts for completeness"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  hrtimer: Annotate lockless access to timer->base
2019-10-20 06:25:12 -04:00
Linus Torvalds
589f1222e0 Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull stop-machine fix from Thomas Gleixner:
 "A single fix, amending stop machine with WRITE/READ_ONCE() to address
  the fallout of KCSAN"

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  stop_machine: Avoid potential race behaviour
2019-10-20 06:22:25 -04:00
Song Liu
aa5de305c9 kernel/events/uprobes.c: only do FOLL_SPLIT_PMD for uprobe register
Attaching uprobe to text section in THP splits the PMD mapped page table
into PTE mapped entries.  On uprobe detach, we would like to regroup PMD
mapped page table entry to regain performance benefit of THP.

However, the regroup is broken For perf_event based trace_uprobe.  This
is because perf_event based trace_uprobe calls uprobe_unregister twice
on close: first in TRACE_REG_PERF_CLOSE, then in
TRACE_REG_PERF_UNREGISTER.  The second call will split the PMD mapped
page table entry, which is not the desired behavior.

Fix this by only use FOLL_SPLIT_PMD for uprobe register case.

Add a WARN() to confirm uprobe unregister never work on huge pages, and
abort the operation when this WARN() triggers.

Link: http://lkml.kernel.org/r/20191017164223.2762148-6-songliubraving@fb.com
Fixes: 5a52c9df62 ("uprobe: use FOLL_SPLIT_PMD instead of FOLL_SPLIT")
Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-19 06:32:33 -04:00
Toke Høiland-Jørgensen
05679ca6fe xdp: Prevent overflow in devmap_hash cost calculation for 32-bit builds
Tetsuo pointed out that without an explicit cast, the cost calculation for
devmap_hash type maps could overflow on 32-bit builds. This adds the
missing cast.

Fixes: 6f9d451ab1 ("xdp: Add devmap_hash map type for looking up devices by hashed index")
Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20191017105702.2807093-1-toke@redhat.com
2019-10-18 16:18:42 -07:00
YueHaibing
1f5343c0ae bpf: Fix build error without CONFIG_NET
If CONFIG_NET is n, building fails:

kernel/trace/bpf_trace.o: In function `raw_tp_prog_func_proto':
bpf_trace.c:(.text+0x1a34): undefined reference to `bpf_skb_output_proto'

Wrap it into a #ifdef to fix this.

Fixes: a7658e1a41 ("bpf: Check types of arguments passed into helpers")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20191018090344.26936-1-yuehaibing@huawei.com
2019-10-18 20:57:07 +02:00
Alexei Starovoitov
c108e3c1bd bpf: Fix bpf_attr.attach_btf_id check
Only raw_tracepoint program type can have bpf_attr.attach_btf_id >= 0.
Make sure to reject other program types that accidentally set it to non-zero.

Fixes: ccfe29eb29 ("bpf: Add attach_btf_id attribute to program load")
Reported-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20191018060933.2950231-1-ast@kernel.org
2019-10-18 20:55:54 +02:00
Zhengjun Xing
9fa8c9c647 tracing: Fix "gfp_t" format for synthetic events
In the format of synthetic events, the "gfp_t" is shown as "signed:1",
but in fact the "gfp_t" is "unsigned", should be shown as "signed:0".

The issue can be reproduced by the following commands:

echo 'memlatency u64 lat; unsigned int order; gfp_t gfp_flags; int migratetype' > /sys/kernel/debug/tracing/synthetic_events
cat  /sys/kernel/debug/tracing/events/synthetic/memlatency/format

name: memlatency
ID: 2233
format:
        field:unsigned short common_type;       offset:0;       size:2; signed:0;
        field:unsigned char common_flags;       offset:2;       size:1; signed:0;
        field:unsigned char common_preempt_count;       offset:3;       size:1; signed:0;
        field:int common_pid;   offset:4;       size:4; signed:1;

        field:u64 lat;  offset:8;       size:8; signed:0;
        field:unsigned int order;       offset:16;      size:4; signed:0;
        field:gfp_t gfp_flags;  offset:24;      size:4; signed:1;
        field:int migratetype;  offset:32;      size:4; signed:1;

print fmt: "lat=%llu, order=%u, gfp_flags=%x, migratetype=%d", REC->lat, REC->order, REC->gfp_flags, REC->migratetype

Link: http://lkml.kernel.org/r/20191018012034.6404-1-zhengjun.xing@linux.intel.com

Reviewed-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Zhengjun Xing <zhengjun.xing@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-10-18 14:42:53 -04:00
Linus Torvalds
e59b76ff67 Power management fixes for 5.4-rc4
- Fix possible NULL pointer dereference in the ACPI processor
    scaling initialization code introduced by a recent cpufreq
    update (Rafael Wysocki).
 
  - Fix possible deadlock due to suspending cpufreq too late during
    system shutdown (Rafael Wysocki).
 
  - Make the PCI device system resume code path be more consistent
    with its PM-runtime counterpart to fix an issue with missing
    delay on transitions from D3cold to D0 during system resume from
    suspend-to-idle on some systems (Rafael Wysocki).
 
  - Drop Dell XPS13 9360 from the LPS0 Idle _DSM blacklist to make it
    use suspend-to-idle by default (Mario Limonciello).
 
  - Fix build warning in the core system suspend support code (Ben
    Dooks).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl2pgDcSHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxsCQQAJ6L+CC5kEmjfkFeUvCR7xe37wEdeCYc
 NL13rDuKAKQDK6ayJtHp6LUPpSiotOzvjohFplLyn8dAZ7x8V0/KOOGeoL35GSlN
 u18AxkRXT79AzrW1SoDm2TnCCZYV0aW5RupGzPd4vso+E5Q5S3e9Ugk3PBxQfTYH
 YKOr7SpBBICma8ZgplR4tUTLb8brhR/NUAIOVyze52gUKh366kZbDD5tn8eHUxXq
 CBj4ryyV88aeQH2sM1sfvdIlDgFgHrO5GOzcDCCKrEtddqP/hSZyHU+OCH35Vu1v
 it96AEx58IEmd8I18tNZtu5rm1b89YkHeuoh/UF92zCYvk4RVg+Dw3AF0NZVjMQo
 vO7xyO93H48TClHcCwxIOhdI16bXidYfVR1JTeTINvaR7mNZtUddSCEFVKwCazbP
 FC55pVPeUx+HbC6TBQJ9gOSxO2yc2f+uLUXfnKgOpeFZpQuX/WhzuWqUlXgBF2QK
 FHOdfVvX1p1tE9U+3sMLyx1Sln+gqtd2P6GnGzewqIAl43LyJBC4l1s+nkE1b5fc
 7pa0rEqLu4NtGj5O+YQvp+luTQ9FcVL9K5tTuqV4V1Kt+/bzjfIvaEJpwHs44yrZ
 tP2/YCCiA84Hy8SqS4Txk+D14XVYuRguTMBzcmWj2POKFTbDsIhC9bAEZgfJZJNK
 /S2jf5CTnZtl
 =67IL
 -----END PGP SIGNATURE-----

Merge tag 'pm-5.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
 "These include a fix for a recent regression in the ACPI CPU
performance scaling code, a PCI device power management fix,
a system shutdown fix related to cpufreq, a removal of an ACPI
suspend-to-idle blacklist entry and a build warning fix.

Specifics:

   - Fix possible NULL pointer dereference in the ACPI processor scaling
     initialization code introduced by a recent cpufreq update (Rafael
     Wysocki).

   - Fix possible deadlock due to suspending cpufreq too late during
     system shutdown (Rafael Wysocki).

   - Make the PCI device system resume code path be more consistent with
     its PM-runtime counterpart to fix an issue with missing delay on
     transitions from D3cold to D0 during system resume from
     suspend-to-idle on some systems (Rafael Wysocki).

   - Drop Dell XPS13 9360 from the LPS0 Idle _DSM blacklist to make it
     use suspend-to-idle by default (Mario Limonciello).

   - Fix build warning in the core system suspend support code (Ben
     Dooks)"

* tag 'pm-5.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI: processor: Avoid NULL pointer dereferences at init time
  PCI: PM: Fix pci_power_up()
  PM: sleep: include <linux/pm_runtime.h> for pm_wq
  cpufreq: Avoid cpufreq_suspend() deadlock on system shutdown
  ACPI: PM: Drop Dell XPS13 9360 from LPS0 Idle _DSM blacklist
2019-10-18 08:34:04 -07:00
Kefeng Wang
3da2e1fd46 trace: Use pr_warn instead of pr_warning
As said in commit f2c2cbcc35 ("powerpc: Use pr_warn instead of
pr_warning"), removing pr_warning so all logging messages use a
consistent <prefix>_warn style. Let's do it.

Link: http://lkml.kernel.org/r/20191018031850.48498-26-wangkefeng.wang@huawei.com
To: linux-kernel@vger.kernel.org
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2019-10-18 15:01:57 +02:00
Kefeng Wang
fc65104c7c dma-debug: Use pr_warn instead of pr_warning
As said in commit f2c2cbcc35 ("powerpc: Use pr_warn instead of
pr_warning"), removing pr_warning so all logging messages use a
consistent <prefix>_warn style. Let's do it.

Link: http://lkml.kernel.org/r/20191018031850.48498-25-wangkefeng.wang@huawei.com
To: linux-kernel@vger.kernel.org
Cc: Christoph Hellwig <hch@lst.de>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2019-10-18 15:01:56 +02:00
Rafael J. Wysocki
b23eb5c74e Merge branches 'pm-cpufreq' and 'pm-sleep'
* pm-cpufreq:
  ACPI: processor: Avoid NULL pointer dereferences at init time
  cpufreq: Avoid cpufreq_suspend() deadlock on system shutdown

* pm-sleep:
  PM: sleep: include <linux/pm_runtime.h> for pm_wq
  ACPI: PM: Drop Dell XPS13 9360 from LPS0 Idle _DSM blacklist
2019-10-18 10:27:55 +02:00
Yunfeng Ye
d7e78706e4 perf/ring_buffer: Matching the memory allocate and free, in rb_alloc()
Currently perf_mmap_alloc_page() is used to allocate memory in
rb_alloc(), but using free_page() to free memory in the failure path.

It's better to use perf_mmap_free_page() instead.

Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <jolsa@redhat.co>
Cc: <acme@kernel.org>
Cc: <mingo@redhat.com>
Cc: <mark.rutland@arm.com>
Cc: <namhyung@kernel.org>
Cc: <alexander.shishkin@linux.intel.com>
Link: https://lkml.kernel.org/r/575c7e8c-90c7-4e3a-b41d-f894d8cdbd7f@huawei.com
2019-10-17 21:31:55 +02:00
Yunfeng Ye
8a9f91c51e perf/ring_buffer: Modify the parameter type of perf_mmap_free_page()
In perf_mmap_free_page(), the unsigned long type is converted to the
pointer type, but where the call is made, the pointer type is converted
to the unsigned long type. There is no need to do these operations.

Modify the parameter type of perf_mmap_free_page() to pointer type.

Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <jolsa@redhat.co>
Cc: <acme@kernel.org>
Cc: <mingo@redhat.com>
Cc: <mark.rutland@arm.com>
Cc: <namhyung@kernel.org>
Cc: <alexander.shishkin@linux.intel.com>
Link: https://lkml.kernel.org/r/e6ae3f0c-d04c-50f9-544a-aee3b30330cd@huawei.com
2019-10-17 21:31:55 +02:00
Joel Fernandes (Google)
da97e18458 perf_event: Add support for LSM and SELinux checks
In current mainline, the degree of access to perf_event_open(2) system
call depends on the perf_event_paranoid sysctl.  This has a number of
limitations:

1. The sysctl is only a single value. Many types of accesses are controlled
   based on the single value thus making the control very limited and
   coarse grained.
2. The sysctl is global, so if the sysctl is changed, then that means
   all processes get access to perf_event_open(2) opening the door to
   security issues.

This patch adds LSM and SELinux access checking which will be used in
Android to access perf_event_open(2) for the purposes of attaching BPF
programs to tracepoints, perf profiling and other operations from
userspace. These operations are intended for production systems.

5 new LSM hooks are added:
1. perf_event_open: This controls access during the perf_event_open(2)
   syscall itself. The hook is called from all the places that the
   perf_event_paranoid sysctl is checked to keep it consistent with the
   systctl. The hook gets passed a 'type' argument which controls CPU,
   kernel and tracepoint accesses (in this context, CPU, kernel and
   tracepoint have the same semantics as the perf_event_paranoid sysctl).
   Additionally, I added an 'open' type which is similar to
   perf_event_paranoid sysctl == 3 patch carried in Android and several other
   distros but was rejected in mainline [1] in 2016.

2. perf_event_alloc: This allocates a new security object for the event
   which stores the current SID within the event. It will be useful when
   the perf event's FD is passed through IPC to another process which may
   try to read the FD. Appropriate security checks will limit access.

3. perf_event_free: Called when the event is closed.

4. perf_event_read: Called from the read(2) and mmap(2) syscalls for the event.

5. perf_event_write: Called from the ioctl(2) syscalls for the event.

[1] https://lwn.net/Articles/696240/

Since Peter had suggest LSM hooks in 2016 [1], I am adding his
Suggested-by tag below.

To use this patch, we set the perf_event_paranoid sysctl to -1 and then
apply selinux checking as appropriate (default deny everything, and then
add policy rules to give access to domains that need it). In the future
we can remove the perf_event_paranoid sysctl altogether.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Co-developed-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: James Morris <jmorris@namei.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: rostedt@goodmis.org
Cc: Yonghong Song <yhs@fb.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: jeffv@google.com
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: primiano@google.com
Cc: Song Liu <songliubraving@fb.com>
Cc: rsavitski@google.com
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Matthew Garrett <matthewgarrett@google.com>
Link: https://lkml.kernel.org/r/20191014170308.70668-1-joel@joelfernandes.org
2019-10-17 21:31:55 +02:00
Valentin Schneider
9ae7ab20b4 sched/topology: Don't set SD_BALANCE_WAKE on cpuset domain relax
As pointed out in commit

  182a85f8a1 ("sched: Disable wakeup balancing")

SD_BALANCE_WAKE is a tad too aggressive, and is usually left unset.

However, it turns out cpuset domain relaxation will unconditionally set it
on domains below the relaxation level. This made sense back when
SD_BALANCE_WAKE was set unconditionally, but it no longer is the case.

We can improve things slightly by noticing that set_domain_attribute() is
always called after sd_init(), so rather than setting flags we can rely on
whatever sd_init() is doing and only clear certain flags when above the
relaxation level.

While at it, slightly clean up the function and flip the relax level
check to be more human readable.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: mingo@kernel.org
Cc: vincent.guittot@linaro.org
Cc: juri.lelli@redhat.com
Cc: seto.hidetoshi@jp.fujitsu.com
Cc: qperret@google.com
Cc: Dietmar.Eggemann@arm.com
Cc: morten.rasmussen@arm.com
Link: https://lkml.kernel.org/r/20191014164408.32596-1-valentin.schneider@arm.com
2019-10-17 21:31:54 +02:00
Alexei Starovoitov
a7658e1a41 bpf: Check types of arguments passed into helpers
Introduce new helper that reuses existing skb perf_event output
implementation, but can be called from raw_tracepoint programs
that receive 'struct sk_buff *' as tracepoint argument or
can walk other kernel data structures to skb pointer.

In order to do that teach verifier to resolve true C types
of bpf helpers into in-kernel BTF ids.
The type of kernel pointer passed by raw tracepoint into bpf
program will be tracked by the verifier all the way until
it's passed into helper function.
For example:
kfree_skb() kernel function calls trace_kfree_skb(skb, loc);
bpf programs receives that skb pointer and may eventually
pass it into bpf_skb_output() bpf helper which in-kernel is
implemented via bpf_skb_event_output() kernel function.
Its first argument in the kernel is 'struct sk_buff *'.
The verifier makes sure that types match all the way.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20191016032505.2089704-11-ast@kernel.org
2019-10-17 16:44:36 +02:00
Alexei Starovoitov
3dec541b2e bpf: Add support for BTF pointers to x86 JIT
Pointer to BTF object is a pointer to kernel object or NULL.
Such pointers can only be used by BPF_LDX instructions.
The verifier changed their opcode from LDX|MEM|size
to LDX|PROBE_MEM|size to make JITing easier.
The number of entries in extable is the number of BPF_LDX insns
that access kernel memory via "pointer to BTF type".
Only these load instructions can fault.
Since x86 extable is relative it has to be allocated in the same
memory region as JITed code.
Allocate it prior to last pass of JITing and let the last pass populate it.
Pointer to extable in bpf_prog_aux is necessary to make page fault
handling fast.
Page fault handling is done in two steps:
1. bpf_prog_kallsyms_find() finds BPF program that page faulted.
   It's done by walking rb tree.
2. then extable for given bpf program is binary searched.
This process is similar to how page faulting is done for kernel modules.
The exception handler skips over faulting x86 instruction and
initializes destination register with zero. This mimics exact
behavior of bpf_probe_read (when probe_kernel_read faults dest is zeroed).

JITs for other architectures can add support in similar way.
Until then they will reject unknown opcode and fallback to interpreter.

Since extable should be aligned and placed near JITed code
make bpf_jit_binary_alloc() return 4 byte aligned image offset,
so that extable aligning formula in bpf_int_jit_compile() doesn't need
to rely on internal implementation of bpf_jit_binary_alloc().
On x86 gcc defaults to 16-byte alignment for regular kernel functions
due to better performance. JITed code may be aligned to 16 in the future,
but it will use 4 in the meantime.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20191016032505.2089704-10-ast@kernel.org
2019-10-17 16:44:36 +02:00
Alexei Starovoitov
2a02759ef5 bpf: Add support for BTF pointers to interpreter
Pointer to BTF object is a pointer to kernel object or NULL.
The memory access in the interpreter has to be done via probe_kernel_read
to avoid page faults.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20191016032505.2089704-9-ast@kernel.org
2019-10-17 16:44:36 +02:00
Alexei Starovoitov
ac4414b5ca bpf: Attach raw_tp program with BTF via type name
BTF type id specified at program load time has all
necessary information to attach that program to raw tracepoint.
Use kernel type name to find raw tracepoint.

Add missing CHECK_ATTR() condition.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20191016032505.2089704-8-ast@kernel.org
2019-10-17 16:44:35 +02:00
Alexei Starovoitov
9e15db6613 bpf: Implement accurate raw_tp context access via BTF
libbpf analyzes bpf C program, searches in-kernel BTF for given type name
and stores it into expected_attach_type.
The kernel verifier expects this btf_id to point to something like:
typedef void (*btf_trace_kfree_skb)(void *, struct sk_buff *skb, void *loc);
which represents signature of raw_tracepoint "kfree_skb".

Then btf_ctx_access() matches ctx+0 access in bpf program with 'skb'
and 'ctx+8' access with 'loc' arguments of "kfree_skb" tracepoint.
In first case it passes btf_id of 'struct sk_buff *' back to the verifier core
and 'void *' in second case.

Then the verifier tracks PTR_TO_BTF_ID as any other pointer type.
Like PTR_TO_SOCKET points to 'struct bpf_sock',
PTR_TO_TCP_SOCK points to 'struct bpf_tcp_sock', and so on.
PTR_TO_BTF_ID points to in-kernel structs.
If 1234 is btf_id of 'struct sk_buff' in vmlinux's BTF
then PTR_TO_BTF_ID#1234 points to one of in kernel skbs.

When PTR_TO_BTF_ID#1234 is dereferenced (like r2 = *(u64 *)r1 + 32)
the btf_struct_access() checks which field of 'struct sk_buff' is
at offset 32. Checks that size of access matches type definition
of the field and continues to track the dereferenced type.
If that field was a pointer to 'struct net_device' the r2's type
will be PTR_TO_BTF_ID#456. Where 456 is btf_id of 'struct net_device'
in vmlinux's BTF.

Such verifier analysis prevents "cheating" in BPF C program.
The program cannot cast arbitrary pointer to 'struct sk_buff *'
and access it. C compiler would allow type cast, of course,
but the verifier will notice type mismatch based on BPF assembly
and in-kernel BTF.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20191016032505.2089704-7-ast@kernel.org
2019-10-17 16:44:35 +02:00
Alexei Starovoitov
ccfe29eb29 bpf: Add attach_btf_id attribute to program load
Add attach_btf_id attribute to prog_load command.
It's similar to existing expected_attach_type attribute which is
used in several cgroup based program types.
Unfortunately expected_attach_type is ignored for
tracing programs and cannot be reused for new purpose.
Hence introduce attach_btf_id to verify bpf programs against
given in-kernel BTF type id at load time.
It is strictly checked to be valid for raw_tp programs only.
In a later patches it will become:
btf_id == 0 semantics of existing raw_tp progs.
btd_id > 0 raw_tp with BTF and additional type safety.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20191016032505.2089704-5-ast@kernel.org
2019-10-17 16:44:35 +02:00
Alexei Starovoitov
8580ac9404 bpf: Process in-kernel BTF
If in-kernel BTF exists parse it and prepare 'struct btf *btf_vmlinux'
for further use by the verifier.
In-kernel BTF is trusted just like kallsyms and other build artifacts
embedded into vmlinux.
Yet run this BTF image through BTF verifier to make sure
that it is valid and it wasn't mangled during the build.
If in-kernel BTF is incorrect it means either gcc or pahole or kernel
are buggy. In such case disallow loading BPF programs.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20191016032505.2089704-4-ast@kernel.org
2019-10-17 16:44:35 +02:00
Christian Brauner
1e1d0f0b1a
pid: use pid_has_task() in pidfd_open()
Use the new pid_has_task() helper in pidfd_open(). This simplifies the
code and avoids taking rcu_read_{lock,unlock}() and leads to overall
nicer code.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20191017101832.5985-5-christian.brauner@ubuntu.com
2019-10-17 15:37:00 +02:00
Christian Brauner
1722c14a20
exit: use pid_has_task() in do_wait()
Replace hlist_empty() with the new pid_has_task() helper which is more
idiomatic, easier to grep for, and unifies how callers perform this check.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20191017101832.5985-4-christian.brauner@ubuntu.com
2019-10-17 15:36:58 +02:00
Christian Brauner
1d416a113f
pid: use pid_has_task() in __change_pid()
Replace hlist_empty() with the new pid_has_task() helper which is more
idiomatic, easier to grep for, and unifies how callers perform this
check.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20191017101832.5985-3-christian.brauner@ubuntu.com
2019-10-17 15:36:56 +02:00
Christian Brauner
3d6d8da48d
pidfd: check pid has attached task in fdinfo
Currently, when a task is dead we still print the pid it used to use in
the fdinfo files of its pidfds. This doesn't make much sense since the
pid may have already been reused. So verify that the task is still alive
by introducing the pid_has_task() helper which will be used by other
callers in follow-up patches.
If the task is not alive anymore, we will print -1. This allows us to
differentiate between a task not being present in a given pid namespace
- in which case we already print 0 - and a task having been reaped.

Note that this uses PIDTYPE_PID for the check. Technically, we could've
checked PIDTYPE_TGID since pidfds currently only refer to thread-group
leaders but if they won't anymore in the future then this check becomes
problematic without it being immediately obvious to non-experts imho. If
a thread is created via clone(CLONE_THREAD) than struct pid has a single
non-empty list pid->tasks[PIDTYPE_PID] and this pid can't be used as a
PIDTYPE_TGID meaning pid->tasks[PIDTYPE_TGID] will return NULL even
though the thread-group leader might still be very much alive. So
checking PIDTYPE_PID is fine and is easier to maintain should we ever
allow pidfds to refer to threads.

Cc: Jann Horn <jannh@google.com>
Cc: Christian Kellner <christian@kellner.me>
Cc: linux-api@vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20191017101832.5985-1-christian.brauner@ubuntu.com
2019-10-17 15:36:50 +02:00
Mark Rutland
b1fc583335 stop_machine: Avoid potential race behaviour
Both multi_cpu_stop() and set_state() access multi_stop_data::state
racily using plain accesses. These are subject to compiler
transformations which could break the intended behaviour of the code,
and this situation is detected by KCSAN on both arm64 and x86 (splats
below).

Improve matters by using READ_ONCE() and WRITE_ONCE() to ensure that the
compiler cannot elide, replay, or tear loads and stores.

In multi_cpu_stop() the two loads of multi_stop_data::state are expected to
be a consistent value, so snapshot the value into a temporary variable to
ensure this.

The state transitions are serialized by atomic manipulation of
multi_stop_data::num_threads, and other fields in multi_stop_data are not
modified while subject to concurrent reads.

KCSAN splat on arm64:

| BUG: KCSAN: data-race in multi_cpu_stop+0xa8/0x198 and set_state+0x80/0xb0
|
| write to 0xffff00001003bd00 of 4 bytes by task 24 on cpu 3:
|  set_state+0x80/0xb0
|  multi_cpu_stop+0x16c/0x198
|  cpu_stopper_thread+0x170/0x298
|  smpboot_thread_fn+0x40c/0x560
|  kthread+0x1a8/0x1b0
|  ret_from_fork+0x10/0x18
|
| read to 0xffff00001003bd00 of 4 bytes by task 14 on cpu 1:
|  multi_cpu_stop+0xa8/0x198
|  cpu_stopper_thread+0x170/0x298
|  smpboot_thread_fn+0x40c/0x560
|  kthread+0x1a8/0x1b0
|  ret_from_fork+0x10/0x18
|
| Reported by Kernel Concurrency Sanitizer on:
| CPU: 1 PID: 14 Comm: migration/1 Not tainted 5.3.0-00007-g67ab35a199f4-dirty #3
| Hardware name: linux,dummy-virt (DT)

KCSAN splat on x86:

| write to 0xffffb0bac0013e18 of 4 bytes by task 19 on cpu 2:
|  set_state kernel/stop_machine.c:170 [inline]
|  ack_state kernel/stop_machine.c:177 [inline]
|  multi_cpu_stop+0x1a4/0x220 kernel/stop_machine.c:227
|  cpu_stopper_thread+0x19e/0x280 kernel/stop_machine.c:516
|  smpboot_thread_fn+0x1a8/0x300 kernel/smpboot.c:165
|  kthread+0x1b5/0x200 kernel/kthread.c:255
|  ret_from_fork+0x35/0x40 arch/x86/entry/entry_64.S:352
|
| read to 0xffffb0bac0013e18 of 4 bytes by task 44 on cpu 7:
|  multi_cpu_stop+0xb4/0x220 kernel/stop_machine.c:213
|  cpu_stopper_thread+0x19e/0x280 kernel/stop_machine.c:516
|  smpboot_thread_fn+0x1a8/0x300 kernel/smpboot.c:165
|  kthread+0x1b5/0x200 kernel/kthread.c:255
|  ret_from_fork+0x35/0x40 arch/x86/entry/entry_64.S:352
|
| Reported by Kernel Concurrency Sanitizer on:
| CPU: 7 PID: 44 Comm: migration/7 Not tainted 5.3.0+ #1
| Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Marco Elver <elver@google.com>
Link: https://lkml.kernel.org/r/20191007104536.27276-1-mark.rutland@arm.com
2019-10-17 12:47:12 +02:00
Dmitry Goldin
700dea5a0b kheaders: substituting --sort in archive creation
The option --sort=ORDER was only introduced in tar 1.28 (2014), which
is rather new and might not be available in some setups.

This patch tries to replicate the previous behaviour as closely as
possible to fix the kheaders build for older environments. It does
not produce identical archives compared to the previous version due
to minor sorting differences but produces reproducible results itself
in my tests.

Reported-by: Andreas Schwab <schwab@suse.de>
Signed-off-by: Dmitry Goldin <dgoldin+lkml@protonmail.ch>
Tested-by: Andreas Schwab <schwab@suse.de>
Tested-by: Quentin Perret <qperret@google.com>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2019-10-17 09:08:19 +09:00
Song Liu
eac9153f2b bpf/stackmap: Fix deadlock with rq_lock in bpf_get_stack()
bpf stackmap with build-id lookup (BPF_F_STACK_BUILD_ID) can trigger A-A
deadlock on rq_lock():

rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[...]
Call Trace:
 try_to_wake_up+0x1ad/0x590
 wake_up_q+0x54/0x80
 rwsem_wake+0x8a/0xb0
 bpf_get_stack+0x13c/0x150
 bpf_prog_fbdaf42eded9fe46_on_event+0x5e3/0x1000
 bpf_overflow_handler+0x60/0x100
 __perf_event_overflow+0x4f/0xf0
 perf_swevent_overflow+0x99/0xc0
 ___perf_sw_event+0xe7/0x120
 __schedule+0x47d/0x620
 schedule+0x29/0x90
 futex_wait_queue_me+0xb9/0x110
 futex_wait+0x139/0x230
 do_futex+0x2ac/0xa50
 __x64_sys_futex+0x13c/0x180
 do_syscall_64+0x42/0x100
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

This can be reproduced by:
1. Start a multi-thread program that does parallel mmap() and malloc();
2. taskset the program to 2 CPUs;
3. Attach bpf program to trace_sched_switch and gather stackmap with
   build-id, e.g. with trace.py from bcc tools:
   trace.py -U -p <pid> -s <some-bin,some-lib> t:sched:sched_switch

A sample reproducer is attached at the end.

This could also trigger deadlock with other locks that are nested with
rq_lock.

Fix this by checking whether irqs are disabled. Since rq_lock and all
other nested locks are irq safe, it is safe to do up_read() when irqs are
not disable. If the irqs are disabled, postpone up_read() in irq_work.

Fixes: 615755a77b ("bpf: extend stackmap to save binary_build_id+offset instead of address")
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20191014171223.357174-1-songliubraving@fb.com

Reproducer:
============================ 8< ============================

char *filename;

void *worker(void *p)
{
        void *ptr;
        int fd;
        char *pptr;

        fd = open(filename, O_RDONLY);
        if (fd < 0)
                return NULL;
        while (1) {
                struct timespec ts = {0, 1000 + rand() % 2000};

                ptr = mmap(NULL, 4096 * 64, PROT_READ, MAP_PRIVATE, fd, 0);
                usleep(1);
                if (ptr == MAP_FAILED) {
                        printf("failed to mmap\n");
                        break;
                }
                munmap(ptr, 4096 * 64);
                usleep(1);
                pptr = malloc(1);
                usleep(1);
                pptr[0] = 1;
                usleep(1);
                free(pptr);
                usleep(1);
                nanosleep(&ts, NULL);
        }
        close(fd);
        return NULL;
}

int main(int argc, char *argv[])
{
        void *ptr;
        int i;
        pthread_t threads[THREAD_COUNT];

        if (argc < 2)
                return 0;

        filename = argv[1];

        for (i = 0; i < THREAD_COUNT; i++) {
                if (pthread_create(threads + i, NULL, worker, NULL)) {
                        fprintf(stderr, "Error creating thread\n");
                        return 0;
                }
        }

        for (i = 0; i < THREAD_COUNT; i++)
                pthread_join(threads[i], NULL);
        return 0;
}
============================ 8< ============================
2019-10-16 10:37:52 -07:00
Ben Dooks
bc88f85c6c kthread: make __kthread_queue_delayed_work static
The __kthread_queue_delayed_work is not exported so
make it static, to avoid the following sparse warning:

  kernel/kthread.c:869:6: warning: symbol '__kthread_queue_delayed_work' was not declared. Should it be static?

Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-16 09:20:58 -07:00
Linus Torvalds
02755af0f3 Merge branch 'parisc-5.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc fixes from Helge Deller:

 - Fix a parisc-specific fallout of Christoph's
   dma_set_mask_and_coherent() patches (Sven)

 - Fix a vmap memory leak in ioremap()/ioremap() (Helge)

 - Some minor cleanups and documentation updates (Nick, Helge)

* 'parisc-5.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
  parisc: Remove 32-bit DMA enforcement from sba_iommu
  parisc: Fix vmap memory leak in ioremap()/iounmap()
  parisc: prefer __section from compiler_attributes.h
  parisc: sysctl.c: Use CONFIG_PARISC instead of __hppa_ define
  MAINTAINERS: Add hp_sdc drivers to parisc arch
2019-10-15 09:37:01 -07:00
Christian Kellner
15d42eb26b pidfd: add NSpid entries to fdinfo
Currently, the fdinfo file contains the Pid field which shows the
pid a given pidfd refers to in the pid namespace of the procfs
instance. If pid namespaces are configured, also show an NSpid field
for easy retrieval of the pid in all descendant pid namespaces. If
the pid namespace of the process is not a descendant of the pid
namespace of the procfs instance 0 will be shown as its first NSpid
entry and no other entries will be shown. Add a block comment to
pidfd_show_fdinfo with a detailed explanation of Pid and NSpid fields.

Co-developed-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Christian Kellner <christian@kellner.me>
Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20191014162034.2185-1-ckellner@redhat.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2019-10-15 12:17:00 +02:00
Helge Deller
b67114db64 parisc: sysctl.c: Use CONFIG_PARISC instead of __hppa_ define
Signed-off-by: Helge Deller <deller@gmx.de>
2019-10-14 21:43:54 +02:00
David S. Miller
a98d62c3ee Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2019-10-14

The following pull-request contains BPF updates for your *net-next* tree.

12 days of development and
85 files changed, 1889 insertions(+), 1020 deletions(-)

The main changes are:

1) auto-generation of bpf_helper_defs.h, from Andrii.

2) split of bpf_helpers.h into bpf_{helpers, helper_defs, endian, tracing}.h
   and move into libbpf, from Andrii.

3) Track contents of read-only maps as scalars in the verifier, from Andrii.

4) small x86 JIT optimization, from Daniel.

5) cross compilation support, from Ivan.

6) bpf flow_dissector enhancements, from Jakub and Stanislav.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-14 12:17:21 -07:00
Eric Dumazet
ff229eee3d hrtimer: Annotate lockless access to timer->base
Followup to commit dd2261ed45 ("hrtimer: Protect lockless access
to timer->base")

lock_hrtimer_base() fetches timer->base without lock exclusion.

Compiler is allowed to read timer->base twice (even if considered dumb)
which could end up trying to lock migration_base and return
&migration_base.

  base = timer->base;
  if (likely(base != &migration_base)) {

       /* compiler reads timer->base again, and now (base == &migration_base)

       raw_spin_lock_irqsave(&base->cpu_base->lock, *flags);
       if (likely(base == timer->base))
            return base; /* == &migration_base ! */

Similarly the write sides must use WRITE_ONCE() to avoid store tearing.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191008173204.180879-1-edumazet@google.com
2019-10-14 15:51:49 +02:00
Linus Torvalds
d4615e5a46 A few tracing fixes:
- Removed locked down from tracefs itself and moved it to the trace
    directory. Having the open functions there do the lockdown checks.
 
  - Fixed a few races with opening an instance file and the instance being
    deleted (Discovered during the locked down updates). Kept separate
    from the clean up code such that they can be backported to stable
    easier.
 
  - Cleaned up and consolidated the checks done when opening a trace
    file, as there were multiple checks that need to be done, and it
    did not make sense having them done in each open instance.
 
  - Fixed a regression in the record mcount code.
 
  - Small hw_lat detector tracer fixes.
 
  - A trace_pipe read fix due to not initializing trace_seq.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXaNhphQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6quDIAP4v08ARNdIh+r+c4AOBm3xsOuE/d9GB
 I56ydnskm+x2JQD6Ap9ivXe9yDBIErFeHNtCoq7pM8YDI4YoYIB30N0GfwM=
 =7oAu
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
 "A few tracing fixes:

   - Remove lockdown from tracefs itself and moved it to the trace
     directory. Have the open functions there do the lockdown checks.

   - Fix a few races with opening an instance file and the instance
     being deleted (Discovered during the lockdown updates). Kept
     separate from the clean up code such that they can be backported to
     stable easier.

   - Clean up and consolidated the checks done when opening a trace
     file, as there were multiple checks that need to be done, and it
     did not make sense having them done in each open instance.

   - Fix a regression in the record mcount code.

   - Small hw_lat detector tracer fixes.

   - A trace_pipe read fix due to not initializing trace_seq"

* tag 'trace-v5.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Initialize iter->seq after zeroing in tracing_read_pipe()
  tracing/hwlat: Don't ignore outer-loop duration when calculating max_latency
  tracing/hwlat: Report total time spent in all NMIs during the sample
  recordmcount: Fix nop_mcount() function
  tracing: Do not create tracefs files if tracefs lockdown is in effect
  tracing: Add locked_down checks to the open calls of files created for tracefs
  tracing: Add tracing_check_open_get_tr()
  tracing: Have trace events system open call tracing_open_generic_tr()
  tracing: Get trace_array reference for available_tracers files
  ftrace: Get a reference counter for the trace_array on filter files
  tracefs: Revert ccbd54ff54 ("tracefs: Restrict tracefs when the kernel is locked down")
2019-10-13 14:47:10 -07:00
Petr Mladek
d303de1fcf tracing: Initialize iter->seq after zeroing in tracing_read_pipe()
A customer reported the following softlockup:

[899688.160002] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [test.sh:16464]
[899688.160002] CPU: 0 PID: 16464 Comm: test.sh Not tainted 4.12.14-6.23-azure #1 SLE12-SP4
[899688.160002] RIP: 0010:up_write+0x1a/0x30
[899688.160002] Kernel panic - not syncing: softlockup: hung tasks
[899688.160002] RIP: 0010:up_write+0x1a/0x30
[899688.160002] RSP: 0018:ffffa86784d4fde8 EFLAGS: 00000257 ORIG_RAX: ffffffffffffff12
[899688.160002] RAX: ffffffff970fea00 RBX: 0000000000000001 RCX: 0000000000000000
[899688.160002] RDX: ffffffff00000001 RSI: 0000000000000080 RDI: ffffffff970fea00
[899688.160002] RBP: ffffffffffffffff R08: ffffffffffffffff R09: 0000000000000000
[899688.160002] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8b59014720d8
[899688.160002] R13: ffff8b59014720c0 R14: ffff8b5901471090 R15: ffff8b5901470000
[899688.160002]  tracing_read_pipe+0x336/0x3c0
[899688.160002]  __vfs_read+0x26/0x140
[899688.160002]  vfs_read+0x87/0x130
[899688.160002]  SyS_read+0x42/0x90
[899688.160002]  do_syscall_64+0x74/0x160

It caught the process in the middle of trace_access_unlock(). There is
no loop. So, it must be looping in the caller tracing_read_pipe()
via the "waitagain" label.

Crashdump analyze uncovered that iter->seq was completely zeroed
at this point, including iter->seq.seq.size. It means that
print_trace_line() was never able to print anything and
there was no forward progress.

The culprit seems to be in the code:

	/* reset all but tr, trace, and overruns */
	memset(&iter->seq, 0,
	       sizeof(struct trace_iterator) -
	       offsetof(struct trace_iterator, seq));

It was added by the commit 53d0aa7730 ("ftrace:
add logic to record overruns"). It was v2.6.27-rc1.
It was the time when iter->seq looked like:

     struct trace_seq {
	unsigned char		buffer[PAGE_SIZE];
	unsigned int		len;
     };

There was no "size" variable and zeroing was perfectly fine.

The solution is to reinitialize the structure after or without
zeroing.

Link: http://lkml.kernel.org/r/20191011142134.11997-1-pmladek@suse.com

Signed-off-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-10-12 20:49:34 -04:00
Srivatsa S. Bhat (VMware)
fc64e4ad80 tracing/hwlat: Don't ignore outer-loop duration when calculating max_latency
max_latency is intended to record the maximum ever observed hardware
latency, which may occur in either part of the loop (inner/outer). So
we need to also consider the outer-loop sample when updating
max_latency.

Link: http://lkml.kernel.org/r/157073345463.17189.18124025522664682811.stgit@srivatsa-ubuntu

Fixes: e7c15cd8a1 ("tracing: Added hardware latency tracer")
Cc: stable@vger.kernel.org
Signed-off-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-10-12 20:49:33 -04:00
Srivatsa S. Bhat (VMware)
98dc19c114 tracing/hwlat: Report total time spent in all NMIs during the sample
nmi_total_ts is supposed to record the total time spent in *all* NMIs
that occur on the given CPU during the (active portion of the)
sampling window. However, the code seems to be overwriting this
variable for each NMI, thereby only recording the time spent in the
most recent NMI. Fix it by accumulating the duration instead.

Link: http://lkml.kernel.org/r/157073343544.17189.13911783866738671133.stgit@srivatsa-ubuntu

Fixes: 7b2c862501 ("tracing: Add NMI tracing in hwlat detector")
Cc: stable@vger.kernel.org
Signed-off-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-10-12 20:49:33 -04:00
Steven Rostedt (VMware)
17911ff38a tracing: Add locked_down checks to the open calls of files created for tracefs
Added various checks on open tracefs calls to see if tracefs is in lockdown
mode, and if so, to return -EPERM.

Note, the event format files (which are basically standard on all machines)
as well as the enabled_functions file (which shows what is currently being
traced) are not lockde down. Perhaps they should be, but it seems counter
intuitive to lockdown information to help you know if the system has been
modified.

Link: http://lkml.kernel.org/r/CAHk-=wj7fGPKUspr579Cii-w_y60PtRaiDgKuxVtBAMK0VNNkA@mail.gmail.com

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-10-12 20:48:06 -04:00
Steven Rostedt (VMware)
8530dec63e tracing: Add tracing_check_open_get_tr()
Currently, most files in the tracefs directory test if tracing_disabled is
set. If so, it should return -ENODEV. The tracing_disabled is called when
tracing is found to be broken. Originally it was done in case the ring
buffer was found to be corrupted, and we wanted to prevent reading it from
crashing the kernel. But it's also called if a tracing selftest fails on
boot. It's a one way switch. That is, once it is triggered, tracing is
disabled until reboot.

As most tracefs files can also be used by instances in the tracefs
directory, they need to be carefully done. Each instance has a trace_array
associated to it, and when the instance is removed, the trace_array is
freed. But if an instance is opened with a reference to the trace_array,
then it requires looking up the trace_array to get its ref counter (as there
could be a race with it being deleted and the open itself). Once it is
found, a reference is added to prevent the instance from being removed (and
the trace_array associated with it freed).

Combine the two checks (tracing_disabled and trace_array_get()) into a
single helper function. This will also make it easier to add lockdown to
tracefs later.

Link: http://lkml.kernel.org/r/20191011135458.7399da44@gandalf.local.home

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-10-12 20:44:07 -04:00
Steven Rostedt (VMware)
aa07d71f1b tracing: Have trace events system open call tracing_open_generic_tr()
Instead of having the trace events system open call open code the taking of
the trace_array descriptor (with trace_array_get()) and then calling
trace_open_generic(), have it use the tracing_open_generic_tr() that does
the combination of the two. This requires making tracing_open_generic_tr()
global.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-10-12 20:43:00 -04:00
Steven Rostedt (VMware)
194c2c74f5 tracing: Get trace_array reference for available_tracers files
As instances may have different tracers available, we need to look at the
trace_array descriptor that shows the list of the available tracers for the
instance. But there's a race between opening the file and an admin
deleting the instance. The trace_array_get() needs to be called before
accessing the trace_array.

Cc: stable@vger.kernel.org
Fixes: 607e2ea167 ("tracing: Set up infrastructure to allow tracers for instances")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-10-12 20:40:50 -04:00
Steven Rostedt (VMware)
9ef16693af ftrace: Get a reference counter for the trace_array on filter files
The ftrace set_ftrace_filter and set_ftrace_notrace files are specific for
an instance now. They need to take a reference to the instance otherwise
there could be a race between accessing the files and deleting the instance.

It wasn't until the :mod: caching where these file operations started
referencing the trace_array directly.

Cc: stable@vger.kernel.org
Fixes: 673feb9d76 ("ftrace: Add :mod: caching infrastructure to trace_array")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-10-12 20:40:21 -04:00
Linus Torvalds
328fefadd9 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "Two fixes: a guest-cputime accounting fix, and a cgroup bandwidth
  quota precision fix"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/vtime: Fix guest/system mis-accounting on task switch
  sched/fair: Scale bandwidth quota and period without losing quota/period ratio precision
2019-10-12 15:29:54 -07:00
Linus Torvalds
465a7e291f Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Mostly tooling fixes, but also a couple of updates for new Intel
  models (which are technically hw-enablement, but to users it's a fix
  to perf behavior on those new CPUs - hope this is fine), an AUX
  inheritance fix, event time-sharing fix, and a fix for lost non-perf
  NMI events on AMD systems"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
  perf/x86/cstate: Add Tiger Lake CPU support
  perf/x86/msr: Add Tiger Lake CPU support
  perf/x86/intel: Add Tiger Lake CPU support
  perf/x86/cstate: Update C-state counters for Ice Lake
  perf/x86/msr: Add new CPU model numbers for Ice Lake
  perf/x86/cstate: Add Comet Lake CPU support
  perf/x86/msr: Add Comet Lake CPU support
  perf/x86/intel: Add Comet Lake CPU support
  perf/x86/amd: Change/fix NMI latency mitigation to use a timestamp
  perf/core: Fix corner case in perf_rotate_context()
  perf/core: Rework memory accounting in perf_mmap()
  perf/core: Fix inheritance of aux_output groups
  perf annotate: Don't return -1 for error when doing BPF disassembly
  perf annotate: Return appropriate error code for allocation failures
  perf annotate: Fix arch specific ->init() failure errors
  perf annotate: Propagate the symbol__annotate() error return
  perf annotate: Fix the signedness of failure returns
  perf annotate: Propagate perf_env__arch() error
  perf evsel: Fall back to global 'perf_env' in perf_evsel__env()
  perf tools: Propagate get_cpuid() error
  ...
2019-10-12 15:15:17 -07:00
Andrii Nakryiko
2dedd7d216 bpf: Fix cast to pointer from integer of different size warning
Fix "warning: cast to pointer from integer of different size" when
casting u64 addr to void *.

Fixes: a23740ec43 ("bpf: Track contents of read-only maps as scalars")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20191011172053.2980619-1-andriin@fb.com
2019-10-11 22:28:47 +02:00
Linus Torvalds
297cbcccc2 for-linus-20191010
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl2f5MIQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpvscD/4v8E1s1rt6JwqM0Fa27UjRdhfGnc8ad8vs
 fD7rf3ZmLkoM1apVopMcAscUH726wU4qbxwEUDEntxxv2wJHuZdSZ64zhFJ17uis
 uJ2pF4MpK/m6DnHZu/SAU4t9aU+l6SBqX0tS1bFycecPgGRk46jrVX5tNJggt0Fy
 hqmx3ACWbkGiFDERT2AAQ69WHfmzeI9aUjx3jJY2eLnK7OjjEpyoEBs0j/AHl3ep
 kydhItU5NSFCv94X7vmZy/dvQ5hE4/1HTFfg79fOZcywQi1AN5DafKxiM2kgaSJ0
 jW58i+AFVtUPysNpVsxvjAgqGwDX/UJkOkggPd6V8/6LMfEvBKY4YNXlUEbqTN3Y
 pqn19/cgdKHaQpHKqwettcQujc71kry/yHsaudD+g2fi0efYi3d4qxIp9XA0TF03
 z6jzp8Hfo2SKbwapIFPa7Wqj86ZpbBxtROibCA17WKSNzn0UR3pJmEigo4l364ow
 nJpvZChLDHZXjovgzISmUnbR+O1yP0+ZnI9b7kgNp0UV4SI5ajf6f2T7667dcQs0
 J1GNt4QvqPza3R0z1SuoEi6tbc3GyMj7NZyIseNOXR/NtqXEWtiNvDIuZqs6Wn/T
 4GhaF0Mjqc17B3UEkdU1z09HL0JR40vUrGYE4lDxHhPWd0YngDGJJX2pZG2Y0WBp
 VQ20AzijzQ==
 =wZnt
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20191010' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - Fix wbt performance regression introduced with the blk-rq-qos
   refactoring (Harshad)

 - Fix io_uring fileset removal inadvertently killing the workqueue (me)

 - Fix io_uring typo in linked command nonblock submission (Pavel)

 - Remove spurious io_uring wakeups on request free (Pavel)

 - Fix null_blk zoned command error return (Keith)

 - Don't use freezable workqueues for backing_dev, also means we can
   revert a previous libata hack (Mika)

 - Fix nbd sysfs mutex dropped too soon at removal time (Xiubo)

* tag 'for-linus-20191010' of git://git.kernel.dk/linux-block:
  nbd: fix possible sysfs duplicate warning
  null_blk: Fix zoned command return code
  io_uring: only flush workqueues on fileset removal
  io_uring: remove wait loop spurious wakeups
  blk-wbt: fix performance regression in wbt scale_up/scale_down
  Revert "libata, freezer: avoid block device removal while system is frozen"
  bdi: Do not use freezable workqueue
  io_uring: fix reversed nonblock flag for link submission
2019-10-11 08:45:32 -07:00
Oleg Nesterov
937c6b27c7 cgroup: freezer: call cgroup_enter_frozen() with preemption disabled in ptrace_stop()
ptrace_stop() does preempt_enable_no_resched() to avoid the preemption,
but after that cgroup_enter_frozen() does spin_lock/unlock and this adds
another preemption point.

Reported-and-tested-by: Bruce Ashfield <bruce.ashfield@gmail.com>
Fixes: 76f969e894 ("cgroup: cgroup v2 freezer")
Cc: stable@vger.kernel.org # v5.2+
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-10-11 08:39:57 -07:00
Andrii Nakryiko
a23740ec43 bpf: Track contents of read-only maps as scalars
Maps that are read-only both from BPF program side and user space side
have their contents constant, so verifier can track referenced values
precisely and use that knowledge for dead code elimination, branch
pruning, etc. This patch teaches BPF verifier how to do this.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20191009201458.2679171-2-andriin@fb.com
2019-10-11 01:49:15 +02:00
Christian Brauner
fb3c5386b3 seccomp: add SECCOMP_USER_NOTIF_FLAG_CONTINUE
This allows the seccomp notifier to continue a syscall. A positive
discussion about this feature was triggered by a post to the
ksummit-discuss mailing list (cf. [3]) and took place during KSummit
(cf. [1]) and again at the containers/checkpoint-restore
micro-conference at Linux Plumbers.

Recently we landed seccomp support for SECCOMP_RET_USER_NOTIF (cf. [4])
which enables a process (watchee) to retrieve an fd for its seccomp
filter. This fd can then be handed to another (usually more privileged)
process (watcher). The watcher will then be able to receive seccomp
messages about the syscalls having been performed by the watchee.

This feature is heavily used in some userspace workloads. For example,
it is currently used to intercept mknod() syscalls in user namespaces
aka in containers.
The mknod() syscall can be easily filtered based on dev_t. This allows
us to only intercept a very specific subset of mknod() syscalls.
Furthermore, mknod() is not possible in user namespaces toto coelo and
so intercepting and denying syscalls that are not in the whitelist on
accident is not a big deal. The watchee won't notice a difference.

In contrast to mknod(), a lot of other syscall we intercept (e.g.
setxattr()) cannot be easily filtered like mknod() because they have
pointer arguments. Additionally, some of them might actually succeed in
user namespaces (e.g. setxattr() for all "user.*" xattrs). Since we
currently cannot tell seccomp to continue from a user notifier we are
stuck with performing all of the syscalls in lieu of the container. This
is a huge security liability since it is extremely difficult to
correctly assume all of the necessary privileges of the calling task
such that the syscall can be successfully emulated without escaping
other additional security restrictions (think missing CAP_MKNOD for
mknod(), or MS_NODEV on a filesystem etc.). This can be solved by
telling seccomp to resume the syscall.

One thing that came up in the discussion was the problem that another
thread could change the memory after userspace has decided to let the
syscall continue which is a well known TOCTOU with seccomp which is
present in other ways already.
The discussion showed that this feature is already very useful for any
syscall without pointer arguments. For any accidentally intercepted
non-pointer syscall it is safe to continue.
For syscalls with pointer arguments there is a race but for any cautious
userspace and the main usec cases the race doesn't matter. The notifier
is intended to be used in a scenario where a more privileged watcher
supervises the syscalls of lesser privileged watchee to allow it to get
around kernel-enforced limitations by performing the syscall for it
whenever deemed save by the watcher. Hence, if a user tricks the watcher
into allowing a syscall they will either get a deny based on
kernel-enforced restrictions later or they will have changed the
arguments in such a way that they manage to perform a syscall with
arguments that they would've been allowed to do anyway.
In general, it is good to point out again, that the notifier fd was not
intended to allow userspace to implement a security policy but rather to
work around kernel security mechanisms in cases where the watcher knows
that a given action is safe to perform.

/* References */
[1]: https://linuxplumbersconf.org/event/4/contributions/560
[2]: https://linuxplumbersconf.org/event/4/contributions/477
[3]: https://lore.kernel.org/r/20190719093538.dhyopljyr5ns33qx@brauner.io
[4]: commit 6a21cc50f0 ("seccomp: add a return code to trap to userspace")

Co-developed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Tycho Andersen <tycho@tycho.ws>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Drewry <wad@chromium.org>
CC: Tyler Hicks <tyhicks@canonical.com>
Link: https://lore.kernel.org/r/20190920083007.11475-2-christian.brauner@ubuntu.com
Signed-off-by: Kees Cook <keescook@chromium.org>
2019-10-10 14:45:51 -07:00
Douglas Anderson
2277b49258 kdb: Fix stack crawling on 'running' CPUs that aren't the master
In kdb when you do 'btc' (back trace on CPU) it doesn't necessarily
give you the right info.  Specifically on many architectures
(including arm64, where I tested) you can't dump the stack of a
"running" process that isn't the process running on the current CPU.
This can be seen by this:

 echo SOFTLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT
 # wait 2 seconds
 <sysrq>g

Here's what I see now on rk3399-gru-kevin.  I see the stack crawl for
the CPU that handled the sysrq but everything else just shows me stuck
in __switch_to() which is bogus:

======

[0]kdb> btc
btc: cpu status: Currently on cpu 0
Available cpus: 0, 1-3(I), 4, 5(I)
Stack traceback for pid 0
0xffffff801101a9c0        0        0  1    0   R  0xffffff801101b3b0 *swapper/0
Call trace:
 dump_backtrace+0x0/0x138
 ...
 kgdb_compiled_brk_fn+0x34/0x44
 ...
 sysrq_handle_dbg+0x34/0x5c
Stack traceback for pid 0
0xffffffc0f175a040        0        0  1    1   I  0xffffffc0f175aa30  swapper/1
Call trace:
 __switch_to+0x1e4/0x240
 0xffffffc0f65616c0
Stack traceback for pid 0
0xffffffc0f175d040        0        0  1    2   I  0xffffffc0f175da30  swapper/2
Call trace:
 __switch_to+0x1e4/0x240
 0xffffffc0f65806c0
Stack traceback for pid 0
0xffffffc0f175b040        0        0  1    3   I  0xffffffc0f175ba30  swapper/3
Call trace:
 __switch_to+0x1e4/0x240
 0xffffffc0f659f6c0
Stack traceback for pid 1474
0xffffffc0dde8b040     1474      727  1    4   R  0xffffffc0dde8ba30  bash
Call trace:
 __switch_to+0x1e4/0x240
 __schedule+0x464/0x618
 0xffffffc0dde8b040
Stack traceback for pid 0
0xffffffc0f17b0040        0        0  1    5   I  0xffffffc0f17b0a30  swapper/5
Call trace:
 __switch_to+0x1e4/0x240
 0xffffffc0f65dd6c0

===

The problem is that 'btc' eventually boils down to
  show_stack(task_struct, NULL);

...and show_stack() doesn't work for "running" CPUs because their
registers haven't been stashed.

On x86 things might work better (I haven't tested) because kdb has a
special case for x86 in kdb_show_stack() where it passes the stack
pointer to show_stack().  This wouldn't work on arm64 where the stack
crawling function seems needs the "fp" and "pc", not the "sp" which is
presumably why arm64's show_stack() function totally ignores the "sp"
parameter.

NOTE: we _can_ get a good stack dump for all the cpus if we manually
switch each one to the kdb master and do a back trace.  AKA:
  cpu 4
  bt
...will give the expected trace.  That's because now arm64's
dump_backtrace will now see that "tsk == current" and go through a
different path.

In this patch I fix the problems by catching a request to stack crawl
a task that's running on a CPU and then I ask that CPU to do the stack
crawl.

NOTE: this will (presumably) change what stack crawls are printed for
x86 machines.  Now kdb functions will show up in the stack crawl.
Presumably this is OK but if it's not we can go back and add a special
case for x86 again.

Signed-off-by: Douglas Anderson <dianders@chromium.org>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2019-10-10 16:28:48 +01:00
Douglas Anderson
55a7e23f46 kdb: Fix "btc <cpu>" crash if the CPU didn't round up
I noticed that when I did "btc <cpu>" and the CPU I passed in hadn't
rounded up that I'd crash.  I was going to copy the same fix from
commit 162bc7f5af ("kdb: Don't back trace on a cpu that didn't round
up") into the "not all the CPUs" case, but decided it'd be better to
clean things up a little bit.

This consolidates the two code paths.  It is _slightly_ wasteful in in
that the checks for "cpu" being too small or being offline isn't
really needed when we're iterating over all online CPUs, but that
really shouldn't hurt.  Better to have the same code path.

While at it, eliminate at least one slightly ugly (and totally
needless) recursive use of kdb_parse().

Signed-off-by: Douglas Anderson <dianders@chromium.org>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2019-10-10 16:28:14 +01:00
Douglas Anderson
54af3e39ee kdb: Remove unused "argcount" param from kdb_bt1(); make btaprompt bool
The kdb_bt1() had a mysterious "argcount" parameter passed in (always
the number 5, by the way) and never used.  Presumably this is just old
cruft.  Remove it.  While at it, upgrade the btaprompt parameter to a
full fledged bool instead of an int.

Signed-off-by: Douglas Anderson <dianders@chromium.org>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2019-10-10 16:28:08 +01:00
Douglas Anderson
0f8b5b6d56 kgdb: Remove unused DCPU_SSTEP definition
From doing a 'git log --patch kernel/debug', it looks as if DCPU_SSTEP
has never been used.  Presumably it used to be used back when kgdb was
out of tree and nobody thought to delete the definition when the usage
went away.  Delete.

Signed-off-by: Douglas Anderson <dianders@chromium.org>
Acked-by: Jason Wessel <jason.wessel@windriver.com>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2019-10-10 16:27:52 +01:00
Ben Dooks
f49249d58a PM: sleep: include <linux/pm_runtime.h> for pm_wq
Include the <linux/runtime_pm.h> for the definition of
pm_wq to avoid the following warning:

kernel/power/main.c:890:25: warning: symbol 'pm_wq' was not declared. Should it be static?

Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-10-10 11:11:56 +02:00
Qian Cai
5facae4f35 locking/lockdep: Remove unused @nested argument from lock_release()
Since the following commit:

  b4adfe8e05 ("locking/lockdep: Remove unused argument in __lock_release")

@nested is no longer used in lock_release(), so remove it from all
lock_release() calls and friends.

Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: airlied@linux.ie
Cc: akpm@linux-foundation.org
Cc: alexander.levin@microsoft.com
Cc: daniel@iogearbox.net
Cc: davem@davemloft.net
Cc: dri-devel@lists.freedesktop.org
Cc: duyuyang@gmail.com
Cc: gregkh@linuxfoundation.org
Cc: hannes@cmpxchg.org
Cc: intel-gfx@lists.freedesktop.org
Cc: jack@suse.com
Cc: jlbec@evilplan.or
Cc: joonas.lahtinen@linux.intel.com
Cc: joseph.qi@linux.alibaba.com
Cc: jslaby@suse.com
Cc: juri.lelli@redhat.com
Cc: maarten.lankhorst@linux.intel.com
Cc: mark@fasheh.com
Cc: mhocko@kernel.org
Cc: mripard@kernel.org
Cc: ocfs2-devel@oss.oracle.com
Cc: rodrigo.vivi@intel.com
Cc: sean@poorly.run
Cc: st@kernel.org
Cc: tj@kernel.org
Cc: tytso@mit.edu
Cc: vdavydov.dev@gmail.com
Cc: vincent.guittot@linaro.org
Cc: viro@zeniv.linux.org.uk
Link: https://lkml.kernel.org/r/1568909380-32199-1-git-send-email-cai@lca.pw
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-09 12:46:10 +02:00
Song Liu
7fa343b7fd perf/core: Fix corner case in perf_rotate_context()
In perf_rotate_context(), when the first cpu flexible event fail to
schedule, cpu_rotate is 1, while cpu_event is NULL. Since cpu_event is
NULL, perf_rotate_context will _NOT_ call cpu_ctx_sched_out(), thus
cpuctx->ctx.is_active will have EVENT_FLEXIBLE set. Then, the next
perf_event_sched_in() will skip all cpu flexible events because of the
EVENT_FLEXIBLE bit.

In the next call of perf_rotate_context(), cpu_rotate stays 1, and
cpu_event stays NULL, so this process repeats. The end result is, flexible
events on this cpu will not be scheduled (until another event being added
to the cpuctx).

Here is an easy repro of this issue. On Intel CPUs, where ref-cycles
could only use one counter, run one pinned event for ref-cycles, one
flexible event for ref-cycles, and one flexible event for cycles. The
flexible ref-cycles is never scheduled, which is expected. However,
because of this issue, the cycles event is never scheduled either.

 $ perf stat -e ref-cycles:D,ref-cycles,cycles -C 5 -I 1000

           time             counts unit events
    1.000152973         15,412,480      ref-cycles:D
    1.000152973      <not counted>      ref-cycles     (0.00%)
    1.000152973      <not counted>      cycles         (0.00%)
    2.000486957         18,263,120      ref-cycles:D
    2.000486957      <not counted>      ref-cycles     (0.00%)
    2.000486957      <not counted>      cycles         (0.00%)

To fix this, when the flexible_active list is empty, try rotate the
first event in the flexible_groups. Also, rename ctx_first_active() to
ctx_event_to_rotate(), which is more accurate.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <kernel-team@fb.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 8d5bce0c37 ("perf/core: Optimize perf_rotate_context() event scheduling")
Link: https://lkml.kernel.org/r/20191008165949.920548-1-songliubraving@fb.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-09 12:44:13 +02:00
Song Liu
d44248a413 perf/core: Rework memory accounting in perf_mmap()
perf_mmap() always increases user->locked_vm. As a result, "extra" could
grow bigger than "user_extra", which doesn't make sense. Here is an
example case:

(Note: Assume "user_lock_limit" is very small.)

  | # of perf_mmap calls |vma->vm_mm->pinned_vm|user->locked_vm|
  | 0                    | 0                   | 0             |
  | 1                    | user_extra          | user_extra    |
  | 2                    | 3 * user_extra      | 2 * user_extra|
  | 3                    | 6 * user_extra      | 3 * user_extra|
  | 4                    | 10 * user_extra     | 4 * user_extra|

Fix this by maintaining proper user_extra and extra.

Reviewed-By: Hechao Li <hechaol@fb.com>
Reported-by: Hechao Li <hechaol@fb.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <kernel-team@fb.com>
Cc: Jie Meng <jmeng@fb.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190904214618.3795672-1-songliubraving@fb.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-09 12:44:12 +02:00
Frederic Weisbecker
8d495477d6 sched/cputime: Spare a seqcount lock/unlock cycle on context switch
On context switch we are locking the vtime seqcount of the scheduling-out
task twice:

 * On vtime_task_switch_common(), when we flush the pending vtime through
   vtime_account_system()

 * On arch_vtime_task_switch() to reset the vtime state.

This is pointless as these actions can be performed without the need
to unlock/lock in the middle. The reason these steps are separated is to
consolidate a very small amount of common code between
CONFIG_VIRT_CPU_ACCOUNTING_GEN and CONFIG_VIRT_CPU_ACCOUNTING_NATIVE.

Performance in this fast path is definitely a priority over artificial
code factorization so split the task switch code between GEN and
NATIVE and mutualize the parts than can run under a single seqcount
locked block.

As a side effect, vtime_account_idle() becomes included in the seqcount
protection. This happens to be a welcome preparation in order to
properly support kcpustat under vtime in the future and fetch
CPUTIME_IDLE without race.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Link: https://lkml.kernel.org/r/20191003161745.28464-3-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-09 12:39:26 +02:00
Frederic Weisbecker
f83eeb1a01 sched/cputime: Rename vtime_account_system() to vtime_account_kernel()
vtime_account_system() decides if we need to account the time to the
system (__vtime_account_system()) or to the guest (vtime_account_guest()).

So this function is a misnomer as we are on a higher level than
"system". All we know when we call that function is that we are
accounting kernel cputime. Whether it belongs to guest or system time
is a lower level detail.

Rename this function to vtime_account_kernel(). This will clarify things
and avoid too many underscored vtime_account_system() versions.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Link: https://lkml.kernel.org/r/20191003161745.28464-2-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-09 12:39:25 +02:00
Frederic Weisbecker
68e7a4d66b sched/vtime: Fix guest/system mis-accounting on task switch
vtime_account_system() assumes that the target task to account cputime
to is always the current task. This is most often true indeed except on
task switch where we call:

	vtime_common_task_switch(prev)
		vtime_account_system(prev)

Here prev is the scheduling-out task where we account the cputime to. It
doesn't match current that is already the scheduling-in task at this
stage of the context switch.

So we end up checking the wrong task flags to determine if we are
accounting guest or system time to the previous task.

As a result the wrong task is used to check if the target is running in
guest mode. We may then spuriously account or leak either system or
guest time on task switch.

Fix this assumption and also turn vtime_guest_enter/exit() to use the
task passed in parameter as well to avoid future similar issues.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpengli@tencent.com>
Fixes: 2a42eb9594 ("sched/cputime: Accumulate vtime on top of nsec clocksource")
Link: https://lkml.kernel.org/r/20190925214242.21873-1-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-09 12:38:03 +02:00
Xuewei Zhang
4929a4e6fa sched/fair: Scale bandwidth quota and period without losing quota/period ratio precision
The quota/period ratio is used to ensure a child task group won't get
more bandwidth than the parent task group, and is calculated as:

  normalized_cfs_quota() = [(quota_us << 20) / period_us]

If the quota/period ratio was changed during this scaling due to
precision loss, it will cause inconsistency between parent and child
task groups.

See below example:

A userspace container manager (kubelet) does three operations:

 1) Create a parent cgroup, set quota to 1,000us and period to 10,000us.
 2) Create a few children cgroups.
 3) Set quota to 1,000us and period to 10,000us on a child cgroup.

These operations are expected to succeed. However, if the scaling of
147/128 happens before step 3, quota and period of the parent cgroup
will be changed:

  new_quota: 1148437ns,   1148us
 new_period: 11484375ns, 11484us

And when step 3 comes in, the ratio of the child cgroup will be
104857, which will be larger than the parent cgroup ratio (104821),
and will fail.

Scaling them by a factor of 2 will fix the problem.

Tested-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Xuewei Zhang <xueweiz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Phil Auld <pauld@redhat.com>
Cc: Anton Blanchard <anton@ozlabs.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Fixes: 2e8e192263 ("sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup")
Link: https://lkml.kernel.org/r/20191004001243.140897-1-xueweiz@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-09 12:38:02 +02:00
Linus Torvalds
eda57a0e42 Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "The usual shower of hotfixes.

  Chris's memcg patches aren't actually fixes - they're mature but a few
  niggling review issues were late to arrive.

  The ocfs2 fixes are quite old - those took some time to get reviewer
  attention.

  Subsystems affected by this patch series: ocfs2, hotfixes, mm/memcg,
  mm/slab-generic"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  mm, sl[aou]b: guarantee natural alignment for kmalloc(power-of-two)
  mm, sl[ou]b: improve memory accounting
  mm, memcg: make scan aggression always exclude protection
  mm, memcg: make memory.emin the baseline for utilisation determination
  mm, memcg: proportional memory.{low,min} reclaim
  mm/vmpressure.c: fix a signedness bug in vmpressure_register_event()
  mm/page_alloc.c: fix a crash in free_pages_prepare()
  mm/z3fold.c: claim page in the beginning of free
  kernel/sysctl.c: do not override max_threads provided by userspace
  memcg: only record foreign writebacks with dirty pages when memcg is not disabled
  mm: fix -Wmissing-prototypes warnings
  writeback: fix use-after-free in finish_writeback_work()
  mm/memremap: drop unused SECTION_SIZE and SECTION_MASK
  panic: ensure preemption is disabled during panic()
  fs: ocfs2: fix a possible null-pointer dereference in ocfs2_info_scan_inode_alloc()
  fs: ocfs2: fix a possible null-pointer dereference in ocfs2_write_end_nolock()
  fs: ocfs2: fix possible null-pointer dereferences in ocfs2_xa_prepare_entry()
  ocfs2: clear zero in unaligned direct IO
2019-10-07 16:04:19 -07:00
Michal Hocko
b0f53dbc4b kernel/sysctl.c: do not override max_threads provided by userspace
Partially revert 16db3d3f11 ("kernel/sysctl.c: threads-max observe
limits") because the patch is causing a regression to any workload which
needs to override the auto-tuning of the limit provided by kernel.

set_max_threads is implementing a boot time guesstimate to provide a
sensible limit of the concurrently running threads so that runaways will
not deplete all the memory.  This is a good thing in general but there
are workloads which might need to increase this limit for an application
to run (reportedly WebSpher MQ is affected) and that is simply not
possible after the mentioned change.  It is also very dubious to
override an admin decision by an estimation that doesn't have any direct
relation to correctness of the kernel operation.

Fix this by dropping set_max_threads from sysctl_max_threads so any
value is accepted as long as it fits into MAX_THREADS which is important
to check because allowing more threads could break internal robust futex
restriction.  While at it, do not use MIN_THREADS as the lower boundary
because it is also only a heuristic for automatic estimation and admin
might have a good reason to stop new threads to be created even when
below this limit.

This became more severe when we switched x86 from 4k to 8k kernel
stacks.  Starting since 6538b8ea88 ("x86_64: expand kernel stack to
16K") (3.16) we use THREAD_SIZE_ORDER = 2 and that halved the auto-tuned
value.

In the particular case

  3.12
  kernel.threads-max = 515561

  4.4
  kernel.threads-max = 200000

Neither of the two values is really insane on 32GB machine.

I am not sure we want/need to tune the max_thread value further.  If
anything the tuning should be removed altogether if proven not useful in
general.  But we definitely need a way to override this auto-tuning.

Link: http://lkml.kernel.org/r/20190922065801.GB18814@dhcp22.suse.cz
Fixes: 16db3d3f11 ("kernel/sysctl.c: threads-max observe limits")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-07 15:47:19 -07:00
Will Deacon
20bb759a66 panic: ensure preemption is disabled during panic()
Calling 'panic()' on a kernel with CONFIG_PREEMPT=y can leave the
calling CPU in an infinite loop, but with interrupts and preemption
enabled.  From this state, userspace can continue to be scheduled,
despite the system being "dead" as far as the kernel is concerned.

This is easily reproducible on arm64 when booting with "nosmp" on the
command line; a couple of shell scripts print out a periodic "Ping"
message whilst another triggers a crash by writing to
/proc/sysrq-trigger:

  | sysrq: Trigger a crash
  | Kernel panic - not syncing: sysrq triggered crash
  | CPU: 0 PID: 1 Comm: init Not tainted 5.2.15 #1
  | Hardware name: linux,dummy-virt (DT)
  | Call trace:
  |  dump_backtrace+0x0/0x148
  |  show_stack+0x14/0x20
  |  dump_stack+0xa0/0xc4
  |  panic+0x140/0x32c
  |  sysrq_handle_reboot+0x0/0x20
  |  __handle_sysrq+0x124/0x190
  |  write_sysrq_trigger+0x64/0x88
  |  proc_reg_write+0x60/0xa8
  |  __vfs_write+0x18/0x40
  |  vfs_write+0xa4/0x1b8
  |  ksys_write+0x64/0xf0
  |  __arm64_sys_write+0x14/0x20
  |  el0_svc_common.constprop.0+0xb0/0x168
  |  el0_svc_handler+0x28/0x78
  |  el0_svc+0x8/0xc
  | Kernel Offset: disabled
  | CPU features: 0x0002,24002004
  | Memory Limit: none
  | ---[ end Kernel panic - not syncing: sysrq triggered crash ]---
  |  Ping 2!
  |  Ping 1!
  |  Ping 1!
  |  Ping 2!

The issue can also be triggered on x86 kernels if CONFIG_SMP=n,
otherwise local interrupts are disabled in 'smp_send_stop()'.

Disable preemption in 'panic()' before re-enabling interrupts.

Link: http://lkml.kernel.org/r/20191002123538.22609-1-will@kernel.org
Link: https://lore.kernel.org/r/BX1W47JXPMR8.58IYW53H6M5N@dragonstone
Signed-off-by: Will Deacon <will@kernel.org>
Reported-by: Xogium <contact@xogium.me>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-07 15:47:19 -07:00
Alexander Shishkin
f733c6b508 perf/core: Fix inheritance of aux_output groups
Commit:

  ab43762ef0 ("perf: Allow normal events to output AUX data")

forgets to configure aux_output relation in the inherited groups, which
results in child PEBS events forever failing to schedule.

Fix this by setting up the AUX output link in the inheritance path.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191004125729.32397-1-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-07 16:50:42 +02:00
Michal Koutný
9a3284fad4 cgroup: Optimize single thread migration
There are reports of users who use thread migrations between cgroups and
they report performance drop after d59cfc09c3 ("sched, cgroup: replace
signal_struct->group_rwsem with a global percpu_rwsem"). The effect is
pronounced on machines with more CPUs.

The migration is affected by forking noise happening in the background,
after the mentioned commit a migrating thread must wait for all
(forking) processes on the system, not only of its threadgroup.

There are several places that need to synchronize with migration:
	a) do_exit,
	b) de_thread,
	c) copy_process,
	d) cgroup_update_dfl_csses,
	e) parallel migration (cgroup_{proc,thread}s_write).

In the case of self-migrating thread, we relax the synchronization on
cgroup_threadgroup_rwsem to avoid the cost of waiting. d) and e) are
excluded with cgroup_mutex, c) does not matter in case of single thread
migration and the executing thread cannot exec(2) or exit(2) while it is
writing into cgroup.threads. In case of do_exit because of signal
delivery, we either exit before the migration or finish the migration
(of not yet PF_EXITING thread) and die afterwards.

This patch handles only the case of self-migration by writing "0" into
cgroup.threads. For simplicity, we always take cgroup_threadgroup_rwsem
with numeric PIDs.

This change improves migration dependent workload performance similar
to per-signal_struct state.

Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-10-07 07:11:53 -07:00
Michal Koutný
e7c7b1d85d cgroup: Update comments about task exit path
We no longer take cgroup_mutex in cgroup_exit and the exiting tasks are
not moved to init_css_set, reflect that in several comments to prevent
confusion.

Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-10-07 07:11:53 -07:00
Miaohe Lin
61e867fde2 cgroup: short-circuit current_cgns_cgroup_from_root() on the default hierarchy
Like commit 13d82fb77a ("cgroup: short-circuit cset_cgroup_from_root() on
the default hierarchy"), short-circuit current_cgns_cgroup_from_root() on
the default hierarchy.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-10-07 07:01:09 -07:00
Linus Torvalds
7cdb85df60 dma-mapping regression fix for 5.4-rc2
- revert an incorret hunk from a patch that caused problems
    on various arm boards (Andrey Smirnov)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAl2aFRoLHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYMBeBAAuTsOh1amMUdsAJN67PJcHP8JkOlR21cjLVaKkvWh
 l5XnXITtlNvyzXH67jZuQL15+rQ/kTOkmSc5bIDZW7+sTW2Rwnq6bIQOZpBYKlol
 U/UTBtk26SliKoinlJekKWAA6o32PJU2oLOsTmqoCqH5k0aeKdNHAFPSw4fU3jbW
 U4Sv0uc6MI+PM9OM3H/T60qQPvziOkeDp4wAZZ5AO/kUbNgzUrbGatNk26QqgNbs
 NsAVQ3X/sgUAwXMtivo9nFUd2fuEIf9GueGVohGiW+2znWQ8AxY76/FgxzXICmMi
 S0YLqPrdlzzZ0K7k8enPvJo2hd0qh3yFtWyGx9fUt+EBXepp/hMTIRAEVUHpiiSg
 PDTU74TVtXwSYvIQA6jR1bwh9+aMyeDWDFzUwFQh34mahAsZsBKhNLAcpN2uNGv7
 XLL3Lqi58eIhaSaqxM4ASIsBS+FIiQiOdqq4eLVx+x6wxjNDTyZ+ynbWdNs8+SYh
 MIyjY3wibMwaXtFUBV6LgYtwBF/1pgFcu9jWz02HT7Od0c+Et04ihcXISH+w9fpB
 O5WFjo0Oag2HoNm1ODOlLu5DY9CSQftrv4zl0yTQgb1vFB3fPdtv43wIQ8SkVhVu
 kwuF4kgIAyRRoe7HCPK/FJjKiYCo6Y+3WZ/4X7ktddCpxjaVYfclv8pMotirCQPU
 SSo=
 =kS6W
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-5.4-1' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping regression fix from Christoph Hellwig:
 "Revert an incorret hunk from a patch that caused problems on various
  arm boards (Andrey Smirnov)"

* tag 'dma-mapping-5.4-1' of git://git.infradead.org/users/hch/dma-mapping:
  dma-mapping: fix false positive warnings in dma_common_free_remap()
2019-10-06 11:10:15 -07:00
Mika Westerberg
0e48f51cbb Revert "libata, freezer: avoid block device removal while system is frozen"
This reverts commit 85fbd722ad.

The commit was added as a quick band-aid for a hang that happened when a
block device was removed during system suspend. Now that bdi_wq is not
freezable anymore the hang should not be possible and we can get rid of
this hack by reverting it.

Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-06 09:11:37 -06:00
Linus Torvalds
2d00aee21a Kbuild fixes for v5.4
- remove unneeded ar-option and KBUILD_ARFLAGS
 
  - remove long-deprecated SUBDIRS
 
  - fix modpost to suppress false-positive warnings for UML builds
 
  - fix namespace.pl to handle relative paths to ${objtree}, ${srctree}
 
  - make setlocalversion work for /bin/sh
 
  - make header archive reproducible
 
  - fix some Makefiles and documents
 -----BEGIN PGP SIGNATURE-----
 
 iQJSBAABCgA8FiEEbmPs18K1szRHjPqEPYsBB53g2wYFAl2YPUEeHHlhbWFkYS5t
 YXNhaGlyb0Bzb2Npb25leHQuY29tAAoJED2LAQed4NsGVu4P/3Qv7Ov3/R4BlgYb
 +LaKupCY/ADE5bRAv/N5AAy37+TJmTLQswN2/JwHflYvTeWd4kZjquFpJkFNwMsk
 Qlb79NQvyM9+NlFfSFjap8HBNb0J8A+92aKmrHmh1sQqJJs6JPZ0MOGoAXmgsJaN
 SPLvhqophKpmYu7Oa0x2aC2kq+1DnCQyMLTOuVCdrtF0tF8w0hiowDz5GOmOi1U6
 VK2ECfzjenFkfbqZOUVBPVfPR9hMpmVBdKdFLwD/iTKVkShZcWmdbxk/ADbemyet
 2njehRF2HGp7opbwM4AxIeIubCqYSkThUpLJarKWk/8W87gksH6pCR8yIq1nOwkO
 l+/GY2YdvkBdDCkSKpLiQxtJaqnZb8Yv1ZPvCfGF09Ba8tFtwX+HSecSkLFHGyJv
 K9FD0XSOFBkQesZWdpIr/xeLwwiuSH80QACrub1Z5Q4OCURaBkKwrO/eDG1/2Xle
 YKGZO2va2VVkeo5bisOZ2vfISwZrtiaGakQ8vTdq/5RO59/JvQjsGB8KbccaKXAu
 Ozk8vVqkwTmLP6gzIEd2Wr/ICNGuAVc0EELT7lj07hcd6rzsCxPWVXqTFsCkGBJe
 587i1jeH1z9oyrHUcP6dhR3joIuOUuUJk1uR7YZq4L4POSvrJnvzMFkSv6tBKL2p
 Uq9qD7mpt/9zl3PART7HK9KYfTGJ
 =fSXc
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-fixes-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild fixes from Masahiro Yamada:

 - remove unneeded ar-option and KBUILD_ARFLAGS

 - remove long-deprecated SUBDIRS

 - fix modpost to suppress false-positive warnings for UML builds

 - fix namespace.pl to handle relative paths to ${objtree}, ${srctree}

 - make setlocalversion work for /bin/sh

 - make header archive reproducible

 - fix some Makefiles and documents

* tag 'kbuild-fixes-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  kheaders: make headers archive reproducible
  kbuild: update compile-test header list for v5.4-rc2
  kbuild: two minor updates for Documentation/kbuild/modules.rst
  scripts/setlocalversion: clear local variable to make it work for sh
  namespace: fix namespace.pl script to support relative paths
  video/logo: do not generate unneeded logo C files
  video/logo: remove unneeded *.o pattern from clean-files
  integrity: remove pointless subdir-$(CONFIG_...)
  integrity: remove unneeded, broken attempt to add -fshort-wchar
  modpost: fix static EXPORT_SYMBOL warnings for UML build
  kbuild: correct formatting of header in kbuild module docs
  kbuild: remove SUBDIRS support
  kbuild: remove ar-option and KBUILD_ARFLAGS
2019-10-05 12:56:59 -07:00
Wolfgang M. Reimer
67d64918a1 locking: locktorture: Do not include rwlock.h directly
Including rwlock.h directly will cause kernel builds to fail
if CONFIG_PREEMPT_RT is defined. The correct header file
(rwlock_rt.h OR rwlock.h) will be included by spinlock.h which
is included by locktorture.c anyway.

Remove the include of linux/rwlock.h.

Signed-off-by: Wolfgang M. Reimer <linuxball@gmail.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-05 11:50:24 -07:00
Paul E. McKenney
fbbd5e358c rcutorture: Make in-kernel-loop testing more brutal
The rcu_torture_fwd_prog_nr() tests the ability of RCU to tolerate
in-kernel busy loops.  It invokes rcu_torture_fwd_prog_cond_resched()
within its delay loop, which, in PREEMPT && NO_HZ_FULL kernels results
in the occasional direct call to schedule().  Now, this direct call to
schedule() is appropriate for call_rcu() flood testing, in which either
the kernel should restrain itself or userspace transitions will supply
the needed restraint.  But in pure in-kernel loops, the occasional
cond_resched() should do the job.

This commit therefore makes rcu_torture_fwd_prog_nr() use cond_resched()
instead of rcu_torture_fwd_prog_cond_resched() in order to increase the
brutality of this aspect of rcutorture testing.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-05 11:50:18 -07:00
Paul E. McKenney
8b5ddf8b99 rcutorture: Separate warnings for each failure type
Currently, each of six different types of failure triggers a
single WARN_ON_ONCE(), and it is then necessary to stare at the
rcu_torture_stats(), Reader Pipe, and Reader Batch lines looking for
inappropriately non-zero values.  This can be annoying and error-prone,
so this commit provides a separate WARN_ON_ONCE() for each of the
six error conditions and adds short comments to each to ease error
identification.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-05 11:50:03 -07:00
Ethan Hansen
b3ffb206dd rcu: Remove unused variable rcu_perf_writer_state
The variable rcu_perf_writer_state is declared and initialized,
but is never actually referenced. Remove it to clean code.

Signed-off-by: Ethan Hansen <1ethanhansen@gmail.com>
[ paulmck: Also removed unused macros assigned to that variable. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-05 11:49:36 -07:00
Chuhong Yuan
c5d3c8ca22 locktorture: Replace strncmp() with str_has_prefix()
The strncmp() function is error-prone because it is easy to get the
length wrong, especially if the string is subject to change, especially
given the need to account for the terminating nul byte.  This commit
therefore substitutes the newly introduced str_has_prefix(), which
does not require a separately specified length.

Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-05 11:48:38 -07:00
Ethan Hansen
ac5f636130 rcu: Remove unused function rcutorture_record_progress()
The function rcutorture_record_progress() is declared in rcu.h, but is
never used.  This commit therefore removes rcutorture_record_progress()
to clean code.

Signed-off-by: Ethan Hansen <1ethanhansen@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-05 11:48:13 -07:00
Paul E. McKenney
79ba7ff5a9 rcutorture: Emulate dyntick aspect of userspace nohz_full sojourn
During an actual call_rcu() flood, there would be frequent trips to
userspace (in-kernel call_rcu() floods must be otherwise housebroken).
Userspace execution on nohz_full CPUs implies an RCU dyntick idle/not-idle
transition pair, so this commit adds emulation of that pair.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-05 10:46:05 -07:00
Paul E. McKenney
96926686de rcu: Make CPU-hotplug removal operations enable tick
CPU-hotplug removal operations run the multi_cpu_stop() function, which
relies on the scheduler to gain control from whatever is running on the
various online CPUs, including any nohz_full CPUs running long loops in
kernel-mode code.  Lack of the scheduler-clock interrupt on such CPUs
can delay multi_cpu_stop() for several minutes and can also result in
RCU CPU stall warnings.  This commit therefore causes CPU-hotplug removal
operations to enable the scheduler-clock interrupt on all online CPUs.

[ paulmck: Apply Joel Fernandes TICK_DEP_MASK_RCU->TICK_DEP_BIT_RCU fix. ]
[ paulmck: Apply simplifications suggested by Frederic Weisbecker. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-05 10:46:05 -07:00
Paul E. McKenney
366237e7b0 stop_machine: Provide RCU quiescent state in multi_cpu_stop()
When multi_cpu_stop() loops waiting for other tasks, it can trigger an RCU
CPU stall warning.  This can be misleading because what is instead needed
is information on whatever task is blocking multi_cpu_stop().  This commit
therefore inserts an RCU quiescent state into the multi_cpu_stop()
function's waitloop.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-05 10:46:05 -07:00
Paul E. McKenney
d38e6dc6ed rcutorture: Force on tick for readers and callback flooders
Readers and callback flooders in the rcutorture stress-test suite run for
extended time periods by design.  They do take pains to relinquish the
CPU from time to time, but in some cases this relies on the scheduler
being active, which in turn relies on the scheduler-clock interrupt
firing from time to time.

This commit therefore forces scheduling-clock interrupts within
these loops.  While in the area, this commit also prevents
rcu_torture_reader()'s occasional timed sleeps from delaying shutdown.

[ paulmck: Apply Joel Fernandes TICK_DEP_MASK_RCU->TICK_DEP_BIT_RCU fix. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-05 10:46:04 -07:00
Paul E. McKenney
6a949b7af8 rcu: Force on tick when invoking lots of callbacks
Callback invocation can run for a significant time period, and within
CONFIG_NO_HZ_FULL=y kernels, this period will be devoid of scheduler-clock
interrupts.  In-kernel execution without such interrupts can cause all
manner of malfunction, with RCU CPU stall warnings being but one result.

This commit therefore forces scheduling-clock interrupts on whenever more
than a few RCU callbacks are invoked.  Because offloaded callback invocation
can be preempted, this forcing is withdrawn on each context switch.  This
in turn requires that the loop invoking RCU callbacks reiterate the forcing
periodically.

[ paulmck: Apply Joel Fernandes TICK_DEP_MASK_RCU->TICK_DEP_BIT_RCU fix. ]
[ paulmck: Remove NO_HZ_FULL check per Frederic Weisbecker feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-05 10:46:04 -07:00
Paul E. McKenney
ae9e557b5b time: Export tick start/stop functions for rcutorture
It turns out that rcutorture needs to ensure that the scheduling-clock
interrupt is enabled in CONFIG_NO_HZ_FULL kernels before starting on
CPU-bound in-kernel processing.  This commit therefore exports
tick_nohz_dep_set_task(), tick_nohz_dep_clear_task(), and
tick_nohz_full_setup() to GPL kernel modules.

Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-05 10:46:03 -07:00
Frederic Weisbecker
01b4c39901 nohz: Add TICK_DEP_BIT_RCU
If a nohz_full CPU is looping in the kernel, the scheduling-clock tick
might nevertheless remain disabled.  In !PREEMPT kernels, this can
prevent RCU's attempts to enlist the aid of that CPU's executions of
cond_resched(), which can in turn result in an arbitrarily delayed grace
period and thus an OOM.  RCU therefore needs a way to enable a holdout
nohz_full CPU's scheduler-clock interrupt.

This commit therefore provides a new TICK_DEP_BIT_RCU value which RCU can
pass to tick_dep_set_cpu() and friends to force on the scheduler-clock
interrupt for a specified CPU or task.  In some cases, rcutorture needs
to turn on the scheduler-clock tick, so this commit also exports the
relevant symbols to GPL-licensed modules.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2019-10-05 10:45:16 -07:00
Andrey Smirnov
2cf2aa6a69 dma-mapping: fix false positivse warnings in dma_common_free_remap()
Commit 5cf4537975 ("dma-mapping: introduce a dma_common_find_pages
helper") changed invalid input check in dma_common_free_remap() from:

    if (!area || !area->flags != VM_DMA_COHERENT)

to

    if (!area || !area->flags != VM_DMA_COHERENT || !area->pages)

which seem to produce false positives for memory obtained via
dma_common_contiguous_remap()

This triggers the following warning message when doing "reboot" on ZII
VF610 Dev Board Rev B:

WARNING: CPU: 0 PID: 1 at kernel/dma/remap.c:112 dma_common_free_remap+0x88/0x8c
trying to free invalid coherent area: 9ef82980
Modules linked in:
CPU: 0 PID: 1 Comm: systemd-shutdow Not tainted 5.3.0-rc6-next-20190820 #119
Hardware name: Freescale Vybrid VF5xx/VF6xx (Device Tree)
Backtrace:
[<8010d1ec>] (dump_backtrace) from [<8010d588>] (show_stack+0x20/0x24)
 r7:8015ed78 r6:00000009 r5:00000000 r4:9f4d9b14
[<8010d568>] (show_stack) from [<8077e3f0>] (dump_stack+0x24/0x28)
[<8077e3cc>] (dump_stack) from [<801197a0>] (__warn.part.3+0xcc/0xe4)
[<801196d4>] (__warn.part.3) from [<80119830>] (warn_slowpath_fmt+0x78/0x94)
 r6:00000070 r5:808e540c r4:81c03048
[<801197bc>] (warn_slowpath_fmt) from [<8015ed78>] (dma_common_free_remap+0x88/0x8c)
 r3:9ef82980 r2:808e53e0
 r7:00001000 r6:a0b1e000 r5:a0b1e000 r4:00001000
[<8015ecf0>] (dma_common_free_remap) from [<8010fa9c>] (remap_allocator_free+0x60/0x68)
 r5:81c03048 r4:9f4d9b78
[<8010fa3c>] (remap_allocator_free) from [<801100d0>] (__arm_dma_free.constprop.3+0xf8/0x148)
 r5:81c03048 r4:9ef82900
[<8010ffd8>] (__arm_dma_free.constprop.3) from [<80110144>] (arm_dma_free+0x24/0x2c)
 r5:9f563410 r4:80110120
[<80110120>] (arm_dma_free) from [<8015d80c>] (dma_free_attrs+0xa0/0xdc)
[<8015d76c>] (dma_free_attrs) from [<8020f3e4>] (dma_pool_destroy+0xc0/0x154)
 r8:9efa8860 r7:808f02f0 r6:808f02d0 r5:9ef82880 r4:9ef82780
[<8020f324>] (dma_pool_destroy) from [<805525d0>] (ehci_mem_cleanup+0x6c/0x150)
 r7:9f563410 r6:9efa8810 r5:00000000 r4:9efd0148
[<80552564>] (ehci_mem_cleanup) from [<80558e0c>] (ehci_stop+0xac/0xc0)
 r5:9efd0148 r4:9efd0000
[<80558d60>] (ehci_stop) from [<8053c4bc>] (usb_remove_hcd+0xf4/0x1b0)
 r7:9f563410 r6:9efd0074 r5:81c03048 r4:9efd0000
[<8053c3c8>] (usb_remove_hcd) from [<8056361c>] (host_stop+0x48/0xb8)
 r7:9f563410 r6:9efd0000 r5:9f5f4040 r4:9f5f5040
[<805635d4>] (host_stop) from [<80563d0c>] (ci_hdrc_host_destroy+0x34/0x38)
 r7:9f563410 r6:9f5f5040 r5:9efa8800 r4:9f5f4040
[<80563cd8>] (ci_hdrc_host_destroy) from [<8055ef18>] (ci_hdrc_remove+0x50/0x10c)
[<8055eec8>] (ci_hdrc_remove) from [<804a2ed8>] (platform_drv_remove+0x34/0x4c)
 r7:9f563410 r6:81c4f99c r5:9efa8810 r4:9efa8810
[<804a2ea4>] (platform_drv_remove) from [<804a18a8>] (device_release_driver_internal+0xec/0x19c)
 r5:00000000 r4:9efa8810
[<804a17bc>] (device_release_driver_internal) from [<804a1978>] (device_release_driver+0x20/0x24)
 r7:9f563410 r6:81c41ed0 r5:9efa8810 r4:9f4a1dac
[<804a1958>] (device_release_driver) from [<804a01b8>] (bus_remove_device+0xdc/0x108)
[<804a00dc>] (bus_remove_device) from [<8049c204>] (device_del+0x150/0x36c)
 r7:9f563410 r6:81c03048 r5:9efa8854 r4:9efa8810
[<8049c0b4>] (device_del) from [<804a3368>] (platform_device_del.part.2+0x20/0x84)
 r10:9f563414 r9:809177e0 r8:81cb07dc r7:81c78320 r6:9f563454 r5:9efa8800
 r4:9efa8800
[<804a3348>] (platform_device_del.part.2) from [<804a3420>] (platform_device_unregister+0x28/0x34)
 r5:9f563400 r4:9efa8800
[<804a33f8>] (platform_device_unregister) from [<8055dce0>] (ci_hdrc_remove_device+0x1c/0x30)
 r5:9f563400 r4:00000001
[<8055dcc4>] (ci_hdrc_remove_device) from [<805652ac>] (ci_hdrc_imx_remove+0x38/0x118)
 r7:81c78320 r6:9f563454 r5:9f563410 r4:9f541010
[<8056538c>] (ci_hdrc_imx_shutdown) from [<804a2970>] (platform_drv_shutdown+0x2c/0x30)
[<804a2944>] (platform_drv_shutdown) from [<8049e4fc>] (device_shutdown+0x158/0x1f0)
[<8049e3a4>] (device_shutdown) from [<8013ac80>] (kernel_restart_prepare+0x44/0x48)
 r10:00000058 r9:9f4d8000 r8:fee1dead r7:379ce700 r6:81c0b280 r5:81c03048
 r4:00000000
[<8013ac3c>] (kernel_restart_prepare) from [<8013ad14>] (kernel_restart+0x1c/0x60)
[<8013acf8>] (kernel_restart) from [<8013af84>] (__do_sys_reboot+0xe0/0x1d8)
 r5:81c03048 r4:00000000
[<8013aea4>] (__do_sys_reboot) from [<8013b0ec>] (sys_reboot+0x18/0x1c)
 r8:80101204 r7:00000058 r6:00000000 r5:00000000 r4:00000000
[<8013b0d4>] (sys_reboot) from [<80101000>] (ret_fast_syscall+0x0/0x54)
Exception stack(0x9f4d9fa8 to 0x9f4d9ff0)
9fa0:                   00000000 00000000 fee1dead 28121969 01234567 379ce700
9fc0: 00000000 00000000 00000000 00000058 00000000 00000000 00000000 00016d04
9fe0: 00028e0c 7ec87c64 000135ec 76c1f410

Restore original invalid input check in dma_common_free_remap() to
avoid this problem.

Fixes: 5cf4537975 ("dma-mapping: introduce a dma_common_find_pages helper")
Signed-off-by: Andrey Smirnov <andrew.smirnov@gmail.com>
[hch: just revert the offending hunk instead of creating a new helper]
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-10-05 10:24:17 +02:00
Dmitry Goldin
86cdd2fdc4 kheaders: make headers archive reproducible
In commit 43d8ce9d65 ("Provide in-kernel headers to make
extending kernel easier") a new mechanism was introduced, for kernels
>=5.2, which embeds the kernel headers in the kernel image or a module
and exposes them in procfs for use by userland tools.

The archive containing the header files has nondeterminism caused by
header files metadata. This patch normalizes the metadata and utilizes
KBUILD_BUILD_TIMESTAMP if provided and otherwise falls back to the
default behaviour.

In commit f7b101d330 ("kheaders: Move from proc to sysfs") it was
modified to use sysfs and the script for generation of the archive was
renamed to what is being patched.

Signed-off-by: Dmitry Goldin <dgoldin+lkml@protonmail.ch>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2019-10-05 15:29:49 +09:00
Linus Torvalds
e524d16e7e copy-struct-from-user-v5.4-rc2
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXZZIgQAKCRCRxhvAZXjc
 orNOAP98B2nmoxvq8d5Z6PhoyTBC5NIUuJ5h2YMwcX/hAaj5uQEA58NTKtPmOPDR
 2ffUFFerGZ2+brlHgACa0ZKdH27TjAA=
 =QryD
 -----END PGP SIGNATURE-----

Merge tag 'copy-struct-from-user-v5.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull copy_struct_from_user() helper from Christian Brauner:
 "This contains the copy_struct_from_user() helper which got split out
  from the openat2() patchset. It is a generic interface designed to
  copy a struct from userspace.

  The helper will be especially useful for structs versioned by size of
  which we have quite a few. This allows for backwards compatibility,
  i.e. an extended struct can be passed to an older kernel, or a legacy
  struct can be passed to a newer kernel. For the first case (extended
  struct, older kernel) the new fields in an extended struct can be set
  to zero and the struct safely passed to an older kernel.

  The most obvious benefit is that this helper lets us get rid of
  duplicate code present in at least sched_setattr(), perf_event_open(),
  and clone3(). More importantly it will also help to ensure that users
  implementing versioning-by-size end up with the same core semantics.

  This point is especially crucial since we have at least one case where
  versioning-by-size is used but with slighly different semantics:
  sched_setattr(), perf_event_open(), and clone3() all do do similar
  checks to copy_struct_from_user() while rt_sigprocmask(2) always
  rejects differently-sized struct arguments.

  With this pull request we also switch over sched_setattr(),
  perf_event_open(), and clone3() to use the new helper"

* tag 'copy-struct-from-user-v5.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  usercopy: Add parentheses around assignment in test_copy_struct_from_user
  perf_event_open: switch to copy_struct_from_user()
  sched_setattr: switch to copy_struct_from_user()
  clone3: switch to copy_struct_from_user()
  lib: introduce copy_struct_from_user() helper
2019-10-04 10:36:31 -07:00
Tejun Heo
e66b39af00 workqueue: Fix pwq ref leak in rescuer_thread()
008847f66c ("workqueue: allow rescuer thread to do more work.") made
the rescuer worker requeue the pwq immediately if there may be more
work items which need rescuing instead of waiting for the next mayday
timer expiration.  Unfortunately, it doesn't check whether the pwq is
already on the mayday list and unconditionally gets the ref and moves
it onto the list.  This doesn't corrupt the list but creates an
additional reference to the pwq.  It got queued twice but will only be
removed once.

This leak later can trigger pwq refcnt warning on workqueue
destruction and prevent freeing of the workqueue.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "Williams, Gerald S" <gerald.s.williams@intel.com>
Cc: NeilBrown <neilb@suse.de>
Cc: stable@vger.kernel.org # v3.19+
2019-10-04 10:23:11 -07:00
Tejun Heo
c29eb85386 workqueue: more destroy_workqueue() fixes
destroy_workqueue() warnings still, at a lower frequency, trigger
spuriously.  The problem seems to be in-flight operations which
haven't reached put_pwq() yet.

* Make sanity check grab all the related locks so that it's
  synchronized against operations which puts pwq at the end.

* Always print out the offending pwq.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "Williams, Gerald S" <gerald.s.williams@intel.com>
2019-10-04 10:23:01 -07:00
Linus Torvalds
af0622f6ae for-linus-20191003
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXZZKNgAKCRCRxhvAZXjc
 otfIAPsHUZn+Wfa/8uftNDJ6RLDXDsq6l8xiQTkz+k4YdnDj2AD/aIPjrM950jrS
 W7+8R7CSSQOLmIif6R+S0A1fyFoVlQA=
 =HVz0
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20191003' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull clone3/pidfd fixes from Christian Brauner:
 "This contains a couple of fixes:

   - Fix pidfd selftest compilation (Shuah Kahn)

     Due to a false linking instruction in the Makefile compilation for
     the pidfd selftests would fail on some systems.

   - Fix compilation for glibc on RISC-V systems (Seth Forshee)

     In some scenarios linux/uapi/linux/sched.h is included where
     __ASSEMBLY__ is defined causing a build failure because struct
     clone_args was not guarded by an #ifndef __ASSEMBLY__.

   - Add missing clone3() and struct clone_args kernel-doc (Christian Brauner)

     clone3() and struct clone_args were missing kernel-docs. (The goal
     is to use kernel-doc for any function or type where it's worth it.)
     For struct clone_args this also contains a comment about the fact
     that it's versioned by size"

* tag 'for-linus-20191003' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  sched: add kernel-doc for struct clone_args
  fork: add kernel-doc for clone3
  selftests: pidfd: Fix undefined reference to pthread_create()
  sched: Add __ASSEMBLY__ guards around struct clone_args
2019-10-04 10:18:56 -07:00
Christian Brauner
501bd0166e
fork: add kernel-doc for clone3
Add kernel-doc for the clone3() syscall.

Link: https://lore.kernel.org/r/20191001114701.24661-2-christian.brauner@ubuntu.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2019-10-03 21:18:06 +02:00
Kees Cook
245d73698e audit: Report suspicious O_CREAT usage
This renames the very specific audit_log_link_denied() to
audit_log_path_denied() and adds the AUDIT_* type as an argument. This
allows for the creation of the new AUDIT_ANOM_CREAT that can be used to
report the fifo/regular file creation restrictions that were introduced
in commit 30aba6656f ("namei: allow restricted O_CREAT of FIFOs and
regular files").

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2019-10-03 13:59:29 -04:00
Linus Torvalds
5021b9182e Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Ingo Molnar:
 "Fix a broadcast-timer handling race that can result in spuriously and
  indefinitely delayed hrtimers and even RCU stalls if the system is
  otherwise quiet"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  tick: broadcast-hrtimer: Fix a race in bc_set_next
2019-10-02 15:54:19 -07:00
Peter Zijlstra
73956fc07d membarrier: Fix RCU locking bug caused by faulty merge
The following commit:

  227a4aadc7 ("sched/membarrier: Fix p->mm->membarrier_state racy load")

got fat fingered by me when merging it with other patches. It meant to move
the RCU section out of the for loop but ended up doing it partially, leaving
a superfluous rcu_read_lock() inside, causing havok.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-tip-commits@vger.kernel.org
Fixes: 227a4aadc7 ("sched/membarrier: Fix p->mm->membarrier_state racy load")
Link: https://lkml.kernel.org/r/20191001085033.GP4519@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-01 21:27:50 +02:00
Aleksa Sarai
c2ba8f41ad perf_event_open: switch to copy_struct_from_user()
Switch perf_event_open() syscall from it's own copying
struct perf_event_attr from userspace to the new dedicated
copy_struct_from_user() helper.

The change is very straightforward, and helps unify the syscall
interface for struct-from-userspace syscalls.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com>
[christian.brauner@ubuntu.com: improve commit message]
Link: https://lore.kernel.org/r/20191001011055.19283-5-cyphar@cyphar.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2019-10-01 15:45:22 +02:00
Aleksa Sarai
dff3a85fec sched_setattr: switch to copy_struct_from_user()
Switch sched_setattr() syscall from it's own copying struct sched_attr
from userspace to the new dedicated copy_struct_from_user() helper.

The change is very straightforward, and helps unify the syscall
interface for struct-from-userspace syscalls. Ideally we could also
unify sched_getattr(2)-style syscalls as well, but unfortunately the
correct semantics for such syscalls are much less clear (see [1] for
more detail). In future we could come up with a more sane idea for how
the syscall interface should look.

[1]: commit 1251201c0d ("sched/core: Fix uclamp ABI bug, clean up and
     robustify sched_read_attr() ABI logic and code")

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com>
[christian.brauner@ubuntu.com: improve commit message]
Link: https://lore.kernel.org/r/20191001011055.19283-4-cyphar@cyphar.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2019-10-01 15:45:17 +02:00
Aleksa Sarai
f14c234b4b clone3: switch to copy_struct_from_user()
Switch clone3() syscall from it's own copying struct clone_args from
userspace to the new dedicated copy_struct_from_user() helper.

The change is very straightforward, and helps unify the syscall
interface for struct-from-userspace syscalls. Additionally, explicitly
define CLONE_ARGS_SIZE_VER0 to match the other users of the
struct-extension pattern.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com>
[christian.brauner@ubuntu.com: improve commit message]
Link: https://lore.kernel.org/r/20191001011055.19283-3-cyphar@cyphar.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2019-10-01 15:45:10 +02:00
Iurii Zaikin
2cb80dbbba kernel/sysctl-test: Add null pointer test for sysctl.c:proc_dointvec()
KUnit tests for initialized data behavior of proc_dointvec that is
explicitly checked in the code. Includes basic parsing tests including
int min/max overflow.

Signed-off-by: Iurii Zaikin <yzaikin@google.com>
Signed-off-by: Brendan Higgins <brendanhiggins@google.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Stephen Boyd <sboyd@kernel.org>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2019-09-30 17:35:01 -06:00
Linus Torvalds
cf4f493b10 A few more tracing fixes:
- Fixed a buffer overflow by checking nr_args correctly in probes
 
  - Fixed a warning that is reported by clang
 
  - Fixed a possible memory leak in error path of filter processing
 
  - Fixed the selftest that checks for failures, but wasn't failing
 
  - Minor clean up on call site output of a memory trace event
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXZEP5hQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qhrSAQDlws8rY/vJN4tKL1YaBTRyS5OW+1B+
 LPLOxm9PBuzt0wEArVunv7iMgvRzp5spbmCqmD8Is2vSf+45KSrb10WU2wo=
 =L37R
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
 "A few more tracing fixes:

   - Fix a buffer overflow by checking nr_args correctly in probes

   - Fix a warning that is reported by clang

   - Fix a possible memory leak in error path of filter processing

   - Fix the selftest that checks for failures, but wasn't failing

   - Minor clean up on call site output of a memory trace event"

* tag 'trace-v5.4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  selftests/ftrace: Fix same probe error test
  mm, tracing: Print symbol name for call_site in trace events
  tracing: Have error path in predicate_parse() free its allocated memory
  tracing: Fix clang -Wint-in-bool-context warnings in IF_ASSIGN macro
  tracing/probe: Fix to check the difference of nr_args before adding probe
2019-09-30 09:29:53 -07:00
Linus Torvalds
02dc96ef6c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Sanity check URB networking device parameters to avoid divide by
    zero, from Oliver Neukum.

 2) Disable global multicast filter in NCSI, otherwise LLDP and IPV6
    don't work properly. Longer term this needs a better fix tho. From
    Vijay Khemka.

 3) Small fixes to selftests (use ping when ping6 is not present, etc.)
    from David Ahern.

 4) Bring back rt_uses_gateway member of struct rtable, it's semantics
    were not well understood and trying to remove it broke things. From
    David Ahern.

 5) Move usbnet snaity checking, ignore endpoints with invalid
    wMaxPacketSize. From Bjørn Mork.

 6) Missing Kconfig deps for sja1105 driver, from Mao Wenan.

 7) Various small fixes to the mlx5 DR steering code, from Alaa Hleihel,
    Alex Vesker, and Yevgeny Kliteynik

 8) Missing CAP_NET_RAW checks in various places, from Ori Nimron.

 9) Fix crash when removing sch_cbs entry while offloading is enabled,
    from Vinicius Costa Gomes.

10) Signedness bug fixes, generally in looking at the result given by
    of_get_phy_mode() and friends. From Dan Crapenter.

11) Disable preemption around BPF_PROG_RUN() calls, from Eric Dumazet.

12) Don't create VRF ipv6 rules if ipv6 is disabled, from David Ahern.

13) Fix quantization code in tcp_bbr, from Kevin Yang.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (127 commits)
  net: tap: clean up an indentation issue
  nfp: abm: fix memory leak in nfp_abm_u32_knode_replace
  tcp: better handle TCP_USER_TIMEOUT in SYN_SENT state
  sk_buff: drop all skb extensions on free and skb scrubbing
  tcp_bbr: fix quantization code to not raise cwnd if not probing bandwidth
  mlxsw: spectrum_flower: Fail in case user specifies multiple mirror actions
  Documentation: Clarify trap's description
  mlxsw: spectrum: Clear VLAN filters during port initialization
  net: ena: clean up indentation issue
  NFC: st95hf: clean up indentation issue
  net: phy: micrel: add Asym Pause workaround for KSZ9021
  net: socionext: ave: Avoid using netdev_err() before calling register_netdev()
  ptp: correctly disable flags on old ioctls
  lib: dimlib: fix help text typos
  net: dsa: microchip: Always set regmap stride to 1
  nfp: flower: fix memory leak in nfp_flower_spawn_vnic_reprs
  nfp: flower: prevent memory leak in nfp_flower_spawn_phy_reprs
  net/sched: Set default of CONFIG_NET_TC_SKB_EXT to N
  vrf: Do not attempt to create IPv6 mcast rule if IPv6 is disabled
  net: sched: sch_sfb: don't call qdisc_put() while holding tree lock
  ...
2019-09-28 17:47:33 -07:00
Navid Emamdoost
96c5c6e6a5 tracing: Have error path in predicate_parse() free its allocated memory
In predicate_parse, there is an error path that is not going to
out_free instead it returns directly which leads to a memory leak.

Link: http://lkml.kernel.org/r/20190920225800.3870-1-navid.emamdoost@gmail.com

Signed-off-by: Navid Emamdoost <navid.emamdoost@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-09-28 17:13:39 -04:00
Nathan Chancellor
968e517093 tracing: Fix clang -Wint-in-bool-context warnings in IF_ASSIGN macro
After r372664 in clang, the IF_ASSIGN macro causes a couple hundred
warnings along the lines of:

kernel/trace/trace_output.c:1331:2: warning: converting the enum
constant to a boolean [-Wint-in-bool-context]
kernel/trace/trace.h:409:3: note: expanded from macro
'trace_assign_type'
                IF_ASSIGN(var, ent, struct ftrace_graph_ret_entry,
                ^
kernel/trace/trace.h:371:14: note: expanded from macro 'IF_ASSIGN'
                WARN_ON(id && (entry)->type != id);     \
                           ^
264 warnings generated.

This warning can catch issues with constructs like:

    if (state == A || B)

where the developer really meant:

    if (state == A || state == B)

This is currently the only occurrence of the warning in the kernel
tree across defconfig, allyesconfig, allmodconfig for arm32, arm64,
and x86_64. Add the implicit '!= 0' to the WARN_ON statement to fix
the warnings and find potential issues in the future.

Link: 28b38c277a
Link: https://github.com/ClangBuiltLinux/linux/issues/686
Link: http://lkml.kernel.org/r/20190926162258.466321-1-natechancellor@gmail.com

Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-09-28 17:13:39 -04:00
Masami Hiramatsu
d2aea95a1a tracing/probe: Fix to check the difference of nr_args before adding probe
Steven reported that a test triggered:

==================================================================
 BUG: KASAN: slab-out-of-bounds in trace_kprobe_create+0xa9e/0xe40
 Read of size 8 at addr ffff8880c4f25a48 by task ftracetest/4798

 CPU: 2 PID: 4798 Comm: ftracetest Not tainted 5.3.0-rc6-test+ #30
 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016
 Call Trace:
  dump_stack+0x7c/0xc0
  ? trace_kprobe_create+0xa9e/0xe40
  print_address_description+0x6c/0x332
  ? trace_kprobe_create+0xa9e/0xe40
  ? trace_kprobe_create+0xa9e/0xe40
  __kasan_report.cold.6+0x1a/0x3b
  ? trace_kprobe_create+0xa9e/0xe40
  kasan_report+0xe/0x12
  trace_kprobe_create+0xa9e/0xe40
  ? print_kprobe_event+0x280/0x280
  ? match_held_lock+0x1b/0x240
  ? find_held_lock+0xac/0xd0
  ? fs_reclaim_release.part.112+0x5/0x20
  ? lock_downgrade+0x350/0x350
  ? kasan_unpoison_shadow+0x30/0x40
  ? __kasan_kmalloc.constprop.6+0xc1/0xd0
  ? trace_kprobe_create+0xe40/0xe40
  ? trace_kprobe_create+0xe40/0xe40
  create_or_delete_trace_kprobe+0x2e/0x60
  trace_run_command+0xc3/0xe0
  ? trace_panic_handler+0x20/0x20
  ? kasan_unpoison_shadow+0x30/0x40
  trace_parse_run_command+0xdc/0x163
  vfs_write+0xe1/0x240
  ksys_write+0xba/0x150
  ? __ia32_sys_read+0x50/0x50
  ? tracer_hardirqs_on+0x61/0x180
  ? trace_hardirqs_off_caller+0x43/0x110
  ? mark_held_locks+0x29/0xa0
  ? do_syscall_64+0x14/0x260
  do_syscall_64+0x68/0x260

Fix to check the difference of nr_args before adding probe
on existing probes. This also may set the error log index
bigger than the number of command parameters. In that case
it sets the error position is next to the last parameter.

Link: http://lkml.kernel.org/r/156966474783.3478.13217501608215769150.stgit@devnote2

Fixes: ca89bc071d ("tracing/kprobe: Add multi-probe per event support")
Reported-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-09-28 17:07:53 -04:00
Linus Torvalds
9c5efe9ae7 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:

 - Apply a number of membarrier related fixes and cleanups, which fixes
   a use-after-free race in the membarrier code

 - Introduce proper RCU protection for tasks on the runqueue - to get
   rid of the subtle task_rcu_dereference() interface that was easy to
   get wrong

 - Misc fixes, but also an EAS speedup

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Avoid redundant EAS calculation
  sched/core: Remove double update_max_interval() call on CPU startup
  sched/core: Fix preempt_schedule() interrupt return comment
  sched/fair: Fix -Wunused-but-set-variable warnings
  sched/core: Fix migration to invalid CPU in __set_cpus_allowed_ptr()
  sched/membarrier: Return -ENOMEM to userspace on memory allocation failure
  sched/membarrier: Skip IPIs when mm->mm_users == 1
  selftests, sched/membarrier: Add multi-threaded test
  sched/membarrier: Fix p->mm->membarrier_state racy load
  sched/membarrier: Call sync_core only before usermode for same mm
  sched/membarrier: Remove redundant check
  sched/membarrier: Fix private expedited registration check
  tasks, sched/core: RCUify the assignment of rq->curr
  tasks, sched/core: With a grace period after finish_task_switch(), remove unnecessary code
  tasks, sched/core: Ensure tasks are available for a grace period after leaving the runqueue
  tasks: Add a count of task RCU users
  sched/core: Convert vcpu_is_preempted() from macro to an inline function
  sched/fair: Remove unused cfs_rq_clock_task() function
2019-09-28 12:39:07 -07:00
Linus Torvalds
aefcf2f4b5 Merge branch 'next-lockdown' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull kernel lockdown mode from James Morris:
 "This is the latest iteration of the kernel lockdown patchset, from
  Matthew Garrett, David Howells and others.

  From the original description:

    This patchset introduces an optional kernel lockdown feature,
    intended to strengthen the boundary between UID 0 and the kernel.
    When enabled, various pieces of kernel functionality are restricted.
    Applications that rely on low-level access to either hardware or the
    kernel may cease working as a result - therefore this should not be
    enabled without appropriate evaluation beforehand.

    The majority of mainstream distributions have been carrying variants
    of this patchset for many years now, so there's value in providing a
    doesn't meet every distribution requirement, but gets us much closer
    to not requiring external patches.

  There are two major changes since this was last proposed for mainline:

   - Separating lockdown from EFI secure boot. Background discussion is
     covered here: https://lwn.net/Articles/751061/

   -  Implementation as an LSM, with a default stackable lockdown LSM
      module. This allows the lockdown feature to be policy-driven,
      rather than encoding an implicit policy within the mechanism.

  The new locked_down LSM hook is provided to allow LSMs to make a
  policy decision around whether kernel functionality that would allow
  tampering with or examining the runtime state of the kernel should be
  permitted.

  The included lockdown LSM provides an implementation with a simple
  policy intended for general purpose use. This policy provides a coarse
  level of granularity, controllable via the kernel command line:

    lockdown={integrity|confidentiality}

  Enable the kernel lockdown feature. If set to integrity, kernel features
  that allow userland to modify the running kernel are disabled. If set to
  confidentiality, kernel features that allow userland to extract
  confidential information from the kernel are also disabled.

  This may also be controlled via /sys/kernel/security/lockdown and
  overriden by kernel configuration.

  New or existing LSMs may implement finer-grained controls of the
  lockdown features. Refer to the lockdown_reason documentation in
  include/linux/security.h for details.

  The lockdown feature has had signficant design feedback and review
  across many subsystems. This code has been in linux-next for some
  weeks, with a few fixes applied along the way.

  Stephen Rothwell noted that commit 9d1f8be5cf ("bpf: Restrict bpf
  when kernel lockdown is in confidentiality mode") is missing a
  Signed-off-by from its author. Matthew responded that he is providing
  this under category (c) of the DCO"

* 'next-lockdown' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (31 commits)
  kexec: Fix file verification on S390
  security: constify some arrays in lockdown LSM
  lockdown: Print current->comm in restriction messages
  efi: Restrict efivar_ssdt_load when the kernel is locked down
  tracefs: Restrict tracefs when the kernel is locked down
  debugfs: Restrict debugfs when the kernel is locked down
  kexec: Allow kexec_file() with appropriate IMA policy when locked down
  lockdown: Lock down perf when in confidentiality mode
  bpf: Restrict bpf when kernel lockdown is in confidentiality mode
  lockdown: Lock down tracing and perf kprobes when in confidentiality mode
  lockdown: Lock down /proc/kcore
  x86/mmiotrace: Lock down the testmmiotrace module
  lockdown: Lock down module params that specify hardware parameters (eg. ioport)
  lockdown: Lock down TIOCSSERIAL
  lockdown: Prohibit PCMCIA CIS storage when the kernel is locked down
  acpi: Disable ACPI table override if the kernel is locked down
  acpi: Ignore acpi_rsdp kernel param when the kernel has been locked down
  ACPI: Limit access to custom_method when the kernel is locked down
  x86/msr: Restrict MSR access when the kernel is locked down
  x86: Lock down IO port access when the kernel is locked down
  ...
2019-09-28 08:14:15 -07:00
Linus Torvalds
f1f2f614d5 Merge branch 'next-integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity
Pull integrity updates from Mimi Zohar:
 "The major feature in this time is IMA support for measuring and
  appraising appended file signatures. In addition are a couple of bug
  fixes and code cleanup to use struct_size().

  In addition to the PE/COFF and IMA xattr signatures, the kexec kernel
  image may be signed with an appended signature, using the same
  scripts/sign-file tool that is used to sign kernel modules.

  Similarly, the initramfs may contain an appended signature.

  This contained a lot of refactoring of the existing appended signature
  verification code, so that IMA could retain the existing framework of
  calculating the file hash once, storing it in the IMA measurement list
  and extending the TPM, verifying the file's integrity based on a file
  hash or signature (eg. xattrs), and adding an audit record containing
  the file hash, all based on policy. (The IMA support for appended
  signatures patch set was posted and reviewed 11 times.)

  The support for appended signature paves the way for adding other
  signature verification methods, such as fs-verity, based on a single
  system-wide policy. The file hash used for verifying the signature and
  the signature, itself, can be included in the IMA measurement list"

* 'next-integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity:
  ima: ima_api: Use struct_size() in kzalloc()
  ima: use struct_size() in kzalloc()
  sefltest/ima: support appended signatures (modsig)
  ima: Fix use after free in ima_read_modsig()
  MODSIGN: make new include file self contained
  ima: fix freeing ongoing ahash_request
  ima: always return negative code for error
  ima: Store the measurement again when appraising a modsig
  ima: Define ima-modsig template
  ima: Collect modsig
  ima: Implement support for module-style appended signatures
  ima: Factor xattr_verify() out of ima_appraise_measurement()
  ima: Add modsig appraise_type option for module-style appended signatures
  integrity: Select CONFIG_KEYS instead of depending on it
  PKCS#7: Introduce pkcs7_get_digest()
  PKCS#7: Refactor verify_pkcs7_signature()
  MODSIGN: Export module signature definitions
  ima: initialize the "template" field with the default template
2019-09-27 19:37:27 -07:00
Linus Torvalds
8bbe0dec38 x86 KVM changes:
* The usual accuracy improvements for nested virtualization
 * The usual round of code cleanups from Sean
 * Added back optimizations that were prematurely removed in 5.2
   (the bare minimum needed to fix the regression was in 5.3-rc8,
   here comes the rest)
 * Support for UMWAIT/UMONITOR/TPAUSE
 * Direct L2->L0 TLB flushing when L0 is Hyper-V and L1 is KVM
 * Tell Windows guests if SMT is disabled on the host
 * More accurate detection of vmexit cost
 * Revert a pvqspinlock pessimization
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJdjfaKAAoJEL/70l94x66D8MAH/2thJnM47tYtMTFA4GBFugeH
 mAx8OApWFBo8apOip+8ElFLPQ8FQdZCzr9ti8H4JkuzKxgsxCs1iqEg5pHEKxSTi
 K9kLOZwoFtwgy3XmxC0PIZ9lT2Wx74ruh1HF+QG/YsjKH636UPv2VpmulsTNbm62
 2ryzOb3TlGT/cjf+gv9l6IYIxZa2Ff19PF4i//H8u4YRBj358/jr99CK01iE0M9r
 4NhEKiQZywzREWtKxymGOM6HEbwbWcIa+loYjj2htq8epep6f9Y1zQ0Jcn5+nPA0
 cn1T2gGJAJ0OUahKLwNbz8pzrFDkW+eoQgqCBJZ4RT9Uf8WCESfl14p+/vRkAMg=
 =tk5S
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull more KVM updates from Paolo Bonzini:
 "x86 KVM changes:

   - The usual accuracy improvements for nested virtualization

   - The usual round of code cleanups from Sean

   - Added back optimizations that were prematurely removed in 5.2 (the
     bare minimum needed to fix the regression was in 5.3-rc8, here
     comes the rest)

   - Support for UMWAIT/UMONITOR/TPAUSE

   - Direct L2->L0 TLB flushing when L0 is Hyper-V and L1 is KVM

   - Tell Windows guests if SMT is disabled on the host

   - More accurate detection of vmexit cost

   - Revert a pvqspinlock pessimization"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (56 commits)
  KVM: nVMX: cleanup and fix host 64-bit mode checks
  KVM: vmx: fix build warnings in hv_enable_direct_tlbflush() on i386
  KVM: x86: Don't check kvm_rebooting in __kvm_handle_fault_on_reboot()
  KVM: x86: Drop ____kvm_handle_fault_on_reboot()
  KVM: VMX: Add error handling to VMREAD helper
  KVM: VMX: Optimize VMX instruction error and fault handling
  KVM: x86: Check kvm_rebooting in kvm_spurious_fault()
  KVM: selftests: fix ucall on x86
  Revert "locking/pvqspinlock: Don't wait if vCPU is preempted"
  kvm: nvmx: limit atomic switch MSRs
  kvm: svm: Intercept RDPRU
  kvm: x86: Add "significant index" flag to a few CPUID leaves
  KVM: x86/mmu: Skip invalid pages during zapping iff root_count is zero
  KVM: x86/mmu: Explicitly track only a single invalid mmu generation
  KVM: x86/mmu: Revert "KVM: x86/mmu: Remove is_obsolete() call"
  KVM: x86/mmu: Revert "Revert "KVM: MMU: reclaim the zapped-obsolete page first""
  KVM: x86/mmu: Revert "Revert "KVM: MMU: collapse TLB flushes when zap all pages""
  KVM: x86/mmu: Revert "Revert "KVM: MMU: zap pages in batch""
  KVM: x86/mmu: Revert "Revert "KVM: MMU: add tracepoint for kvm_mmu_invalidate_all_pages""
  KVM: x86/mmu: Revert "Revert "KVM: MMU: show mmu_valid_gen in shadow page related tracepoints""
  ...
2019-09-27 12:44:26 -07:00
Balasubramani Vivekanandan
b9023b91dd tick: broadcast-hrtimer: Fix a race in bc_set_next
When a cpu requests broadcasting, before starting the tick broadcast
hrtimer, bc_set_next() checks if the timer callback (bc_handler) is active
using hrtimer_try_to_cancel(). But hrtimer_try_to_cancel() does not provide
the required synchronization when the callback is active on other core.

The callback could have already executed tick_handle_oneshot_broadcast()
and could have also returned. But still there is a small time window where
the hrtimer_try_to_cancel() returns -1. In that case bc_set_next() returns
without doing anything, but the next_event of the tick broadcast clock
device is already set to a timeout value.

In the race condition diagram below, CPU #1 is running the timer callback
and CPU #2 is entering idle state and so calls bc_set_next().

In the worst case, the next_event will contain an expiry time, but the
hrtimer will not be started which happens when the racing callback returns
HRTIMER_NORESTART. The hrtimer might never recover if all further requests
from the CPUs to subscribe to tick broadcast have timeout greater than the
next_event of tick broadcast clock device. This leads to cascading of
failures and finally noticed as rcu stall warnings

Here is a depiction of the race condition

CPU #1 (Running timer callback)                   CPU #2 (Enter idle
                                                  and subscribe to
                                                  tick broadcast)
---------------------                             ---------------------

__run_hrtimer()                                   tick_broadcast_enter()

  bc_handler()                                      __tick_broadcast_oneshot_control()

    tick_handle_oneshot_broadcast()

      raw_spin_lock(&tick_broadcast_lock);

      dev->next_event = KTIME_MAX;                  //wait for tick_broadcast_lock
      //next_event for tick broadcast clock
      set to KTIME_MAX since no other cores
      subscribed to tick broadcasting

      raw_spin_unlock(&tick_broadcast_lock);

    if (dev->next_event == KTIME_MAX)
      return HRTIMER_NORESTART
    // callback function exits without
       restarting the hrtimer                      //tick_broadcast_lock acquired
                                                   raw_spin_lock(&tick_broadcast_lock);

                                                   tick_broadcast_set_event()

                                                     clockevents_program_event()

                                                       dev->next_event = expires;

                                                       bc_set_next()

                                                         hrtimer_try_to_cancel()
                                                         //returns -1 since the timer
                                                         callback is active. Exits without
                                                         restarting the timer
  cpu_base->running = NULL;

The comment that hrtimer cannot be armed from within the callback is
wrong. It is fine to start the hrtimer from within the callback. Also it is
safe to start the hrtimer from the enter/exit idle code while the broadcast
handler is active. The enter/exit idle code and the broadcast handler are
synchronized using tick_broadcast_lock. So there is no need for the
existing try to cancel logic. All this can be removed which will eliminate
the race condition as well.

Fixes: 5d1638acb9 ("tick: Introduce hrtimer based broadcast")
Originally-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Balasubramani Vivekanandan <balasubramani_vivekanandan@mentor.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20190926135101.12102-2-balasubramani_vivekanandan@mentor.com
2019-09-27 14:45:55 +02:00
Allan Zhang
768fb61fcc bpf: Fix bpf_event_output re-entry issue
BPF_PROG_TYPE_SOCK_OPS program can reenter bpf_event_output because it
can be called from atomic and non-atomic contexts since we don't have
bpf_prog_active to prevent it happen.

This patch enables 3 levels of nesting to support normal, irq and nmi
context.

We can easily reproduce the issue by running netperf crr mode with 100
flows and 10 threads from netperf client side.

Here is the whole stack dump:

[  515.228898] WARNING: CPU: 20 PID: 14686 at kernel/trace/bpf_trace.c:549 bpf_event_output+0x1f9/0x220
[  515.228903] CPU: 20 PID: 14686 Comm: tcp_crr Tainted: G        W        4.15.0-smp-fixpanic #44
[  515.228904] Hardware name: Intel TBG,ICH10/Ikaria_QC_1b, BIOS 1.22.0 06/04/2018
[  515.228905] RIP: 0010:bpf_event_output+0x1f9/0x220
[  515.228906] RSP: 0018:ffff9a57ffc03938 EFLAGS: 00010246
[  515.228907] RAX: 0000000000000012 RBX: 0000000000000001 RCX: 0000000000000000
[  515.228907] RDX: 0000000000000000 RSI: 0000000000000096 RDI: ffffffff836b0f80
[  515.228908] RBP: ffff9a57ffc039c8 R08: 0000000000000004 R09: 0000000000000012
[  515.228908] R10: ffff9a57ffc1de40 R11: 0000000000000000 R12: 0000000000000002
[  515.228909] R13: ffff9a57e13bae00 R14: 00000000ffffffff R15: ffff9a57ffc1e2c0
[  515.228910] FS:  00007f5a3e6ec700(0000) GS:ffff9a57ffc00000(0000) knlGS:0000000000000000
[  515.228910] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  515.228911] CR2: 0000537082664fff CR3: 000000061fed6002 CR4: 00000000000226f0
[  515.228911] Call Trace:
[  515.228913]  <IRQ>
[  515.228919]  [<ffffffff82c6c6cb>] bpf_sockopt_event_output+0x3b/0x50
[  515.228923]  [<ffffffff8265daee>] ? bpf_ktime_get_ns+0xe/0x10
[  515.228927]  [<ffffffff8266fda5>] ? __cgroup_bpf_run_filter_sock_ops+0x85/0x100
[  515.228930]  [<ffffffff82cf90a5>] ? tcp_init_transfer+0x125/0x150
[  515.228933]  [<ffffffff82cf9159>] ? tcp_finish_connect+0x89/0x110
[  515.228936]  [<ffffffff82cf98e4>] ? tcp_rcv_state_process+0x704/0x1010
[  515.228939]  [<ffffffff82c6e263>] ? sk_filter_trim_cap+0x53/0x2a0
[  515.228942]  [<ffffffff82d90d1f>] ? tcp_v6_inbound_md5_hash+0x6f/0x1d0
[  515.228945]  [<ffffffff82d92160>] ? tcp_v6_do_rcv+0x1c0/0x460
[  515.228947]  [<ffffffff82d93558>] ? tcp_v6_rcv+0x9f8/0xb30
[  515.228951]  [<ffffffff82d737c0>] ? ip6_route_input+0x190/0x220
[  515.228955]  [<ffffffff82d5f7ad>] ? ip6_protocol_deliver_rcu+0x6d/0x450
[  515.228958]  [<ffffffff82d60246>] ? ip6_rcv_finish+0xb6/0x170
[  515.228961]  [<ffffffff82d5fb90>] ? ip6_protocol_deliver_rcu+0x450/0x450
[  515.228963]  [<ffffffff82d60361>] ? ipv6_rcv+0x61/0xe0
[  515.228966]  [<ffffffff82d60190>] ? ipv6_list_rcv+0x330/0x330
[  515.228969]  [<ffffffff82c4976b>] ? __netif_receive_skb_one_core+0x5b/0xa0
[  515.228972]  [<ffffffff82c497d1>] ? __netif_receive_skb+0x21/0x70
[  515.228975]  [<ffffffff82c4a8d2>] ? process_backlog+0xb2/0x150
[  515.228978]  [<ffffffff82c4aadf>] ? net_rx_action+0x16f/0x410
[  515.228982]  [<ffffffff830000dd>] ? __do_softirq+0xdd/0x305
[  515.228986]  [<ffffffff8252cfdc>] ? irq_exit+0x9c/0xb0
[  515.228989]  [<ffffffff82e02de5>] ? smp_call_function_single_interrupt+0x65/0x120
[  515.228991]  [<ffffffff82e020e1>] ? call_function_single_interrupt+0x81/0x90
[  515.228992]  </IRQ>
[  515.228996]  [<ffffffff82a11ff0>] ? io_serial_in+0x20/0x20
[  515.229000]  [<ffffffff8259c040>] ? console_unlock+0x230/0x490
[  515.229003]  [<ffffffff8259cbaa>] ? vprintk_emit+0x26a/0x2a0
[  515.229006]  [<ffffffff8259cbff>] ? vprintk_default+0x1f/0x30
[  515.229008]  [<ffffffff8259d9f5>] ? vprintk_func+0x35/0x70
[  515.229011]  [<ffffffff8259d4bb>] ? printk+0x50/0x66
[  515.229013]  [<ffffffff82637637>] ? bpf_event_output+0xb7/0x220
[  515.229016]  [<ffffffff82c6c6cb>] ? bpf_sockopt_event_output+0x3b/0x50
[  515.229019]  [<ffffffff8265daee>] ? bpf_ktime_get_ns+0xe/0x10
[  515.229023]  [<ffffffff82c29e87>] ? release_sock+0x97/0xb0
[  515.229026]  [<ffffffff82ce9d6a>] ? tcp_recvmsg+0x31a/0xda0
[  515.229029]  [<ffffffff8266fda5>] ? __cgroup_bpf_run_filter_sock_ops+0x85/0x100
[  515.229032]  [<ffffffff82ce77c1>] ? tcp_set_state+0x191/0x1b0
[  515.229035]  [<ffffffff82ced10e>] ? tcp_disconnect+0x2e/0x600
[  515.229038]  [<ffffffff82cecbbb>] ? tcp_close+0x3eb/0x460
[  515.229040]  [<ffffffff82d21082>] ? inet_release+0x42/0x70
[  515.229043]  [<ffffffff82d58809>] ? inet6_release+0x39/0x50
[  515.229046]  [<ffffffff82c1f32d>] ? __sock_release+0x4d/0xd0
[  515.229049]  [<ffffffff82c1f3e5>] ? sock_close+0x15/0x20
[  515.229052]  [<ffffffff8273b517>] ? __fput+0xe7/0x1f0
[  515.229055]  [<ffffffff8273b66e>] ? ____fput+0xe/0x10
[  515.229058]  [<ffffffff82547bf2>] ? task_work_run+0x82/0xb0
[  515.229061]  [<ffffffff824086df>] ? exit_to_usermode_loop+0x7e/0x11f
[  515.229064]  [<ffffffff82408171>] ? do_syscall_64+0x111/0x130
[  515.229067]  [<ffffffff82e0007c>] ? entry_SYSCALL_64_after_hwframe+0x3d/0xa2

Fixes: a5a3a828cd ("bpf: add perf event notificaton support for sock_ops")
Signed-off-by: Allan Zhang <allanzhang@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20190925234312.94063-2-allanzhang@google.com
2019-09-27 11:24:29 +02:00
Linus Torvalds
da05b5ea12 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Ingo Molnar:
 "Fix a timer expiry bug that would cause spurious delay of timers"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timer: Read jiffies once when forwarding base clk
2019-09-26 15:53:17 -07:00
Linus Torvalds
a7b7b772bb Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull more perf updates from Ingo Molnar:
 "The only kernel change is comment typo fixes.

  The rest is mostly tooling fixes, but also new vendor event additions
  and updates, a bigger libperf/libtraceevent library and a header files
  reorganization that came in a bit late"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (108 commits)
  perf unwind: Fix libunwind build failure on i386 systems
  perf parser: Remove needless include directives
  perf build: Add detection of java-11-openjdk-devel package
  perf jvmti: Include JVMTI support for s390
  perf vendor events: Remove P8 HW events which are not supported
  perf evlist: Fix access of freed id arrays
  perf stat: Fix free memory access / memory leaks in metrics
  perf tools: Replace needless mmap.h with what is needed, event.h
  perf evsel: Move config terms to a separate header
  perf evlist: Remove unused perf_evlist__fprintf() method
  perf evsel: Introduce evsel_fprintf.h
  perf evsel: Remove need for symbol_conf in evsel_fprintf.c
  perf copyfile: Move copyfile routines to separate files
  libperf: Add perf_evlist__poll() function
  libperf: Add perf_evlist__add_pollfd() function
  libperf: Add perf_evlist__alloc_pollfd() function
  libperf: Add libperf_init() call to the tests
  libperf: Merge libperf_set_print() into libperf_init()
  libperf: Add libperf dependency for tests targets
  libperf: Use sys/types.h to get ssize_t, not unistd.h
  ...
2019-09-26 15:38:07 -07:00
Linus Torvalds
7897c04ad0 Srikar Dronamraju fixed a bug in the newmulti probe code.
-----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXYvAlBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qlK6APsECr49j3ew/tRCnzkq0Y09w0TLYeHL
 ax6aAVO1fHX0TgEAhCBwkWh8ZcoxGbu1CDOkQjJAqfTFFSu38Klv1P+3PQg=
 =oSuX
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fix from Steven Rostedt:
 "Srikar Dronamraju fixed a bug in the newmulti probe code"

* tag 'trace-v5.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing/probe: Fix same probe event argument matching
2019-09-26 13:07:38 -07:00
Colin Ian King
e3439af4a3 bpf: Clean up indentation issue in BTF kflag processing
There is a statement that is indented one level too deeply, remove
the extraneous tab.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20190925093835.19515-1-colin.king@canonical.com
2019-09-26 17:09:18 +02:00
Kees Cook
2da1ead4d5 bug: consolidate __WARN_FLAGS usage
Instead of having separate tests for __WARN_FLAGS, merge the two #ifdef
blocks and replace the synonym WANT_WARN_ON_SLOWPATH macro.

Link: http://lkml.kernel.org/r/20190819234111.9019-7-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Drew Davenport <ddavenport@chromium.org>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25 17:51:41 -07:00
Kees Cook
d38aba49a9 bug: lift "cut here" out of __warn()
In preparation for cleaning up "cut here", move the "cut here" logic up
out of __warn() and into callers that pass non-NULL args.  For anyone
looking closely, there are two callers that pass NULL args: one already
explicitly prints "cut here".  The remaining case is covered by how a WARN
is built, which will be cleaned up in the next patch.

Link: http://lkml.kernel.org/r/20190819234111.9019-5-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Drew Davenport <ddavenport@chromium.org>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25 17:51:40 -07:00
Kees Cook
f2f84b05e0 bug: consolidate warn_slowpath_fmt() usage
Instead of having a separate helper for no printk output, just consolidate
the logic into warn_slowpath_fmt().

Link: http://lkml.kernel.org/r/20190819234111.9019-4-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Drew Davenport <ddavenport@chromium.org>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25 17:51:40 -07:00
Kees Cook
ee8711336c bug: refactor away warn_slowpath_fmt_taint()
Patch series "Clean up WARN() "cut here" handling", v2.

Christophe Leroy noticed that the fix for missing "cut here" in the WARN()
case was adding explicit printk() calls instead of teaching the exception
handler to add it.  This refactors the bug/warn infrastructure to pass
this information as a new BUGFLAG.

Longer details repeated from the last patch in the series:

bug: move WARN_ON() "cut here" into exception handler

The original cleanup of "cut here" missed the WARN_ON() case (that does
not have a printk message), which was fixed recently by adding an explicit
printk of "cut here".  This had the downside of adding a printk() to every
WARN_ON() caller, which reduces the utility of using an instruction
exception to streamline the resulting code.  By making this a new BUGFLAG,
all of these can be removed and "cut here" can be handled by the exception
handler.

This was very pronounced on PowerPC, but the effect can be seen on x86 as
well.  The resulting text size of a defconfig build shows some small
savings from this patch:

   text    data     bss     dec     hex filename
19691167        5134320 1646664 26472151        193eed7 vmlinux.before
19676362        5134260 1663048 26473670        193f4c6 vmlinux.after

This change also opens the door for creating something like BUG_MSG(),
where a custom printk() before issuing BUG(), without confusing the "cut
here" line.

This patch (of 7):

There's no reason to have specialized helpers for passing the warn taint
down to __warn().  Consolidate and refactor helper macros, removing
__WARN_printf() and warn_slowpath_fmt_taint().

Link: http://lkml.kernel.org/r/20190819234111.9019-2-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Drew Davenport <ddavenport@chromium.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25 17:51:40 -07:00
Douglas Anderson
7d92bda271 kgdb: don't use a notifier to enter kgdb at panic; call directly
Right now kgdb/kdb hooks up to debug panics by registering for the panic
notifier.  This works OK except that it means that kgdb/kdb gets called
_after_ the CPUs in the system are taken offline.  That means that if
anything important was happening on those CPUs (like something that might
have contributed to the panic) you can't debug them.

Specifically I ran into a case where I got a panic because a task was
"blocked for more than 120 seconds" which was detected on CPU 2.  I nicely
got shown stack traces in the kernel log for all CPUs including CPU 0,
which was running 'PID: 111 Comm: kworker/0:1H' and was in the middle of
__mmc_switch().

I then ended up at the kdb prompt where switched over to kgdb to try to
look at local variables of the process on CPU 0.  I found that I couldn't.
Digging more, I found that I had no info on any tasks running on CPUs
other than CPU 2 and that asking kdb for help showed me "Error: no saved
data for this cpu".  This was because all the CPUs were offline.

Let's move the entry of kdb/kgdb to a direct call from panic() and stop
using the generic notifier.  Putting a direct call in allows us to order
things more properly and it also doesn't seem like we're breaking any
abstractions by calling into the debugger from the panic function.

Daniel said:

: This patch changes the way kdump and kgdb interact with each other.
: However it would seem rather odd to have both tools simultaneously armed
: and, even if they were, the user still has the option to use panic_timeout
: to force a kdump to happen.  Thus I think the change of order is
: acceptable.

Link: http://lkml.kernel.org/r/20190703170354.217312-1-dianders@chromium.org
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Feng Tang <feng.tang@intel.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25 17:51:40 -07:00
Tetsuo Handa
7c3a6aedcd kexec: bail out upon SIGKILL when allocating memory.
syzbot found that a thread can stall for minutes inside kexec_load() after
that thread was killed by SIGKILL [1].  It turned out that the reproducer
was trying to allocate 2408MB of memory using kimage_alloc_page() from
kimage_load_normal_segment().  Let's check for SIGKILL before doing memory
allocation.

[1] https://syzkaller.appspot.com/bug?id=a0e3436829698d5824231251fad9d8e998f94f5e

Link: http://lkml.kernel.org/r/993c9185-d324-2640-d061-bed2dd18b1f7@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: syzbot <syzbot+8ab2d0f39fb79fe6ca40@syzkaller.appspotmail.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25 17:51:40 -07:00
Sai Praneeth Prakhya
8495f7e673 fork: improve error message for corrupted page tables
When a user process exits, the kernel cleans up the mm_struct of the user
process and during cleanup, check_mm() checks the page tables of the user
process for corruption (E.g: unexpected page flags set/cleared).  For
corrupted page tables, the error message printed by check_mm() isn't very
clear as it prints the loop index instead of page table type (E.g:
Resident file mapping pages vs Resident shared memory pages).  The loop
index in check_mm() is used to index rss_stat[] which represents
individual memory type stats.  Hence, instead of printing index, print
memory type, thereby improving error message.

Without patch:
--------------
[  204.836425] mm/pgtable-generic.c:29: bad p4d 0000000089eb4e92(800000025f941467)
[  204.836544] BUG: Bad rss-counter state mm:00000000f75895ea idx:0 val:2
[  204.836615] BUG: Bad rss-counter state mm:00000000f75895ea idx:1 val:5
[  204.836685] BUG: non-zero pgtables_bytes on freeing mm: 20480

With patch:
-----------
[   69.815453] mm/pgtable-generic.c:29: bad p4d 0000000084653642(800000025ca37467)
[   69.815872] BUG: Bad rss-counter state mm:00000000014a6c03 type:MM_FILEPAGES val:2
[   69.815962] BUG: Bad rss-counter state mm:00000000014a6c03 type:MM_ANONPAGES val:5
[   69.816050] BUG: non-zero pgtables_bytes on freeing mm: 20480

Also, change print function (from printk(KERN_ALERT, ..) to pr_alert()) so
that it matches the other print statement.

Link: http://lkml.kernel.org/r/da75b5153f617f4c5739c08ee6ebeb3d19db0fbc.1565123758.git.sai.praneeth.prakhya@intel.com
Signed-off-by: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25 17:51:40 -07:00
Valdis Kletnieks
0f74914071 kernel/elfcore.c: include proper prototypes
When building with W=1, gcc properly complains that there's no prototypes:

  CC      kernel/elfcore.o
kernel/elfcore.c:7:17: warning: no previous prototype for 'elf_core_extra_phdrs' [-Wmissing-prototypes]
    7 | Elf_Half __weak elf_core_extra_phdrs(void)
      |                 ^~~~~~~~~~~~~~~~~~~~
kernel/elfcore.c:12:12: warning: no previous prototype for 'elf_core_write_extra_phdrs' [-Wmissing-prototypes]
   12 | int __weak elf_core_write_extra_phdrs(struct coredump_params *cprm, loff_t offset)
      |            ^~~~~~~~~~~~~~~~~~~~~~~~~~
kernel/elfcore.c:17:12: warning: no previous prototype for 'elf_core_write_extra_data' [-Wmissing-prototypes]
   17 | int __weak elf_core_write_extra_data(struct coredump_params *cprm)
      |            ^~~~~~~~~~~~~~~~~~~~~~~~~
kernel/elfcore.c:22:15: warning: no previous prototype for 'elf_core_extra_data_size' [-Wmissing-prototypes]
   22 | size_t __weak elf_core_extra_data_size(void)
      |               ^~~~~~~~~~~~~~~~~~~~~~~~

Provide the include file so gcc is happy, and we don't have potential code drift

Link: http://lkml.kernel.org/r/29875.1565224705@turing-police
Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25 17:51:39 -07:00
Jonathan Lemon
fcd30ae066 bpf/xskmap: Return ERR_PTR for failure case instead of NULL.
When kzalloc() failed, NULL was returned to the caller, which
tested the pointer with IS_ERR(), which didn't match, so the
pointer was used later, resulting in a NULL dereference.

Return ERR_PTR(-ENOMEM) instead of NULL.

Reported-by: syzbot+491c1b7565ba9069ecae@syzkaller.appspotmail.com
Fixes: 0402acd683 ("xsk: remove AF_XDP socket from map when the socket is released")
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-09-25 22:14:16 +02:00
Quentin Perret
4892f51ad5 sched/fair: Avoid redundant EAS calculation
The EAS wake-up path computes the system energy for several CPU
candidates: the CPU with maximum spare capacity in each performance
domain, and the prev_cpu. However, if prev_cpu also happens to be the
CPU with maximum spare capacity in its performance domain, the energy
calculation is still done twice, unnecessarily.

Add a condition to filter out this corner case before doing the energy
calculation.

Reported-by: Pavan Kondeti <pkondeti@codeaurora.org>
Signed-off-by: Quentin Perret <qperret@qperret.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: juri.lelli@redhat.com
Cc: morten.rasmussen@arm.com
Cc: qais.yousef@arm.com
Cc: rjw@rjwysocki.net
Cc: tkjos@google.com
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Fixes: eb92692b25 ("sched/fair: Speed-up energy-aware wake-ups")
Link: https://lkml.kernel.org/r/20190920094115.GA11503@qperret.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-25 17:42:32 +02:00
Valentin Schneider
9fc41acc89 sched/core: Remove double update_max_interval() call on CPU startup
update_max_interval() is called in both CPUHP_AP_SCHED_STARTING's startup
and teardown callbacks, but it turns out it's also called at the end of
the startup callback of CPUHP_AP_ACTIVE (which is further down the
startup sequence).

There's no point in repeating this interval update in the startup sequence
since the CPU will remain online until it goes down the teardown path.

Remove the redundant call in sched_cpu_activate() (CPUHP_AP_ACTIVE).

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: juri.lelli@redhat.com
Cc: vincent.guittot@linaro.org
Link: https://lkml.kernel.org/r/20190923093017.11755-1-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-25 17:42:32 +02:00
Valentin Schneider
a49b4f4012 sched/core: Fix preempt_schedule() interrupt return comment
preempt_schedule_irq() is the one that should be called on return from
interrupt, clean up the comment to avoid any ambiguity.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-m68k@lists.linux-m68k.org
Cc: linux-riscv@lists.infradead.org
Cc: uclinux-h8-devel@lists.sourceforge.jp
Link: https://lkml.kernel.org/r/20190923143620.29334-2-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-25 17:42:32 +02:00
Qian Cai
763a9ec06c sched/fair: Fix -Wunused-but-set-variable warnings
Commit:

   de53fd7aed ("sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices")

introduced a few compilation warnings:

  kernel/sched/fair.c: In function '__refill_cfs_bandwidth_runtime':
  kernel/sched/fair.c:4365:6: warning: variable 'now' set but not used [-Wunused-but-set-variable]
  kernel/sched/fair.c: In function 'start_cfs_bandwidth':
  kernel/sched/fair.c:4992:6: warning: variable 'overrun' set but not used [-Wunused-but-set-variable]

Also, __refill_cfs_bandwidth_runtime() does no longer update the
expiration time, so fix the comments accordingly.

Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Reviewed-by: Dave Chiluk <chiluk+linux@indeed.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: pauld@redhat.com
Fixes: de53fd7aed ("sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices")
Link: https://lkml.kernel.org/r/1566326455-8038-1-git-send-email-cai@lca.pw
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-25 17:42:31 +02:00
KeMeng Shi
714e501e16 sched/core: Fix migration to invalid CPU in __set_cpus_allowed_ptr()
An oops can be triggered in the scheduler when running qemu on arm64:

 Unable to handle kernel paging request at virtual address ffff000008effe40
 Internal error: Oops: 96000007 [#1] SMP
 Process migration/0 (pid: 12, stack limit = 0x00000000084e3736)
 pstate: 20000085 (nzCv daIf -PAN -UAO)
 pc : __ll_sc___cmpxchg_case_acq_4+0x4/0x20
 lr : move_queued_task.isra.21+0x124/0x298
 ...
 Call trace:
  __ll_sc___cmpxchg_case_acq_4+0x4/0x20
  __migrate_task+0xc8/0xe0
  migration_cpu_stop+0x170/0x180
  cpu_stopper_thread+0xec/0x178
  smpboot_thread_fn+0x1ac/0x1e8
  kthread+0x134/0x138
  ret_from_fork+0x10/0x18

__set_cpus_allowed_ptr() will choose an active dest_cpu in affinity mask to
migrage the process if process is not currently running on any one of the
CPUs specified in affinity mask. __set_cpus_allowed_ptr() will choose an
invalid dest_cpu (dest_cpu >= nr_cpu_ids, 1024 in my virtual machine) if
CPUS in an affinity mask are deactived by cpu_down after cpumask_intersects
check. cpumask_test_cpu() of dest_cpu afterwards is overflown and may pass if
corresponding bit is coincidentally set. As a consequence, kernel will
access an invalid rq address associate with the invalid CPU in
migration_cpu_stop->__migrate_task->move_queued_task and the Oops occurs.

The reproduce the crash:

  1) A process repeatedly binds itself to cpu0 and cpu1 in turn by calling
  sched_setaffinity.

  2) A shell script repeatedly does "echo 0 > /sys/devices/system/cpu/cpu1/online"
  and "echo 1 > /sys/devices/system/cpu/cpu1/online" in turn.

  3) Oops appears if the invalid CPU is set in memory after tested cpumask.

Signed-off-by: KeMeng Shi <shikemeng@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/1568616808-16808-1-git-send-email-shikemeng@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-25 17:42:31 +02:00
Mathieu Desnoyers
c172e0a3e8 sched/membarrier: Return -ENOMEM to userspace on memory allocation failure
Remove the IPI fallback code from membarrier to deal with very
infrequent cpumask memory allocation failure. Use GFP_KERNEL rather
than GFP_NOWAIT, and relax the blocking guarantees for the expedited
membarrier system call commands, allowing it to block if waiting for
memory to be made available.

In addition, now -ENOMEM can be returned to user-space if the cpumask
memory allocation fails.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190919173705.2181-8-mathieu.desnoyers@efficios.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-25 17:42:31 +02:00
Mathieu Desnoyers
c6d68c1c4a sched/membarrier: Skip IPIs when mm->mm_users == 1
If there is only a single mm_user for the mm, the private expedited
membarrier command can skip the IPIs, because only a single thread
is using the mm.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190919173705.2181-7-mathieu.desnoyers@efficios.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-25 17:42:31 +02:00
Mathieu Desnoyers
227a4aadc7 sched/membarrier: Fix p->mm->membarrier_state racy load
The membarrier_state field is located within the mm_struct, which
is not guaranteed to exist when used from runqueue-lock-free iteration
on runqueues by the membarrier system call.

Copy the membarrier_state from the mm_struct into the scheduler runqueue
when the scheduler switches between mm.

When registering membarrier for mm, after setting the registration bit
in the mm membarrier state, issue a synchronize_rcu() to ensure the
scheduler observes the change. In order to take care of the case
where a runqueue keeps executing the target mm without swapping to
other mm, iterate over each runqueue and issue an IPI to copy the
membarrier_state from the mm_struct into each runqueue which have the
same mm which state has just been modified.

Move the mm membarrier_state field closer to pgd in mm_struct to use
a cache line already touched by the scheduler switch_mm.

The membarrier_execve() (now membarrier_exec_mmap) hook now needs to
clear the runqueue's membarrier state in addition to clear the mm
membarrier state, so move its implementation into the scheduler
membarrier code so it can access the runqueue structure.

Add memory barrier in membarrier_exec_mmap() prior to clearing
the membarrier state, ensuring memory accesses executed prior to exec
are not reordered with the stores clearing the membarrier state.

As suggested by Linus, move all membarrier.c RCU read-side locks outside
of the for each cpu loops.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190919173705.2181-5-mathieu.desnoyers@efficios.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-25 17:42:30 +02:00
Mathieu Desnoyers
09554009c0 sched/membarrier: Remove redundant check
Checking that the number of threads is 1 is redundant with checking
mm_users == 1.

No change in functionality intended.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190919173705.2181-3-mathieu.desnoyers@efficios.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-25 17:42:30 +02:00
Mathieu Desnoyers
fc0d77387c sched/membarrier: Fix private expedited registration check
Fix a logic flaw in the way membarrier_register_private_expedited()
handles ready state checks for private expedited sync core and private
expedited registrations.

If a private expedited membarrier registration is first performed, and
then a private expedited sync_core registration is performed, the ready
state check will skip the second registration when it really should not.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190919173705.2181-2-mathieu.desnoyers@efficios.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-25 17:42:30 +02:00
Eric W. Biederman
5311a98fef tasks, sched/core: RCUify the assignment of rq->curr
The current task on the runqueue is currently read with rcu_dereference().

To obtain ordinary RCU semantics for an rcu_dereference() of rq->curr it needs
to be paired with rcu_assign_pointer() of rq->curr.  Which provides the
memory barrier necessary to order assignments to the task_struct
and the assignment to rq->curr.

Unfortunately the assignment of rq->curr in __schedule is a hot path,
and it has already been show that additional barriers in that code
will reduce the performance of the scheduler.  So I will attempt to
describe below why you can effectively have ordinary RCU semantics
without any additional barriers.

The assignment of rq->curr in init_idle is a slow path called once
per cpu and that can use rcu_assign_pointer() without any concerns.

As I write this there are effectively two users of rcu_dereference() on
rq->curr.  There is the membarrier code in kernel/sched/membarrier.c
that only looks at "->mm" after the rcu_dereference().  Then there is
task_numa_compare() in kernel/sched/fair.c.  My best reading of the
code shows that task_numa_compare only access: "->flags",
"->cpus_ptr", "->numa_group", "->numa_faults[]",
"->total_numa_faults", and "->se.cfs_rq".

The code in __schedule() essentially does:
	rq_lock(...);
	smp_mb__after_spinlock();

	next = pick_next_task(...);
	rq->curr = next;

	context_switch(prev, next);

At the start of the function the rq_lock/smp_mb__after_spinlock
pair provides a full memory barrier.  Further there is a full memory barrier
in context_switch().

This means that any task that has already run and modified itself (the
common case) has already seen two memory barriers before __schedule()
runs and begins executing.  A task that modifies itself then sees a
third full memory barrier pair with the rq_lock();

For a brand new task that is enqueued with wake_up_new_task() there
are the memory barriers present from the taking and release the
pi_lock and the rq_lock as the processes is enqueued as well as the
full memory barrier at the start of __schedule() assuming __schedule()
happens on the same cpu.

This means that by the time we reach the assignment of rq->curr
except for values on the task struct modified in pick_next_task
the code has the same guarantees as if it used rcu_assign_pointer().

Reading through all of the implementations of pick_next_task it
appears pick_next_task is limited to modifying the task_struct fields
"->se", "->rt", "->dl".  These fields are the sched_entity structures
of the varies schedulers.

Further "->se.cfs_rq" is only changed in cgroup attach/move operations
initialized by userspace.

Unless I have missed something this means that in practice that the
users of "rcu_dereference(rq->curr)" get normal RCU semantics of
rcu_dereference() for the fields the care about, despite the
assignment of rq->curr in __schedule() ot using rcu_assign_pointer.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20190903200603.GW2349@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-25 17:42:29 +02:00
Eric W. Biederman
154abafc68 tasks, sched/core: With a grace period after finish_task_switch(), remove unnecessary code
Remove work arounds that were written before there was a grace period
after tasks left the runqueue in finish_task_switch().

In particular now that there tasks exiting the runqueue exprience
a RCU grace period none of the work performed by task_rcu_dereference()
excpet the rcu_dereference() is necessary so replace task_rcu_dereference()
with rcu_dereference().

Remove the code in rcuwait_wait_event() that checks to ensure the current
task has not exited.  It is no longer necessary as it is guaranteed
that any running task will experience a RCU grace period after it
leaves the run queueue.

Remove the comment in rcuwait_wake_up() as it is no longer relevant.

Ref: 8f95c90ceb ("sched/wait, RCU: Introduce rcuwait machinery")
Ref: 150593bf86 ("sched/api: Introduce task_rcu_dereference() and try_get_task_struct()")
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/87lfurdpk9.fsf_-_@x220.int.ebiederm.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-25 17:42:29 +02:00
Eric W. Biederman
0ff7b2cfba tasks, sched/core: Ensure tasks are available for a grace period after leaving the runqueue
In the ordinary case today the RCU grace period for a task_struct is
triggered when another process wait's for it's zombine and causes the
kernel to call release_task().  As the waiting task has to receive a
signal and then act upon it before this happens, typically this will
occur after the original task as been removed from the runqueue.

Unfortunaty in some cases such as self reaping tasks it can be shown
that release_task() will be called starting the grace period for
task_struct long before the task leaves the runqueue.

Therefore use put_task_struct_rcu_user() in finish_task_switch() to
guarantee that the there is a RCU lifetime after the task
leaves the runqueue.

Besides the change in the start of the RCU grace period for the
task_struct this change may cause perf_event_delayed_put and
trace_sched_process_free.  The function perf_event_delayed_put boils
down to just a WARN_ON for cases that I assume never show happen.  So
I don't see any problem with delaying it.

The function trace_sched_process_free is a trace point and thus
visible to user space.  Occassionally userspace has the strangest
dependencies so this has a miniscule chance of causing a regression.
This change only changes the timing of when the tracepoint is called.
The change in timing arguably gives userspace a more accurate picture
of what is going on.  So I don't expect there to be a regression.

In the case where a task self reaps we are pretty much guaranteed that
the RCU grace period is delayed.  So we should get quite a bit of
coverage in of this worst case for the change in a normal threaded
workload.  So I expect any issues to turn up quickly or not at all.

I have lightly tested this change and everything appears to work
fine.

Inspired-by: Linus Torvalds <torvalds@linux-foundation.org>
Inspired-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/87r24jdpl5.fsf_-_@x220.int.ebiederm.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-25 17:42:29 +02:00
Eric W. Biederman
3fbd7ee285 tasks: Add a count of task RCU users
Add a count of the number of RCU users (currently 1) of the task
struct so that we can later add the scheduler case and get rid of the
very subtle task_rcu_dereference(), and just use rcu_dereference().

As suggested by Oleg have the count overlap rcu_head so that no
additional space in task_struct is required.

Inspired-by: Linus Torvalds <torvalds@linux-foundation.org>
Inspired-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/87woebdplt.fsf_-_@x220.int.ebiederm.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-25 17:42:29 +02:00
Srikar Dronamraju
f8d7ab2bde tracing/probe: Fix same probe event argument matching
Commit fe60b0ce8e ("tracing/probe: Reject exactly same probe event")
tries to reject a event which matches an already existing probe.

However it currently continues to match arguments and rejects adding a
probe even when the arguments don't match. Fix this by only rejecting a
probe if and only if all the arguments match.

Link: http://lkml.kernel.org/r/20190924114906.14038-1-srikar@linux.vnet.ibm.com

Fixes: fe60b0ce8e ("tracing/probe: Reject exactly same probe event")
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-09-25 06:34:06 -04:00
Wanpeng Li
89340d0935 Revert "locking/pvqspinlock: Don't wait if vCPU is preempted"
This patch reverts commit 75437bb304 (locking/pvqspinlock: Don't
wait if vCPU is preempted).  A large performance regression was caused
by this commit.  on over-subscription scenarios.

The test was run on a Xeon Skylake box, 2 sockets, 40 cores, 80 threads,
with three VMs of 80 vCPUs each.  The score of ebizzy -M is reduced from
13000-14000 records/s to 1700-1800 records/s:

          Host                Guest                score

vanilla w/o kvm optimizations     upstream    1700-1800 records/s
vanilla w/o kvm optimizations     revert      13000-14000 records/s
vanilla w/ kvm optimizations      upstream    4500-5000 records/s
vanilla w/ kvm optimizations      revert      14000-15500 records/s

Exit from aggressive wait-early mechanism can result in premature yield
and extra scheduling latency.

Actually, only 6% of wait_early events are caused by vcpu_is_preempted()
being true.  However, when one vCPU voluntarily releases its vCPU, all
the subsequently waiters in the queue will do the same and the cascading
effect leads to bad performance.

kvm optimizations:
[1] commit d73eb57b80 (KVM: Boost vCPUs that are delivering interrupts)
[2] commit 266e85a5ec (KVM: X86: Boost queue head vCPU to mitigate lock waiter preemption)

Tested-by: loobinliu@tencent.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: loobinliu@tencent.com
Cc: stable@vger.kernel.org
Fixes: 75437bb304 (locking/pvqspinlock: Don't wait if vCPU is preempted)
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-25 10:22:37 +02:00
Linus Torvalds
9c9fa97a8e Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:

 - a few hot fixes

 - ocfs2 updates

 - almost all of -mm (slab-generic, slab, slub, kmemleak, kasan,
   cleanups, debug, pagecache, memcg, gup, pagemap, memory-hotplug,
   sparsemem, vmalloc, initialization, z3fold, compaction, mempolicy,
   oom-kill, hugetlb, migration, thp, mmap, madvise, shmem, zswap,
   zsmalloc)

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (132 commits)
  mm/zsmalloc.c: fix a -Wunused-function warning
  zswap: do not map same object twice
  zswap: use movable memory if zpool support allocate movable memory
  zpool: add malloc_support_movable to zpool_driver
  shmem: fix obsolete comment in shmem_getpage_gfp()
  mm/madvise: reduce code duplication in error handling paths
  mm: mmap: increase sockets maximum memory size pgoff for 32bits
  mm/mmap.c: refine find_vma_prev() with rb_last()
  riscv: make mmap allocation top-down by default
  mips: use generic mmap top-down layout and brk randomization
  mips: replace arch specific way to determine 32bit task with generic version
  mips: adjust brk randomization offset to fit generic version
  mips: use STACK_TOP when computing mmap base address
  mips: properly account for stack randomization and stack guard gap
  arm: use generic mmap top-down layout and brk randomization
  arm: use STACK_TOP when computing mmap base address
  arm: properly account for stack randomization and stack guard gap
  arm64, mm: make randomization selected by generic topdown mmap layout
  arm64, mm: move generic mmap layout functions to mm
  arm64: consider stack randomization for mmap base only when necessary
  ...
2019-09-24 16:10:23 -07:00
Alexandre Ghiti
67f3977f80 arm64, mm: move generic mmap layout functions to mm
arm64 handles top-down mmap layout in a way that can be easily reused by
other architectures, so make it available in mm.  It then introduces a new
config ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT that can be set by other
architectures to benefit from those functions.  Note that this new config
depends on MMU being enabled, if selected without MMU support, a warning
will be thrown.

Link: http://lkml.kernel.org/r/20190730055113.23635-5-alex@ghiti.fr
Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: James Hogan <jhogan@kernel.org>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:11 -07:00
Song Liu
f385cb85a4 uprobe: collapse THP pmd after removing all uprobes
After all uprobes are removed from the huge page (with PTE pgtable), it is
possible to collapse the pmd and benefit from THP again.  This patch does
the collapse by calling collapse_pte_mapped_thp().

Link: http://lkml.kernel.org/r/20190815164525.1848545-7-songliubraving@fb.com
Signed-off-by: Song Liu <songliubraving@fb.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: kbuild test robot <lkp@intel.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:11 -07:00
Song Liu
5a52c9df62 uprobe: use FOLL_SPLIT_PMD instead of FOLL_SPLIT
Use the newly added FOLL_SPLIT_PMD in uprobe.  This preserves the huge
page when the uprobe is enabled.  When the uprobe is disabled, newer
instances of the same application could still benefit from huge page.

For the next step, we will enable khugepaged to regroup the pmd, so that
existing instances of the application could also benefit from huge page
after the uprobe is disabled.

Link: http://lkml.kernel.org/r/20190815164525.1848545-5-songliubraving@fb.com
Signed-off-by: Song Liu <songliubraving@fb.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:11 -07:00
Song Liu
fb4fb04ff4 uprobe: use original page when all uprobes are removed
Currently, uprobe swaps the target page with a anonymous page in both
install_breakpoint() and remove_breakpoint().  When all uprobes on a page
are removed, the given mm is still using an anonymous page (not the
original page).

This patch allows uprobe to use original page when possible (all uprobes
on the page are already removed, and the original page is in page cache
and uptodate).

As suggested by Oleg, we unmap the old_page and let the original page
fault in.

Link: http://lkml.kernel.org/r/20190815164525.1848545-3-songliubraving@fb.com
Signed-off-by: Song Liu <songliubraving@fb.com>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:11 -07:00
David Hildenbrand
00ff9a91bd mm/memory_hotplug.c: use PFN_UP / PFN_DOWN in walk_system_ram_range()
Patch series "mm/memory_hotplug: online_pages() cleanups", v2.

Some cleanups (+ one fix for a special case) in the context of
online_pages().

This patch (of 5):

This makes it clearer that we will never call func() with duplicate PFNs
in case we have multiple sub-page memory resources.  All unaligned parts
of PFNs are completely discarded.

Link: http://lkml.kernel.org/r/20190814154109.3448-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Wei Yang <richardw.yang@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Arun KS <arunks@codeaurora.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:09 -07:00
Nicholas Piggin
13224794cb mm: remove quicklist page table caches
Patch series "mm: remove quicklist page table caches".

A while ago Nicholas proposed to remove quicklist page table caches [1].

I've rebased his patch on the curren upstream and switched ia64 and sh to
use generic versions of PTE allocation.

[1] https://lore.kernel.org/linux-mm/20190711030339.20892-1-npiggin@gmail.com

This patch (of 3):

Remove page table allocator "quicklists".  These have been around for a
long time, but have not got much traction in the last decade and are only
used on ia64 and sh architectures.

The numbers in the initial commit look interesting but probably don't
apply anymore.  If anybody wants to resurrect this it's in the git
history, but it's unhelpful to have this code and divergent allocator
behaviour for minor archs.

Also it might be better to instead make more general improvements to page
allocator if this is still so slow.

Link: http://lkml.kernel.org/r/1565250728-21721-2-git-send-email-rppt@linux.ibm.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:09 -07:00
Linus Torvalds
0b36c9eed2 Merge branch 'work.mount3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more mount API conversions from Al Viro:
 "Assorted conversions of options parsing to new API.

  gfs2 is probably the most serious one here; the rest is trivial stuff.

  Other things in what used to be #work.mount are going to wait for the
  next cycle (and preferably go via git trees of the filesystems
  involved)"

* 'work.mount3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  gfs2: Convert gfs2 to fs_context
  vfs: Convert spufs to use the new mount API
  vfs: Convert hypfs to use the new mount API
  hypfs: Fix error number left in struct pointer member
  vfs: Convert functionfs to use the new mount API
  vfs: Convert bpf to use the new mount API
2019-09-24 12:33:34 -07:00
Vitaly Kuznetsov
e1572f1d08 cpu/SMT: create and export cpu_smt_possible()
KVM needs to know if SMT is theoretically possible, this means it is
supported and not forcefully disabled ('nosmt=force'). Create and
export cpu_smt_possible() answering this question.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-24 13:37:28 +02:00
Linus Torvalds
9f7582d15f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching
Pull livepatching fix from Jiri Kosina:
 "Error handling fix in livepatching module notifier, from Miroslav
  Benes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching:
  livepatch: Nullify obj->mod in klp_module_coming()'s error path
2019-09-23 12:21:03 -07:00
Linus Torvalds
e070355664 Modules updates for v5.4
Summary of modules changes for the 5.4 merge window:
 
 - Introduce exported symbol namespaces.
 
   This new feature allows subsystem maintainers to partition and
   categorize their exported symbols into explicit namespaces. Module
   authors are now required to import the namespaces they need.
 
   Some of the main motivations of this feature include: allowing kernel
   developers to better manage the export surface, allow subsystem
   maintainers to explicitly state that usage of some exported symbols
   should only be limited to certain users (think: inter-module or
   inter-driver symbols, debugging symbols, etc), as well as more easily
   limiting the availability of namespaced symbols to other parts of the
   kernel. With the module import requirement, it is also easier to spot
   the misuse of exported symbols during patch review. Two new macros are
   introduced: EXPORT_SYMBOL_NS() and EXPORT_SYMBOL_NS_GPL(). The API is
   thoroughly documented in Documentation/kbuild/namespaces.rst.
 
 - Some small code and kbuild cleanups here and there.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCgAGBQJdh3n8AAoJEMBFfjjOO8Fy94kP+QHZF39QDvLbxAzEYAETAS+o
 CFu6wix/DrAwFkTU/kX1eAsAwDBEz0xkMciR4BsLX3sIafUVERxtDXVAui/dA1+6
 zfw2c3ObyVwPEk6aUPFprgkj+08gxujsJFlYTsQQUhtRbmxg6R7hD6t6ANxiHaY2
 AQe5TzOWXoIa2hHO+7rPMqf8l6qiFCaL0s3v5jrmBXa5mHmc4PVy95h1J6xQVw2u
 b+SlvKeylHv+OtCtvthkAJS3hfS35J/1TNb/RNYIvh60IfEguEuFsGuQ9JiSSAZS
 pv1cJ+I5d4v8Y/md1rZpdjTJL9gCrq/UUC67+UkejCOn0C+7XM2eR4Bu/jWvdMSn
 ZQDHcPhFSIfmP7FaKomPogaBbw1sI1FvM5930pPJzHnyO9+cefBXe7rWaaB+y0At
 GAxOtmk1dKh01BT7YO/C0oVuX87csWd74NHypVsbs0TgQo5jBFdZRheyDrq5YB+s
 tVK+5H0nqQrCcfo/TvhcsZlgITTGtgTPenaW99/i7qNa9mRUtxC/VkE+aob6HNRF
 1iBxxopOTxGN8akyKOVumtkuTQH3EJfouZee//pWbXLzyDmScg/k67vuao8kxbyq
 NA1piFAGJAHFsHATxrbvNOq6jZ5bfUT8pwSTs83JppuR++8Hxk7zaShS3/LvsvHt
 6ist/epOwTZ7oiNQ04nj
 =72Uy
 -----END PGP SIGNATURE-----

Merge tag 'modules-for-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux

Pull modules updates from Jessica Yu:
 "The main bulk of this pull request introduces a new exported symbol
  namespaces feature. The number of exported symbols is increasingly
  growing with each release (we're at about 31k exports as of 5.3-rc7)
  and we currently have no way of visualizing how these symbols are
  "clustered" or making sense of this huge export surface.

  Namespacing exported symbols allows kernel developers to more
  explicitly partition and categorize exported symbols, as well as more
  easily limiting the availability of namespaced symbols to other parts
  of the kernel. For starters, we have introduced the USB_STORAGE
  namespace to demonstrate the API's usage. I have briefly summarized
  the feature and its main motivations in the tag below.

  Summary:

   - Introduce exported symbol namespaces.

     This new feature allows subsystem maintainers to partition and
     categorize their exported symbols into explicit namespaces. Module
     authors are now required to import the namespaces they need.

     Some of the main motivations of this feature include: allowing
     kernel developers to better manage the export surface, allow
     subsystem maintainers to explicitly state that usage of some
     exported symbols should only be limited to certain users (think:
     inter-module or inter-driver symbols, debugging symbols, etc), as
     well as more easily limiting the availability of namespaced symbols
     to other parts of the kernel.

     With the module import requirement, it is also easier to spot the
     misuse of exported symbols during patch review.

     Two new macros are introduced: EXPORT_SYMBOL_NS() and
     EXPORT_SYMBOL_NS_GPL(). The API is thoroughly documented in
     Documentation/kbuild/namespaces.rst.

   - Some small code and kbuild cleanups here and there"

* tag 'modules-for-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
  module: Remove leftover '#undef' from export header
  module: remove unneeded casts in cmp_name()
  module: move CONFIG_UNUSED_SYMBOLS to the sub-menu of MODULES
  module: remove redundant 'depends on MODULES'
  module: Fix link failure due to invalid relocation on namespace offset
  usb-storage: export symbols in USB_STORAGE namespace
  usb-storage: remove single-use define for debugging
  docs: Add documentation for Symbol Namespaces
  scripts: Coccinelle script for namespace dependencies.
  modpost: add support for generating namespace dependencies
  export: allow definition default namespaces in Makefiles or sources
  module: add config option MODULE_ALLOW_MISSING_NAMESPACE_IMPORTS
  modpost: add support for symbol namespaces
  module: add support for symbol namespaces.
  export: explicitly align struct kernel_symbol
  module: support reading multiple values per modinfo tag
2019-09-22 10:34:46 -07:00
Linus Torvalds
9dca3432ee This pull request contains the following changes for UML:
- virtio support
 - Fixes for our new time travel mode
 - Various improvements to make lockdep and kasan work better
 - SPDX header updates
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAl2Fx9kWHHJpY2hhcmRA
 c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wa2kD/9UJ5JOe6yBeMfPO5Vv8vpJRc10
 0gS8qDbzfutrWddq1wUvEaNCIQY4NOf4tsqjauYHpTUA/0AWwruz++iyI9u3XWEQ
 0b+ZMhKXkws3UgPwWIxrgLr0106wz6Xuz6d36nqpAc6F4MJhC3LqUCC9yEp3hxMd
 pSF65ueQXp7NKfOAqqKU1m3FnfmyBTpsL5PpA6OEZn//kt/Qz5PhIjHpC3JwIBQb
 z0OUhE/6mmWb66wtqHIx4Zd2ybLLnsfby24q+1e8J2B+gcORxhubvgCIGY+PU98o
 EW3N4aMevUdgG9MJbnlZUgWeZ1bsByail2z8aFElRKefT2xkEnjxfQZgKahI6LnO
 jzLm9pk3RjTiZxvYkEbgRAjBkZD514M6FvOlyrHtLxMDfWE6/z71VKDqFjEyeIHQ
 QpDjwEjdJTxVHr4Ol+VnZe1lE5zXLNuCFT5qdPQBqyr8g151T7jwYXnGK2SqGo2D
 UQ6/KnaN+pgM7BaqcNtwciKk3Xjng0BDLfdZs7z8F3bzv53rg2mpQt5iPm+nWFPa
 aNt4B3FKXv3+YnjuSbi5NlvKKK9alRcvZTOk8jFjwOVmFJXlvMCzegZnuTxtqU+j
 XpwmUlsT6aMV7vPZN2ta7y1bjOijzZIjL0O7rP4Obxwfp3dTGGYX/T6vW8F2o9V6
 evyx/KSD6nqlY1bvwQ==
 =oxpp
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml

Pull UML updates from Richard Weinberger:

 - virtio support

 - fixes for our new time travel mode

 - various improvements to make lockdep and kasan work better

 - SPDX header updates

* tag 'for-linus-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml: (25 commits)
  um: irq: Fix LAST_IRQ usage in init_IRQ()
  um: Add SPDX headers for files in arch/um/include
  um: Add SPDX headers for files in arch/um/os-Linux
  um: Add SPDX headers to files in arch/um/kernel/
  um: Add SPDX headers for files in arch/um/drivers
  um: virtio: Implement VHOST_USER_PROTOCOL_F_REPLY_ACK
  um: virtio: Implement VHOST_USER_PROTOCOL_F_SLAVE_REQ
  um: drivers: Add virtio vhost-user driver
  um: Use real DMA barriers
  um: Don't use generic barrier.h
  um: time-travel: Restrict time update in IRQ handler
  um: time-travel: Fix periodic timers
  um: Enable CONFIG_CONSTRUCTORS
  um: Place (soft)irq text with macros
  um: Fix VDSO compiler warning
  um: Implement TRACE_IRQFLAGS_SUPPORT
  um: Remove misleading #define ARCh_IRQ_ENABLED
  um: Avoid using uninitialized regs
  um: Remove sig_info[SIGALRM]
  um: Error handling fixes in vector drivers
  ...
2019-09-21 11:07:02 -07:00
Linus Torvalds
84da111de0 hmm related patches for 5.4
This is more cleanup and consolidation of the hmm APIs and the very
 strongly related mmu_notifier interfaces. Many places across the tree
 using these interfaces are touched in the process. Beyond that a cleanup
 to the page walker API and a few memremap related changes round out the
 series:
 
 - General improvement of hmm_range_fault() and related APIs, more
   documentation, bug fixes from testing, API simplification &
   consolidation, and unused API removal
 
 - Simplify the hmm related kconfigs to HMM_MIRROR and DEVICE_PRIVATE, and
   make them internal kconfig selects
 
 - Hoist a lot of code related to mmu notifier attachment out of drivers by
   using a refcount get/put attachment idiom and remove the convoluted
   mmu_notifier_unregister_no_release() and related APIs.
 
 - General API improvement for the migrate_vma API and revision of its only
   user in nouveau
 
 - Annotate mmu_notifiers with lockdep and sleeping region debugging
 
 Two series unrelated to HMM or mmu_notifiers came along due to
 dependencies:
 
 - Allow pagemap's memremap_pages family of APIs to work without providing
   a struct device
 
 - Make walk_page_range() and related use a constant structure for function
   pointers
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAl1/nnkACgkQOG33FX4g
 mxqaRg//c6FqowV1pQlLutvAOAgMdpzfZ9eaaDKngy9RVQxz+k/MmJrdRH/p/mMA
 Pq93A1XfwtraGKErHegFXGEDk4XhOustVAVFwvjyXO41dTUdoFVUkti6ftbrl/rS
 6CT+X90jlvrwdRY7QBeuo7lxx7z8Qkqbk1O1kc1IOracjKfNJS+y6LTamy6weM3g
 tIMHI65PkxpRzN36DV9uCN5dMwFzJ73DWHp1b0acnDIigkl6u5zp6orAJVWRjyQX
 nmEd3/IOvdxaubAoAvboNS5CyVb4yS9xshWWMbH6AulKJv3Glca1Aa7QuSpBoN8v
 wy4c9+umzqRgzgUJUe1xwN9P49oBNhJpgBSu8MUlgBA4IOc3rDl/Tw0b5KCFVfkH
 yHkp8n6MP8VsRrzXTC6Kx0vdjIkAO8SUeylVJczAcVSyHIo6/JUJCVDeFLSTVymh
 EGWJ7zX2iRhUbssJ6/izQTTQyCH3YIyZ5QtqByWuX2U7ZrfkqS3/EnBW1Q+j+gPF
 Z2yW8iT6k0iENw6s8psE9czexuywa/Lttz94IyNlOQ8rJTiQqB9wLaAvg9hvUk7a
 kuspL+JGIZkrL3ouCeO/VA6xnaP+Q7nR8geWBRb8zKGHmtWrb5Gwmt6t+vTnCC2l
 olIDebrnnxwfBQhEJ5219W+M1pBpjiTpqK/UdBd92A4+sOOhOD0=
 =FRGg
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull hmm updates from Jason Gunthorpe:
 "This is more cleanup and consolidation of the hmm APIs and the very
  strongly related mmu_notifier interfaces. Many places across the tree
  using these interfaces are touched in the process. Beyond that a
  cleanup to the page walker API and a few memremap related changes
  round out the series:

   - General improvement of hmm_range_fault() and related APIs, more
     documentation, bug fixes from testing, API simplification &
     consolidation, and unused API removal

   - Simplify the hmm related kconfigs to HMM_MIRROR and DEVICE_PRIVATE,
     and make them internal kconfig selects

   - Hoist a lot of code related to mmu notifier attachment out of
     drivers by using a refcount get/put attachment idiom and remove the
     convoluted mmu_notifier_unregister_no_release() and related APIs.

   - General API improvement for the migrate_vma API and revision of its
     only user in nouveau

   - Annotate mmu_notifiers with lockdep and sleeping region debugging

  Two series unrelated to HMM or mmu_notifiers came along due to
  dependencies:

   - Allow pagemap's memremap_pages family of APIs to work without
     providing a struct device

   - Make walk_page_range() and related use a constant structure for
     function pointers"

* tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (75 commits)
  libnvdimm: Enable unit test infrastructure compile checks
  mm, notifier: Catch sleeping/blocking for !blockable
  kernel.h: Add non_block_start/end()
  drm/radeon: guard against calling an unpaired radeon_mn_unregister()
  csky: add missing brackets in a macro for tlb.h
  pagewalk: use lockdep_assert_held for locking validation
  pagewalk: separate function pointers from iterator data
  mm: split out a new pagewalk.h header from mm.h
  mm/mmu_notifiers: annotate with might_sleep()
  mm/mmu_notifiers: prime lockdep
  mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end
  mm/mmu_notifiers: remove the __mmu_notifier_invalidate_range_start/end exports
  mm/hmm: hmm_range_fault() infinite loop
  mm/hmm: hmm_range_fault() NULL pointer bug
  mm/hmm: fix hmm_range_fault()'s handling of swapped out pages
  mm/mmu_notifiers: remove unregister_no_release
  RDMA/odp: remove ib_ucontext from ib_umem
  RDMA/odp: use mmu_notifier_get/put for 'struct ib_ucontext_per_mm'
  RDMA/mlx5: Use odp instead of mr->umem in pagefault_mr
  RDMA/mlx5: Use ib_umem_start instead of umem.address
  ...
2019-09-21 10:07:42 -07:00
Linus Torvalds
56c1e83434 Printk changes for 5.4
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAl1/akAACgkQUqAMR0iA
 lPLK+Q//fDKalQodEQU2qwP2eLaVwCn5lKu0ab29JwQgA21i5f1AUibPBLvQoHYX
 CyclUSzylz0AO37V1nSWwmDs80TY0nhbgQUURglKQQpkOyZiU/7FtQnVLm2qUUgp
 Jd6hjbFEGQUdfaY21MXzCY5XgYWIPQ7I0exFBd17cbj4ad9mjs28SSLwtK+ogSxL
 S9pfZdB9DWJBeA+d0N9cA6Q4V+u2w4JPcaoYcsPcxbyf3oxxQJZF6FCkaJjXRG+d
 NDzRKfB5+vBJqLFs8/wjiokuGl7YSJwBEe+q02JrSMEBfJb/9+9Uh8NUy6yIFCdL
 DmZaK6zasb9IW6vGmpfVO2fHoSwXYF1Le8YtEzD37+HZRYFnv6+AvpsOgr7hGPpR
 7Ci6U6yEckMxwtKxiPyf+isUV9R8fPWpWnCdz/jVIVTlSP/o3tA8RG8S+cOd66pK
 /TF2DZoMXdwcpB+MgCeCusM/f8jNkAYQPh74SHZQGcY8gdxu/hkv0URU1IVEkdku
 o/+Y2HvgPh2Pd3fqOQU1QKSUCQ0OcVyAt4fg59YpOQTDr6TMZ6xy/TEuFRzzAe8e
 LNLxsGrN1Kqn3JGTOAZgPnbFc5N1eP9/GoGh8+nQaNJe202QjbTUIKEOioOXu5VR
 aFGPvD5aEdlXj+Ms/Igv8qc5/ogXAg4afkqC8Hkk8XaUXHqt3ZI=
 =Ui4A
 -----END PGP SIGNATURE-----

Merge tag 'printk-for-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk

Pull printk updates from Petr Mladek:

 - Fix off-by-one error when calculating messages that might fit into
   kmsg buffer. It causes occasional omitting of the last message.

 - Add missing pointer check in %pD format modifier handling.

 - Some clean up

* tag 'printk-for-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk:
  ABI: Update dev-kmsg documentation to match current kernel behaviour
  printk: Replace strncmp() with str_has_prefix()
  lib/test_printf: Remove obvious comments from %pd and %pD tests
  lib/test_printf: Add test of null/invalid pointer dereference for dentry
  vsprintf: Prevent crash when dereferencing invalid pointers for %pD
  printk: Do not lose last line in kmsg buffer dump
2019-09-21 09:34:29 -07:00
Tejun Heo
30ae2fc0a7 workqueue: Minor follow-ups to the rescuer destruction change
* Now that wq->rescuer may be cleared while rescuer is still there,
  switch show_pwq() debug printout to test worker->rescue_wq to
  identify rescuers intead of testing wq->rescuer.

* Update comment on ->rescuer locking.

Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Lai Jiangshan <jiangshanlai@gmail.com>
2019-09-20 14:09:14 -07:00
Tejun Heo
8efe1223d7 workqueue: Fix missing kfree(rescuer) in destroy_workqueue()
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Qian Cai <cai@lca.pw>
Fixes: def98c84b6 ("workqueue: Fix spurious sanity check failures in destroy_workqueue()")
2019-09-20 13:39:57 -07:00
Roy Ben Shlomo
9f014e3a66 perf/core: Fix several typos in comments
Fix typos in a few functions' documentation comments.

Signed-off-by: Roy Ben Shlomo <royb@sentinelone.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: royb@sentinelone.com
Link: http://lore.kernel.org/lkml/20190920171254.31373-1-royb@sentinelone.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-09-20 16:05:20 -03:00
Linus Torvalds
45824fc0da powerpc updates for 5.4
- Initial support for running on a system with an Ultravisor, which is software
    that runs below the hypervisor and protects guests against some attacks by
    the hypervisor.
 
  - Support for building the kernel to run as a "Secure Virtual Machine", ie. as
    a guest capable of running on a system with an Ultravisor.
 
  - Some changes to our DMA code on bare metal, to allow devices with medium
    sized DMA masks (> 32 && < 59 bits) to use more than 2GB of DMA space.
 
  - Support for firmware assisted crash dumps on bare metal (powernv).
 
  - Two series fixing bugs in and refactoring our PCI EEH code.
 
  - A large series refactoring our exception entry code to use gas macros, both
    to make it more readable and also enable some future optimisations.
 
 As well as many cleanups and other minor features & fixups.
 
 Thanks to:
   Adam Zerella, Alexey Kardashevskiy, Alistair Popple, Andrew Donnellan, Aneesh
   Kumar K.V, Anju T Sudhakar, Anshuman Khandual, Balbir Singh, Benjamin
   Herrenschmidt, Cédric Le Goater, Christophe JAILLET, Christophe Leroy,
   Christopher M. Riedl, Christoph Hellwig, Claudio Carvalho, Daniel Axtens,
   David Gibson, David Hildenbrand, Desnes A. Nunes do Rosario, Ganesh Goudar,
   Gautham R. Shenoy, Greg Kurz, Guerney Hunt, Gustavo Romero, Halil Pasic, Hari
   Bathini, Joakim Tjernlund, Jonathan Neuschafer, Jordan Niethe, Leonardo Bras,
   Lianbo Jiang, Madhavan Srinivasan, Mahesh Salgaonkar, Mahesh Salgaonkar,
   Masahiro Yamada, Maxiwell S. Garcia, Michael Anderson, Nathan Chancellor,
   Nathan Lynch, Naveen N. Rao, Nicholas Piggin, Oliver O'Halloran, Qian Cai, Ram
   Pai, Ravi Bangoria, Reza Arbab, Ryan Grimm, Sam Bobroff, Santosh Sivaraj,
   Segher Boessenkool, Sukadev Bhattiprolu, Thiago Bauermann, Thiago Jung
   Bauermann, Thomas Gleixner, Tom Lendacky, Vasant Hegde.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAl2EtEcTHG1wZUBlbGxl
 cm1hbi5pZC5hdQAKCRBR6+o8yOGlgPfsD/9uXyBXn3anI/H08+mk74k5gCsmMQpn
 D442CD/ByogZcccp23yBTlhawtCE03hcHnCLygn0Xgd8a4YvHts/RGHUe3fPHqlG
 bEyZ7jsLVz5ebNZQP7r4eGs2pSzCajwJy2N9HJ/C1ojf15rrfRxoVJtnyhE2wXpm
 DL+6o2K+nUCB3gTQ1Inr3DnWzoGOOUfNTOea2u+J+yfHwGRqOBYpevwqiwy5eelK
 aRjUJCqMTvrzra49MeFwjo0Nt3/Y8UNcwA+JlGdeR8bRuWhFrYmyBRiZEKPaujNO
 5EAfghBBlB0KQCqvF/tRM/c0OftHqK59AMobP9T7u9oOaBXeF/FpZX/iXjzNDPsN
 j9Oo2tKLTu/YVEXqBFuREGP+znANr1Wo4CFyOG8SbvYz0HFjR6XbtRJsS+0e8GWl
 kqX5/ZhYz3lBnKSNe9jgWOrh/J0KCSFigBTEWJT3xsn4YE8x8kK2l9KPqAIldWEP
 sKb2UjGS7v0NKq+NvShH88Q9AeQUEIjTcg/9aDDQDe6FaRQ7KiF8bUxSdwSPi+Fn
 j0lnF6i+1ATWZKuCr85veVi7C5qoe/+MqalnmP7MxULyzgXLLxUgN0SzEYO6QofK
 LQK/VaH2XVr5+M5YAb7K4/NX5gbM3s1bKrCiUy4EyHNvgG7gricYdbz6HgAjKpR7
 oP0rHfgmVYvF1g==
 =WlW+
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-5.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:
 "This is a bit late, partly due to me travelling, and partly due to a
  power outage knocking out some of my test systems *while* I was
  travelling.

   - Initial support for running on a system with an Ultravisor, which
     is software that runs below the hypervisor and protects guests
     against some attacks by the hypervisor.

   - Support for building the kernel to run as a "Secure Virtual
     Machine", ie. as a guest capable of running on a system with an
     Ultravisor.

   - Some changes to our DMA code on bare metal, to allow devices with
     medium sized DMA masks (> 32 && < 59 bits) to use more than 2GB of
     DMA space.

   - Support for firmware assisted crash dumps on bare metal (powernv).

   - Two series fixing bugs in and refactoring our PCI EEH code.

   - A large series refactoring our exception entry code to use gas
     macros, both to make it more readable and also enable some future
     optimisations.

  As well as many cleanups and other minor features & fixups.

  Thanks to: Adam Zerella, Alexey Kardashevskiy, Alistair Popple, Andrew
  Donnellan, Aneesh Kumar K.V, Anju T Sudhakar, Anshuman Khandual,
  Balbir Singh, Benjamin Herrenschmidt, Cédric Le Goater, Christophe
  JAILLET, Christophe Leroy, Christopher M. Riedl, Christoph Hellwig,
  Claudio Carvalho, Daniel Axtens, David Gibson, David Hildenbrand,
  Desnes A. Nunes do Rosario, Ganesh Goudar, Gautham R. Shenoy, Greg
  Kurz, Guerney Hunt, Gustavo Romero, Halil Pasic, Hari Bathini, Joakim
  Tjernlund, Jonathan Neuschafer, Jordan Niethe, Leonardo Bras, Lianbo
  Jiang, Madhavan Srinivasan, Mahesh Salgaonkar, Mahesh Salgaonkar,
  Masahiro Yamada, Maxiwell S. Garcia, Michael Anderson, Nathan
  Chancellor, Nathan Lynch, Naveen N. Rao, Nicholas Piggin, Oliver
  O'Halloran, Qian Cai, Ram Pai, Ravi Bangoria, Reza Arbab, Ryan Grimm,
  Sam Bobroff, Santosh Sivaraj, Segher Boessenkool, Sukadev Bhattiprolu,
  Thiago Bauermann, Thiago Jung Bauermann, Thomas Gleixner, Tom
  Lendacky, Vasant Hegde"

* tag 'powerpc-5.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (264 commits)
  powerpc/mm/mce: Keep irqs disabled during lockless page table walk
  powerpc: Use ftrace_graph_ret_addr() when unwinding
  powerpc/ftrace: Enable HAVE_FUNCTION_GRAPH_RET_ADDR_PTR
  ftrace: Look up the address of return_to_handler() using helpers
  powerpc: dump kernel log before carrying out fadump or kdump
  docs: powerpc: Add missing documentation reference
  powerpc/xmon: Fix output of XIVE IPI
  powerpc/xmon: Improve output of XIVE interrupts
  powerpc/mm/radix: remove useless kernel messages
  powerpc/fadump: support holes in kernel boot memory area
  powerpc/fadump: remove RMA_START and RMA_END macros
  powerpc/fadump: update documentation about option to release opalcore
  powerpc/fadump: consider f/w load area
  powerpc/opalcore: provide an option to invalidate /sys/firmware/opal/core file
  powerpc/opalcore: export /sys/firmware/opal/core for analysing opal crashes
  powerpc/fadump: update documentation about CONFIG_PRESERVE_FA_DUMP
  powerpc/fadump: add support to preserve crash data on FADUMP disabled kernel
  powerpc/fadump: improve how crashed kernel's memory is reserved
  powerpc/fadump: consider reserved ranges while releasing memory
  powerpc/fadump: make crash memory ranges array allocation generic
  ...
2019-09-20 11:48:06 -07:00
Linus Torvalds
45979a956b Tracing updates:
- Addition of multiprobes to kprobe and uprobe events
    Allows for more than one probe attached to the same location
 
  - Addition of adding immediates to probe parameters
 
  - Clean up of the recordmcount.c code. This brings us closer
    to merging recordmcount into objtool, and reuse code.
 
  - Other small clean ups
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXYQoqhQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qlIxAP9VVABbpuvOYqxKuFgyP62ituSXPLkL
 gZv4I5Zse4b6/gD/eksFXY/OHo7jp6aQiHvxotUkAiFFE9iHzi0JscdMJgo=
 =WqrT
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing updates from Steven Rostedt:

 - Addition of multiprobes to kprobe and uprobe events (allows for more
   than one probe attached to the same location)

 - Addition of adding immediates to probe parameters

 - Clean up of the recordmcount.c code. This brings us closer to merging
   recordmcount into objtool, and reuse code.

 - Other small clean ups

* tag 'trace-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (33 commits)
  selftests/ftrace: Update kprobe event error testcase
  tracing/probe: Reject exactly same probe event
  tracing/probe: Fix to allow user to enable events on unloaded modules
  selftests/ftrace: Select an existing function in kprobe_eventname test
  tracing/kprobe: Fix NULL pointer access in trace_porbe_unlink()
  tracing: Make sure variable reference alias has correct var_ref_idx
  tracing: Be more clever when dumping hex in __print_hex()
  ftrace: Simplify ftrace hash lookup code in clear_func_from_hash()
  tracing: Add "gfp_t" support in synthetic_events
  tracing: Rename tracing_reset() to tracing_reset_cpu()
  tracing: Document the stack trace algorithm in the comments
  tracing/arm64: Have max stack tracer handle the case of return address after data
  recordmcount: Clarify what cleanup() does
  recordmcount: Remove redundant cleanup() calls
  recordmcount: Kernel style formatting
  recordmcount: Kernel style function signature formatting
  recordmcount: Rewrite error/success handling
  selftests/ftrace: Add syntax error test for multiprobe
  selftests/ftrace: Add syntax error test for immediates
  selftests/ftrace: Add a testcase for kprobe multiprobe event
  ...
2019-09-20 11:19:48 -07:00
Linus Torvalds
3207598ab0 kgdb patches for 5.4-rc1
It has been a quiet dev cycle for kgdb. There has been some good stuff
 for kdb on the mailing list but unfortunately the patches caused a
 couple of problems with the kdb pager so I had to drop those and they
 will have to wait for next time!
 
 That just leaves us with just a couple of very tiny clean ups for now:
 
  * Fix a broken comment
  * Use str_has_prefix() for the grep "pipe" in kdb
 
 Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEELzVBU1D3lWq6cKzwfOMlXTn3iKEFAl2Do/AACgkQfOMlXTn3
 iKF8Lw//cpxBlzZlBpkWDS7Uiuvsrzy8M7vZndYNsrnY89kjYOm0JynIT+8CZFXk
 tLKWyL3OZxUTWUvouTSeqyJAbLJYOB3wBPCMy7aVg7UAaoeEvRzN/xipM21sa4Hm
 1eVP51EMuNTaLzn4jgZV5Ad+ruSnj8Ua3TxK5qc2wYIuJvULnVzsSQxKcQopn2tk
 LqIEwo0wvPVkb865z3D0KAOxqkftJHa4RdwNAyDy6xCQH/RlHR06NshKEN1g3j0I
 0yKd/LBW7Z5yBMFpf1EOv3euF0URu9cmDP4o+n5Tyx71UbZ++NqiPFAAV2zWZq42
 3sCjgLgPm+JIT4aMFEFlJn05xAwJEs3bpglGf1BTN2lFyIZZKD+lBpaAIA7AJtfi
 oc/pJtmy9qAaACVe+l6s4q9YDZjcz9YZ1KZJuAUaVSc/L7AaZRC64CmzoLealZ1x
 bKHvJ0FEKbwVC/JSF7ILKsb0qzyiQFAjuNfnqAX0KKrxOC42IukQy90eDC5JeWhd
 yYAaUjtjeIKVn6wZYyki8dwmK2LPlAbathAQxNDYdYLyteTOeTyuZS7X8V0aLMDN
 VfbOylp9SasZX80vHjhI+gZv523ljDj/Ec0o1kHaswpAELU+oVbdp91vuSvi4lPn
 MmrNS2o3QZMlGMvLupSqmp+nQHZFmhoTWbxa84gYptPxwk2vipI=
 =oIVA
 -----END PGP SIGNATURE-----

Merge tag 'kgdb-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux

Pull kgdb updates from Daniel Thompson:
 "It has been a quiet dev cycle for kgdb. There has been some good stuff
  for kdb on the mailing list but unfortunately the patches caused a
  couple of problems with the kdb pager so I had to drop those and they
  will have to wait for next time!

  That just leaves us with just a couple of very tiny clean ups for now:

   - Fix a broken comment

   - Use str_has_prefix() for the grep "pipe" in kdb"

* tag 'kgdb-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux:
  kgdb: fix comment regarding static function
  kdb: Replace strncmp with str_has_prefix
2019-09-20 10:31:19 -07:00
Linus Torvalds
d7b0827f28 Kbuild updates for v5.4
- add modpost warn exported symbols marked as 'static' because 'static'
    and EXPORT_SYMBOL is an odd combination
 
  - break the build early if gold linker is used
 
  - optimize the Bison rule to produce .c and .h files by a single
    pattern rule
 
  - handle PREEMPT_RT in the module vermagic and UTS_VERSION
 
  - warn CONFIG options leaked to the user-space except existing ones
 
  - make single targets work properly
 
  - rebuild modules when module linker scripts are updated
 
  - split the module final link stage into scripts/Makefile.modfinal
 
  - fix the missed error code in merge_config.sh
 
  - improve the error message displayed on the attempt of the O= build
    in unclean source tree
 
  - remove 'clean-dirs' syntax
 
  - disable -Wimplicit-fallthrough warning for Clang
 
  - add CONFIG_CC_OPTIMIZE_FOR_SIZE_O3 for ARC
 
  - remove ARCH_{CPP,A,C}FLAGS variables
 
  - add $(BASH) to run bash scripts
 
  - change *CFLAGS_<basetarget>.o to take the relative path to $(obj)
    instead of the basename
 
  - stop suppressing Clang's -Wunused-function warnings when W=1
 
  - fix linux/export.h to avoid genksyms calculating CRC of trimmed
    exported symbols
 
  - misc cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQJSBAABCgA8FiEEbmPs18K1szRHjPqEPYsBB53g2wYFAl1+OnoeHHlhbWFkYS5t
 YXNhaGlyb0Bzb2Npb25leHQuY29tAAoJED2LAQed4NsGoKEQAKcid9lDacMe5KWT
 4Ic93hANMFKZ9Qy8WoxivnOr1a93NcloZ0Bhka96QUt7hYUkLmDCs99eMbxKuMfP
 m/ViHepojOBPzq+VtAGWOiIyPMCA7XDrTPph4wcPDKeOURTreK1PZ20fxDoAR4to
 +qaqKZJGdRcNf2DpJN1yIosz8Wj0Sa2LQrRi9jgUHi3bzgvLfL7P9WM2xyZMggAc
 GaSktCEFL0UzMFlMpYyDrKh2EV6ryOnN8+bVAKbmWP89tuU3njutycKdWOoL+bsj
 tH2kjFThxQyIcZGNHS1VzNunYAFE2q5nj2q47O1EDN6sjTYUoRn5cHwPam6x3Kly
 NH88xDEtJ7sUUc9GZEIXADWWD0f08QIhAH5x+jxFg3529lNgyrNHRSQ2XceYNAnG
 i/GnMJ0EhODOFKusXw7sNlWFKtukep+8/pwnvfTXWQu6plEm5EQ3a3RL5SESubVo
 mHzXsQDFCE0x/UrsJxEAww+3YO3pQEelfVi74W9z0cckpbRF8FuUq/69ltOT15l4
 X+gCz80lXMWBKw/kNoR4GQoAJo3KboMEociawwoj72HXEHTPLJnCdUOsAf3n+opj
 xuz/UPZ4WYSgKdnbmmDbJ+1POA1NqtARZZXpMVyKVVCOiLafbJkLQYwLKEpE2mOO
 TP9igzP1i3/jPWec8cJ6Fa8UwuGh
 =VGqV
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild updates from Masahiro Yamada:

 - add modpost warn exported symbols marked as 'static' because 'static'
   and EXPORT_SYMBOL is an odd combination

 - break the build early if gold linker is used

 - optimize the Bison rule to produce .c and .h files by a single
   pattern rule

 - handle PREEMPT_RT in the module vermagic and UTS_VERSION

 - warn CONFIG options leaked to the user-space except existing ones

 - make single targets work properly

 - rebuild modules when module linker scripts are updated

 - split the module final link stage into scripts/Makefile.modfinal

 - fix the missed error code in merge_config.sh

 - improve the error message displayed on the attempt of the O= build in
   unclean source tree

 - remove 'clean-dirs' syntax

 - disable -Wimplicit-fallthrough warning for Clang

 - add CONFIG_CC_OPTIMIZE_FOR_SIZE_O3 for ARC

 - remove ARCH_{CPP,A,C}FLAGS variables

 - add $(BASH) to run bash scripts

 - change *CFLAGS_<basetarget>.o to take the relative path to $(obj)
   instead of the basename

 - stop suppressing Clang's -Wunused-function warnings when W=1

 - fix linux/export.h to avoid genksyms calculating CRC of trimmed
   exported symbols

 - misc cleanups

* tag 'kbuild-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (63 commits)
  genksyms: convert to SPDX License Identifier for lex.l and parse.y
  modpost: use __section in the output to *.mod.c
  modpost: use MODULE_INFO() for __module_depends
  export.h, genksyms: do not make genksyms calculate CRC of trimmed symbols
  export.h: remove defined(__KERNEL__), which is no longer needed
  kbuild: allow Clang to find unused static inline functions for W=1 build
  kbuild: rename KBUILD_ENABLE_EXTRA_GCC_CHECKS to KBUILD_EXTRA_WARN
  kbuild: refactor scripts/Makefile.extrawarn
  merge_config.sh: ignore unwanted grep errors
  kbuild: change *FLAGS_<basetarget>.o to take the path relative to $(obj)
  modpost: add NOFAIL to strndup
  modpost: add guid_t type definition
  kbuild: add $(BASH) to run scripts with bash-extension
  kbuild: remove ARCH_{CPP,A,C}FLAGS
  kbuild,arc: add CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE_O3 for ARC
  kbuild: Do not enable -Wimplicit-fallthrough for clang for now
  kbuild: clean up subdir-ymn calculation in Makefile.clean
  kbuild: remove unneeded '+' marker from cmd_clean
  kbuild: remove clean-dirs syntax
  kbuild: check clean srctree even earlier
  ...
2019-09-20 08:36:47 -07:00
Linus Torvalds
671df18953 dma-mapping updates for 5.4:
- add dma-mapping and block layer helpers to take care of IOMMU
    merging for mmc plus subsequent fixups (Yoshihiro Shimoda)
  - rework handling of the pgprot bits for remapping (me)
  - take care of the dma direct infrastructure for swiotlb-xen (me)
  - improve the dma noncoherent remapping infrastructure (me)
  - better defaults for ->mmap, ->get_sgtable and ->get_required_mask (me)
  - cleanup mmaping of coherent DMA allocations (me)
  - various misc cleanups (Andy Shevchenko, me)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAl2CSucLHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYPfrhAAgXZA/EdFPvkkCoDrmgtf3XkudX9gajeCd9g4NZy6
 ZBQElTVvm4S0sQj7IXgALnMumDMbbTibW5SQLX5GwQDe+XXBpZ8ajpAnJAXc8a5T
 qaFQ4SInr4CgBZf9nZKDkbSBZ1Tu3AQm1c0QI8riRCkrVTuX4L06xpCef4Yh4mgO
 rwWEjIioYpQiKZMmu98riXh3ZNfFG3mVJRhKt8B6XJbBgnUnjDOPYGgaUwp6CU20
 tFBKL2GaaV0vdLJ5wYhIGXT4DJ8tp9T5n3IYGZv1Ux889RaZEHlCrMxzelYeDbCT
 KhZbhcSECGnddsh73t/UX7/KhytuqnfKa9n+Xo6AWuA47xO4c36quOOcTk9M0vE5
 TfGDmewgL6WIv4lzokpRn5EkfDhyL33j8eYJrJ8e0ldcOhSQIFk4ciXnf2stWi6O
 JrlzzzSid+zXxu48iTfoPdnMr7psTpiMvvRvKfEeMp2FX9Fg6EdMzJYLTEl+COHB
 0WwNacZmY3P01+b5EZXEgqKEZevIIdmPKbyM9rPtTjz8BjBwkABHTpN3fWbVBf7/
 Ax6OPYyW40xp1fnJuzn89m3pdOxn88FpDdOaeLz892Zd+Qpnro1ayulnFspVtqGM
 mGbzA9whILvXNRpWBSQrvr2IjqMRjbBxX3BVACl3MMpOChgkpp5iANNfSDjCftSF
 Zu8=
 =/wGv
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-5.4' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping updates from Christoph Hellwig:

 - add dma-mapping and block layer helpers to take care of IOMMU merging
   for mmc plus subsequent fixups (Yoshihiro Shimoda)

 - rework handling of the pgprot bits for remapping (me)

 - take care of the dma direct infrastructure for swiotlb-xen (me)

 - improve the dma noncoherent remapping infrastructure (me)

 - better defaults for ->mmap, ->get_sgtable and ->get_required_mask
   (me)

 - cleanup mmaping of coherent DMA allocations (me)

 - various misc cleanups (Andy Shevchenko, me)

* tag 'dma-mapping-5.4' of git://git.infradead.org/users/hch/dma-mapping: (41 commits)
  mmc: renesas_sdhi_internal_dmac: Add MMC_CAP2_MERGE_CAPABLE
  mmc: queue: Fix bigger segments usage
  arm64: use asm-generic/dma-mapping.h
  swiotlb-xen: merge xen_unmap_single into xen_swiotlb_unmap_page
  swiotlb-xen: simplify cache maintainance
  swiotlb-xen: use the same foreign page check everywhere
  swiotlb-xen: remove xen_swiotlb_dma_mmap and xen_swiotlb_dma_get_sgtable
  xen: remove the exports for xen_{create,destroy}_contiguous_region
  xen/arm: remove xen_dma_ops
  xen/arm: simplify dma_cache_maint
  xen/arm: use dev_is_dma_coherent
  xen/arm: consolidate page-coherent.h
  xen/arm: use dma-noncoherent.h calls for xen-swiotlb cache maintainance
  arm: remove wrappers for the generic dma remap helpers
  dma-mapping: introduce a dma_common_find_pages helper
  dma-mapping: always use VM_DMA_COHERENT for generic DMA remap
  vmalloc: lift the arm flag for coherent mappings to common code
  dma-mapping: provide a better default ->get_required_mask
  dma-mapping: remove the dma_declare_coherent_memory export
  remoteproc: don't allow modular build
  ...
2019-09-19 13:27:23 -07:00
Li RongQing
e430d802d6 timer: Read jiffies once when forwarding base clk
The timer delayed for more than 3 seconds warning was triggered during
testing.

  Workqueue: events_unbound sched_tick_remote
  RIP: 0010:sched_tick_remote+0xee/0x100
  ...
  Call Trace:
   process_one_work+0x18c/0x3a0
   worker_thread+0x30/0x380
   kthread+0x113/0x130
   ret_from_fork+0x22/0x40

The reason is that the code in collect_expired_timers() uses jiffies
unprotected:

    if (next_event > jiffies)
        base->clk = jiffies;

As the compiler is allowed to reload the value base->clk can advance
between the check and the store and in the worst case advance farther than
next event. That causes the timer expiry to be delayed until the wheel
pointer wraps around.

Convert the code to use READ_ONCE()

Fixes: 236968383c ("timers: Optimize collect_expired_timers() for NOHZ")
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Liang ZhiCheng <liangzhicheng@baidu.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1568894687-14499-1-git-send-email-lirongqing@baidu.com
2019-09-19 17:50:11 +02:00
Masami Hiramatsu
fe60b0ce8e tracing/probe: Reject exactly same probe event
Reject exactly same probe events as existing probes.

Multiprobe allows user to define multiple probes on same
event. If user appends a probe which exactly same definition
(same probe address and same arguments) on existing event,
the event will record same probe information twice.
That can be confusing users, so reject it.

Link: http://lkml.kernel.org/r/156879694602.31056.5533024778165036763.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-09-19 11:09:16 -04:00
Masami Hiramatsu
44d00dc7ce tracing/probe: Fix to allow user to enable events on unloaded modules
Fix to allow user to enable probe events on unloaded modules.

This operations was allowed before commit 60d53e2c3b ("tracing/probe:
Split trace_event related data from trace_probe"), because if users
need to probe module init functions, they have to enable those probe
events before loading module.

Link: http://lkml.kernel.org/r/156879693733.31056.9331322616994665167.stgit@devnote2

Cc: stable@vger.kernel.org
Fixes: 60d53e2c3b ("tracing/probe: Split trace_event related data from trace_probe")
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-09-19 09:55:42 -04:00
Alexei Starovoitov
9eea984979 bpf: fix BTF verification of enums
vmlinux BTF has enums that are 8 byte and 1 byte in size.
2 byte enum is a valid construct as well.
Fix BTF enum verification to accept those sizes.

Fixes: 69b693f0ae ("bpf: btf: Introduce BPF Type Format (BTF)")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-09-19 14:22:44 +02:00
David Howells
d2935de7e4 vfs: Convert bpf to use the new mount API
Convert the bpf filesystem to the new internal mount API as the old
one will be obsoleted and removed.  This allows greater flexibility in
communication of mount parameters between userspace, the VFS and the
filesystem.

See Documentation/filesystems/mount_api.txt for more information.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Alexei Starovoitov <ast@kernel.org>
cc: Daniel Borkmann <daniel@iogearbox.net>
cc: Martin KaFai Lau <kafai@fb.com>
cc: Song Liu <songliubraving@fb.com>
cc: Yonghong Song <yhs@fb.com>
cc: netdev@vger.kernel.org
cc: bpf@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-09-18 22:35:31 -04:00
Tejun Heo
def98c84b6 workqueue: Fix spurious sanity check failures in destroy_workqueue()
Before actually destrying a workqueue, destroy_workqueue() checks
whether it's actually idle.  If it isn't, it prints out a bunch of
warning messages and leaves the workqueue dangling.  It unfortunately
has a couple issues.

* Mayday list queueing increments pwq's refcnts which gets detected as
  busy and fails the sanity checks.  However, because mayday list
  queueing is asynchronous, this condition can happen without any
  actual work items left in the workqueue.

* Sanity check failure leaves the sysfs interface behind too which can
  lead to init failure of newer instances of the workqueue.

This patch fixes the above two by

* If a workqueue has a rescuer, disable and kill the rescuer before
  sanity checks.  Disabling and killing is guaranteed to flush the
  existing mayday list.

* Remove sysfs interface before sanity checks.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Marcin Pawlowski <mpawlowski@fb.com>
Reported-by: "Williams, Gerald S" <gerald.s.williams@intel.com>
Cc: stable@vger.kernel.org
2019-09-18 18:45:23 -07:00
Linus Torvalds
81160dda9a Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from David Miller:

 1) Support IPV6 RA Captive Portal Identifier, from Maciej Żenczykowski.

 2) Use bio_vec in the networking instead of custom skb_frag_t, from
    Matthew Wilcox.

 3) Make use of xmit_more in r8169 driver, from Heiner Kallweit.

 4) Add devmap_hash to xdp, from Toke Høiland-Jørgensen.

 5) Support all variants of 5750X bnxt_en chips, from Michael Chan.

 6) More RTNL avoidance work in the core and mlx5 driver, from Vlad
    Buslov.

 7) Add TCP syn cookies bpf helper, from Petar Penkov.

 8) Add 'nettest' to selftests and use it, from David Ahern.

 9) Add extack support to drop_monitor, add packet alert mode and
    support for HW drops, from Ido Schimmel.

10) Add VLAN offload to stmmac, from Jose Abreu.

11) Lots of devm_platform_ioremap_resource() conversions, from
    YueHaibing.

12) Add IONIC driver, from Shannon Nelson.

13) Several kTLS cleanups, from Jakub Kicinski.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1930 commits)
  mlxsw: spectrum_buffers: Add the ability to query the CPU port's shared buffer
  mlxsw: spectrum: Register CPU port with devlink
  mlxsw: spectrum_buffers: Prevent changing CPU port's configuration
  net: ena: fix incorrect update of intr_delay_resolution
  net: ena: fix retrieval of nonadaptive interrupt moderation intervals
  net: ena: fix update of interrupt moderation register
  net: ena: remove all old adaptive rx interrupt moderation code from ena_com
  net: ena: remove ena_restore_ethtool_params() and relevant fields
  net: ena: remove old adaptive interrupt moderation code from ena_netdev
  net: ena: remove code duplication in ena_com_update_nonadaptive_moderation_interval _*()
  net: ena: enable the interrupt_moderation in driver_supported_features
  net: ena: reimplement set/get_coalesce()
  net: ena: switch to dim algorithm for rx adaptive interrupt moderation
  net: ena: add intr_moder_rx_interval to struct ena_com_dev and use it
  net: phy: adin: implement Energy Detect Powerdown mode via phy-tunable
  ethtool: implement Energy Detect Powerdown support via phy-tunable
  xen-netfront: do not assume sk_buff_head list is empty in error handling
  s390/ctcm: Delete unnecessary checks before the macro call “dev_kfree_skb”
  net: ena: don't wake up tx queue when down
  drop_monitor: Better sanitize notified packets
  ...
2019-09-18 12:34:53 -07:00
Linus Torvalds
8b53c76533 Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
 "API:
   - Add the ability to abort a skcipher walk.

  Algorithms:
   - Fix XTS to actually do the stealing.
   - Add library helpers for AES and DES for single-block users.
   - Add library helpers for SHA256.
   - Add new DES key verification helper.
   - Add surrounding bits for ESSIV generator.
   - Add accelerations for aegis128.
   - Add test vectors for lzo-rle.

  Drivers:
   - Add i.MX8MQ support to caam.
   - Add gcm/ccm/cfb/ofb aes support in inside-secure.
   - Add ofb/cfb aes support in media-tek.
   - Add HiSilicon ZIP accelerator support.

  Others:
   - Fix potential race condition in padata.
   - Use unbound workqueues in padata"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (311 commits)
  crypto: caam - Cast to long first before pointer conversion
  crypto: ccree - enable CTS support in AES-XTS
  crypto: inside-secure - Probe transform record cache RAM sizes
  crypto: inside-secure - Base RD fetchcount on actual RD FIFO size
  crypto: inside-secure - Base CD fetchcount on actual CD FIFO size
  crypto: inside-secure - Enable extended algorithms on newer HW
  crypto: inside-secure: Corrected configuration of EIP96_TOKEN_CTRL
  crypto: inside-secure - Add EIP97/EIP197 and endianness detection
  padata: remove cpu_index from the parallel_queue
  padata: unbind parallel jobs from specific CPUs
  padata: use separate workqueues for parallel and serial work
  padata, pcrypt: take CPU hotplug lock internally in padata_alloc_possible
  crypto: pcrypt - remove padata cpumask notifier
  padata: make padata_do_parallel find alternate callback CPU
  workqueue: require CPU hotplug read exclusion for apply_workqueue_attrs
  workqueue: unconfine alloc/apply/free_workqueue_attrs()
  padata: allocate workqueue internally
  arm64: dts: imx8mq: Add CAAM node
  random: Use wait_event_freezable() in add_hwgenerator_randomness()
  crypto: ux500 - Fix COMPILE_TEST warnings
  ...
2019-09-18 12:11:14 -07:00
Naveen N. Rao
a3db31ff6c ftrace: Look up the address of return_to_handler() using helpers
This ensures that we use the right address on architectures that use
function descriptors.

Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/8f6f14d192a994008ac370ce14036bbe67224c7d.1567707399.git.naveen.n.rao@linux.vnet.ibm.com
2019-09-18 12:24:47 +10:00
Linus Torvalds
77dcfe2b9e Power management updates for 5.4-rc1
- Rework the main suspend-to-idle control flow to avoid repeating
    "noirq" device resume and suspend operations in case of spurious
    wakeups from the ACPI EC and decouple the ACPI EC wakeups support
    from the LPS0 _DSM support (Rafael Wysocki).
 
  - Extend the wakeup sources framework to expose wakeup sources as
    device objects in sysfs (Tri Vo, Stephen Boyd).
 
  - Expose system suspend statistics in sysfs (Kalesh Singh).
 
  - Introduce a new haltpoll cpuidle driver and a new matching
    governor for virtualized guests wanting to do guest-side polling
    in the idle loop (Marcelo Tosatti, Joao Martins, Wanpeng Li,
    Stephen Rothwell).
 
  - Fix the menu and teo cpuidle governors to allow the scheduler tick
    to be stopped if PM QoS is used to limit the CPU idle state exit
    latency in some cases (Rafael Wysocki).
 
  - Increase the resolution of the play_idle() argument to microseconds
    for more fine-grained injection of CPU idle cycles (Daniel Lezcano).
 
  - Switch over some users of cpuidle notifiers to the new QoS-based
    frequency limits and drop the CPUFREQ_ADJUST and CPUFREQ_NOTIFY
    policy notifier events (Viresh Kumar).
 
  - Add new cpufreq driver based on nvmem for sun50i (Yangtao Li).
 
  - Add support for MT8183 and MT8516 to the mediatek cpufreq driver
    (Andrew-sh.Cheng, Fabien Parent).
 
  - Add i.MX8MN support to the imx-cpufreq-dt cpufreq driver (Anson
    Huang).
 
  - Add qcs404 to cpufreq-dt-platdev blacklist (Jorge Ramirez-Ortiz).
 
  - Update the qcom cpufreq driver (among other things, to make it
    easier to extend and to use kryo cpufreq for other nvmem-based
    SoCs) and add qcs404 support to it  (Niklas Cassel, Douglas
    RAILLARD, Sibi Sankar, Sricharan R).
 
  - Fix assorted issues and make assorted minor improvements in the
    cpufreq code (Colin Ian King, Douglas RAILLARD, Florian Fainelli,
    Gustavo Silva, Hariprasad Kelam).
 
  - Add new devfreq driver for NVidia Tegra20 (Dmitry Osipenko, Arnd
    Bergmann).
 
  - Add new Exynos PPMU events to devfreq events and extend that
    mechanism (Lukasz Luba).
 
  - Fix and clean up the exynos-bus devfreq driver (Kamil Konieczny).
 
  - Improve devfreq documentation and governor code, fix spelling
    typos in devfreq (Ezequiel Garcia, Krzysztof Kozlowski, Leonard
    Crestez, MyungJoo Ham, Gaël PORTAY).
 
  - Add regulators enable and disable to the OPP (operating performance
    points) framework (Kamil Konieczny).
 
  - Update the OPP framework to support multiple opp-suspend properties
    (Anson Huang).
 
  - Fix assorted issues and make assorted minor improvements in the OPP
    code (Niklas Cassel, Viresh Kumar, Yue Hu).
 
  - Clean up the generic power domains (genpd) framework (Ulf Hansson).
 
  - Clean up assorted pieces of power management code and documentation
    (Akinobu Mita, Amit Kucheria, Chuhong Yuan).
 
  - Update the pm-graph tool to version 5.5 including multiple fixes
    and improvements (Todd Brandt).
 
  - Update the cpupower utility (Benjamin Weis, Geert Uytterhoeven,
    Sébastien Szymanski).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl2ArZ4SHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxgfYQAK80hs43vWQDmp7XKrN4pQe8+qYULAGO
 fBfrFl+NG9y/cnuqnt3NtA8MoyNsMMkMLkpkEDMfSbYqqH5ehEzX5+uGJWiWx8+Y
 oH5KU8MH7Tj/utYaalGzDt0AHfHZDIGC0NCUNQJVtE/4mOANFabwsCwscp4MrD5Q
 WjFN8U4BrsmWgJdZ/U9QIWcDZ0I+1etCF+rZG2yxSv31FMq2Zk/Qm4YyobqCvQFl
 TR9rxl08wqUmIYIz5cDjt/3AKH7NLLDqOTstbCL7cmufM5XPFc1yox69xc89UrIa
 4AMgmDp7SMwFG/gdUPof0WQNmx7qxmiRAPleAOYBOZW/8jPNZk2y+RhM5NeF72m7
 AFqYiuxqatkSb4IsT8fLzH9IUZOdYr8uSmoMQECw+MHdApaKFjFV8Lb/qx5+AwkD
 y7pwys8dZSamAjAf62eUzJDWcEwkNrujIisGrIXrVHb7ISbweskMOmdAYn9p4KgP
 dfRzpJBJ45IaMIdbaVXNpg3rP7Apfs7X1X+/ZhG6f+zHH3zYwr8Y81WPqX8WaZJ4
 qoVCyxiVWzMYjY2/1lzjaAdqWojPWHQ3or3eBaK52DouyG3jY6hCDTLwU7iuqcCX
 jzAtrnqrNIKufvaObEmqcmYlIIOFT7QaJCtGUSRFQLfSon8fsVSR7LLeXoAMUJKT
 JWQenuNaJngK
 =TBDQ
 -----END PGP SIGNATURE-----

Merge tag 'pm-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:
 "These include a rework of the main suspend-to-idle code flow (related
  to the handling of spurious wakeups), a switch over of several users
  of cpufreq notifiers to QoS-based limits, a new devfreq driver for
  Tegra20, a new cpuidle driver and governor for virtualized guests, an
  extension of the wakeup sources framework to expose wakeup sources as
  device objects in sysfs, and more.

  Specifics:

   - Rework the main suspend-to-idle control flow to avoid repeating
     "noirq" device resume and suspend operations in case of spurious
     wakeups from the ACPI EC and decouple the ACPI EC wakeups support
     from the LPS0 _DSM support (Rafael Wysocki).

   - Extend the wakeup sources framework to expose wakeup sources as
     device objects in sysfs (Tri Vo, Stephen Boyd).

   - Expose system suspend statistics in sysfs (Kalesh Singh).

   - Introduce a new haltpoll cpuidle driver and a new matching governor
     for virtualized guests wanting to do guest-side polling in the idle
     loop (Marcelo Tosatti, Joao Martins, Wanpeng Li, Stephen Rothwell).

   - Fix the menu and teo cpuidle governors to allow the scheduler tick
     to be stopped if PM QoS is used to limit the CPU idle state exit
     latency in some cases (Rafael Wysocki).

   - Increase the resolution of the play_idle() argument to microseconds
     for more fine-grained injection of CPU idle cycles (Daniel
     Lezcano).

   - Switch over some users of cpuidle notifiers to the new QoS-based
     frequency limits and drop the CPUFREQ_ADJUST and CPUFREQ_NOTIFY
     policy notifier events (Viresh Kumar).

   - Add new cpufreq driver based on nvmem for sun50i (Yangtao Li).

   - Add support for MT8183 and MT8516 to the mediatek cpufreq driver
     (Andrew-sh.Cheng, Fabien Parent).

   - Add i.MX8MN support to the imx-cpufreq-dt cpufreq driver (Anson
     Huang).

   - Add qcs404 to cpufreq-dt-platdev blacklist (Jorge Ramirez-Ortiz).

   - Update the qcom cpufreq driver (among other things, to make it
     easier to extend and to use kryo cpufreq for other nvmem-based
     SoCs) and add qcs404 support to it (Niklas Cassel, Douglas
     RAILLARD, Sibi Sankar, Sricharan R).

   - Fix assorted issues and make assorted minor improvements in the
     cpufreq code (Colin Ian King, Douglas RAILLARD, Florian Fainelli,
     Gustavo Silva, Hariprasad Kelam).

   - Add new devfreq driver for NVidia Tegra20 (Dmitry Osipenko, Arnd
     Bergmann).

   - Add new Exynos PPMU events to devfreq events and extend that
     mechanism (Lukasz Luba).

   - Fix and clean up the exynos-bus devfreq driver (Kamil Konieczny).

   - Improve devfreq documentation and governor code, fix spelling typos
     in devfreq (Ezequiel Garcia, Krzysztof Kozlowski, Leonard Crestez,
     MyungJoo Ham, Gaël PORTAY).

   - Add regulators enable and disable to the OPP (operating performance
     points) framework (Kamil Konieczny).

   - Update the OPP framework to support multiple opp-suspend properties
     (Anson Huang).

   - Fix assorted issues and make assorted minor improvements in the OPP
     code (Niklas Cassel, Viresh Kumar, Yue Hu).

   - Clean up the generic power domains (genpd) framework (Ulf Hansson).

   - Clean up assorted pieces of power management code and documentation
     (Akinobu Mita, Amit Kucheria, Chuhong Yuan).

   - Update the pm-graph tool to version 5.5 including multiple fixes
     and improvements (Todd Brandt).

   - Update the cpupower utility (Benjamin Weis, Geert Uytterhoeven,
     Sébastien Szymanski)"

* tag 'pm-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (126 commits)
  cpuidle-haltpoll: Enable kvm guest polling when dedicated physical CPUs are available
  cpuidle-haltpoll: do not set an owner to allow modunload
  cpuidle-haltpoll: return -ENODEV on modinit failure
  cpuidle-haltpoll: set haltpoll as preferred governor
  cpuidle: allow governor switch on cpuidle_register_driver()
  PM: runtime: Documentation: add runtime_status ABI document
  pm-graph: make setVal unbuffered again for python2 and python3
  powercap: idle_inject: Use higher resolution for idle injection
  cpuidle: play_idle: Increase the resolution to usec
  cpuidle-haltpoll: vcpu hotplug support
  cpufreq: Add qcs404 to cpufreq-dt-platdev blacklist
  cpufreq: qcom: Add support for qcs404 on nvmem driver
  cpufreq: qcom: Refactor the driver to make it easier to extend
  cpufreq: qcom: Re-organise kryo cpufreq to use it for other nvmem based qcom socs
  dt-bindings: opp: Add qcom-opp bindings with properties needed for CPR
  dt-bindings: opp: qcom-nvmem: Support pstates provided by a power domain
  Documentation: cpufreq: Update policy notifier documentation
  cpufreq: Remove CPUFREQ_ADJUST and CPUFREQ_NOTIFY policy notifier events
  PM / Domains: Verify PM domain type in dev_pm_genpd_set_performance_state()
  PM / Domains: Simplify genpd_lookup_dev()
  ...
2019-09-17 19:15:14 -07:00
Linus Torvalds
3ee8d6c592 Merge branch 'for-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "Three minor cleanup patches"

* 'for-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  Use kvmalloc in cgroups-v1
  cgroup: minor tweak for logic to get cgroup css
  cgroup: Replace a seq_printf() call by seq_puts() in cgroup_print_ss_mask()
2019-09-17 15:57:22 -07:00
Linus Torvalds
7f2444d38f Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core timer updates from Thomas Gleixner:
 "Timers and timekeeping updates:

   - A large overhaul of the posix CPU timer code which is a preparation
     for moving the CPU timer expiry out into task work so it can be
     properly accounted on the task/process.

     An update to the bogus permission checks will come later during the
     merge window as feedback was not complete before heading of for
     travel.

   - Switch the timerqueue code to use cached rbtrees and get rid of the
     homebrewn caching of the leftmost node.

   - Consolidate hrtimer_init() + hrtimer_init_sleeper() calls into a
     single function

   - Implement the separation of hrtimers to be forced to expire in hard
     interrupt context even when PREEMPT_RT is enabled and mark the
     affected timers accordingly.

   - Implement a mechanism for hrtimers and the timer wheel to protect
     RT against priority inversion and live lock issues when a (hr)timer
     which should be canceled is currently executing the callback.
     Instead of infinitely spinning, the task which tries to cancel the
     timer blocks on a per cpu base expiry lock which is held and
     released by the (hr)timer expiry code.

   - Enable the Hyper-V TSC page based sched_clock for Hyper-V guests
     resulting in faster access to timekeeping functions.

   - Updates to various clocksource/clockevent drivers and their device
     tree bindings.

   - The usual small improvements all over the place"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (101 commits)
  posix-cpu-timers: Fix permission check regression
  posix-cpu-timers: Always clear head pointer on dequeue
  hrtimer: Add a missing bracket and hide `migration_base' on !SMP
  posix-cpu-timers: Make expiry_active check actually work correctly
  posix-timers: Unbreak CONFIG_POSIX_TIMERS=n build
  tick: Mark sched_timer to expire in hard interrupt context
  hrtimer: Add kernel doc annotation for HRTIMER_MODE_HARD
  x86/hyperv: Hide pv_ops access for CONFIG_PARAVIRT=n
  posix-cpu-timers: Utilize timerqueue for storage
  posix-cpu-timers: Move state tracking to struct posix_cputimers
  posix-cpu-timers: Deduplicate rlimit handling
  posix-cpu-timers: Remove pointless comparisons
  posix-cpu-timers: Get rid of 64bit divisions
  posix-cpu-timers: Consolidate timer expiry further
  posix-cpu-timers: Get rid of zero checks
  rlimit: Rewrite non-sensical RLIMIT_CPU comment
  posix-cpu-timers: Respect INFINITY for hard RTTIME limit
  posix-cpu-timers: Switch thread group sampling to array
  posix-cpu-timers: Restructure expiry array
  posix-cpu-timers: Remove cputime_expires
  ...
2019-09-17 12:35:15 -07:00
Linus Torvalds
a572ba6329 Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core irq updates from Thomas Gleixner:
 "Updates from the irq departement:

   - Update the interrupt spreading code so it handles numa node with
     different CPU counts properly.

   - A large overhaul of the ARM GiCv3 driver to support new PPI and SPI
     ranges.

   - Conversion of all alloc_fwnode() users to use physical addresses
     instead of virtual addresses so the virtual addresses are not
     leaked. The physical address is sufficient to identify the
     associated interrupt chip.

   - Add support for Marvel MMP3, Amlogic Meson SM1 interrupt chips.

   - Enforce interrupt threading at compile time if RT is enabled.

   - Small updates and improvements all over the place"

* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
  irqchip/gic-v3-its: Fix LPI release for Multi-MSI devices
  irqchip/uniphier-aidet: Use devm_platform_ioremap_resource()
  irqdomain: Add the missing assignment of domain->fwnode for named fwnode
  irqchip/mmp: Coexist with GIC root IRQ controller
  irqchip/mmp: Mask off interrupts from other cores
  irqchip/mmp: Add missing chained_irq_{enter,exit}()
  irqchip/mmp: Do not use of_address_to_resource() to get mux regs
  irqchip/meson-gpio: Add support for meson sm1 SoCs
  dt-bindings: interrupt-controller: New binding for the meson sm1 SoCs
  genirq/affinity: Remove const qualifier from node_to_cpumask argument
  genirq/affinity: Spread vectors on node according to nr_cpu ratio
  genirq/affinity: Improve __irq_build_affinity_masks()
  irqchip: Remove dev_err() usage after platform_get_irq()
  irqchip: Add include guard to irq-partition-percpu.h
  irqchip/mmp: Do not call irq_set_default_host() on DT platforms
  irqchip/gic-v3-its: Remove the redundant set_bit for lpi_map
  irqchip/gic-v3: Add quirks for HIP06/07 invalid GICD_TYPER erratum 161010803
  irqchip/gic: Skip DT quirks when evaluating IIDR-based quirks
  irqchip/gic-v3: Warn about inconsistent implementations of extended ranges
  irqchip/gic-v3: Add EPPI range support
  ...
2019-09-17 11:42:15 -07:00
Linus Torvalds
3cd0462230 Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull CPU hotplug updates from Thomas Gleixner:
 "A small update for the SMP hotplug code code:

   - Track "booted once" CPUs in a cpumask so the x86 APIC code has an
     easy way to decide whether broadcast IPIs are safe to use or not.

   - Implement a cpumask_or_equal() helper for the IPI broadcast
     evaluation.

     The above two changes have been also pulled into the x86/apic
     branch for implementing the conditional IPI broadcast feature.

   - Cache the number of online CPUs instead of reevaluating it over and
     over. num_online_cpus() is an unreliable snapshot anyway except
     when it is used outside a cpu hotplug locked region. The cached
     access is not changing this, but it's definitely faster than
     calculating the bitmap wheight especially in hot paths"

* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  cpu/hotplug: Cache number of online CPUs
  cpumask: Implement cpumask_or_equal()
  smp/hotplug: Track booted once CPUs in a cpumask
2019-09-17 10:32:05 -07:00
Linus Torvalds
16208cd6c3 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Thomas Gleixner:
 "A single fix to prevent the alarm timer code from returning ENOTSUPP
  to user space.

  ENOTSUPP is a purely kernel internal error code related to NFSv3 and
  should never be handed back to user space. The risk for ABI breakage
  is low as the number of systems which do not have a working RTC is
  very limited"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  alarmtimer: Use EOPNOTSUPP instead of ENOTSUPP
2019-09-17 10:12:41 -07:00
Masami Hiramatsu
d59fae6fea tracing/kprobe: Fix NULL pointer access in trace_porbe_unlink()
Fix NULL pointer access in trace_probe_unlink() by initializing
trace_probe.list correctly in trace_probe_init().

In the error case of trace_probe_init(), it can call trace_probe_unlink()
before initializing trace_probe.list member. This causes NULL pointer
dereference at list_del_init() in trace_probe_unlink().

Syzbot reported :

kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] PREEMPT SMP KASAN
CPU: 1 PID: 8633 Comm: syz-executor797 Not tainted 5.3.0-rc8-next-20190915
#0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
RIP: 0010:__list_del_entry_valid+0x85/0xf5 lib/list_debug.c:51
Code: 0f 84 e1 00 00 00 48 b8 22 01 00 00 00 00 ad de 49 39 c4 0f 84 e2 00
00 00 48 b8 00 00 00 00 00 fc ff df 4c 89 e2 48 c1 ea 03 <80> 3c 02 00 75
53 49 8b 14 24 4c 39 f2 0f 85 99 00 00 00 49 8d 7d
RSP: 0018:ffff888090a7f9d8 EFLAGS: 00010246
RAX: dffffc0000000000 RBX: ffff88809b6f90c0 RCX: ffffffff817c0ca9
RDX: 0000000000000000 RSI: ffffffff817c0a73 RDI: ffff88809b6f90c8
RBP: ffff888090a7f9f0 R08: ffff88809a04e600 R09: ffffed1015d26aed
R10: ffffed1015d26aec R11: ffff8880ae935763 R12: 0000000000000000
R13: 0000000000000000 R14: ffff88809b6f90c0 R15: ffff88809b6f90d0
FS:  0000555556f99880(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000006cc090 CR3: 00000000962b2000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
  __list_del_entry include/linux/list.h:131 [inline]
  list_del_init include/linux/list.h:190 [inline]
  trace_probe_unlink+0x1f/0x200 kernel/trace/trace_probe.c:959
  trace_probe_cleanup+0xd3/0x110 kernel/trace/trace_probe.c:973
  trace_probe_init+0x3f2/0x510 kernel/trace/trace_probe.c:1011
  alloc_trace_uprobe+0x5e/0x250 kernel/trace/trace_uprobe.c:353
  create_local_trace_uprobe+0x109/0x4a0 kernel/trace/trace_uprobe.c:1508
  perf_uprobe_init+0x131/0x210 kernel/trace/trace_event_perf.c:314
  perf_uprobe_event_init+0x106/0x1a0 kernel/events/core.c:8898
  perf_try_init_event+0x135/0x590 kernel/events/core.c:10184
  perf_init_event kernel/events/core.c:10228 [inline]
  perf_event_alloc.part.0+0x1b89/0x33d0 kernel/events/core.c:10505
  perf_event_alloc kernel/events/core.c:10887 [inline]
  __do_sys_perf_event_open+0xa2d/0x2d00 kernel/events/core.c:10989
  __se_sys_perf_event_open kernel/events/core.c:10871 [inline]
  __x64_sys_perf_event_open+0xbe/0x150 kernel/events/core.c:10871
  do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

Link: http://lkml.kernel.org/r/156869709721.22406.5153754822203046939.stgit@devnote2

Reported-by: syzbot+2f807f4d3a2a4e87f18f@syzkaller.appspotmail.com
Fixes: ca89bc071d ("tracing/kprobe: Add multi-probe per event support")
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-09-17 11:21:29 -04:00
Tom Zanussi
17f8607a16 tracing: Make sure variable reference alias has correct var_ref_idx
Original changelog from Steve Rostedt (except last sentence which
explains the problem, and the Fixes: tag):

I performed a three way histogram with the following commands:

echo 'irq_lat u64 lat pid_t pid' > synthetic_events
echo 'wake_lat u64 lat u64 irqlat pid_t pid' >> synthetic_events
echo 'hist:keys=common_pid:irqts=common_timestamp.usecs if function == 0xffffffff81200580' > events/timer/hrtimer_start/trigger
echo 'hist:keys=common_pid:lat=common_timestamp.usecs-$irqts:onmatch(timer.hrtimer_start).irq_lat($lat,pid) if common_flags & 1' > events/sched/sched_waking/trigger
echo 'hist:keys=pid:wakets=common_timestamp.usecs,irqlat=lat' > events/synthetic/irq_lat/trigger
echo 'hist:keys=next_pid:lat=common_timestamp.usecs-$wakets,irqlat=$irqlat:onmatch(synthetic.irq_lat).wake_lat($lat,$irqlat,next_pid)' > events/sched/sched_switch/trigger
echo 1 > events/synthetic/wake_lat/enable

Basically I wanted to see:

 hrtimer_start (calling function tick_sched_timer)

Note:

  # grep tick_sched_timer /proc/kallsyms
ffffffff81200580 t tick_sched_timer

And save the time of that, and then record sched_waking if it is called
in interrupt context and with the same pid as the hrtimer_start, it
will record the latency between that and the waking event.

I then look at when the task that is woken is scheduled in, and record
the latency between the wakeup and the task running.

At the end, the wake_lat synthetic event will show the wakeup to
scheduled latency, as well as the irq latency in from hritmer_start to
the wakeup. The problem is that I found this:

          <idle>-0     [007] d...   190.485261: wake_lat: lat=27 irqlat=190485230 pid=698
          <idle>-0     [005] d...   190.485283: wake_lat: lat=40 irqlat=190485239 pid=10
          <idle>-0     [002] d...   190.488327: wake_lat: lat=56 irqlat=190488266 pid=335
          <idle>-0     [005] d...   190.489330: wake_lat: lat=64 irqlat=190489262 pid=10
          <idle>-0     [003] d...   190.490312: wake_lat: lat=43 irqlat=190490265 pid=77
          <idle>-0     [005] d...   190.493322: wake_lat: lat=54 irqlat=190493262 pid=10
          <idle>-0     [005] d...   190.497305: wake_lat: lat=35 irqlat=190497267 pid=10
          <idle>-0     [005] d...   190.501319: wake_lat: lat=50 irqlat=190501264 pid=10

The irqlat seemed quite large! Investigating this further, if I had
enabled the irq_lat synthetic event, I noticed this:

          <idle>-0     [002] d.s.   249.429308: irq_lat: lat=164968 pid=335
          <idle>-0     [002] d...   249.429369: wake_lat: lat=55 irqlat=249429308 pid=335

Notice that the timestamp of the irq_lat "249.429308" is awfully
similar to the reported irqlat variable. In fact, all instances were
like this. It appeared that:

  irqlat=$irqlat

Wasn't assigning the old $irqlat to the new irqlat variable, but
instead was assigning the $irqts to it.

The issue is that assigning the old $irqlat to the new irqlat variable
creates a variable reference alias, but the alias creation code
forgets to make sure the alias uses the same var_ref_idx to access the
reference.

Link: http://lkml.kernel.org/r/1567375321.5282.12.camel@kernel.org

Cc: Linux Trace Devel <linux-trace-devel@vger.kernel.org>
Cc: linux-rt-users <linux-rt-users@vger.kernel.org>
Cc: stable@vger.kernel.org
Fixes: 7e8b88a30b ("tracing: Add hist trigger support for variable reference aliases")
Reported-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-09-17 11:21:29 -04:00
Andy Shevchenko
119cdbdb95 tracing: Be more clever when dumping hex in __print_hex()
Hex dump as many as 16 bytes at once in trace_print_hex_seq()
instead of byte-by-byte approach.

Link: http://lkml.kernel.org/r/20190806151543.86061-1-andriy.shevchenko@linux.intel.com

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-09-17 11:21:28 -04:00
Changbin Du
08468754c1 ftrace: Simplify ftrace hash lookup code in clear_func_from_hash()
Function ftrace_lookup_ip() will check empty hash table. So we don't
need extra check outside.

Link: http://lkml.kernel.org/r/20190910143336.13472-1-changbin.du@gmail.com

Signed-off-by: Changbin Du <changbin.du@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-09-17 11:21:20 -04:00
Qian Cai
dac9f027b1 sched/fair: Remove unused cfs_rq_clock_task() function
cfs_rq_clock_task() was first introduced and used in:

  f1b17280ef ("sched: Maintain runnable averages across throttled periods")

Over time its use has been graduately removed by the following commits:

  d31b1a66cb ("sched/fair: Factorize PELT update")
  2312729688 ("sched/fair: Update scale invariance of PELT")

Today, there is no single user left, so it can be safely removed.

Found via the -Wunused-function build warning.

Signed-off-by: Qian Cai <cai@lca.pw>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/1568668775-2127-1-git-send-email-cai@lca.pw
[ Rewrote the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-17 09:55:02 +02:00
Rafael J. Wysocki
fc6763a2d7 Merge branches 'pm-opp', 'pm-qos', 'acpi-pm', 'pm-domains' and 'pm-tools'
* pm-opp:
  PM / OPP: Correct Documentation about library location
  opp: of: Support multiple suspend OPPs defined in DT
  dt-bindings: opp: Support multiple opp-suspend properties
  opp: core: add regulators enable and disable
  opp: Don't decrement uninitialized list_kref

* pm-qos:
  PM: QoS: Get rid of unused flags

* acpi-pm:
  ACPI: PM: Print debug messages on device power state changes

* pm-domains:
  PM / Domains: Verify PM domain type in dev_pm_genpd_set_performance_state()
  PM / Domains: Simplify genpd_lookup_dev()
  PM / Domains: Align in-parameter names for some genpd functions

* pm-tools:
  pm-graph: make setVal unbuffered again for python2 and python3
  cpupower: update German translation
  tools/power/cpupower: fix 64bit detection when cross-compiling
  cpupower: Add missing newline at end of file
  pm-graph v5.5
2019-09-17 09:49:19 +02:00
Rafael J. Wysocki
ca61a72ac3 Merge branch 'pm-cpufreq'
* pm-cpufreq: (36 commits)
  cpufreq: Add qcs404 to cpufreq-dt-platdev blacklist
  cpufreq: qcom: Add support for qcs404 on nvmem driver
  cpufreq: qcom: Refactor the driver to make it easier to extend
  cpufreq: qcom: Re-organise kryo cpufreq to use it for other nvmem based qcom socs
  dt-bindings: opp: Add qcom-opp bindings with properties needed for CPR
  dt-bindings: opp: qcom-nvmem: Support pstates provided by a power domain
  Documentation: cpufreq: Update policy notifier documentation
  cpufreq: Remove CPUFREQ_ADJUST and CPUFREQ_NOTIFY policy notifier events
  sched/cpufreq: Align trace event behavior of fast switching
  ACPI: cpufreq: Switch to QoS requests instead of cpufreq notifier
  video: pxafb: Remove cpufreq policy notifier
  video: sa1100fb: Remove cpufreq policy notifier
  arch_topology: Use CPUFREQ_CREATE_POLICY instead of CPUFREQ_NOTIFY
  cpufreq: powerpc_cbe: Switch to QoS requests for freq limits
  cpufreq: powerpc: macintosh: Switch to QoS requests for freq limits
  cpufreq: Print driver name if cpufreq_suspend() fails
  cpufreq: mediatek: Add support for mt8183
  cpufreq: mediatek: change to regulator_get_optional
  cpufreq: imx-cpufreq-dt: Add i.MX8MN support
  cpufreq: Use imx-cpufreq-dt for i.MX8MN's speed grading
  ...
2019-09-17 09:44:29 +02:00
Rafael J. Wysocki
2cdd5cc703 Merge branch 'pm-cpuidle'
* pm-cpuidle:
  cpuidle-haltpoll: Enable kvm guest polling when dedicated physical CPUs are available
  cpuidle-haltpoll: do not set an owner to allow modunload
  cpuidle-haltpoll: return -ENODEV on modinit failure
  cpuidle-haltpoll: set haltpoll as preferred governor
  cpuidle: allow governor switch on cpuidle_register_driver()
  powercap: idle_inject: Use higher resolution for idle injection
  cpuidle: play_idle: Increase the resolution to usec
  cpuidle-haltpoll: vcpu hotplug support
  cpuidle: teo: Get rid of redundant check in teo_update()
  cpuidle: teo: Allow tick to be stopped if PM QoS is used
  cpuidle: menu: Allow tick to be stopped if PM QoS is used
  cpuidle: header file stubs must be "static inline"
  cpuidle-haltpoll: disable host side polling when kvm virtualized
  cpuidle: add haltpoll governor
  governors: unify last_state_idx
  cpuidle: add poll_limit_ns to cpuidle_device structure
  add cpuidle-haltpoll driver
2019-09-17 09:41:26 +02:00
Rafael J. Wysocki
d281706369 Merge branch 'pm-sleep'
* pm-sleep: (29 commits)
  ACPI: PM: s2idle: Always set up EC GPE for system wakeup
  ACPI: PM: s2idle: Avoid rearming SCI for wakeup unnecessarily
  PM / wakeup: Unexport wakeup_source_sysfs_{add,remove}()
  PM / wakeup: Register wakeup class kobj after device is added
  PM / wakeup: Fix sysfs registration error path
  PM / wakeup: Show wakeup sources stats in sysfs
  PM / wakeup: Use wakeup_source_register() in wakelock.c
  PM / wakeup: Drop wakeup_source_init(), wakeup_source_prepare()
  PM: sleep: Replace strncmp() with str_has_prefix()
  PM: suspend: Fix platform_suspend_prepare_noirq()
  intel-hid: Disable button array during suspend-to-idle
  intel-hid: intel-vbtn: Avoid leaking wakeup_mode set
  ACPI: PM: s2idle: Execute LPS0 _DSM functions with suspended devices
  ACPI: EC: PM: Make acpi_ec_dispatch_gpe() print debug message
  ACPI: EC: PM: Consolidate some code depending on PM_SLEEP
  ACPI: PM: s2idle: Eliminate acpi_sleep_no_ec_events()
  ACPI: PM: s2idle: Switch EC over to polling during "noirq" suspend
  ACPI: PM: s2idle: Add acpi.sleep_no_lps0 module parameter
  ACPI: PM: s2idle: Rearrange lps0_device_attach()
  PM/sleep: Expose suspend stats in sysfs
  ...
2019-09-17 09:36:34 +02:00
Rafael J. Wysocki
1b531e55c5 Merge suspend-to-idle rework material for v5.4.
* pm-s2idle-rework: (21 commits)
  ACPI: PM: s2idle: Always set up EC GPE for system wakeup
  ACPI: PM: s2idle: Avoid rearming SCI for wakeup unnecessarily
  PM: suspend: Fix platform_suspend_prepare_noirq()
  intel-hid: Disable button array during suspend-to-idle
  intel-hid: intel-vbtn: Avoid leaking wakeup_mode set
  ACPI: PM: s2idle: Execute LPS0 _DSM functions with suspended devices
  ACPI: EC: PM: Make acpi_ec_dispatch_gpe() print debug message
  ACPI: EC: PM: Consolidate some code depending on PM_SLEEP
  ACPI: PM: s2idle: Eliminate acpi_sleep_no_ec_events()
  ACPI: PM: s2idle: Switch EC over to polling during "noirq" suspend
  ACPI: PM: s2idle: Add acpi.sleep_no_lps0 module parameter
  ACPI: PM: s2idle: Rearrange lps0_device_attach()
  ACPI: PM: Set up EC GPE for system wakeup from drivers that need it
  PM: sleep: Drop dpm_noirq_begin() and dpm_noirq_end()
  PM: sleep: Integrate suspend-to-idle with generig suspend flow
  PM: sleep: Simplify suspend-to-idle control flow
  ACPI: PM: Set s2idle_wakeup earlier and clear it later
  PM: sleep: Fix possible overflow in pm_system_cancel_wakeup()
  ACPI: EC: Return bool from acpi_ec_dispatch_gpe()
  ACPICA: Return u32 from acpi_dispatch_gpe()
  ...
2019-09-17 09:35:35 +02:00
Linus Torvalds
22331f8952 Merge branch 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cpu-feature updates from Ingo Molnar:

 - Rework the Intel model names symbols/macros, which were decades of
   ad-hoc extensions and added random noise. It's now a coherent, easy
   to follow nomenclature.

 - Add new Intel CPU model IDs:
    - "Tiger Lake" desktop and mobile models
    - "Elkhart Lake" model ID
    - and the "Lightning Mountain" variant of Airmont, plus support code

 - Add the new AVX512_VP2INTERSECT instruction to cpufeatures

 - Remove Intel MPX user-visible APIs and the self-tests, because the
   toolchain (gcc) is not supporting it going forward. This is the
   first, lowest-risk phase of MPX removal.

 - Remove X86_FEATURE_MFENCE_RDTSC

 - Various smaller cleanups and fixes

* 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
  x86/cpu: Update init data for new Airmont CPU model
  x86/cpu: Add new Airmont variant to Intel family
  x86/cpu: Add Elkhart Lake to Intel family
  x86/cpu: Add Tiger Lake to Intel family
  x86: Correct misc typos
  x86/intel: Add common OPTDIFFs
  x86/intel: Aggregate microserver naming
  x86/intel: Aggregate big core graphics naming
  x86/intel: Aggregate big core mobile naming
  x86/intel: Aggregate big core client naming
  x86/cpufeature: Explain the macro duplication
  x86/ftrace: Remove mcount() declaration
  x86/PCI: Remove superfluous returns from void functions
  x86/msr-index: Move AMD MSRs where they belong
  x86/cpu: Use constant definitions for CPU models
  lib: Remove redundant ftrace flag removal
  x86/crash: Remove unnecessary comparison
  x86/bitops: Use __builtin_constant_p() directly instead of IS_IMMEDIATE()
  x86: Remove X86_FEATURE_MFENCE_RDTSC
  x86/mpx: Remove MPX APIs
  ...
2019-09-16 18:47:53 -07:00
Linus Torvalds
7e67a85999 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:

 - MAINTAINERS: Add Mark Rutland as perf submaintainer, Juri Lelli and
   Vincent Guittot as scheduler submaintainers. Add Dietmar Eggemann,
   Steven Rostedt, Ben Segall and Mel Gorman as scheduler reviewers.

   As perf and the scheduler is getting bigger and more complex,
   document the status quo of current responsibilities and interests,
   and spread the review pain^H^H^H^H fun via an increase in the Cc:
   linecount generated by scripts/get_maintainer.pl. :-)

 - Add another series of patches that brings the -rt (PREEMPT_RT) tree
   closer to mainline: split the monolithic CONFIG_PREEMPT dependencies
   into a new CONFIG_PREEMPTION category that will allow the eventual
   introduction of CONFIG_PREEMPT_RT. Still a few more hundred patches
   to go though.

 - Extend the CPU cgroup controller with uclamp.min and uclamp.max to
   allow the finer shaping of CPU bandwidth usage.

 - Micro-optimize energy-aware wake-ups from O(CPUS^2) to O(CPUS).

 - Improve the behavior of high CPU count, high thread count
   applications running under cpu.cfs_quota_us constraints.

 - Improve balancing with SCHED_IDLE (SCHED_BATCH) tasks present.

 - Improve CPU isolation housekeeping CPU allocation NUMA locality.

 - Fix deadline scheduler bandwidth calculations and logic when cpusets
   rebuilds the topology, or when it gets deadline-throttled while it's
   being offlined.

 - Convert the cpuset_mutex to percpu_rwsem, to allow it to be used from
   setscheduler() system calls without creating global serialization.
   Add new synchronization between cpuset topology-changing events and
   the deadline acceptance tests in setscheduler(), which were broken
   before.

 - Rework the active_mm state machine to be less confusing and more
   optimal.

 - Rework (simplify) the pick_next_task() slowpath.

 - Improve load-balancing on AMD EPYC systems.

 - ... and misc cleanups, smaller fixes and improvements - please see
   the Git log for more details.

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits)
  sched/psi: Correct overly pessimistic size calculation
  sched/fair: Speed-up energy-aware wake-ups
  sched/uclamp: Always use 'enum uclamp_id' for clamp_id values
  sched/uclamp: Update CPU's refcount on TG's clamp changes
  sched/uclamp: Use TG's clamps to restrict TASK's clamps
  sched/uclamp: Propagate system defaults to the root group
  sched/uclamp: Propagate parent clamps
  sched/uclamp: Extend CPU's cgroup controller
  sched/topology: Improve load balancing on AMD EPYC systems
  arch, ia64: Make NUMA select SMP
  sched, perf: MAINTAINERS update, add submaintainers and reviewers
  sched/fair: Use rq_lock/unlock in online_fair_sched_group
  cpufreq: schedutil: fix equation in comment
  sched: Rework pick_next_task() slow-path
  sched: Allow put_prev_task() to drop rq->lock
  sched/fair: Expose newidle_balance()
  sched: Add task_struct pointer to sched_class::set_curr_task
  sched: Rework CPU hotplug task selection
  sched/{rt,deadline}: Fix set_next_task vs pick_next_task
  sched: Fix kerneldoc comment for ia64_set_curr_task
  ...
2019-09-16 17:25:49 -07:00
Linus Torvalds
772c1d06bd Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Ingo Molnar:
 "Kernel side changes:

   - Improved kbprobes robustness

   - Intel PEBS support for PT hardware tracing

   - Other Intel PT improvements: high order pages memory footprint
     reduction and various related cleanups

   - Misc cleanups

  The perf tooling side has been very busy in this cycle, with over 300
  commits. This is an incomplete high-level summary of the many
  improvements done by over 30 developers:

   - Lots of updates to the following tools:

      'perf c2c'
      'perf config'
      'perf record'
      'perf report'
      'perf script'
      'perf test'
      'perf top'
      'perf trace'

   - Updates to libperf and libtraceevent, and a consolidation of the
     proliferation of x86 instruction decoder libraries.

   - Vendor event updates for Intel and PowerPC CPUs,

   - Updates to hardware tracing tooling for ARM and Intel CPUs,

   - ... and lots of other changes and cleanups - see the shortlog and
     Git log for details"

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (322 commits)
  kprobes: Prohibit probing on BUG() and WARN() address
  perf/x86: Make more stuff static
  x86, perf: Fix the dependency of the x86 insn decoder selftest
  objtool: Ignore intentional differences for the x86 insn decoder
  objtool: Update sync-check.sh from perf's check-headers.sh
  perf build: Ignore intentional differences for the x86 insn decoder
  perf intel-pt: Use shared x86 insn decoder
  perf intel-pt: Remove inat.c from build dependency list
  perf: Update .gitignore file
  objtool: Move x86 insn decoder to a common location
  perf metricgroup: Support multiple events for metricgroup
  perf metricgroup: Scale the metric result
  perf pmu: Change convert_scale from static to global
  perf symbols: Move mem_info and branch_info out of symbol.h
  perf auxtrace: Uninline functions that touch perf_session
  perf tools: Remove needless evlist.h include directives
  perf tools: Remove needless evlist.h include directives
  perf tools: Remove needless thread_map.h include directives
  perf tools: Remove needless thread.h include directives
  perf tools: Remove needless map.h include directives
  ...
2019-09-16 17:06:21 -07:00
Linus Torvalds
c7eba51cfd Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:

 - improve rwsem scalability

 - add uninitialized rwsem debugging check

 - reduce lockdep's stacktrace memory usage and add diagnostics

 - misc cleanups, code consolidation and constification

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  mutex: Fix up mutex_waiter usage
  locking/mutex: Use mutex flags macro instead of hard code
  locking/mutex: Make __mutex_owner static to mutex.c
  locking/qspinlock,x86: Clarify virt_spin_lock_key
  locking/rwsem: Check for operations on an uninitialized rwsem
  locking/rwsem: Make handoff writer optimistically spin on owner
  locking/lockdep: Report more stack trace statistics
  locking/lockdep: Reduce space occupied by stack traces
  stacktrace: Constify 'entries' arguments
  locking/lockdep: Make it clear that what lock_class::key points at is not modified
2019-09-16 16:49:55 -07:00
Linus Torvalds
94d18ee934 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 "This cycle's RCU changes were:

   - A few more RCU flavor consolidation cleanups.

   - Updates to RCU's list-traversal macros improving lockdep usability.

   - Forward-progress improvements for no-CBs CPUs: Avoid ignoring
     incoming callbacks during grace-period waits.

   - Forward-progress improvements for no-CBs CPUs: Use ->cblist
     structure to take advantage of others' grace periods.

   - Also added a small commit that avoids needlessly inflicting
     scheduler-clock ticks on callback-offloaded CPUs.

   - Forward-progress improvements for no-CBs CPUs: Reduce contention on
     ->nocb_lock guarding ->cblist.

   - Forward-progress improvements for no-CBs CPUs: Add ->nocb_bypass
     list to further reduce contention on ->nocb_lock guarding ->cblist.

   - Miscellaneous fixes.

   - Torture-test updates.

   - minor LKMM updates"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (86 commits)
  MAINTAINERS: Update from paulmck@linux.ibm.com to paulmck@kernel.org
  rcu: Don't include <linux/ktime.h> in rcutiny.h
  rcu: Allow rcu_do_batch() to dynamically adjust batch sizes
  rcu/nocb: Don't wake no-CBs GP kthread if timer posted under overload
  rcu/nocb: Reduce __call_rcu_nocb_wake() leaf rcu_node ->lock contention
  rcu/nocb: Reduce nocb_cb_wait() leaf rcu_node ->lock contention
  rcu/nocb: Advance CBs after merge in rcutree_migrate_callbacks()
  rcu/nocb: Avoid synchronous wakeup in __call_rcu_nocb_wake()
  rcu/nocb: Print no-CBs diagnostics when rcutorture writer unduly delayed
  rcu/nocb: EXP Check use and usefulness of ->nocb_lock_contended
  rcu/nocb: Add bypass callback queueing
  rcu/nocb: Atomic ->len field in rcu_segcblist structure
  rcu/nocb: Unconditionally advance and wake for excessive CBs
  rcu/nocb: Reduce ->nocb_lock contention with separate ->nocb_gp_lock
  rcu/nocb: Reduce contention at no-CBs invocation-done time
  rcu/nocb: Reduce contention at no-CBs registry-time CB advancement
  rcu/nocb: Round down for number of no-CBs grace-period kthreads
  rcu/nocb: Avoid ->nocb_lock capture by corresponding CPU
  rcu/nocb: Avoid needless wakeups of no-CBs grace-period kthread
  rcu/nocb: Make __call_rcu_nocb_wake() safe for many callbacks
  ...
2019-09-16 16:28:19 -07:00
Linus Torvalds
d0a16fe934 Merge branch 'parisc-5.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc updates from Helge Deller:

 - Make the powerpc implementation to read elf files available as a
   public kexec interface so it can be re-used on other architectures
   (Sven)

 - Implement kexec on parisc (Sven)

 - Add kprobes on ftrace on parisc (Sven)

 - Fix kernel crash with HSC-PCI cards based on card-mode Dino

 - Add assembly implementations for memset, strlen, strcpy, strncpy and
   strcat

 - Some cleanups, documentation updates, warning fixes, ...

* 'parisc-5.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux: (25 commits)
  parisc: Have git ignore generated real2.S and firmware.c
  parisc: Disable HP HSC-PCI Cards to prevent kernel crash
  parisc: add support for kexec_file_load() syscall
  parisc: wire up kexec_file_load syscall
  parisc: add kexec syscall support
  parisc: add __pdc_cpu_rendezvous()
  kprobes/parisc: remove arch_kprobe_on_func_entry()
  kexec_elf: support 32 bit ELF files
  kexec_elf: remove unused variable in kexec_elf_load()
  kexec_elf: remove Elf_Rel macro
  kexec_elf: remove PURGATORY_STACK_SIZE
  kexec_elf: remove parsing of section headers
  kexec_elf: change order of elf_*_to_cpu() functions
  kexec: add KEXEC_ELF
  parisc: Save some bytes in dino driver
  parisc: Drop comments which are already in pci.h
  parisc: Convert eisa_enumerator to use pr_cont()
  parisc: Avoid warning when loading hppb driver
  parisc: speed up flush_tlb_all_local with qemu
  parisc: Add ALTERNATIVE_CODE() and ALT_COND_RUN_ON_QEMU
  ...
2019-09-16 15:38:31 -07:00
Linus Torvalds
76f0f227cf ia64 for v5.4 - big change here is removal of support for SGI Altix
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJdf64MAAoJEKurIx+X31iBB20P/07o93sBT92SiA2/ety9sLqV
 BGJmEdw7gyb9WVbUip6s71FIEKZw4foCGkqDiX+lr5Fw2A9tiK7LmFgTLi4LLwg+
 syhYZ1y5/mwBI4FLlJudKjQdFZjr/n7DNlz4H67woE2kK+FyRsOKEaFUhuR8+0rC
 mKJBKtIGnoIOPG06PT1k5qfdpzlreCFoWdIhjO55LfDgZnnDiMaX5h0vcBQ9xgCp
 xGV0n/f7+qn4pzB4hGvNV209Sdgv2V4t77bHNvyXlJrM5Hqzafo5MzFgEJv+fRqJ
 2RnkWVhwctfbid/2ggf2aAsYnMK3GigEaOCsYW2oWJESVUQhxIi3ndF/Jt9fraZv
 ZouD7G/s64P5lUQuCT9JnKGzJrSgxvkd37049AZ4pFVc2MzLC6o6dyyP8pu5ARe8
 T0shFik3+gsml2US/vSUzxvrg1saRQjl9E/AJ0RTZ8oyP4FNnFmkJf38qj3a0L0k
 ILFYscM5q7WPggoDA/m6F96tLGhdK/sKjDzrADjEh2dIvn4woqoEJSDn+rXuP+Gm
 UOj1v8mILZCqvOAmc9IkGCkPUlbrmNV/1FYh5+GWudtillEaD82vjSqm+jnVbfXD
 REvHlR/kxCSj1gg/+nk+NFdZCkW3xETOcTZohhDkR7du2mHjTwBMZ2YRPrqoX4c8
 VZA57Mrqm5Uk5601qYRl
 =L5e+
 -----END PGP SIGNATURE-----

Merge tag 'please-pull-ia64_for_5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux

Pull ia64 updates from Tony Luck:
 "The big change here is removal of support for SGI Altix"

* tag 'please-pull-ia64_for_5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux: (33 commits)
  genirq: remove the is_affinity_mask_valid hook
  ia64: remove CONFIG_SWIOTLB ifdefs
  ia64: remove support for machvecs
  ia64: move the screen_info setup to common code
  ia64: move the ROOT_DEV setup to common code
  ia64: rework iommu probing
  ia64: remove the unused sn_coherency_id symbol
  ia64: remove the SGI UV simulator support
  ia64: remove the zx1 swiotlb machvec
  ia64: remove CONFIG_ACPI ifdefs
  ia64: remove CONFIG_PCI ifdefs
  ia64: remove the hpsim platform
  ia64: remove now unused machvec indirections
  ia64: remove support for the SGI SN2 platform
  drivers: remove the SGI SN2 IOC4 base support
  drivers: remove the SGI SN2 IOC3 base support
  qla2xxx: remove SGI SN2 support
  qla1280: remove SGI SN2 support
  misc/sgi-xp: remove SGI SN2 support
  char/mspec: remove SGI SN2 support
  ...
2019-09-16 15:32:01 -07:00
Linus Torvalds
e77fafe9af arm64 updates for 5.4:
- 52-bit virtual addressing in the kernel
 
 - New ABI to allow tagged user pointers to be dereferenced by syscalls
 
 - Early RNG seeding by the bootloader
 
 - Improve robustness of SMP boot
 
 - Fix TLB invalidation in light of recent architectural clarifications
 
 - Support for i.MX8 DDR PMU
 
 - Remove direct LSE instruction patching in favour of static keys
 
 - Function error injection using kprobes
 
 - Support for the PPTT "thread" flag introduced by ACPI 6.3
 
 - Move PSCI idle code into proper cpuidle driver
 
 - Relaxation of implicit I/O memory barriers
 
 - Build with RELR relocations when toolchain supports them
 
 - Numerous cleanups and non-critical fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAl1yYREQHHdpbGxAa2Vy
 bmVsLm9yZwAKCRC3rHDchMFjNAM3CAChqDFQkryXoHwdeEcaukMRVNxtxOi4pM4g
 5xqkb7PoqRJssIblsuhaXjrSD97yWCgaqCmFe6rKoes++lP4bFcTe22KXPPyPBED
 A+tK4nTuKKcZfVbEanUjI+ihXaHJmKZ/kwAxWsEBYZ4WCOe3voCiJVNO2fHxqg1M
 8TskZ2BoayTbWMXih0eJg2MCy/xApBq4b3nZG4bKI7Z9UpXiKN1NYtDh98ZEBK4V
 d/oNoHsJ2ZvIQsztoBJMsvr09DTCazCijWZiECadm6l41WEPFizngrACiSJLLtYo
 0qu4qxgg9zgFlvBCRQmIYSggTuv35RgXSfcOwChmW5DUjHG+f9GK
 =Ru4B
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Will Deacon:
 "Although there isn't tonnes of code in terms of line count, there are
  a fair few headline features which I've noted both in the tag and also
  in the merge commits when I pulled everything together.

  The part I'm most pleased with is that we had 35 contributors this
  time around, which feels like a big jump from the usual small group of
  core arm64 arch developers. Hopefully they all enjoyed it so much that
  they'll continue to contribute, but we'll see.

  It's probably worth highlighting that we've pulled in a branch from
  the risc-v folks which moves our CPU topology code out to where it can
  be shared with others.

  Summary:

   - 52-bit virtual addressing in the kernel

   - New ABI to allow tagged user pointers to be dereferenced by
     syscalls

   - Early RNG seeding by the bootloader

   - Improve robustness of SMP boot

   - Fix TLB invalidation in light of recent architectural
     clarifications

   - Support for i.MX8 DDR PMU

   - Remove direct LSE instruction patching in favour of static keys

   - Function error injection using kprobes

   - Support for the PPTT "thread" flag introduced by ACPI 6.3

   - Move PSCI idle code into proper cpuidle driver

   - Relaxation of implicit I/O memory barriers

   - Build with RELR relocations when toolchain supports them

   - Numerous cleanups and non-critical fixes"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (114 commits)
  arm64: remove __iounmap
  arm64: atomics: Use K constraint when toolchain appears to support it
  arm64: atomics: Undefine internal macros after use
  arm64: lse: Make ARM64_LSE_ATOMICS depend on JUMP_LABEL
  arm64: asm: Kill 'asm/atomic_arch.h'
  arm64: lse: Remove unused 'alt_lse' assembly macro
  arm64: atomics: Remove atomic_ll_sc compilation unit
  arm64: avoid using hard-coded registers for LSE atomics
  arm64: atomics: avoid out-of-line ll/sc atomics
  arm64: Use correct ll/sc atomic constraints
  jump_label: Don't warn on __exit jump entries
  docs/perf: Add documentation for the i.MX8 DDR PMU
  perf/imx_ddr: Add support for AXI ID filtering
  arm64: kpti: ensure patched kernel text is fetched from PoU
  arm64: fix fixmap copy for 16K pages and 48-bit VA
  perf/smmuv3: Validate groups for global filtering
  perf/smmuv3: Validate group size
  arm64: Relax Documentation/arm64/tagged-pointers.rst
  arm64: kvm: Replace hardcoded '1' with SYS_PAR_EL1_F
  arm64: mm: Ignore spurious translation faults taken from the kernel
  ...
2019-09-16 14:31:40 -07:00
Linus Torvalds
52a5525214 IOMMU Updates for Linux v5.4:
Including:
 
 	- Batched unmap support for the IOMMU-API
 
 	- Support for unlocked command queueing in the ARM-SMMU driver
 
 	- Rework the ATS support in the ARM-SMMU driver
 
 	- More refactoring in the ARM-SMMU driver to support hardware
 	  implemention specific quirks and errata
 
 	- Bounce buffering DMA-API implementatation in the Intel VT-d driver
 	  for untrusted devices (like Thunderbolt devices)
 
 	- Fixes for runtime PM support in the OMAP iommu driver
 
 	- MT8183 IOMMU support in the Mediatek IOMMU driver
 
 	- Rework of the way the IOMMU core sets the default domain type for
 	  groups. Changing the default domain type on x86 does not require two
 	  kernel parameters anymore.
 
 	- More smaller fixes and cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEr9jSbILcajRFYWYyK/BELZcBGuMFAl1/pdoACgkQK/BELZcB
 GuMvCw/+K1GPyyZbPWAuXcnclraSZTHXS1lV0yilBXXyT2omFRQpRJYZGN/8NTbE
 SqD2FtzTKGuGSy2jA0drd3RcMKK/zZsFYnJShiM3FHLXatZdaFrnkK7vRHuzKlHf
 dvOlH7gHKtjIPPXodUEb0xd/oRAEIVsKjJyq1fBMARPPAluhU7mIFUI/xbGvX17g
 LM00hIxEhVNsSPemU2kTVISNBPVneecNVLlKXySjp0YPW/+sf8R7tTvwlSXX6h3I
 JM6wOU479O8mBvIcpAjfZlanHCHtqLk0ybaPx666DjdgYx6cUBHbDCF0P57XnGJA
 HNeVGtBwGQb8VWgbPLJKrStSOzYudDG8ndctqfKYN7uiPDjYM2/sqXcwQSVXR9vX
 AjT2s0GFEWT/AJhgBSeg9PJilEX1hPtomGKcQhKfR0wRGycixeZJFbwToQqzJrZN
 7XoORbZPH1S5W6sjXsXH3eVPW3EGnKipulJSPGDqFLa2aIUG+PXuu/A+iJS6sADh
 mqzVfcEs3/NYsro40eA/iQc0t99ftJXgpX18KxYprjyL6VWcwC/xeHcT/Zw9abxz
 r7dYDGTR0z6RIew0GOaeZVdZJh/J6yraKCNDS0ARZgol6JPaU7HGHhh6Ohdmmu8L
 Gtsgdxp4NeLEgp4mQiRvvpQ5pPJ/YR+oCOx3v+PPnKRLhMTxymQ=
 =UF08
 -----END PGP SIGNATURE-----

Merge tag 'iommu-updates-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu

Pull iommu updates from Joerg Roedel:

 - batched unmap support for the IOMMU-API

 - support for unlocked command queueing in the ARM-SMMU driver

 - rework the ATS support in the ARM-SMMU driver

 - more refactoring in the ARM-SMMU driver to support hardware
   implemention specific quirks and errata

 - bounce buffering DMA-API implementatation in the Intel VT-d driver
   for untrusted devices (like Thunderbolt devices)

 - fixes for runtime PM support in the OMAP iommu driver

 - MT8183 IOMMU support in the Mediatek IOMMU driver

 - rework of the way the IOMMU core sets the default domain type for
   groups. Changing the default domain type on x86 does not require two
   kernel parameters anymore.

 - more smaller fixes and cleanups

* tag 'iommu-updates-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (113 commits)
  iommu/vt-d: Declare Broadwell igfx dmar support snafu
  iommu/vt-d: Add Scalable Mode fault information
  iommu/vt-d: Use bounce buffer for untrusted devices
  iommu/vt-d: Add trace events for device dma map/unmap
  iommu/vt-d: Don't switch off swiotlb if bounce page is used
  iommu/vt-d: Check whether device requires bounce buffer
  swiotlb: Split size parameter to map/unmap APIs
  iommu/omap: Mark pm functions __maybe_unused
  iommu/ipmmu-vmsa: Disable cache snoop transactions on R-Car Gen3
  iommu/ipmmu-vmsa: Move IMTTBCR_SL0_TWOBIT_* to restore sort order
  iommu: Don't use sme_active() in generic code
  iommu/arm-smmu-v3: Fix build error without CONFIG_PCI_ATS
  iommu/qcom: Use struct_size() helper
  iommu: Remove wrong default domain comments
  iommu/dma: Fix for dereferencing before null checking
  iommu/mediatek: Clean up struct mtk_smi_iommu
  memory: mtk-smi: Get rid of need_larbid
  iommu/mediatek: Fix VLD_PA_RNG register backup when suspend
  memory: mtk-smi: Add bus_sel for mt8183
  memory: mtk-smi: Invoke pm runtime_callback to enable clocks
  ...
2019-09-16 14:14:40 -07:00
Linus Torvalds
c17112a5c4 core-process-v5.4
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXXe8mQAKCRCRxhvAZXjc
 ou7oAQCszihkNfpjORSSSOqenMDrxxDW++A7TIOLuq7UyZQl8QD+LM1wvT/xypfJ
 ORD9XX8+Wrv07AQn85fZBEFXGrnengk=
 =o+VL
 -----END PGP SIGNATURE-----

Merge tag 'core-process-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull pidfd/waitid updates from Christian Brauner:
 "This contains two features and various tests.

  First, it adds support for waiting on process through pidfds by adding
  the P_PIDFD type to the waitid() syscall. This completes the basic
  functionality of the pidfd api (cf. [1]). In the meantime we also have
  a new adition to the userspace projects that make use of the pidfd
  api. The qt project was nice enough to send a mail pointing out that
  they have a pr up to switch to the pidfd api (cf. [2]).

  Second, this tag contains an extension to the waitid() syscall to make
  it possible to wait on the current process group in a race free manner
  (even though the actual problem is very unlikely) by specifing 0
  together with the P_PGID type. This extension traces back to a
  discussion on the glibc development mailing list.

  There are also a range of tests for the features above. Additionally,
  the test-suite which detected the pidfd-polling race we fixed in [3]
  is included in this tag"

[1] https://lwn.net/Articles/794707/
[2] https://codereview.qt-project.org/c/qt/qtbase/+/108456
[3] commit b191d6491b ("pidfd: fix a poll race when setting exit_state")

* tag 'core-process-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  waitid: Add support for waiting for the current process group
  tests: add pidfd poll tests
  tests: move common definitions and functions into pidfd.h
  pidfd: add pidfd_wait tests
  pidfd: add P_PIDFD to waitid()
2019-09-16 09:28:19 -07:00
David S. Miller
28f2c362db Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2019-09-16

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Now that initial BPF backend for gcc has been merged upstream, enable
   BPF kselftest suite for bpf-gcc. Also fix a BE issue with access to
   bpf_sysctl.file_pos, from Ilya.

2) Follow-up fix for link-vmlinux.sh to remove bash-specific extensions
   related to recent work on exposing BTF info through sysfs, from Andrii.

3) AF_XDP zero copy fixes for i40e and ixgbe driver which caused umem
   headroom to be added twice, from Ciara.

4) Refactoring work to convert sock opt tests into test_progs framework
   in BPF kselftests, from Stanislav.

5) Fix a general protection fault in dev_map_hash_update_elem(), from Toke.

6) Cleanup to use BPF_PROG_RUN() macro in KCM, from Sami.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-16 16:02:03 +02:00
Ingo Molnar
563c4f85f9 Merge branch 'sched/rt' into sched/core, to pick up -rt changes
Pick up the first couple of patches working towards PREEMPT_RT.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-16 14:05:04 +02:00
Petr Mladek
ae88de56a1 Merge branch 'for-5.4' into for-linus 2019-09-16 12:54:25 +02:00
Ilya Leoshkevich
d895a0f16f bpf: fix accessing bpf_sysctl.file_pos on s390
"ctx:file_pos sysctl:read write ok" fails on s390 with "Read value  !=
nux". This is because verifier rewrites a complete 32-bit
bpf_sysctl.file_pos update to a partial update of the first 32 bits of
64-bit *bpf_sysctl_kern.ppos, which is not correct on big-endian
systems.

Fix by using an offset on big-endian systems.

Ditto for bpf_sysctl.file_pos reads. Currently the test does not detect
a problem there, since it expects to see 0, which it gets with high
probability in error cases, so change it to seek to offset 3 and expect
3 in bpf_sysctl.file_pos.

Fixes: e1550bfe0d ("bpf: Add file_pos field to bpf_sysctl ctx")
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20190816105300.49035-1-iii@linux.ibm.com/
2019-09-16 11:44:05 +02:00
Toke Høiland-Jørgensen
af58e7ee6a xdp: Fix race in dev_map_hash_update_elem() when replacing element
syzbot found a crash in dev_map_hash_update_elem(), when replacing an
element with a new one. Jesper correctly identified the cause of the crash
as a race condition between the initial lookup in the map (which is done
before taking the lock), and the removal of the old element.

Rather than just add a second lookup into the hashmap after taking the
lock, fix this by reworking the function logic to take the lock before the
initial lookup.

Fixes: 6f9d451ab1 ("xdp: Add devmap_hash map type for looking up devices by hashed index")
Reported-and-tested-by: syzbot+4e7a85b1432052e8d6f8@syzkaller.appspotmail.com
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-09-16 10:19:51 +02:00
Johannes Berg
786b2384bf um: Enable CONFIG_CONSTRUCTORS
We do need to call the constructors for *modules*, and
at least for KASAN in the future, we must call even the
kernel constructors only later when the kernel has been
initialized.

Instead of relying on libc to call them, emit an empty
section for libc and let the kernel's CONSTRUCTORS code
do the rest of the job.

Tested that it indeed doesn't work in modules, and does
work after the fixes in both, with a few functions with
__attribute__((constructor)) in both dynamic and static
builds.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2019-09-15 21:37:13 +02:00
David S. Miller
aa2eaa8c27 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Minor overlapping changes in the btusb and ixgbe drivers.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-15 14:17:27 +02:00
Linus Torvalds
36024fcf8d Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Don't corrupt xfrm_interface parms before validation, from Nicolas
    Dichtel.

 2) Revert use of usb-wakeup in btusb, from Mario Limonciello.

 3) Block ipv6 packets in bridge netfilter if ipv6 is disabled, from
    Leonardo Bras.

 4) IPS_OFFLOAD not honored in ctnetlink, from Pablo Neira Ayuso.

 5) Missing ULP check in sock_map, from John Fastabend.

 6) Fix receive statistic handling in forcedeth, from Zhu Yanjun.

 7) Fix length of SKB allocated in 6pack driver, from Christophe
    JAILLET.

 8) ip6_route_info_create() returns an error pointer, not NULL. From
    Maciej Żenczykowski.

 9) Only add RDS sock to the hashes after rs_transport is set, from
    Ka-Cheong Poon.

10) Don't double clean TX descriptors in ixgbe, from Ilya Maximets.

11) Presence of transmit IPSEC offload in an SKB is not tested for
    correctly in ixgbe and ixgbevf. From Steffen Klassert and Jeff
    Kirsher.

12) Need rcu_barrier() when register_netdevice() takes one of the
    notifier based failure paths, from Subash Abhinov Kasiviswanathan.

13) Fix leak in sctp_do_bind(), from Mao Wenan.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (72 commits)
  cdc_ether: fix rndis support for Mediatek based smartphones
  sctp: destroy bucket if failed to bind addr
  sctp: remove redundant assignment when call sctp_get_port_local
  sctp: change return type of sctp_get_port_local
  ixgbevf: Fix secpath usage for IPsec Tx offload
  sctp: Fix the link time qualifier of 'sctp_ctrlsock_exit()'
  ixgbe: Fix secpath usage for IPsec TX offload.
  net: qrtr: fix memort leak in qrtr_tun_write_iter
  net: Fix null de-reference of device refcount
  ipv6: Fix the link time qualifier of 'ping_v6_proc_exit_net()'
  tun: fix use-after-free when register netdev failed
  tcp: fix tcp_ecn_withdraw_cwr() to clear TCP_ECN_QUEUE_CWR
  ixgbe: fix double clean of Tx descriptors with xdp
  ixgbe: Prevent u8 wrapping of ITR value to something less than 10us
  mlx4: fix spelling mistake "veify" -> "verify"
  net: hns3: fix spelling mistake "undeflow" -> "underflow"
  net: lmc: fix spelling mistake "runnin" -> "running"
  NFC: st95hf: fix spelling mistake "receieve" -> "receive"
  net/rds: An rds_sock is added too early to the hash table
  mac80211: Do not send Layer 2 Update frame before authorization
  ...
2019-09-14 12:20:38 -07:00
Daniel Jordan
c51636a306 padata: remove cpu_index from the parallel_queue
With the removal of the ENODATA case from padata_get_next, the cpu_index
field is no longer useful, so it can go away.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-09-13 21:15:41 +10:00
Daniel Jordan
bfde23ce20 padata: unbind parallel jobs from specific CPUs
Padata binds the parallel part of a job to a single CPU and round-robins
over all CPUs in the system for each successive job.  Though the serial
parts rely on per-CPU queues for correct ordering, they're not necessary
for parallel work, and it improves performance to run the job locally on
NUMA machines and let the scheduler pick the CPU within a node on a busy
system.

So, make the parallel workqueue unbound.

Update the parallel workqueue's cpumask when the instance's parallel
cpumask changes.

Now that parallel jobs no longer run on max_active=1 workqueues, two or
more parallel works that hash to the same CPU may run simultaneously,
finish out of order, and so be serialized out of order.  Prevent this by
keeping the works sorted on the reorder list by sequence number and
checking that in the reordering logic.

padata_get_next becomes padata_find_next so it can be reused for the end
of padata_reorder, where it's used to avoid uselessly queueing work when
the next job by sequence number isn't finished yet but a later job that
hashed to the same CPU has.

The ENODATA case in padata_find_next no longer makes sense because
parallel jobs aren't bound to specific CPUs.  The EINPROGRESS case takes
care of the scenario where a parallel job is potentially running on the
same CPU as padata_find_next, and with only one error code left, just
use NULL instead.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-09-13 21:15:40 +10:00
Daniel Jordan
45d153c08b padata: use separate workqueues for parallel and serial work
padata currently uses one per-CPU workqueue per instance for all work.

Prepare for running parallel jobs on an unbound workqueue by introducing
dedicated workqueues for parallel and serial work.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-09-13 21:15:40 +10:00
Daniel Jordan
cc491d8e64 padata, pcrypt: take CPU hotplug lock internally in padata_alloc_possible
With pcrypt's cpumask no longer used, take the CPU hotplug lock inside
padata_alloc_possible.

Useful later in the series for avoiding nested acquisition of the CPU
hotplug lock in padata when padata_alloc_possible is allocating an
unbound workqueue.

Without this patch, this nested acquisition would happen later in the
series:

      pcrypt_init_padata
        get_online_cpus
        alloc_padata_possible
          alloc_padata
            alloc_workqueue(WQ_UNBOUND)   // later in the series
              alloc_and_link_pwqs
                apply_wqattrs_lock
                  get_online_cpus         // recursive rwsem acquisition

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-09-13 21:15:40 +10:00
Daniel Jordan
e6ce0e0807 padata: make padata_do_parallel find alternate callback CPU
padata_do_parallel currently returns -EINVAL if the callback CPU isn't
in the callback cpumask.

pcrypt tries to prevent this situation by keeping its own callback
cpumask in sync with padata's and checks that the callback CPU it passes
to padata is valid.  Make padata handle this instead.

padata_do_parallel now takes a pointer to the callback CPU and updates
it for the caller if an alternate CPU is used.  Overall behavior in
terms of which callback CPUs are chosen stays the same.

Prepares for removal of the padata cpumask notifier in pcrypt, which
will fix a lockdep complaint about nested acquisition of the CPU hotplug
lock later in the series.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-09-13 21:15:40 +10:00
Daniel Jordan
509b320489 workqueue: require CPU hotplug read exclusion for apply_workqueue_attrs
Change the calling convention for apply_workqueue_attrs to require CPU
hotplug read exclusion.

Avoids lockdep complaints about nested calls to get_online_cpus in a
future patch where padata calls apply_workqueue_attrs when changing
other CPU-hotplug-sensitive data structures with the CPU read lock
already held.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-09-13 21:15:40 +10:00
Daniel Jordan
513c98d086 workqueue: unconfine alloc/apply/free_workqueue_attrs()
padata will use these these interfaces in a later patch, so unconfine them.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-09-13 21:15:39 +10:00
Daniel Jordan
b128a30409 padata: allocate workqueue internally
Move workqueue allocation inside of padata to prepare for further
changes to how padata uses workqueues.

Guarantees the workqueue is created with max_active=1, which padata
relies on to work correctly.  No functional change.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-crypto@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-09-13 21:15:39 +10:00
Miles Chen
4adcdcea71 sched/psi: Correct overly pessimistic size calculation
When passing a equal or more then 32 bytes long string to psi_write(),
psi_write() copies 31 bytes to its buf and overwrites buf[30]
with '\0'. Which makes the input string 1 byte shorter than
it should be.

Fix it by copying sizeof(buf) bytes when nbytes >= sizeof(buf).

This does not cause problems in normal use case like:
"some 500000 10000000" or "full 500000 10000000" because they
are less than 32 bytes in length.

	/* assuming nbytes == 35 */
	char buf[32];

	buf_size = min(nbytes, (sizeof(buf) - 1)); /* buf_size = 31 */
	if (copy_from_user(buf, user_buf, buf_size))
		return -EFAULT;

	buf[buf_size - 1] = '\0'; /* buf[30] = '\0' */

Before:

 %cd /proc/pressure/
 %echo "123456789|123456789|123456789|1234" > memory
 [   22.473497] nbytes=35,buf_size=31
 [   22.473775] 123456789|123456789|123456789| (print 30 chars)
 %sh: write error: Invalid argument

 %echo "123456789|123456789|123456789|1" > memory
 [   64.916162] nbytes=32,buf_size=31
 [   64.916331] 123456789|123456789|123456789| (print 30 chars)
 %sh: write error: Invalid argument

After:

 %cd /proc/pressure/
 %echo "123456789|123456789|123456789|1234" > memory
 [  254.837863] nbytes=35,buf_size=32
 [  254.838541] 123456789|123456789|123456789|1 (print 31 chars)
 %sh: write error: Invalid argument

 %echo "123456789|123456789|123456789|1" > memory
 [ 9965.714935] nbytes=32,buf_size=32
 [ 9965.715096] 123456789|123456789|123456789|1 (print 31 chars)
 %sh: write error: Invalid argument

Also remove the superfluous parentheses.

Signed-off-by: Miles Chen <miles.chen@mediatek.com>
Cc: <linux-mediatek@lists.infradead.org>
Cc: <wsd_upstream@mediatek.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190912103452.13281-1-miles.chen@mediatek.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-13 07:49:28 +02:00
Quentin Perret
eb92692b25 sched/fair: Speed-up energy-aware wake-ups
EAS computes the energy impact of migrating a waking task when deciding
on which CPU it should run. However, the current approach is known to
have a high algorithmic complexity, which can result in prohibitively
high wake-up latencies on systems with complex energy models, such as
systems with per-CPU DVFS. On such systems, the algorithm complexity is
in O(n^2) (ignoring the cost of searching for performance states in the
EM) with 'n' the number of CPUs.

To address this, re-factor the EAS wake-up path to compute the energy
'delta' (with and without the task) on a per-performance domain basis,
rather than system-wide, which brings the complexity down to O(n).

No functional changes intended.

Test results
~~~~~~~~~~~~

* Setup: Tested on a Google Pixel 3, with a Snapdragon 845 (4+4 CPUs,
  A55/A75). Base kernel is 5.3-rc5 + Pixel3 specific patches. Android
  userspace, no graphics.

* Test case:  Run a periodic rt-app task, with 16ms period, ramping down
  from 70% to 10%, in 5% steps of 500 ms each (json avail. at [1]).
  Frequencies of all CPUs are pinned to max (using scaling_min_freq
  CPUFreq sysfs entries) to reduce variability. The time to run
  select_task_rq_fair() is measured using the function profiler
  (/sys/kernel/debug/tracing/trace_stat/function*). See the test script
  for more details [2].

Test 1:

I hacked the DT to 'fake' per-CPU DVFS. That is, we end up with one
CPUFreq policy per CPU (8 policies in total). Since all frequencies are
pinned to max for the test, this should have no impact on the actual
frequency selection, but it does in the EAS calculation.

      +---------------------------+----------------------------------+
      | Without patch             | With patch                       |
+-----+-----+----------+----------+-----+-----------------+----------+
| CPU | Hit | Avg (us) | s^2 (us) | Hit | Avg (us)        | s^2 (us) |
|-----+-----+----------+----------+-----+-----------------+----------+
|  0  | 274 | 38.303   | 1750.239 | 401 | 14.126 (-63.1%) | 146.625  |
|  1  | 197 | 49.529   | 1695.852 | 314 | 16.135 (-67.4%) | 167.525  |
|  2  | 142 | 34.296   | 1758.665 | 302 | 14.133 (-58.8%) | 130.071  |
|  3  | 172 | 31.734   | 1490.975 | 641 | 14.637 (-53.9%) | 139.189  |
|  4  | 316 | 7.834    | 178.217  | 425 | 5.413  (-30.9%) | 20.803   |
|  5  | 447 | 8.424    | 144.638  | 556 | 5.929  (-29.6%) | 27.301   |
|  6  | 581 | 14.886   | 346.793  | 456 | 5.711  (-61.6%) | 23.124   |
|  7  | 456 | 10.005   | 211.187  | 997 | 4.708  (-52.9%) | 21.144   |
+-----+-----+----------+----------+-----+-----------------+----------+
             * Hit, Avg and s^2 are as reported by the function profiler

Test 2:
I also ran the same test with a normal DT, with 2 CPUFreq policies, to
see if this causes regressions in the most common case.

      +---------------------------+----------------------------------+
      | Without patch             | With patch                       |
+-----+-----+----------+----------+-----+-----------------+----------+
| CPU | Hit | Avg (us) | s^2 (us) | Hit | Avg (us)        | s^2 (us) |
|-----+-----+----------+----------+-----+-----------------+----------+
|  0  | 345 | 22.184   | 215.321  | 580 | 18.635 (-16.0%) | 146.892  |
|  1  | 358 | 18.597   | 200.596  | 438 | 12.934 (-30.5%) | 104.604  |
|  2  | 359 | 25.566   | 200.217  | 397 | 10.826 (-57.7%) | 74.021   |
|  3  | 362 | 16.881   | 200.291  | 718 | 11.455 (-32.1%) | 102.280  |
|  4  | 457 | 3.822    | 9.895    | 757 | 4.616  (+20.8%) | 13.369   |
|  5  | 344 | 4.301    | 7.121    | 594 | 5.320  (+23.7%) | 18.798   |
|  6  | 472 | 4.326    | 7.849    | 464 | 5.648  (+30.6%) | 22.022   |
|  7  | 331 | 4.630    | 13.937   | 408 | 5.299  (+14.4%) | 18.273   |
+-----+-----+----------+----------+-----+-----------------+----------+
             * Hit, Avg and s^2 are as reported by the function profiler

In addition to these two tests, I also ran 50 iterations of the Lisa
EAS functional test suite [3] with this patch applied on Arm Juno r0,
Arm Juno r2, Arm TC2 and Hikey960, and could not see any regressions
(all EAS functional tests are passing).

 [1] https://paste.debian.net/1100055/
 [2] https://paste.debian.net/1100057/
 [3] https://github.com/ARM-software/lisa/blob/master/lisa/tests/scheduler/eas_behaviour.py

Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: juri.lelli@redhat.com
Cc: morten.rasmussen@arm.com
Cc: qais.yousef@arm.com
Cc: qperret@qperret.net
Cc: rjw@rjwysocki.net
Cc: tkjos@google.com
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Link: https://lkml.kernel.org/r/20190912094404.13802-1-qperret@qperret.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-13 07:45:17 +02:00
Roman Gushchin
97a6136983 cgroup: freezer: fix frozen state inheritance
If a new child cgroup is created in the frozen cgroup hierarchy
(one or more of ancestor cgroups is frozen), the CGRP_FREEZE cgroup
flag should be set. Otherwise if a process will be attached to the
child cgroup, it won't become frozen.

The problem can be reproduced with the test_cgfreezer_mkdir test.

This is the output before this patch:
  ~/test_freezer
  ok 1 test_cgfreezer_simple
  ok 2 test_cgfreezer_tree
  ok 3 test_cgfreezer_forkbomb
  Cgroup /sys/fs/cgroup/cg_test_mkdir_A/cg_test_mkdir_B isn't frozen
  not ok 4 test_cgfreezer_mkdir
  ok 5 test_cgfreezer_rmdir
  ok 6 test_cgfreezer_migrate
  ok 7 test_cgfreezer_ptrace
  ok 8 test_cgfreezer_stopped
  ok 9 test_cgfreezer_ptraced
  ok 10 test_cgfreezer_vfork

And with this patch:
  ~/test_freezer
  ok 1 test_cgfreezer_simple
  ok 2 test_cgfreezer_tree
  ok 3 test_cgfreezer_forkbomb
  ok 4 test_cgfreezer_mkdir
  ok 5 test_cgfreezer_rmdir
  ok 6 test_cgfreezer_migrate
  ok 7 test_cgfreezer_ptrace
  ok 8 test_cgfreezer_stopped
  ok 9 test_cgfreezer_ptraced
  ok 10 test_cgfreezer_vfork

Reported-by: Mark Crossen <mcrossen@fb.com>
Signed-off-by: Roman Gushchin <guro@fb.com>
Fixes: 76f969e894 ("cgroup: cgroup v2 freezer")
Cc: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org # v5.2+
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-09-12 14:04:45 -07:00
Linus Torvalds
98dcb386e5 for-linus-20190912
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXXpIKgAKCRCRxhvAZXjc
 om8yAQCIPJp2HWNsJRPRl9KVKRmR6MxItG1Hpj0MvgwzLEjufwD/SF9VAPgl2AmD
 oaAXrOYH4yhr9aaFlVuMpe2aWAZdxgU=
 =Tjvf
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20190912' of gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linux

Pull clone3 fix from Christian Brauner:
 "This is a last-minute bugfix for clone3() that should go in before we
  release 5.3 with clone3().

  clone3() did not verify that the exit_signal argument was set to a
  valid signal. This can be used to cause a crash by specifying a signal
  greater than NSIG. e.g. -1.

  The commit from Eugene adds a check to copy_clone_args_from_user() to
  verify that the exit signal is limited by CSIGNAL as with legacy
  clone() and that the signal is valid. With this we don't get the
  legacy clone behavior were an invalid signal could be handed down and
  would only be detected and then ignored in do_notify_parent(). Users
  of clone3() will now get a proper error right when they pass an
  invalid exit signal. Note, that this is not a change in user-visible
  behavior since no kernel with clone3() has been released yet"

* tag 'for-linus-20190912' of gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linux:
  fork: block invalid exit signals with clone3()
2019-09-12 14:50:14 +01:00
Eugene Syromiatnikov
a0eb9abd8a
fork: block invalid exit signals with clone3()
Previously, higher 32 bits of exit_signal fields were lost when copied
to the kernel args structure (that uses int as a type for the respective
field). Moreover, as Oleg has noted, exit_signal is used unchecked, so
it has to be checked for sanity before use; for the legacy syscalls,
applying CSIGNAL mask guarantees that it is at least non-negative;
however, there's no such thing is done in clone3() code path, and that
can break at least thread_group_leader.

This commit adds a check to copy_clone_args_from_user() to verify that
the exit signal is limited by CSIGNAL as with legacy clone() and that
the signal is valid. With this we don't get the legacy clone behavior
were an invalid signal could be handed down and would only be detected
and ignored in do_notify_parent(). Users of clone3() will now get a
proper error when they pass an invalid exit signal. Note, that this is
not user-visible behavior since no kernel with clone3() has been
released yet.

The following program will cause a splat on a non-fixed clone3() version
and will fail correctly on a fixed version:

 #define _GNU_SOURCE
 #include <linux/sched.h>
 #include <linux/types.h>
 #include <sched.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <sys/syscall.h>
 #include <sys/wait.h>
 #include <unistd.h>

 int main(int argc, char *argv[])
 {
        pid_t pid = -1;
        struct clone_args args = {0};
        args.exit_signal = -1;

        pid = syscall(__NR_clone3, &args, sizeof(struct clone_args));
        if (pid < 0)
                exit(EXIT_FAILURE);

        if (pid == 0)
                exit(EXIT_SUCCESS);

        wait(NULL);

        exit(EXIT_SUCCESS);
 }

Fixes: 7f192e3cd3 ("fork: add clone3")
Reported-by: Oleg Nesterov <oleg@redhat.com>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Suggested-by: Dmitry V. Levin <ldv@altlinux.org>
Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com>
Link: https://lore.kernel.org/r/4b38fa4ce420b119a4c6345f42fe3cec2de9b0b5.1568223594.git.esyr@redhat.com
[christian.brauner@ubuntu.com: simplify check and rework commit message]
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2019-09-12 14:56:33 +02:00
Linus Torvalds
6dcf6a4eb9 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fix from Ingo Molnar:
 "Fix an initialization bug in the hw-breakpoints, which triggered on
  the ARM platform"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/hw_breakpoint: Fix arch_hw_breakpoint use-before-initialization
2019-09-12 11:04:50 +01:00
Linus Torvalds
95779fe850 Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fix from Ingo Molnar:
 "Fix a race in the IRQ resend mechanism, which can result in a NULL
  dereference crash"

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq: Prevent NULL pointer dereference in resend_irqs()
2019-09-12 11:02:00 +01:00
Masahiro Yamada
b605be6581 module: remove unneeded casts in cmp_name()
You can pass opaque pointers directly.

I also renamed 'va' and 'vb' into more meaningful arguments.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2019-09-11 21:41:56 +02:00
Will Deacon
069e1c07c1 module: Fix link failure due to invalid relocation on namespace offset
Commit 8651ec01da ("module: add support for symbol namespaces.")
broke linking for arm64 defconfig:

  | lib/crypto/arc4.o: In function `__ksymtab_arc4_setkey':
  | arc4.c:(___ksymtab+arc4_setkey+0x8): undefined reference to `no symbol'
  | lib/crypto/arc4.o: In function `__ksymtab_arc4_crypt':
  | arc4.c:(___ksymtab+arc4_crypt+0x8): undefined reference to `no symbol'

This is because the dummy initialisation of the 'namespace_offset' field
in 'struct kernel_symbol' when using EXPORT_SYMBOL on architectures with
support for PREL32 locations uses an offset from an absolute address (0)
in an effort to trick 'offset_to_pointer' into behaving as a NOP,
allowing non-namespaced symbols to be treated in the same way as those
belonging to a namespace.

Unfortunately, place-relative relocations require a symbol reference
rather than an absolute value and, although x86 appears to get away with
this due to placing the kernel text at the top of the address space, it
almost certainly results in a runtime failure if the kernel is relocated
dynamically as a result of KASLR.

Rework 'namespace_offset' so that a value of 0, which cannot occur for a
valid namespaced symbol, indicates that the corresponding symbol does
not belong to a namespace.

Cc: Matthias Maennich <maennich@google.com>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Fixes: 8651ec01da ("module: add support for symbol namespaces.")
Reported-by: kbuild test robot <lkp@intel.com>
Tested-by: Matthias Maennich <maennich@google.com>
Tested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Reviewed-by: Matthias Maennich <maennich@google.com>
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2019-09-11 18:53:30 +02:00
Joerg Roedel
e95adb9add Merge branches 'arm/omap', 'arm/exynos', 'arm/smmu', 'arm/mediatek', 'arm/qcom', 'arm/renesas', 'x86/amd', 'x86/vt-d' and 'core' into next 2019-09-11 12:39:19 +02:00
Lu Baolu
3fc1ca0065 swiotlb: Split size parameter to map/unmap APIs
This splits the size parameter to swiotlb_tbl_map_single() and
swiotlb_tbl_unmap_single() into an alloc_size and a mapping_size
parameter, where the latter one is rounded up to the iommu page
size.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2019-09-11 12:34:29 +02:00
Eric W. Biederman
821cc7b0b2
waitid: Add support for waiting for the current process group
It was recently discovered that the linux version of waitid is not a
superset of the other wait functions because it does not include support
for waiting for the current process group. This has two downsides:
1. An extra system call is needed to get the current process group.
2. After the current process group is received and before it is passed
   to waitid a signal could arrive causing the current process group to change.
   Inherent race-conditions as these make it impossible for userspace to
   emulate this functionaly and thus violate async-signal safety
   requirements for waitpid.

Arguments can be made for using a different choice of idtype and id
for this case but the BSDs already use this P_PGID and 0 to indicate
waiting for the current process's process group.  So be nice to user
space programmers and don't introduce an unnecessary incompatibility.

Some people have noted that the posix description is that
waitpid will wait for the current process group, and that in
the presence of pthreads that process group can change.  To get
clarity on this issue I looked at XNU, FreeBSD, and Luminos.  All of
those flavors of unix waited for the current process group at the
time of call and as written could not adapt to the process group
changing after the call.

At one point Linux did adapt to the current process group changing but
that stopped in 161550d74c ("pid: sys_wait... fixes").  It has been
over 11 years since Linux has that behavior, no programs that fail
with the change in behavior have been reported, and I could not
find any other unix that does this.  So I think it is safe to clarify
the definition of current process group, to current process group
at the time of the wait function.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Rich Felker <dalias@libc.org>
Cc: Alistair Francis <alistair23@gmail.com>
Cc: Zong Li <zongbox@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Cc: GNU C Library <libc-alpha@sourceware.org>
Link: https://lore.kernel.org/r/20190814154400.6371-2-christian.brauner@ubuntu.com
2019-09-10 17:05:46 +02:00
Thomas Gleixner
77b4b54204 posix-cpu-timers: Fix permission check regression
The recent consolidation of the three permission checks introduced a subtle
regression. For timer_create() with a process wide timer it returns the
current task if the lookup through the PID which is encoded into the
clockid results in returning current.

That's broken because it does not validate whether the current task is the
group leader.

That was caused by the two different variants of permission checks:

  - posix_cpu_timer_get() allowed access to the process wide clock when the
    looked up task is current. That's not an issue because the process wide
    clock is in the shared sighand.

  - posix_cpu_timer_create() made sure that the looked up task is the group
    leader.

Restore the previous state.

Note, that these permission checks are more than questionable, but that's
subject to follow up changes.

Fixes: 6ae40e3fdc ("posix-cpu-timers: Provide task validation functions")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1909052314110.1902@nanos.tec.linutronix.de
2019-09-10 12:13:07 +01:00
Matthias Maennich
3d52ec5e5d module: add config option MODULE_ALLOW_MISSING_NAMESPACE_IMPORTS
If MODULE_ALLOW_MISSING_NAMESPACE_IMPORTS is enabled (default=n), the
requirement for modules to import all namespaces that are used by
the module is relaxed.

Enabling this option effectively allows (invalid) modules to be loaded
while only a warning is emitted.

Disabling this option keeps the enforcement at module loading time and
loading is denied if the module's imports are not satisfactory.

Reviewed-by: Martijn Coenen <maco@android.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Matthias Maennich <maennich@google.com>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2019-09-10 10:30:27 +02:00
Matthias Maennich
8651ec01da module: add support for symbol namespaces.
The EXPORT_SYMBOL_NS() and EXPORT_SYMBOL_NS_GPL() macros can be used to
export a symbol to a specific namespace.  There are no _GPL_FUTURE and
_UNUSED variants because these are currently unused, and I'm not sure
they are necessary.

I didn't add EXPORT_SYMBOL_NS() for ASM exports; this patch sets the
namespace of ASM exports to NULL by default. In case of relative
references, it will be relocatable to NULL. If there's a need, this
should be pretty easy to add.

A module that wants to use a symbol exported to a namespace must add a
MODULE_IMPORT_NS() statement to their module code; otherwise, modpost
will complain when building the module, and the kernel module loader
will emit an error and fail when loading the module.

MODULE_IMPORT_NS() adds a modinfo tag 'import_ns' to the module. That
tag can be observed by the modinfo command, modpost and kernel/module.c
at the time of loading the module.

The ELF symbols are renamed to include the namespace with an asm label;
for example, symbol 'usb_stor_suspend' in namespace USB_STORAGE becomes
'usb_stor_suspend.USB_STORAGE'.  This allows modpost to do namespace
checking, without having to go through all the effort of parsing ELF and
relocation records just to get to the struct kernel_symbols.

On x86_64 I saw no difference in binary size (compression), but at
runtime this will require a word of memory per export to hold the
namespace. An alternative could be to store namespaced symbols in their
own section and use a separate 'struct namespaced_kernel_symbol' for
that section, at the cost of making the module loader more complex.

Co-developed-by: Martijn Coenen <maco@android.com>
Signed-off-by: Martijn Coenen <maco@android.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Matthias Maennich <maennich@google.com>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2019-09-10 10:30:17 +02:00
Matthias Maennich
c5e4a062fe module: support reading multiple values per modinfo tag
Similar to modpost's get_next_modinfo(), introduce get_next_modinfo() in
kernel/module.c to acquire any further values associated with the same
modinfo tag name. That is useful for any tags that have multiple
occurrences (such as 'alias'), but is in particular introduced here as
part of the symbol namespaces patch series to read the (potentially)
multiple namespaces a module is importing.

Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Martijn Coenen <maco@android.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Matthias Maennich <maennich@google.com>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2019-09-10 10:30:02 +02:00
Daniel Vetter
312364f353 kernel.h: Add non_block_start/end()
In some special cases we must not block, but there's not a spinlock,
preempt-off, irqs-off or similar critical section already that arms the
might_sleep() debug checks. Add a non_block_start/end() pair to annotate
these.

This will be used in the oom paths of mmu-notifiers, where blocking is not
allowed to make sure there's forward progress. Quoting Michal:

"The notifier is called from quite a restricted context - oom_reaper -
which shouldn't depend on any locks or sleepable conditionals. The code
should be swift as well but we mostly do care about it to make a forward
progress. Checking for sleepable context is the best thing we could come
up with that would describe these demands at least partially."

Peter also asked whether we want to catch spinlocks on top, but Michal
said those are less of a problem because spinlocks can't have an indirect
dependency upon the page allocator and hence close the loop with the oom
reaper.

Suggested by Michal Hocko.

Link: https://lore.kernel.org/r/20190826201425.17547-4-daniel.vetter@ffwll.ch
Acked-by: Christian König <christian.koenig@amd.com> (v1)
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-09-07 04:28:05 -03:00
Sven Schnelle
ea46a13ebf kexec_elf: support 32 bit ELF files
The powerpc version only supported 64 bit. Add some
code to switch decoding of fields during runtime so
we can kexec a 32 bit kernel from a 64 bit kernel and
vice versa.

Signed-off-by: Sven Schnelle <svens@stackframe.org>
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Helge Deller <deller@gmx.de>
2019-09-06 23:58:44 +02:00
Sven Schnelle
571ceb7d96 kexec_elf: remove unused variable in kexec_elf_load()
base was never assigned, so we can remove it.

Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@stackframe.org>
Signed-off-by: Helge Deller <deller@gmx.de>
2019-09-06 23:58:44 +02:00
Sven Schnelle
3bd9c3366e kexec_elf: remove Elf_Rel macro
It wasn't used anywhere, so lets drop it.

Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@stackframe.org>
Signed-off-by: Helge Deller <deller@gmx.de>
2019-09-06 23:58:43 +02:00
Sven Schnelle
10ba459f87 kexec_elf: remove PURGATORY_STACK_SIZE
It's not used anywhere so just drop it.

Signed-off-by: Sven Schnelle <svens@stackframe.org>
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Helge Deller <deller@gmx.de>
2019-09-06 23:58:43 +02:00
Sven Schnelle
5f71d97720 kexec_elf: remove parsing of section headers
We're not using them, so we can drop the parsing.

Signed-off-by: Sven Schnelle <svens@stackframe.org>
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Helge Deller <deller@gmx.de>
2019-09-06 23:58:43 +02:00
Sven Schnelle
d34e0ad3ea kexec_elf: change order of elf_*_to_cpu() functions
Change the order to have a 64/32/16 order, no functional change.

Signed-off-by: Sven Schnelle <svens@stackframe.org>
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Helge Deller <deller@gmx.de>
2019-09-06 23:58:43 +02:00
Sven Schnelle
175fca3bf9 kexec: add KEXEC_ELF
Right now powerpc provides an implementation to read elf files
with the kexec_file_load() syscall. Make that available as a public
kexec interface so it can be re-used on other architectures.

Signed-off-by: Sven Schnelle <svens@stackframe.org>
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Helge Deller <deller@gmx.de>
2019-09-06 23:58:43 +02:00
David S. Miller
1e46c09ec1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Add the ability to use unaligned chunks in the AF_XDP umem. By
   relaxing where the chunks can be placed, it allows to use an
   arbitrary buffer size and place whenever there is a free
   address in the umem. Helps more seamless DPDK AF_XDP driver
   integration. Support for i40e, ixgbe and mlx5e, from Kevin and
   Maxim.

2) Addition of a wakeup flag for AF_XDP tx and fill rings so the
   application can wake up the kernel for rx/tx processing which
   avoids busy-spinning of the latter, useful when app and driver
   is located on the same core. Support for i40e, ixgbe and mlx5e,
   from Magnus and Maxim.

3) bpftool fixes for printf()-like functions so compiler can actually
   enforce checks, bpftool build system improvements for custom output
   directories, and addition of 'bpftool map freeze' command, from Quentin.

4) Support attaching/detaching XDP programs from 'bpftool net' command,
   from Daniel.

5) Automatic xskmap cleanup when AF_XDP socket is released, and several
   barrier/{read,write}_once fixes in AF_XDP code, from Björn.

6) Relicense of bpf_helpers.h/bpf_endian.h for future libbpf
   inclusion as well as libbpf versioning improvements, from Andrii.

7) Several new BPF kselftests for verifier precision tracking, from Alexei.

8) Several BPF kselftest fixes wrt endianess to run on s390x, from Ilya.

9) And more BPF kselftest improvements all over the place, from Stanislav.

10) Add simple BPF map op cache for nfp driver to batch dumps, from Jakub.

11) AF_XDP socket umem mapping improvements for 32bit archs, from Ivan.

12) Add BPF-to-BPF call and BTF line info support for s390x JIT, from Yauheni.

13) Small optimization in arm64 JIT to spare 1 insns for BPF_MOD, from Jerin.

14) Fix an error check in bpf_tcp_gen_syncookie() helper, from Petar.

15) Various minor fixes and cleanups, from Nathan, Masahiro, Masanari,
    Peter, Wei, Yue.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-06 16:49:17 +02:00
Thomas Gleixner
9cc5b7fba5 irqchip updates for Linux 5.4
- Large GICv3 updates to support new PPI and SPI ranges
 - Conver all alloc_fwnode() users to use PAs instead of VAs
 - Add support for Marvell's MMP3 irqchip
 - Add support for Amlogic Meson SM1
 - Various cleanups and fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAl1yLdwPHG1hekBrZXJu
 ZWwub3JnAAoJECPQ0LrRPXpDVHgP+wTAtNzxF4nRsJO3ZlbKe7n2ut7toZvqip7U
 97Ll652m/24stioU7b4/Z92E1fWIQe6Bv7u5detNQ4bIg2uwD6ZDYMjLLOXsJXrQ
 XH7pEAVe2eO0rjZudodUteyPGZQAS1/okY/4qsYNDiWKHi5xQPwMBNJ6fVwOVXUk
 HG7ClMny1i7SSRm3CGmW3Hxb279DrHOxCGPO6aEX5UQ/uZErCTJN5oFdLfAKvfaU
 qRpzetq4QOO+L7UBer6+jmtAvA/kuYwpGSnhJXhFOLMSgrGLQbINvYeAyv2vKGvF
 l6HQNuEf+OQjR5RcUZi6qbAO/WEh2xzmxpqta8COwjBFDcwsITyc5wNRNEfmNqO9
 Yrh3/a/J0eWX6X4rmmJkMDEBQvNbyA1//9d5NIjPhIVERZYPa4Z2lb220P4+58r7
 xnCqZnxZO78BhUd7HXeCluuTJs83iHxs03nAHysU9YZWkiF7jDXE0d3kla8cGcdY
 aOaSID0Q52flRxLymWJs29kyXZU8rpiPTbxvi86Ggkvgrj2pSwb8PF8LL4WPkzgh
 BcuF2ca9oa22OlA3oLV/sKwFRS77en+bpzLeU0NjeAcu5b8AY2nSSYXiuvtxIzi3
 FbQCsBBl7i5F5AQT9XGugjeNsCRUsUvIOP4f4w2Ej6HmqJb/SJVMBedMBsjGoEFG
 sj93Nc7f
 =PsGr
 -----END PGP SIGNATURE-----

Merge tag 'irqchip-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/core

Pull irqchip updates for Linux 5.4 from Marc Zyngier:

 - Large GICv3 updates to support new PPI and SPI ranges
 - Conver all alloc_fwnode() users to use PAs instead of VAs
 - Add support for Marvell's MMP3 irqchip
 - Add support for Amlogic Meson SM1
 - Various cleanups and fixes
2019-09-06 13:03:39 +02:00
Mark-PK Tsai
310aa0a25b perf/hw_breakpoint: Fix arch_hw_breakpoint use-before-initialization
If we disable the compiler's auto-initialization feature, if
-fplugin-arg-structleak_plugin-byref or -ftrivial-auto-var-init=pattern
are disabled, arch_hw_breakpoint may be used before initialization after:

  9a4903dde2 ("perf/hw_breakpoint: Split attribute parse and commit")

On our ARM platform, the struct step_ctrl in arch_hw_breakpoint, which
used to be zero-initialized by kzalloc(), may be used in
arch_install_hw_breakpoint() without initialization.

Signed-off-by: Mark-PK Tsai <mark-pk.tsai@mediatek.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alix Wu <alix.wu@mediatek.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: YJ Chiang <yj.chiang@mediatek.com>
Link: https://lkml.kernel.org/r/20190906060115.9460-1-mark-pk.tsai@mediatek.com
[ Minor edits. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-06 08:24:01 +02:00
Ingo Molnar
9326011edf Merge branch 'x86/cleanups' into x86/cpu, to pick up dependent changes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-06 07:30:23 +02:00
Yunfeng Ye
eddf3e9c7c genirq: Prevent NULL pointer dereference in resend_irqs()
The following crash was observed:

  Unable to handle kernel NULL pointer dereference at 0000000000000158
  Internal error: Oops: 96000004 [#1] SMP
  pc : resend_irqs+0x68/0xb0
  lr : resend_irqs+0x64/0xb0
  ...
  Call trace:
   resend_irqs+0x68/0xb0
   tasklet_action_common.isra.6+0x84/0x138
   tasklet_action+0x2c/0x38
   __do_softirq+0x120/0x324
   run_ksoftirqd+0x44/0x60
   smpboot_thread_fn+0x1ac/0x1e8
   kthread+0x134/0x138
   ret_from_fork+0x10/0x18

The reason for this is that the interrupt resend mechanism happens in soft
interrupt context, which is a asynchronous mechanism versus other
operations on interrupts. free_irq() does not take resend handling into
account. Thus, the irq descriptor might be already freed before the resend
tasklet is executed. resend_irqs() does not check the return value of the
interrupt descriptor lookup and derefences the return value
unconditionally.

  1):
  __setup_irq
    irq_startup
      check_irq_resend  // activate softirq to handle resend irq
  2):
  irq_domain_free_irqs
    irq_free_descs
      free_desc
        call_rcu(&desc->rcu, delayed_free_desc)
  3):
  __do_softirq
    tasklet_action
      resend_irqs
        desc = irq_to_desc(irq)
        desc->handle_irq(desc)  // desc is NULL --> Ooops

Fix this by adding a NULL pointer check in resend_irqs() before derefencing
the irq descriptor.

Fixes: a4633adcdb ("[PATCH] genirq: add genirq sw IRQ-retrigger")
Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Zhiqiang Liu <liuzhiqiang26@huawei.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1630ae13-5c8e-901e-de09-e740b6a426a7@huawei.com
2019-09-05 21:31:14 +02:00
Thadeu Lima de Souza Cascardo
f18ddc13af alarmtimer: Use EOPNOTSUPP instead of ENOTSUPP
ENOTSUPP is not supposed to be returned to userspace. This was found on an
OpenPower machine, where the RTC does not support set_alarm.

On that system, a clock_nanosleep(CLOCK_REALTIME_ALARM, ...) results in
"524 Unknown error 524"

Replace it with EOPNOTSUPP which results in the expected "95 Operation not
supported" error.

Fixes: 1c6b39ad3f (alarmtimers: Return -ENOTSUPP if no RTC device is present)
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20190903171802.28314-1-cascardo@canonical.com
2019-09-05 21:19:26 +02:00
Zhengjun Xing
ac68154626 tracing: Add "gfp_t" support in synthetic_events
Add "gfp_t" support in synthetic_events, then the "gfp_t" type
parameter in some functions can be traced.

Prints the gfp flags as hex in addition to the human-readable flag
string.  Example output:

  whoopsie-630 [000] ...1 78.969452: testevent: bar=b20 (GFP_ATOMIC|__GFP_ZERO)
    rcuc/0-11  [000] ...1 81.097555: testevent: bar=a20 (GFP_ATOMIC)
    rcuc/0-11  [000] ...1 81.583123: testevent: bar=a20 (GFP_ATOMIC)

Link: http://lkml.kernel.org/r/20190712015308.9908-1-zhengjun.xing@linux.intel.com

Signed-off-by: Zhengjun Xing <zhengjun.xing@linux.intel.com>
[ Added printing of flag names ]
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-09-05 11:35:14 -04:00
Alexei Starovoitov
2339cd6cd0 bpf: fix precision tracking of stack slots
The problem can be seen in the following two tests:
0: (bf) r3 = r10
1: (55) if r3 != 0x7b goto pc+0
2: (7a) *(u64 *)(r3 -8) = 0
3: (79) r4 = *(u64 *)(r10 -8)
..
0: (85) call bpf_get_prandom_u32#7
1: (bf) r3 = r10
2: (55) if r3 != 0x7b goto pc+0
3: (7b) *(u64 *)(r3 -8) = r0
4: (79) r4 = *(u64 *)(r10 -8)

When backtracking need to mark R4 it will mark slot fp-8.
But ST or STX into fp-8 could belong to the same block of instructions.
When backtracing is done the parent state may have fp-8 slot
as "unallocated stack". Which will cause verifier to warn
and incorrectly reject such programs.

Writes into stack via non-R10 register are rare. llvm always
generates canonical stack spill/fill.
For such pathological case fall back to conservative precision
tracking instead of rejecting.

Reported-by: syzbot+c8d66267fd2b5955287e@syzkaller.appspotmail.com
Fixes: b5dc0163d8 ("bpf: precise scalar_value tracking")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-09-05 14:06:58 +02:00
Sebastian Andrzej Siewior
5d2295f3a9 hrtimer: Add a missing bracket and hide `migration_base' on !SMP
The recent change to avoid taking the expiry lock when a timer is currently
migrated missed to add a bracket at the end of the if statement leading to
compile errors.  Since that commit the variable `migration_base' is always
used but it is only available on SMP configuration thus leading to another
compile error.  The changelog says "The timer base and base->cpu_base
cannot be NULL in the code path", so it is safe to limit this check to SMP
configurations only.

Add the missing bracket to the if statement and hide `migration_base'
behind CONFIG_SMP bars.

[ tglx: Mark the functions inline ... ]

Fixes: 68b2c8c1e4 ("hrtimer: Don't take expiry_lock when timer is currently migrated")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190904145527.eah7z56ntwobqm6j@linutronix.de
2019-09-05 10:39:06 +02:00
Masami Hiramatsu
e336b40277 kprobes: Prohibit probing on BUG() and WARN() address
Since BUG() and WARN() may use a trap (e.g. UD2 on x86) to
get the address where the BUG() has occurred, kprobes can not
do single-step out-of-line that instruction. So prohibit
probing on such address.

Without this fix, if someone put a kprobe on WARN(), the
kernel will crash with invalid opcode error instead of
outputing warning message, because kernel can not find
correct bug address.

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: David S . Miller <davem@davemloft.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Naveen N . Rao <naveen.n.rao@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/156750890133.19112.3393666300746167111.stgit@devnote2
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-05 10:15:16 +02:00
Ingo Molnar
1251201c0d sched/core: Fix uclamp ABI bug, clean up and robustify sched_read_attr() ABI logic and code
Thadeu Lima de Souza Cascardo reported that 'chrt' broke on recent kernels:

  $ chrt -p $$
  chrt: failed to get pid 26306's policy: Argument list too long

and he has root-caused the bug to the following commit increasing sched_attr
size and breaking sched_read_attr() into returning -EFBIG:

  a509a7cd79 ("sched/uclamp: Extend sched_setattr() to support utilization clamping")

The other, bigger bug is that the whole sched_getattr() and sched_read_attr()
logic of checking non-zero bits in new ABI components is arguably broken,
and pretty much any extension of the ABI will spuriously break the ABI.
That's way too fragile.

Instead implement the perf syscall's extensible ABI instead, which we
already implement on the sched_setattr() side:

 - if user-attributes have the same size as kernel attributes then the
   logic is unchanged.

 - if user-attributes are larger than the kernel knows about then simply
   skip the extra bits, but set attr->size to the (smaller) kernel size
   so that tooling can (in principle) handle older kernel as well.

 - if user-attributes are smaller than the kernel knows about then just
   copy whatever user-space can accept.

Also clean up the whole logic:

 - Simplify the code flow - there's no need for 'ret' for example.

 - Standardize on 'kattr/uattr' and 'ksize/usize' naming to make sure we
   always know which side we are dealing with.

 - Why is it called 'read' when what it does is to copy to user? This
   code is so far away from VFS read() semantics that the naming is
   actively confusing. Name it sched_attr_copy_to_user() instead, which
   mirrors other copy_to_user() functionality.

 - Move the attr->size assignment from the head of sched_getattr() to the
   sched_attr_copy_to_user() function. Nothing else within the kernel
   should care about the size of the structure.

With these fixes the sched_getattr() syscall now nicely supports an
extensible ABI in both a forward and backward compatible fashion, and
will also fix the chrt bug.

As an added bonus the bogus -EFBIG return is removed as well, which as
Thadeu noted should have been -E2BIG to begin with.

Reported-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Acked-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: a509a7cd79 ("sched/uclamp: Extend sched_setattr() to support utilization clamping")
Link: https://lkml.kernel.org/r/20190904075532.GA26751@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-04 19:51:30 +02:00
Masahiro Yamada
858805b336 kbuild: add $(BASH) to run scripts with bash-extension
CONFIG_SHELL falls back to sh when bash is not installed on the system,
but nobody is testing such a case since bash is usually installed.
So, shell scripts invoked by CONFIG_SHELL are only tested with bash.

It makes it difficult to test whether the hashbang #!/bin/sh is real.
For example, #!/bin/sh in arch/powerpc/kernel/prom_init_check.sh is
false. (I fixed it up)

Besides, some shell scripts invoked by CONFIG_SHELL use bash-extension
and #!/bin/bash is specified as the hashbang, while CONFIG_SHELL may
not always be set to bash.

Probably, the right thing to do is to introduce BASH, which is bash by
default, and always set CONFIG_SHELL to sh. Replace $(CONFIG_SHELL)
with $(BASH) for bash scripts.

If somebody tries to add bash-extension to a #!/bin/sh script, it will
be caught in testing because /bin/sh is a symlink to dash on some major
distributions.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2019-09-04 22:54:13 +09:00
Christoph Hellwig
5cf4537975 dma-mapping: introduce a dma_common_find_pages helper
A helper to find the backing page array based on a virtual address.
This also ensures we do the same vm_flags check everywhere instead
of slightly different or missing ones in a few places.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-09-04 11:13:20 +02:00
Christoph Hellwig
512317401f dma-mapping: always use VM_DMA_COHERENT for generic DMA remap
Currently the generic dma remap allocator gets a vm_flags passed by
the caller that is a little confusing.  We just introduced a generic
vmalloc-level flag to identify the dma coherent allocations, so use
that everywhere and remove the now pointless argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-09-04 11:13:20 +02:00
Christoph Hellwig
249baa5479 dma-mapping: provide a better default ->get_required_mask
Most dma_map_ops instances are IOMMUs that work perfectly fine in 32-bits
of IOVA space, and the generic direct mapping code already provides its
own routines that is intelligent based on the amount of memory actually
present.  Wire up the dma-direct routine for the ARM direct mapping code
as well, and otherwise default to the constant 32-bit mask.  This way
we only need to override it for the occasional odd IOMMU that requires
64-bit IOVA support, or IOMMU drivers that are more efficient if they
can fall back to the direct mapping.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-09-04 11:13:19 +02:00
Christoph Hellwig
d9295532d5 dma-mapping: remove the dma_declare_coherent_memory export
dma_declare_coherent_memory is something that the platform setup code
(which pretty much means the device tree these days) need to do so that
drivers can use the memory as declared by the platform.  Drivers
themselves have no business calling this function.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-09-04 11:13:19 +02:00
Christoph Hellwig
7a01ee4220 dma-mapping: remove the dma_mmap_from_dev_coherent export
dma_mmap_from_dev_coherent is only used by dma_map_ops instances,
none of which is modular.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-09-04 11:13:19 +02:00
Christoph Hellwig
1fa0682448 dma-mapping: remove dma_release_declared_memory
This function is entirely unused given that declared memory is
generally provided by platform setup code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-09-04 11:13:19 +02:00
Christoph Hellwig
62fcee9a3b dma-mapping: remove CONFIG_ARCH_NO_COHERENT_DMA_MMAP
CONFIG_ARCH_NO_COHERENT_DMA_MMAP is now functionally identical to
!CONFIG_MMU, so remove the separate symbol.  The only difference is that
arm did not set it for !CONFIG_MMU, but arm uses a separate dma mapping
implementation including its own mmap method, which is handled by moving
the CONFIG_MMU check in dma_can_mmap so that is only applies to the
dma-direct case, just as the other ifdefs for it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>	# m68k
2019-09-04 11:13:18 +02:00
Christoph Hellwig
e29ccc188f dma-mapping: add a dma_can_mmap helper
Add a helper to check if DMA allocations for a specific device can be
mapped to userspace using dma_mmap_*.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-09-04 11:13:18 +02:00
Christoph Hellwig
f9f3232a7d dma-mapping: explicitly wire up ->mmap and ->get_sgtable
While the default ->mmap and ->get_sgtable implementations work for the
majority of our dma_map_ops impementations they are inherently safe
for others that don't use the page allocator or CMA and/or use their
own way of remapping not covered by the common code.  So remove the
defaults if these methods are not wired up, but instead wire up the
default implementations for all safe instances.

Fixes: e1c7e32453 ("dma-mapping: always provide the dma_map_ops based implementation")
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-09-04 11:13:18 +02:00
Christoph Hellwig
1445146701 dma-mapping: move the dma_get_sgtable API comments from arm to common code
The comments are spot on and should be near the central API, not just
near a single implementation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-09-04 11:13:17 +02:00
Nadav Amit
d8a050f5a3 kgdb: fix comment regarding static function
The comment that says that module_event() is not static is clearly
wrong.

Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2019-09-03 11:19:41 +01:00
Chuhong Yuan
635714312e kdb: Replace strncmp with str_has_prefix
strncmp(str, const, len) is error-prone.
We had better use newly introduced
str_has_prefix() instead of it.

Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2019-09-03 11:19:31 +01:00
Daniel Lezcano
82e430a6df cpuidle: play_idle: Increase the resolution to usec
The play_idle resolution is 1ms. The intel_powerclamp bases the idle
duration on jiffies. The idle injection API is also using msec based
duration but has no user yet.

Unfortunately, msec based time does not fit well when we want to
inject idle cycle precisely with shallow idle state.

In order to set the scene for the incoming idle injection user, move
the precision up to usec when calling play_idle.

Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-09-03 11:33:29 +02:00
Dexuan Cui
711419e504 irqdomain: Add the missing assignment of domain->fwnode for named fwnode
Recently device pass-through stops working for Linux VM running on Hyper-V.

git-bisect shows the regression is caused by the recent commit
467a3bb974 ("PCI: hv: Allocate a named fwnode ..."), but the root cause
is that the commit d59f6617ee forgets to set the domain->fwnode for
IRQCHIP_FWNODE_NAMED*, and as a result:

1. The domain->fwnode remains to be NULL.

2. irq_find_matching_fwspec() returns NULL since "h->fwnode == fwnode" is
false, and pci_set_bus_msi_domain() sets the Hyper-V PCI root bus's
msi_domain to NULL.

3. When the device is added onto the root bus, the device's dev->msi_domain
is set to NULL in pci_set_msi_domain().

4. When a device driver tries to enable MSI-X, pci_msi_setup_msi_irqs()
calls arch_setup_msi_irqs(), which uses the native MSI chip (i.e.
arch/x86/kernel/apic/msi.c: pci_msi_controller) to set up the irqs, but
actually pci_msi_setup_msi_irqs() is supposed to call
msi_domain_alloc_irqs() with the hbus->irq_domain, which is created in
hv_pcie_init_irq_domain() and is associated with the Hyper-V chip
hv_msi_irq_chip. Consequently, the irq line is not properly set up, and
the device driver can not receive any interrupt.

Fixes: d59f6617ee ("genirq: Allow fwnode to carry name information only")
Fixes: 467a3bb974 ("PCI: hv: Allocate a named fwnode instead of an address-based one")
Reported-by: Lili Deng <v-lide@microsoft.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/PU1P153MB01694D9AF625AC335C600C5FBFBE0@PU1P153MB0169.APCP153.PROD.OUTLOOK.COM
2019-09-03 09:16:50 +01:00
Patrick Bellasi
0413d7f33e sched/uclamp: Always use 'enum uclamp_id' for clamp_id values
The supported clamp indexes are defined in 'enum clamp_id', however, because
of the code logic in some of the first utilization clamping series version,
sometimes we needed to use 'unsigned int' to represent indices.

This is not more required since the final version of the uclamp_* APIs can
always use the proper enum uclamp_id type.

Fix it with a bulk rename now that we have all the bits merged.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Michal Koutny <mkoutny@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190822132811.31294-7-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-03 09:17:40 +02:00
Patrick Bellasi
babbe170e0 sched/uclamp: Update CPU's refcount on TG's clamp changes
On updates of task group (TG) clamp values, ensure that these new values
are enforced on all RUNNABLE tasks of the task group, i.e. all RUNNABLE
tasks are immediately boosted and/or capped as requested.

Do that each time we update effective clamps from cpu_util_update_eff().
Use the *cgroup_subsys_state (css) to walk the list of tasks in each
affected TG and update their RUNNABLE tasks.
Update each task by using the same mechanism used for cpu affinity masks
updates, i.e. by taking the rq lock.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Michal Koutny <mkoutny@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190822132811.31294-6-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-03 09:17:40 +02:00
Patrick Bellasi
3eac870a32 sched/uclamp: Use TG's clamps to restrict TASK's clamps
When a task specific clamp value is configured via sched_setattr(2), this
value is accounted in the corresponding clamp bucket every time the task is
{en,de}qeued. However, when cgroups are also in use, the task specific
clamp values could be restricted by the task_group (TG) clamp values.

Update uclamp_cpu_inc() to aggregate task and TG clamp values. Every time a
task is enqueued, it's accounted in the clamp bucket tracking the smaller
clamp between the task specific value and its TG effective value. This
allows to:

1. ensure cgroup clamps are always used to restrict task specific requests,
   i.e. boosted not more than its TG effective protection and capped at
   least as its TG effective limit.

2. implement a "nice-like" policy, where tasks are still allowed to request
   less than what enforced by their TG effective limits and protections

Do this by exploiting the concept of "effective" clamp, which is already
used by a TG to track parent enforced restrictions.

Apply task group clamp restrictions only to tasks belonging to a child
group. While, for tasks in the root group or in an autogroup, system
defaults are still enforced.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Michal Koutny <mkoutny@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190822132811.31294-5-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-03 09:17:39 +02:00
Patrick Bellasi
7274a5c1bb sched/uclamp: Propagate system defaults to the root group
The clamp values are not tunable at the level of the root task group.
That's for two main reasons:

 - the root group represents "system resources" which are always
   entirely available from the cgroup standpoint.

 - when tuning/restricting "system resources" makes sense, tuning must
   be done using a system wide API which should also be available when
   control groups are not.

When a system wide restriction is available, cgroups should be aware of
its value in order to know exactly how much "system resources" are
available for the subgroups.

Utilization clamping supports already the concepts of:

 - system defaults: which define the maximum possible clamp values
   usable by tasks.

 - effective clamps: which allows a parent cgroup to constraint (maybe
   temporarily) its descendants without losing the information related
   to the values "requested" from them.

Exploit these two concepts and bind them together in such a way that,
whenever system default are tuned, the new values are propagated to
(possibly) restrict or relax the "effective" value of nested cgroups.

When cgroups are in use, force an update of all the RUNNABLE tasks.
Otherwise, keep things simple and do just a lazy update next time each
task will be enqueued.
Do that since we assume a more strict resource control is required when
cgroups are in use. This allows also to keep "effective" clamp values
updated in case we need to expose them to user-space.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Michal Koutny <mkoutny@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190822132811.31294-4-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-03 09:17:38 +02:00
Patrick Bellasi
0b60ba2dd3 sched/uclamp: Propagate parent clamps
In order to properly support hierarchical resources control, the cgroup
delegation model requires that attribute writes from a child group never
fail but still are locally consistent and constrained based on parent's
assigned resources. This requires to properly propagate and aggregate
parent attributes down to its descendants.

Implement this mechanism by adding a new "effective" clamp value for each
task group. The effective clamp value is defined as the smaller value
between the clamp value of a group and the effective clamp value of its
parent. This is the actual clamp value enforced on tasks in a task group.

Since it's possible for a cpu.uclamp.min value to be bigger than the
cpu.uclamp.max value, ensure local consistency by restricting each
"protection" (i.e. min utilization) with the corresponding "limit"
(i.e. max utilization).

Do that at effective clamps propagation to ensure all user-space write
never fails while still always tracking the most restrictive values.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Michal Koutny <mkoutny@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190822132811.31294-3-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-03 09:17:38 +02:00
Patrick Bellasi
2480c09313 sched/uclamp: Extend CPU's cgroup controller
The cgroup CPU bandwidth controller allows to assign a specified
(maximum) bandwidth to the tasks of a group. However this bandwidth is
defined and enforced only on a temporal base, without considering the
actual frequency a CPU is running on. Thus, the amount of computation
completed by a task within an allocated bandwidth can be very different
depending on the actual frequency the CPU is running that task.
The amount of computation can be affected also by the specific CPU a
task is running on, especially when running on asymmetric capacity
systems like Arm's big.LITTLE.

With the availability of schedutil, the scheduler is now able
to drive frequency selections based on actual task utilization.
Moreover, the utilization clamping support provides a mechanism to
bias the frequency selection operated by schedutil depending on
constraints assigned to the tasks currently RUNNABLE on a CPU.

Giving the mechanisms described above, it is now possible to extend the
cpu controller to specify the minimum (or maximum) utilization which
should be considered for tasks RUNNABLE on a cpu.
This makes it possible to better defined the actual computational
power assigned to task groups, thus improving the cgroup CPU bandwidth
controller which is currently based just on time constraints.

Extend the CPU controller with a couple of new attributes uclamp.{min,max}
which allow to enforce utilization boosting and capping for all the
tasks in a group.

Specifically:

- uclamp.min: defines the minimum utilization which should be considered
	      i.e. the RUNNABLE tasks of this group will run at least at a
	      minimum frequency which corresponds to the uclamp.min
	      utilization

- uclamp.max: defines the maximum utilization which should be considered
	      i.e. the RUNNABLE tasks of this group will run up to a
	      maximum frequency which corresponds to the uclamp.max
	      utilization

These attributes:

a) are available only for non-root nodes, both on default and legacy
   hierarchies, while system wide clamps are defined by a generic
   interface which does not depends on cgroups. This system wide
   interface enforces constraints on tasks in the root node.

b) enforce effective constraints at each level of the hierarchy which
   are a restriction of the group requests considering its parent's
   effective constraints. Root group effective constraints are defined
   by the system wide interface.
   This mechanism allows each (non-root) level of the hierarchy to:
   - request whatever clamp values it would like to get
   - effectively get only up to the maximum amount allowed by its parent

c) have higher priority than task-specific clamps, defined via
   sched_setattr(), thus allowing to control and restrict task requests.

Add two new attributes to the cpu controller to collect "requested"
clamp values. Allow that at each non-root level of the hierarchy.
Keep it simple by not caring now about "effective" values computation
and propagation along the hierarchy.

Update sysctl_sched_uclamp_handler() to use the newly introduced
uclamp_mutex so that we serialize system default updates with cgroup
relate updates.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Michal Koutny <mkoutny@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190822132811.31294-2-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-03 09:17:37 +02:00
Matt Fleming
a55c7454a8 sched/topology: Improve load balancing on AMD EPYC systems
SD_BALANCE_{FORK,EXEC} and SD_WAKE_AFFINE are stripped in sd_init()
for any sched domains with a NUMA distance greater than 2 hops
(RECLAIM_DISTANCE). The idea being that it's expensive to balance
across domains that far apart.

However, as is rather unfortunately explained in:

  commit 32e45ff43e ("mm: increase RECLAIM_DISTANCE to 30")

the value for RECLAIM_DISTANCE is based on node distance tables from
2011-era hardware.

Current AMD EPYC machines have the following NUMA node distances:

 node distances:
 node   0   1   2   3   4   5   6   7
   0:  10  16  16  16  32  32  32  32
   1:  16  10  16  16  32  32  32  32
   2:  16  16  10  16  32  32  32  32
   3:  16  16  16  10  32  32  32  32
   4:  32  32  32  32  10  16  16  16
   5:  32  32  32  32  16  10  16  16
   6:  32  32  32  32  16  16  10  16
   7:  32  32  32  32  16  16  16  10

where 2 hops is 32.

The result is that the scheduler fails to load balance properly across
NUMA nodes on different sockets -- 2 hops apart.

For example, pinning 16 busy threads to NUMA nodes 0 (CPUs 0-7) and 4
(CPUs 32-39) like so,

  $ numactl -C 0-7,32-39 ./spinner 16

causes all threads to fork and remain on node 0 until the active
balancer kicks in after a few seconds and forcibly moves some threads
to node 4.

Override node_reclaim_distance for AMD Zen.

Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Suravee.Suthikulpanit@amd.com
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas.Lendacky@amd.com
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20190808195301.13222-3-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-03 09:17:37 +02:00
Liangyan
5e2d2cc258 sched/fair: Don't assign runtime for throttled cfs_rq
do_sched_cfs_period_timer() will refill cfs_b runtime and call
distribute_cfs_runtime to unthrottle cfs_rq, sometimes cfs_b->runtime
will allocate all quota to one cfs_rq incorrectly, then other cfs_rqs
attached to this cfs_b can't get runtime and will be throttled.

We find that one throttled cfs_rq has non-negative
cfs_rq->runtime_remaining and cause an unexpetced cast from s64 to u64
in snippet:

  distribute_cfs_runtime() {
    runtime = -cfs_rq->runtime_remaining + 1;
  }

The runtime here will change to a large number and consume all
cfs_b->runtime in this cfs_b period.

According to Ben Segall, the throttled cfs_rq can have
account_cfs_rq_runtime called on it because it is throttled before
idle_balance, and the idle_balance calls update_rq_clock to add time
that is accounted to the task.

This commit prevents cfs_rq to be assgined new runtime if it has been
throttled until that distribute_cfs_runtime is called.

Signed-off-by: Liangyan <liangyan.peng@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: shanpeic@linux.alibaba.com
Cc: stable@vger.kernel.org
Cc: xlpang@linux.alibaba.com
Fixes: d3d9dc3302 ("sched: Throttle entities exceeding their allowed bandwidth")
Link: https://lkml.kernel.org/r/20190826121633.6538-1-liangyan.peng@linux.alibaba.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-03 08:55:07 +02:00
Yoshihiro Shimoda
6ba99411b8 dma-mapping: introduce dma_get_merge_boundary()
This patch adds a new DMA API "dma_get_merge_boundary". This function
returns the DMA merge boundary if the DMA layer can merge the segments.
This patch also adds the implementation for a new dma_map_ops pointer.

Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
Reviewed-by: Simon Horman <horms+renesas@verge.net.au>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-09-03 08:33:01 +02:00
David S. Miller
765b7590c9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
r8152 conflicts are the NAPI fixes in 'net' overlapping with
some tasklet stuff in net-next

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-02 11:20:17 -07:00
Ingo Molnar
e98db89489 Merge branch 'linus' into perf/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-02 09:12:21 +02:00
Linus Torvalds
345464fb76 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Fix some length checks during OGM processing in batman-adv, from
    Sven Eckelmann.

 2) Fix regression that caused netfilter conntrack sysctls to not be
    per-netns any more. From Florian Westphal.

 3) Use after free in netpoll, from Feng Sun.

 4) Guard destruction of pfifo_fast per-cpu qdisc stats with
    qdisc_is_percpu_stats(), from Davide Caratti. Similar bug is fixed
    in pfifo_fast_enqueue().

 5) Fix memory leak in mld_del_delrec(), from Eric Dumazet.

 6) Handle neigh events on internal ports correctly in nfp, from John
    Hurley.

 7) Clear SKB timestamp in NF flow table code so that it does not
    confuse fq scheduler. From Florian Westphal.

 8) taprio destroy can crash if it is invoked in a failure path of
    taprio_init(), because the list head isn't setup properly yet and
    the list del is unconditional. Perform the list add earlier to
    address this. From Vladimir Oltean.

 9) Make sure to reapply vlan filters on device up, in aquantia driver.
    From Dmitry Bogdanov.

10) sgiseeq driver releases DMA memory using free_page() instead of
    dma_free_attrs(). From Christophe JAILLET.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (58 commits)
  net: seeq: Fix the function used to release some memory in an error handling path
  enetc: Add missing call to 'pci_free_irq_vectors()' in probe and remove functions
  net: bcmgenet: use ethtool_op_get_ts_info()
  tc-testing: don't hardcode 'ip' in nsPlugin.py
  net: dsa: microchip: add KSZ8563 compatibility string
  dt-bindings: net: dsa: document additional Microchip KSZ8563 switch
  net: aquantia: fix out of memory condition on rx side
  net: aquantia: linkstate irq should be oneshot
  net: aquantia: reapply vlan filters on up
  net: aquantia: fix limit of vlan filters
  net: aquantia: fix removal of vlan 0
  net/sched: cbs: Set default link speed to 10 Mbps in cbs_set_port_rate
  taprio: Set default link speed to 10 Mbps in taprio_set_picos_per_byte
  taprio: Fix kernel panic in taprio_destroy
  net: dsa: microchip: fill regmap_config name
  rxrpc: Fix lack of conn cleanup when local endpoint is cleaned up [ver #2]
  net: stmmac: dwmac-rk: Don't fail if phy regulator is absent
  amd-xgbe: Fix error path in xgbe_mod_init()
  netfilter: nft_meta_bridge: Fix get NFT_META_BRI_IIFVPROTO in network byteorder
  mac80211: Correctly set noencrypt for PAE frames
  ...
2019-09-01 18:45:28 -07:00
Steven Rostedt (VMware)
a47b53e95a tracing: Rename tracing_reset() to tracing_reset_cpu()
The name tracing_reset() was a misnomer, as it really only reset a single
CPU buffer. Rename it to tracing_reset_cpu() and also make it static and
remove the prototype from trace.h, as it is only used in a single function.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-08-31 12:19:40 -04:00
Steven Rostedt (VMware)
58fe7a87db tracing: Document the stack trace algorithm in the comments
As the max stack tracer algorithm is not that easy to understand from the
code, add comments that explain the algorithm and mentions how
ARCH_FTRACE_SHIFT_STACK_TRACER affects it.

Link: http://lkml.kernel.org/r/20190806123455.487ac02b@gandalf.local.home

Suggested-by: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-08-31 12:19:40 -04:00
Steven Rostedt (VMware)
f7edb451fa tracing/arm64: Have max stack tracer handle the case of return address after data
Most archs (well at least x86) store the function call return address on the
stack before storing the local variables for the function. The max stack
tracer depends on this in its algorithm to display the stack size of each
function it finds in the back trace.

Some archs (arm64), may store the return address (from its link register)
just before calling a nested function. There's no reason to save the link
register on leaf functions, as it wont be updated. This breaks the algorithm
of the max stack tracer.

Add a new define ARCH_FTRACE_SHIFT_STACK_TRACER that an architecture may set
if it stores the return address (link register) after it stores the
function's local variables, and have the stack trace shift the values of the
mapped stack size to the appropriate functions.

Link: 20190802094103.163576-1-jiping.ma2@windriver.com

Reported-by: Jiping Ma <jiping.ma2@windriver.com>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-08-31 12:19:40 -04:00
Masami Hiramatsu
a42e3c4de9 tracing/probe: Add immediate string parameter support
Add immediate string parameter (\"string") support to
probe events. This allows you to specify an immediate
(or dummy) parameter instead of fetching a string from
memory.

This feature looks odd, but imagine that you put a probe
on a code to trace some string data. If the code is
compiled into 2 instructions and 1 instruction has a
string on memory but other has no string since it is
optimized out. In that case, you can not fold those into
one event, even if ftrace supported multiple probes on
one event. With this feature, you can set a dummy string
like foo=\"(optimized)":string instead of something
like foo=+0(+0(%bp)):string.

Link: http://lkml.kernel.org/r/156095691687.28024.13372712423865047991.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-08-31 12:19:39 -04:00
Masami Hiramatsu
6218bf9f4d tracing/probe: Add immediate parameter support
Add immediate value parameter (\1234) support to
probe events. This allows you to specify an immediate
(or dummy) parameter instead of fetching from memory
or register.

This feature looks odd, but imagine when you put a probe
on a code to trace some data. If the code is compiled into
2 instructions and 1 instruction has a value but other has
nothing since it is optimized out.
In that case, you can not fold those into one event, even
if ftrace supported multiple probes on one event.
With this feature, you can set a dummy value like
foo=\deadbeef instead of something like foo=%di.

Link: http://lkml.kernel.org/r/156095690733.28024.13258186548822649469.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-08-31 12:19:39 -04:00
Masami Hiramatsu
ab10d69eb7 tracing/uprobe: Add per-probe delete from event
Add per-probe delete method from one event passing the head of
definition. In other words, the events which match the head
N parameters are deleted.

Link: http://lkml.kernel.org/r/156095689811.28024.221706761151739433.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-08-31 12:19:39 -04:00
Masami Hiramatsu
eb5bf81330 tracing/kprobe: Add per-probe delete from event
Allow user to delete a probe from event. This is done by head
match. For example, if we have 2 probes on an event

$ cat kprobe_events
p:kprobes/testprobe _do_fork r1=%ax r2=%dx
p:kprobes/testprobe idle_fork r1=%ax r2=%cx

Then you can remove one of them by passing the head of definition
which identify the probe.

$ echo "-:kprobes/testprobe idle_fork" >> kprobe_events

Link: http://lkml.kernel.org/r/156095688848.28024.15798690082378432435.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-08-31 12:19:38 -04:00
Masami Hiramatsu
41af3cf587 tracing/uprobe: Add multi-probe per uprobe event support
Allow user to define several probes on one uprobe event.
Note that this only support appending method. So deleting
event will delete all probes on the event.

Link: http://lkml.kernel.org/r/156095687876.28024.13840331032234992863.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-08-31 12:19:38 -04:00
Masami Hiramatsu
ca89bc071d tracing/kprobe: Add multi-probe per event support
Add multi-probe per one event support to kprobe events.
User can define several different probes on one trace event
if those events have same "event signature",
e.g.

  # echo p:testevent _do_fork > kprobe_events
  # echo p:testevent fork_idle >> kprobe_events
  # kprobe_events
  p:kprobes/testevent _do_fork
  p:kprobes/testevent fork_idle

The event signature is defined by kprobe type (retprobe or not),
the number of args, argument names, and argument types.

Note that this only support appending method. Delete event
operation will delete all probes on the event.

Link: http://lkml.kernel.org/r/156095686913.28024.9357292202316540742.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-08-31 12:19:38 -04:00
Masami Hiramatsu
30199137c8 tracing/dynevent: Pass extra arguments to match operation
Pass extra arguments to match operation for checking
exact match. If the event doesn't support exact match,
it will be ignored.

Link: http://lkml.kernel.org/r/156095685930.28024.10405547027475590975.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-08-31 12:19:38 -04:00
Masami Hiramatsu
cb8e7a8d55 tracing/dynevent: Delete all matched events
When user gives an event name to delete, delete all
matched events instead of the first one.

This means if there are several events which have same
name but different group (subsystem) name, those are
removed if user passed only the event name, e.g.

  # cat kprobe_events
  p:group1/testevent _do_fork
  p:group2/testevent fork_idle
  # echo -:testevent >> kprobe_events
  # cat kprobe_events
  #

Link: http://lkml.kernel.org/r/156095684958.28024.16597826267117453638.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-08-31 12:19:38 -04:00
Masami Hiramatsu
60d53e2c3b tracing/probe: Split trace_event related data from trace_probe
Split the trace_event related data from trace_probe data structure
and introduce trace_probe_event data structure for its folder.
This trace_probe_event data structure can have multiple trace_probe.

Link: http://lkml.kernel.org/r/156095683995.28024.7552150340561557873.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-08-31 12:19:38 -04:00
Masami Hiramatsu
0bc11ed5ab kprobes: Allow kprobes coexist with livepatch
Allow kprobes which do not modify regs->ip, coexist with livepatch
by dropping FTRACE_OPS_FL_IPMODIFY from ftrace_ops.

User who wants to modify regs->ip (e.g. function fault injection)
must set a dummy post_handler to its kprobes when registering.
However, if such regs->ip modifying kprobes is set on a function,
that function can not be livepatched.

Link: http://lkml.kernel.org/r/156403587671.30117.5233558741694155985.stgit@devnote2

Acked-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-08-31 12:19:10 -04:00
Linus Torvalds
95381debd9 Small fixes and minor cleanups for Tracing
- Make exported ftrace function not static
  - Fix NULL pointer dereference in reading probes as they are created
  - Fix NULL pointer dereference in k/uprobe clean up path
  - Various documentation fixes
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXWpTNRQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qpXtAPsGoHDHkgPIyl9bnV0oZfwLrAl4qEyg
 RpVp9ZMcG4UtMwEAp/SXRFzvL+EUiKyd1U3FZy2jhVec3+hX7SzIGqgONA4=
 =ee8V
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.3-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
 "Small fixes and minor cleanups for tracing:

   - Make exported ftrace function not static

   - Fix NULL pointer dereference in reading probes as they are created

   - Fix NULL pointer dereference in k/uprobe clean up path

   - Various documentation fixes"

* tag 'trace-v5.3-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Correct kdoc formats
  ftrace/x86: Remove mcount() declaration
  tracing/probe: Fix null pointer dereference
  tracing: Make exported ftrace_set_clr_event non-static
  ftrace: Check for successful allocation of hash
  ftrace: Check for empty hash and comment the race with registering probes
  ftrace: Fix NULL pointer dereference in t_probe_next()
2019-08-31 09:15:25 -07:00
Jakub Kicinski
c68c9ec1c5 tracing: Correct kdoc formats
Fix the following kdoc warnings:

kernel/trace/trace.c:1579: warning: Function parameter or member 'tr' not described in 'update_max_tr_single'
kernel/trace/trace.c:1579: warning: Function parameter or member 'tsk' not described in 'update_max_tr_single'
kernel/trace/trace.c:1579: warning: Function parameter or member 'cpu' not described in 'update_max_tr_single'
kernel/trace/trace.c:1776: warning: Function parameter or member 'type' not described in 'register_tracer'
kernel/trace/trace.c:2239: warning: Function parameter or member 'task' not described in 'tracing_record_taskinfo'
kernel/trace/trace.c:2239: warning: Function parameter or member 'flags' not described in 'tracing_record_taskinfo'
kernel/trace/trace.c:2269: warning: Function parameter or member 'prev' not described in 'tracing_record_taskinfo_sched_switch'
kernel/trace/trace.c:2269: warning: Function parameter or member 'next' not described in 'tracing_record_taskinfo_sched_switch'
kernel/trace/trace.c:2269: warning: Function parameter or member 'flags' not described in 'tracing_record_taskinfo_sched_switch'
kernel/trace/trace.c:3078: warning: Function parameter or member 'ip' not described in 'trace_vbprintk'
kernel/trace/trace.c:3078: warning: Function parameter or member 'fmt' not described in 'trace_vbprintk'
kernel/trace/trace.c:3078: warning: Function parameter or member 'args' not described in 'trace_vbprintk'

Link: http://lkml.kernel.org/r/20190828052549.2472-2-jakub.kicinski@netronome.com

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-08-31 06:51:56 -04:00
Xinpeng Liu
19a58ce1dc tracing/probe: Fix null pointer dereference
BUG: KASAN: null-ptr-deref in trace_probe_cleanup+0x8d/0xd0
Read of size 8 at addr 0000000000000000 by task syz-executor.0/9746
trace_probe_cleanup+0x8d/0xd0
free_trace_kprobe.part.14+0x15/0x50
alloc_trace_kprobe+0x23e/0x250

Link: http://lkml.kernel.org/r/1565220563-980-1-git-send-email-danielliu861@gmail.com

Fixes: e3dc9f898e ("tracing/probe: Add trace_event_call accesses APIs")
Signed-off-by: Xinpeng Liu <danielliu861@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-08-31 06:51:55 -04:00
Denis Efremov
595a438c78 tracing: Make exported ftrace_set_clr_event non-static
The function ftrace_set_clr_event is declared static and marked
EXPORT_SYMBOL_GPL(), which is at best an odd combination. Because the
function was decided to be a part of API, this commit removes the static
attribute and adds the declaration to the header.

Link: http://lkml.kernel.org/r/20190704172110.27041-1-efremov@linux.com

Fixes: f45d1225ad ("tracing: Kernel access to Ftrace instances")
Reviewed-by: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Denis Efremov <efremov@linux.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-08-31 06:51:49 -04:00
David S. Miller
94880a5b2e Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2019-08-31

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Fix 32-bit zero-extension during constant blinding which
   has been causing a regression on ppc64, from Naveen.

2) Fix a latency bug in nfp driver when updating stack index
   register, from Jiong.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-30 17:39:37 -07:00
Naveen N. Rao
5b0022dd32 ftrace: Check for successful allocation of hash
In register_ftrace_function_probe(), we are not checking the return
value of alloc_and_copy_ftrace_hash(). The subsequent call to
ftrace_match_records() may end up dereferencing the same. Add a check to
ensure this doesn't happen.

Link: http://lkml.kernel.org/r/26e92574f25ad23e7cafa3cf5f7a819de1832cbe.1562249521.git.naveen.n.rao@linux.vnet.ibm.com

Cc: stable@vger.kernel.org
Fixes: 1ec3a81a0c ("ftrace: Have each function probe use its own ftrace_ops")
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-08-30 16:49:07 -04:00
Steven Rostedt (VMware)
372e0d01da ftrace: Check for empty hash and comment the race with registering probes
The race between adding a function probe and reading the probes that exist
is very subtle. It needs a comment. Also, the issue can also happen if the
probe has has the EMPTY_HASH as its func_hash.

Cc: stable@vger.kernel.org
Fixes: 7b60f3d876 ("ftrace: Dynamically create the probe ftrace_ops for the trace_array")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-08-30 16:30:01 -04:00
Naveen N. Rao
7bd46644ea ftrace: Fix NULL pointer dereference in t_probe_next()
LTP testsuite on powerpc results in the below crash:

  Unable to handle kernel paging request for data at address 0x00000000
  Faulting instruction address: 0xc00000000029d800
  Oops: Kernel access of bad area, sig: 11 [#1]
  LE SMP NR_CPUS=2048 NUMA PowerNV
  ...
  CPU: 68 PID: 96584 Comm: cat Kdump: loaded Tainted: G        W
  NIP:  c00000000029d800 LR: c00000000029dac4 CTR: c0000000001e6ad0
  REGS: c0002017fae8ba10 TRAP: 0300   Tainted: G        W
  MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 28022422  XER: 20040000
  CFAR: c00000000029d90c DAR: 0000000000000000 DSISR: 40000000 IRQMASK: 0
  ...
  NIP [c00000000029d800] t_probe_next+0x60/0x180
  LR [c00000000029dac4] t_mod_start+0x1a4/0x1f0
  Call Trace:
  [c0002017fae8bc90] [c000000000cdbc40] _cond_resched+0x10/0xb0 (unreliable)
  [c0002017fae8bce0] [c0000000002a15b0] t_start+0xf0/0x1c0
  [c0002017fae8bd30] [c0000000004ec2b4] seq_read+0x184/0x640
  [c0002017fae8bdd0] [c0000000004a57bc] sys_read+0x10c/0x300
  [c0002017fae8be30] [c00000000000b388] system_call+0x5c/0x70

The test (ftrace_set_ftrace_filter.sh) is part of ftrace stress tests
and the crash happens when the test does 'cat
$TRACING_PATH/set_ftrace_filter'.

The address points to the second line below, in t_probe_next(), where
filter_hash is dereferenced:
  hash = iter->probe->ops.func_hash->filter_hash;
  size = 1 << hash->size_bits;

This happens due to a race with register_ftrace_function_probe(). A new
ftrace_func_probe is created and added into the func_probes list in
trace_array under ftrace_lock. However, before initializing the filter,
we drop ftrace_lock, and re-acquire it after acquiring regex_lock. If
another process is trying to read set_ftrace_filter, it will be able to
acquire ftrace_lock during this window and it will end up seeing a NULL
filter_hash.

Fix this by just checking for a NULL filter_hash in t_probe_next(). If
the filter_hash is NULL, then this probe is just being added and we can
simply return from here.

Link: http://lkml.kernel.org/r/05e021f757625cbbb006fad41380323dbe4e3b43.1562249521.git.naveen.n.rao@linux.vnet.ibm.com

Cc: stable@vger.kernel.org
Fixes: 7b60f3d876 ("ftrace: Dynamically create the probe ftrace_ops for the trace_array")
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-08-30 16:23:47 -04:00
Will Deacon
61b7cddfe8 Merge branch 'for-next/atomics' into for-next/core
* for-next/atomics: (10 commits)
  Rework LSE instruction selection to use static keys instead of alternatives
2019-08-30 12:55:39 +01:00
Michael Ellerman
07aa1e786d Merge branch 'topic/mem-encrypt' into next
This branch has some cross-arch patches that are a prequisite for the
SVM work. They're in a topic branch in case any of the other arch
maintainers want to merge them to resolve conflicts.
2019-08-30 09:49:28 +10:00
Christoph Hellwig
8e3a68fb55 dma-mapping: make dma_atomic_pool_init self-contained
The memory allocated for the atomic pool needs to have the same
mapping attributes that we use for remapping, so use
pgprot_dmacoherent instead of open coding it.  Also deduct a
suitable zone to allocate the memory from based on the presence
of the DMA zones.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-08-29 16:43:33 +02:00
Christoph Hellwig
419e2f1838 dma-mapping: remove arch_dma_mmap_pgprot
arch_dma_mmap_pgprot is used for two things:

 1) to override the "normal" uncached page attributes for mapping
    memory coherent to devices that can't snoop the CPU caches
 2) to provide the special DMA_ATTR_WRITE_COMBINE semantics on older
    arm systems and some mips platforms

Replace one with the pgprot_dmacoherent macro that is already provided
by arm and much simpler to use, and lift the DMA_ATTR_WRITE_COMBINE
handling to common code with an explicit arch opt-in.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>	# m68k
Acked-by: Paul Burton <paul.burton@mips.com>		# mips
2019-08-29 16:43:22 +02:00
Andrew Murray
8f35eaa5f2 jump_label: Don't warn on __exit jump entries
On architectures that discard .exit.* sections at runtime, a
warning is printed for each jump label that is used within an
in-kernel __exit annotated function:

can't patch jump_label at ehci_hcd_cleanup+0x8/0x3c
WARNING: CPU: 0 PID: 1 at kernel/jump_label.c:410 __jump_label_update+0x12c/0x138

As these functions will never get executed (they are free'd along
with the rest of initmem) - we do not need to patch them and should
not display any warnings.

The warning is displayed because the test required to satisfy
jump_entry_is_init is based on init_section_contains (__init_begin to
__init_end) whereas the test in __jump_label_update is based on
init_kernel_text (_sinittext to _einittext) via kernel_text_address).

Fixes: 1948367768 ("jump_label: Annotate entries that operate on __init code earlier")
Signed-off-by: Andrew Murray <andrew.murray@arm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
2019-08-29 15:10:10 +01:00
Thomas Gleixner
a2ed4fd685 posix-cpu-timers: Make expiry_active check actually work correctly
The state tracking changes broke the expiry active check by not writing to
it and instead sitting timers_active, which is already set.

That's not a big issue as the actual expiry is protected by sighand lock,
so concurrent handling is not possible. That means that the second task
which invokes that function executes the expiry code for nothing.

Write to the proper flag.

Also add a check whether the flag is set into check_process_timers(). That
check had been missing in the code before the rework already. The check for
another task handling the expiry of process wide timers was only done in
the fastpath check. If the fastpath check returns true because a per task
timer expired, then the checking of process wide timers was done in
parallel which is as explained above just a waste of cycles.

Fixes: 244d49e306 ("posix-cpu-timers: Move state tracking to struct posix_cputimers")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Frederic Weisbecker <frederic@kernel.org>
2019-08-29 12:52:28 +02:00
Linus Torvalds
9cf6b756cd arm64 fixes for -rc7
- Fix GICv2 emulation bug (KVM)
 
 - Fix deadlock in virtual GIC interrupt injection code (KVM)
 
 - Fix kprobes blacklist init failure due to broken kallsyms lookup
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCgAdFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAl1mW/0ACgkQt6xw3ITB
 YzQWEQgAnG+20c1uzPqIEME3o0bCXsci0I1w4yPYPH/NVrfmJdAjfway/cLZlMCh
 JYKBTc392kfYXgzAvMqvjNe+9O/qg9MnMxV7Afjfk9e0k+41Kcw/7U39/B6q4yOg
 zIz4YW6SdbbbUfY9VgR8SjsZ3HwJoP8YVNEv3W259funtFHkKDO35h1iG1f57GWL
 UELjEnag9dmCH35ykt7+SC4woBGjOQnic2XLSS7VZuU/26TfnUyo2HCdDcLcViCZ
 DHDQgaHAscb+tgJjeNtNEeRu+GPBG9LOhPw4EpBjGHeoJla7z74GOJPMEt4Ufoqd
 qG9TPfOFAFM30/PAdlWF3SjfzqK3PA==
 =WYzY
 -----END PGP SIGNATURE-----

Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 fixes from Will Deacon:
 "Hot on the heels of our last set of fixes are a few more for -rc7.

  Two of them are fixing issues with our virtual interrupt controller
  implementation in KVM/arm, while the other is a longstanding but
  straightforward kallsyms fix which was been acked by Masami and
  resolves an initialisation failure in kprobes observed on arm64.

   - Fix GICv2 emulation bug (KVM)

   - Fix deadlock in virtual GIC interrupt injection code (KVM)

   - Fix kprobes blacklist init failure due to broken kallsyms lookup"

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  KVM: arm/arm64: vgic-v2: Handle SGI bits in GICD_I{S,C}PENDR0 as WI
  KVM: arm/arm64: vgic: Fix potential deadlock when ap_list is long
  kallsyms: Don't let kallsyms_lookup_size_offset() fail on retrieving the first symbol
2019-08-28 10:37:21 -07:00
Sebastian Andrzej Siewior
71fed982d6 tick: Mark sched_timer to expire in hard interrupt context
sched_timer must be initialized with the _HARD mode suffix to ensure expiry
in hard interrupt context on RT.

The previous conversion to HARD expiry mode missed on one instance in
tick_nohz_switch_to_nohz(). Fix it up.

Fixes: 902a9f9c50 ("tick: Mark tick related hrtimers to expiry in hard interrupt context")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190823113845.12125-3-bigeasy@linutronix.de
2019-08-28 13:01:26 +02:00
Ming Lei
101f85b56d genirq/affinity: Remove const qualifier from node_to_cpumask argument
When CONFIG_CPUMASK_OFFSTACK isn't enabled, 'cpumask_var_t' is as

'typedef struct cpumask cpumask_var_t[1]',

so the argument 'node_to_cpumask' alloc_nodes_vectors() can't be declared
as 'const cpumask_var_t *'

Fixes the following warning:

   kernel/irq/affinity.c: In function '__irq_build_affinity_masks':
     alloc_nodes_vectors(numvecs, node_to_cpumask, cpu_mask,
                                  ^
   kernel/irq/affinity.c:128:13: note: expected 'const struct cpumask (*)[1]' but argument is of type 'struct cpumask (*)[1]'
    static void alloc_nodes_vectors(unsigned int numvecs,
                ^
Fixes: b1a5a73e64 ("genirq/affinity: Spread vectors on node according to nr_cpu ratio")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190828085815.19931-1-ming.lei@redhat.com
2019-08-28 12:20:43 +02:00
Thomas Gleixner
60bda037f1 posix-cpu-timers: Utilize timerqueue for storage
Using a linear O(N) search for timer insertion affects execution time and
D-cache footprint badly with a larger number of timers.

Switch the storage to a timerqueue which is already used for hrtimers and
alarmtimers. It does not affect the size of struct k_itimer as it.alarm is
still larger.

The extra list head for the expiry list will go away later once the expiry
is moved into task work context.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1908272129220.1939@nanos.tec.linutronix.de
2019-08-28 11:50:43 +02:00
Thomas Gleixner
244d49e306 posix-cpu-timers: Move state tracking to struct posix_cputimers
Put it where it belongs and clean up the ifdeffery in fork completely.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190821192922.743229404@linutronix.de
2019-08-28 11:50:42 +02:00
Thomas Gleixner
8991afe264 posix-cpu-timers: Deduplicate rlimit handling
Both thread and process expiry functions have the same functionality for
sending signals for soft and hard RLIMITs duplicated in 4 different
ways.

Split it out into a common function and cleanup the callsites.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192922.653276779@linutronix.de
2019-08-28 11:50:42 +02:00
Thomas Gleixner
dd67022413 posix-cpu-timers: Remove pointless comparisons
The soft RLIMIT expiry code checks whether the soft limit is greater than
the hard limit. That's pointless because if the soft RLIMIT is greater than
the hard RLIMIT then that code cannot be reached as the hard RLIMIT check
is before that and already killed the process.

Remove it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192922.548747613@linutronix.de
2019-08-28 11:50:42 +02:00
Thomas Gleixner
8ea1de90a5 posix-cpu-timers: Get rid of 64bit divisions
Instead of dividing A to match the units of B it's more efficient to
multiply B to match the units of A.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192922.458286860@linutronix.de
2019-08-28 11:50:41 +02:00
Thomas Gleixner
1cd07c0b94 posix-cpu-timers: Consolidate timer expiry further
With the array based samples and expiry cache, the expiry function can use
a loop to collect timers from the clock specific lists.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192922.365469982@linutronix.de
2019-08-28 11:50:41 +02:00
Thomas Gleixner
2bbdbdae05 posix-cpu-timers: Get rid of zero checks
Deactivation of the expiry cache is done by setting all clock caches to
0. That requires to have a check for zero in all places which update the
expiry cache:

	if (cache == 0 || new < cache)
		cache = new;

Use U64_MAX as the deactivated value, which allows to remove the zero
checks when updating the cache and reduces it to the obvious check:

	if (new < cache)
		cache = new;

This also removes the weird workaround in do_prlimit() which was required
to convert a RLIMIT_CPU value of 0 (immediate expiry) to 1 because handing
in 0 to the posix CPU timer code would have effectively disarmed it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192922.275086128@linutronix.de
2019-08-28 11:50:40 +02:00
Thomas Gleixner
24db4dd90d rlimit: Rewrite non-sensical RLIMIT_CPU comment
The comment above the function which arms RLIMIT_CPU in the posix CPU timer
code makes no sense at all. It claims that the kernel does not return an
error code when it rejected the attempt to set RLIMIT_CPU. That's clearly
bogus as the code does an error check and the rlimit is only set and
activated when the permission checks are ok. In case of a rejection an
appropriate error code is returned.

This is a historical and outdated comment which got dragged along even when
the rlimit handling code was rewritten.

Replace it with an explanation why the setup function is not called when
the rlimit value is RLIM_INFINITY and how the 'disarming' is handled.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192922.185511287@linutronix.de
2019-08-28 11:50:40 +02:00
Thomas Gleixner
fe0517f893 posix-cpu-timers: Respect INFINITY for hard RTTIME limit
The RTIME limit expiry code does not check the hard RTTIME limit for
INFINITY, i.e. being disabled.  Add it.

While this could be considered an ABI breakage if something would depend on
this behaviour. Though it's highly unlikely to have an effect because
RLIM_INFINITY is at minimum INT_MAX and the RTTIME limit is in seconds, so
the timer would fire after ~68 years.

Adding this obvious correct limit check also allows further consolidation
of that code and is a prerequisite for cleaning up the 0 based checks and
the rlimit setter code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192922.078293002@linutronix.de
2019-08-28 11:50:39 +02:00
Thomas Gleixner
b7be4ef136 posix-cpu-timers: Switch thread group sampling to array
That allows more simplifications in various places.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192921.988426956@linutronix.de
2019-08-28 11:50:39 +02:00
Thomas Gleixner
87dc64480f posix-cpu-timers: Restructure expiry array
Now that the abused struct task_cputime is gone, it's more natural to
bundle the expiry cache and the list head of each clock into a struct and
have an array of those structs.

Follow the hrtimer naming convention of 'bases' and rename the expiry cache
to 'nextevt' and adapt all usage sites.

Generates also better code .text size shrinks by 80 bytes.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1908262021140.1939@nanos.tec.linutronix.de
2019-08-28 11:50:39 +02:00
Thomas Gleixner
46b883995c posix-cpu-timers: Remove cputime_expires
The last users of the magic struct cputime based expiry cache are
gone. Remove the leftovers.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192921.790209622@linutronix.de
2019-08-28 11:50:38 +02:00
Thomas Gleixner
001f797143 posix-cpu-timers: Make expiry checks array based
The expiry cache is an array indexed by clock ids. The new sample functions
allow to retrieve a corresponding array of samples.

Convert the fastpath expiry checks to make use of the new sample functions
and do the comparisons on the sample and the expiry array.

Make the check for the expiry array being zero array based as well.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192921.695481430@linutronix.de
2019-08-28 11:50:38 +02:00
Thomas Gleixner
b0d524f779 posix-cpu-timers: Provide array based sample functions
Instead of using task_cputime and doing the addition of utime and stime at
all call sites, it's way simpler to have a sample array which allows
indexed based checks against the expiry cache array.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192921.590362974@linutronix.de
2019-08-28 11:50:38 +02:00
Thomas Gleixner
c02b078e63 posix-cpu-timers: Switch check_*_timers() to array cache
Use the array based expiry cache in check_thread_timers() and convert the
store in check_process_timers() for consistency.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192921.408222378@linutronix.de
2019-08-28 11:50:36 +02:00
Thomas Gleixner
1b0dd96d0f posix-cpu-timers: Simplify set_process_cpu_timer()
The expiry cache can now be accessed as an array. Replace the per clock
checks with a simple comparison of the clock indexed array member.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192921.303316423@linutronix.de
2019-08-28 11:50:36 +02:00
Thomas Gleixner
3b495b22d0 posix-cpu-timers: Simplify timer queueing
Now that the expiry cache can be accessed as an array, the per clock
checking can be reduced to just comparing the corresponding array elements.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192921.212129449@linutronix.de
2019-08-28 11:50:36 +02:00
Thomas Gleixner
11b8462f7e posix-cpu-timers: Provide array based access to expiry cache
Using struct task_cputime for the expiry cache is a pretty odd choice and
comes with magic defines to rename the fields for usage in the expiry
cache.

struct task_cputime is basically a u64 array with 3 members, but it has
distinct members.

The expiry cache content is different than the content of task_cputime
because

  expiry[PROF]  = task_cputime.stime + task_cputime.utime
  expiry[VIRT]  = task_cputime.utime
  expiry[SCHED] = task_cputime.sum_exec_runtime

So there is no direct mapping between task_cputime and the expiry cache and
the #define based remapping is just a horrible hack.

Having the expiry cache array based allows further simplification of the
expiry code.

To avoid an all in one cleanup which is hard to review add a temporary
anonymous union into struct task_cputime which allows array based access to
it. That requires to reorder the members. Add a build time sanity check to
validate that the members are at the same place.

The union and the build time checks will be removed after conversion.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192921.105793824@linutronix.de
2019-08-28 11:50:35 +02:00
Thomas Gleixner
3a245c0f11 posix-cpu-timers: Move expiry cache into struct posix_cputimers
The expiry cache belongs into the posix_cputimers container where the other
cpu timers information is.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192921.014444012@linutronix.de
2019-08-28 11:50:35 +02:00
Thomas Gleixner
2b69942f90 posix-cpu-timers: Create a container struct
Per task/process data of posix CPU timers is all over the place which
makes the code hard to follow and requires ifdeffery.

Create a container to hold all this information in one place, so data is
consolidated and the ifdeffery can be confined to the posix timer header
file and removed from places like fork.

As a first step, move the cpu_timers list head array into the new struct
and clean up the initializers and simplify fork. The remaining #ifdef in
fork will be removed later.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192920.819418976@linutronix.de
2019-08-28 11:50:33 +02:00
Thomas Gleixner
ab693c5a5e posix-cpu-timers: Move prof/virt_ticks into caller
The functions have only one caller left. No point in having them.

Move the almost duplicated code into the caller and simplify it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192920.729298382@linutronix.de
2019-08-28 11:50:33 +02:00
Thomas Gleixner
0476ff2c15 posix-cpu-timers: Sample task times once in expiry check
Sampling the task times twice does not make sense. Do it once.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192920.639878168@linutronix.de
2019-08-28 11:50:32 +02:00
Thomas Gleixner
8c2d74f037 posix-cpu-timers: Get rid of pointer indirection
Now that the sample functions have no return value anymore, the result can
simply be returned instead of using pointer indirection.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192920.535079278@linutronix.de
2019-08-28 11:50:32 +02:00
Thomas Gleixner
2092c1d4fe posix-cpu-timers: Simplify sample functions
All callers hand in a valdiated clock id. Remove the return value which was
unchecked in most places anyway.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192920.430475832@linutronix.de
2019-08-28 11:50:31 +02:00
Thomas Gleixner
5405d0051f posix-cpu-timers: Remove pointless return value check
set_process_cpu_timer() checks already whether the clock id is valid. No
point in checking the return value of the sample function. That allows to
simplify the sample function later.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192920.339725769@linutronix.de
2019-08-28 11:50:30 +02:00
Thomas Gleixner
da020ce406 posix-cpu-timers: Use clock ID in posix_cpu_timer_rearm()
Extract the clock ID (PROF/VIRT/SCHED) from the clock selector and use it
as argument to the sample functions. That allows to simplify them once all
callers are fixed.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192920.245357769@linutronix.de
2019-08-28 11:50:29 +02:00
Thomas Gleixner
99093c5b81 posix-cpu-timers: Use clock ID in posix_cpu_timer_get()
Extract the clock ID (PROF/VIRT/SCHED) from the clock selector and use it
as argument to the sample functions. That allows to simplify them once all
callers are fixed.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192920.155487201@linutronix.de
2019-08-28 11:50:29 +02:00
Thomas Gleixner
c7a37c6f4c posix-cpu-timers: Use clock ID in posix_cpu_timer_set()
Extract the clock ID (PROF/VIRT/SCHED) from the clock selector and use it
as argument to the sample functions. That allows to simplify them once all
callers are fixed.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192920.050770464@linutronix.de
2019-08-28 11:50:28 +02:00
Thomas Gleixner
24ab7f5a7b posix-cpu-timers: Consolidate thread group sample code
cpu_clock_sample_group() and cpu_timer_sample_group() are almost the
same. Before the rename one called thread_group_cputimer() and the other
thread_group_cputime(). Really intuitive function names.

Consolidate the functions and also avoid the thread traversal when
the thread group's accounting is already active.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192919.960966884@linutronix.de
2019-08-28 11:50:28 +02:00
Thomas Gleixner
c506bef424 posix-cpu-timers: Rename thread_group_cputimer() and make it static
thread_group_cputimer() is a complete misnomer. The function does two things:

 - For arming process wide timers it makes sure that the atomic time
   storage is up to date. If no cpu timer is armed yet, then the atomic
   time storage is not updated by the scheduler for performance reasons.

   In that case a full summing up of all threads needs to be done and the
   update needs to be enabled.

- Samples the current time into the caller supplied storage.

Rename it to thread_group_start_cputime(), make it static and fixup the
callsite.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192919.869350319@linutronix.de
2019-08-28 11:50:27 +02:00
Thomas Gleixner
a324956fae posix-cpu-timers: Sample directly in timer check
The thread group accounting is active, otherwise the expiry function would
not be running. Sample the thread group time directly.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192919.780348088@linutronix.de
2019-08-28 11:50:27 +02:00
Thomas Gleixner
a34360d424 itimers: Use quick sample function
get_itimer() locks sighand lock and checks whether the timer is already
expired. If it is not expired then the thread group cputime accounting is
already enabled. Use the sampling function not the one which is meant for
starting a timer.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192919.689713638@linutronix.de
2019-08-28 11:50:26 +02:00
Thomas Gleixner
19298fbf45 posix-cpu-timers: Provide quick sample function for itimer
get_itimer() needs a sample of the current thread group cputime. It invokes
thread_group_cputimer() - which is a misnomer. That function also starts
eventually the group cputime accouting which is bogus because the
accounting is already active when a timer is armed.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192919.599658199@linutronix.de
2019-08-28 11:50:26 +02:00
Thomas Gleixner
e5a8b65b4c posix-cpu-timers: Use common permission check in posix_cpu_timer_create()
Yet another copy of the same thing gone...

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192919.505833418@linutronix.de
2019-08-28 11:50:25 +02:00
Thomas Gleixner
bfcf3e92c6 posix-cpu-timers: Use common permission check in posix_cpu_clock_get()
Replace the next slightly different copy of permission checks. That also
removes the necessarity to check the return value of the sample functions
because the clock id is already validated.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192919.414813172@linutronix.de
2019-08-28 11:50:25 +02:00
Thomas Gleixner
6ae40e3fdc posix-cpu-timers: Provide task validation functions
The code contains three slightly different copies of validating whether a
given clock resolves to a valid task and whether the current caller has
permissions to access it.

Create central functions. Replace check_clock() as a first step and rename
it to something sensible.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190821192919.326097175@linutronix.de
2019-08-28 11:50:24 +02:00
Alexander Shishkin
ab43762ef0 perf: Allow normal events to output AUX data
In some cases, ordinary (non-AUX) events can generate data for AUX events.
For example, PEBS events can come out as records in the Intel PT stream
instead of their usual DS records, if configured to do so.

One requirement for such events is to consistently schedule together, to
ensure that the data from the "AUX output" events isn't lost while their
corresponding AUX event is not scheduled. We use grouping to provide this
guarantee: an "AUX output" event can be added to a group where an AUX event
is a group leader, and provided that the former supports writing to the
latter.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: kan.liang@linux.intel.com
Link: https://lkml.kernel.org/r/20190806084606.4021-2-alexander.shishkin@linux.intel.com
2019-08-28 11:29:38 +02:00
Douglas RAILLARD
77c84dd188 sched/cpufreq: Align trace event behavior of fast switching
Fast switching path only emits an event for the CPU of interest, whereas the
regular path emits an event for all the CPUs that had their frequency changed,
i.e. all the CPUs sharing the same policy.

With the current behavior, looking at cpu_frequency event for a given CPU that
is using the fast switching path will not give the correct frequency signal.

Signed-off-by: Douglas RAILLARD <douglas.raillard@arm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-08-28 11:26:46 +02:00
Alexei Starovoitov
10d274e880 bpf: introduce verifier internal test flag
Introduce BPF_F_TEST_STATE_FREQ flag to stress test parentage chain
and state pruning.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-28 00:30:11 +02:00
David S. Miller
68aaf44595 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Minor conflict in r8169, bug fix had two versions in net
and net-next, take the net-next hunks.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-27 14:23:31 -07:00
Linus Torvalds
452a04441b Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Use 32-bit index for tails calls in s390 bpf JIT, from Ilya
    Leoshkevich.

 2) Fix missed EPOLLOUT events in TCP, from Eric Dumazet. Same fix for
    SMC from Jason Baron.

 3) ipv6_mc_may_pull() should return 0 for malformed packets, not
    -EINVAL. From Stefano Brivio.

 4) Don't forget to unpin umem xdp pages in error path of
    xdp_umem_reg(). From Ivan Khoronzhuk.

 5) Fix sta object leak in mac80211, from Johannes Berg.

 6) Fix regression by not configuring PHYLINK on CPU port of bcm_sf2
    switches. From Florian Fainelli.

 7) Revert DMA sync removal from r8169 which was causing regressions on
    some MIPS Loongson platforms. From Heiner Kallweit.

 8) Use after free in flow dissector, from Jakub Sitnicki.

 9) Fix NULL derefs of net devices during ICMP processing across
    collect_md tunnels, from Hangbin Liu.

10) proto_register() memory leaks, from Zhang Lin.

11) Set NLM_F_MULTI flag in multipart netlink messages consistently,
    from John Fastabend.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (66 commits)
  r8152: Set memory to all 0xFFs on failed reg reads
  openvswitch: Fix conntrack cache with timeout
  ipv4: mpls: fix mpls_xmit for iptunnel
  nexthop: Fix nexthop_num_path for blackhole nexthops
  net: rds: add service level support in rds-info
  net: route dump netlink NLM_F_MULTI flag missing
  s390/qeth: reject oversized SNMP requests
  sock: fix potential memory leak in proto_register()
  MAINTAINERS: Add phylink keyword to SFF/SFP/SFP+ MODULE SUPPORT
  xfrm/xfrm_policy: fix dst dev null pointer dereference in collect_md mode
  ipv4/icmp: fix rt dst dev null pointer dereference
  openvswitch: Fix log message in ovs conntrack
  bpf: allow narrow loads of some sk_reuseport_md fields with offset > 0
  bpf: fix use after free in prog symbol exposure
  bpf: fix precision tracking in presence of bpf2bpf calls
  flow_dissector: Fix potential use-after-free on BPF_PROG_DETACH
  Revert "r8169: remove not needed call to dma_sync_single_for_device"
  ipv6: propagate ipv6_add_dev's error returns out of ipv6_find_idev
  net/ncsi: Fix the payload copying for the request coming from Netlink
  qed: Add cleanup in qed_slowpath_start()
  ...
2019-08-27 10:12:48 -07:00
Marc Zyngier
2a1a3fa0f2 kallsyms: Don't let kallsyms_lookup_size_offset() fail on retrieving the first symbol
An arm64 kernel configured with

  CONFIG_KPROBES=y
  CONFIG_KALLSYMS=y
  # CONFIG_KALLSYMS_ALL is not set
  CONFIG_KALLSYMS_BASE_RELATIVE=y

reports the following kprobe failure:

  [    0.032677] kprobes: failed to populate blacklist: -22
  [    0.033376] Please take care of using kprobes.

It appears that kprobe fails to retrieve the symbol at address
0xffff000010081000, despite this symbol being in System.map:

  ffff000010081000 T __exception_text_start

This symbol is part of the first group of aliases in the
kallsyms_offsets array (symbol names generated using ugly hacks in
scripts/kallsyms.c):

  kallsyms_offsets:
          .long   0x1000 // do_undefinstr
          .long   0x1000 // efi_header_end
          .long   0x1000 // _stext
          .long   0x1000 // __exception_text_start
          .long   0x12b0 // do_cp15instr

Looking at the implementation of get_symbol_pos(), it returns the
lowest index for aliasing symbols. In this case, it return 0.

But kallsyms_lookup_size_offset() considers 0 as a failure, which
is obviously wrong (there is definitely a valid symbol living there).
In turn, the kprobe blacklisting stops abruptly, hence the original
error.

A CONFIG_KALLSYMS_ALL kernel wouldn't fail as there is always
some random symbols at the beginning of this array, which are never
looked up via kallsyms_lookup_size_offset.

Fix it by considering that get_symbol_pos() is always successful
(which is consistent with the other uses of this function).

Fixes: ffc5089196 ("[PATCH] Create kallsyms_lookup_size_offset()")
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>
2019-08-27 16:19:56 +01:00
Ming Lei
b1a5a73e64 genirq/affinity: Spread vectors on node according to nr_cpu ratio
Now __irq_build_affinity_masks() spreads vectors evenly per node, but there
is a case that not all vectors have been spread when each numa node has a
different number of CPUs which triggers the warning in the spreading code.

Improve the spreading algorithm by

 - assigning vectors according to the ratio of the number of CPUs on a node
   to the number of remaining CPUs.

 - running the assignment from smaller nodes to bigger nodes to guarantee
   that every active node gets allocated at least one vector.

This ensures that all vectors are spread out. Asided of that the spread
becomes more fair if the nodes have different number of CPUs.

For example, on the following machine:
	CPU(s):              16
	On-line CPU(s) list: 0-15
	Thread(s) per core:  1
	Core(s) per socket:  8
	Socket(s):           2
	NUMA node(s):        2
	...
	NUMA node0 CPU(s):   0,1,3,5-9,11,13-15
	NUMA node1 CPU(s):   2,4,10,12

When a driver requests to allocate 8 vectors, the following spread results:

	irq 31, cpu list 2,4
	irq 32, cpu list 10,12
	irq 33, cpu list 0-1
	irq 34, cpu list 3,5
	irq 35, cpu list 6-7
	irq 36, cpu list 8-9
	irq 37, cpu list 11,13
	irq 38, cpu list 14-15

So Node 0 has now 6 and Node 1 has 2 vectors assigned. The original
algorithm assigned 4 vectors on each node which was unfair versus Node 0.

[ tglx: Massaged changelog ]

Reported-by: Jon Derrick <jonathan.derrick@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Jon Derrick <jonathan.derrick@intel.com>
Link: https://lkml.kernel.org/r/20190816022849.14075-3-ming.lei@redhat.com
2019-08-27 16:31:17 +02:00
Ming Lei
53c1788b7d genirq/affinity: Improve __irq_build_affinity_masks()
One invariant of __irq_build_affinity_masks() is that all CPUs in the
specified masks (cpu_mask AND node_to_cpumask for each node) should be
covered during the spread. Even though all requested vectors have been
reached, it's still required to spread vectors among remained CPUs. A
similar policy has been taken in case of 'numvecs <= nodes' already.

So remove the following check inside the loop:

	if (done >= numvecs)
		break;

Meantime assign at least 1 vector for remaining nodes if 'numvecs' vectors
have been handled already.

Also, if the specified cpumask for one numa node is empty, simply do not
spread vectors on this node.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190816022849.14075-2-ming.lei@redhat.com
2019-08-27 16:31:17 +02:00
Naveen N. Rao
ede7c460b1 bpf: handle 32-bit zext during constant blinding
Since BPF constant blinding is performed after the verifier pass, the
ALU32 instructions inserted for doubleword immediate loads don't have a
corresponding zext instruction. This is causing a kernel oops on powerpc
and can be reproduced by running 'test_cgroup_storage' with
bpf_jit_harden=2.

Fix this by emitting BPF_ZEXT during constant blinding if
prog->aux->verifier_zext is set.

Fixes: a4b1d3c1dd ("bpf: verifier: insert zero extension according to analysis result")
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Reviewed-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-26 23:05:01 +02:00
Linus Torvalds
5a13fc3d8b Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timekeeping fix from Thomas Gleixner:
 "A single fix for a regression caused by the generic VDSO
  implementation where a math overflow causes CLOCK_BOOTTIME to become a
  random number generator"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timekeeping/vsyscall: Prevent math overflow in BOOTTIME update
2019-08-25 10:08:01 -07:00
Linus Torvalds
8a04c2ee62 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Thomas Gleixner:
 "Handle the worker management in situations where a task is scheduled
  out on a PI lock contention correctly and schedule a new worker if
  possible"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/core: Schedule new worker even if PI-blocked
2019-08-25 10:06:12 -07:00
Linus Torvalds
05bbb9360a Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner:
 "Two small fixes for kprobes and perf:

   - Prevent a deadlock in kprobe_optimizer() causes by reverse lock
     ordering

   - Fix a comment typo"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  kprobes: Fix potential deadlock in kprobe_optimizer()
  perf/x86: Fix typo in comment
2019-08-25 10:03:32 -07:00
Linus Torvalds
44c471e436 Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fix from Thomas Gleixner:
 "A single fix for a imbalanced kobject operation in the irq decriptor
  code which was unearthed by the new warnings in the kobject code"

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq: Properly pair kobject_del() with kobject_add()
2019-08-25 10:00:21 -07:00
Linus Torvalds
f47edb59bb Merge branch 'akpm' (patches from Andrew)
Mergr misc fixes from Andrew Morton:
 "11 fixes"

Mostly VM fixes, one psi polling fix, and one parisc build fix.

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  mm/kasan: fix false positive invalid-free reports with CONFIG_KASAN_SW_TAGS=y
  mm/zsmalloc.c: fix race condition in zs_destroy_pool
  mm/zsmalloc.c: migration can leave pages in ZS_EMPTY indefinitely
  mm, page_owner: handle THP splits correctly
  userfaultfd_release: always remove uffd flags and clear vm_userfaultfd_ctx
  psi: get poll_work to run when calling poll syscall next time
  mm: memcontrol: flush percpu vmevents before releasing memcg
  mm: memcontrol: flush percpu vmstats before releasing memcg
  parisc: fix compilation errrors
  mm, page_alloc: move_freepages should not examine struct page of reserved memory
  mm/z3fold.c: fix race between migration and destruction
2019-08-25 09:56:27 -07:00
Linus Torvalds
e67095fd2f dma-mapping fixes for 5.3-rc
Two fixes for regressions in this merge window:
 
  - select the Kconfig symbols for the noncoherent dma arch helpers
    on arm if swiotlb is selected, not just for LPAE to not break then
    Xen build, that uses swiotlb indirectly through swiotlb-xen
  - fix the page allocator fallback in dma_alloc_contiguous if the CMA
    allocation fails
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAl1hvn4LHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYON4w//Recfoy5T2Q4Gfjp1xVKGbr2sP7J93Vs7VCyQNZmX
 PrtzhmNKs4gxCEXVgHm+GVA+IJwQFqDtSFaPb8q3GQ+qM9NUDF4ScMFpfrLZsFr1
 dorm5kC1xcwrQtWjS1CQS/Gj0VBtWiMQOoUcAESMqgBIUo4ssj3Ny+vnh8hWgAOs
 oVDgOM4wt35bW0Pv/iY44uQzOq7xcYJUUYtPIiP9vMDrhPsxe6D1DgFQ4HZKJWix
 uS3BjZnsZDnLltXM/0CKdRV9wLF+jHYP/wJTztksRlr/A5V3FJ8lJIvgphxG1v3J
 tDfQs4BNuGWBjqdg+Qo6qOPEL9krvVYYVVql93DXwtPK/cJW1Z+0glgC2rbbHmIy
 ew35DFnYm9v0sFLZnbpuoHd6sQ9G59nTZstkqt/Z/hldBvKotwBpeuILAcMC9Nlw
 3iYW6Sz5L7cmkifC8OvopKKJWVoW5rVtMrVQw5niBiZVERtWbY825r/7ju2xYhZC
 iSAaUHT5wNtXsXQOTrFQ5LzTDBtgGyXRXgvNagEHhBf120jBQfOhvOCVT2HHOxdy
 5vx7xeeRS0M2HpxIsmd3XQjIUQEY9x1to4FKiYczGM1kcKeyWWBMFOXfLxe2Rmhg
 h14lbfsAxIEWdFkJAVFhjyjzC6IzxyVGtHCxw1iw0VgGzYATO/K6Oo8T2hG3HagR
 abQ=
 =DXk9
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-5.3-5' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping fixes from Christoph Hellwig:
 "Two fixes for regressions in this merge window:

   - select the Kconfig symbols for the noncoherent dma arch helpers on
     arm if swiotlb is selected, not just for LPAE to not break then Xen
     build, that uses swiotlb indirectly through swiotlb-xen

   - fix the page allocator fallback in dma_alloc_contiguous if the CMA
     allocation fails"

* tag 'dma-mapping-5.3-5' of git://git.infradead.org/users/hch/dma-mapping:
  dma-direct: fix zone selection after an unaddressable CMA allocation
  arm: select the dma-noncoherent symbols for all swiotlb builds
2019-08-24 20:00:11 -07:00
Jason Xing
7b2b55da1d psi: get poll_work to run when calling poll syscall next time
Only when calling the poll syscall the first time can user receive
POLLPRI correctly.  After that, user always fails to acquire the event
signal.

Reproduce case:
 1. Get the monitor code in Documentation/accounting/psi.txt
 2. Run it, and wait for the event triggered.
 3. Kill and restart the process.

The question is why we can end up with poll_scheduled = 1 but the work
not running (which would reset it to 0).  And the answer is because the
scheduling side sees group->poll_kworker under RCU protection and then
schedules it, but here we cancel the work and destroy the worker.  The
cancel needs to pair with resetting the poll_scheduled flag.

Link: http://lkml.kernel.org/r/1566357985-97781-1-git-send-email-joseph.qi@linux.alibaba.com
Signed-off-by: Jason Xing <kerneljasonxing@linux.alibaba.com>
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Caspar Zhang <caspar@linux.alibaba.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-24 19:48:42 -07:00
David S. Miller
211c462452 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2019-08-24

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Fix verifier precision tracking with BPF-to-BPF calls, from Alexei.

2) Fix a use-after-free in prog symbol exposure, from Daniel.

3) Several s390x JIT fixes plus BE related fixes in BPF kselftests, from Ilya.

4) Fix memory leak by unpinning XDP umem pages in error path, from Ivan.

5) Fix a potential use-after-free on flow dissector detach, from Jakub.

6) Fix bpftool to close prog fd after showing metadata, from Quentin.

7) BPF kselftest config and TEST_PROGS_EXTENDED fixes, from Anders.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-23 17:34:11 -07:00
Daniel Borkmann
c751798aa2 bpf: fix use after free in prog symbol exposure
syzkaller managed to trigger the warning in bpf_jit_free() which checks via
bpf_prog_kallsyms_verify_off() for potentially unlinked JITed BPF progs
in kallsyms, and subsequently trips over GPF when walking kallsyms entries:

  [...]
  8021q: adding VLAN 0 to HW filter on device batadv0
  8021q: adding VLAN 0 to HW filter on device batadv0
  WARNING: CPU: 0 PID: 9869 at kernel/bpf/core.c:810 bpf_jit_free+0x1e8/0x2a0
  Kernel panic - not syncing: panic_on_warn set ...
  CPU: 0 PID: 9869 Comm: kworker/0:7 Not tainted 5.0.0-rc8+ #1
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
  Workqueue: events bpf_prog_free_deferred
  Call Trace:
   __dump_stack lib/dump_stack.c:77 [inline]
   dump_stack+0x113/0x167 lib/dump_stack.c:113
   panic+0x212/0x40b kernel/panic.c:214
   __warn.cold.8+0x1b/0x38 kernel/panic.c:571
   report_bug+0x1a4/0x200 lib/bug.c:186
   fixup_bug arch/x86/kernel/traps.c:178 [inline]
   do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:271
   do_invalid_op+0x36/0x40 arch/x86/kernel/traps.c:290
   invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:973
  RIP: 0010:bpf_jit_free+0x1e8/0x2a0
  Code: 02 4c 89 e2 83 e2 07 38 d0 7f 08 84 c0 0f 85 86 00 00 00 48 ba 00 02 00 00 00 00 ad de 0f b6 43 02 49 39 d6 0f 84 5f fe ff ff <0f> 0b e9 58 fe ff ff 48 b8 00 00 00 00 00 fc ff df 4c 89 e2 48 c1
  RSP: 0018:ffff888092f67cd8 EFLAGS: 00010202
  RAX: 0000000000000007 RBX: ffffc90001947000 RCX: ffffffff816e9d88
  RDX: dead000000000200 RSI: 0000000000000008 RDI: ffff88808769f7f0
  RBP: ffff888092f67d00 R08: fffffbfff1394059 R09: fffffbfff1394058
  R10: fffffbfff1394058 R11: ffffffff89ca02c7 R12: ffffc90001947002
  R13: ffffc90001947020 R14: ffffffff881eca80 R15: ffff88808769f7e8
  BUG: unable to handle kernel paging request at fffffbfff400d000
  #PF error: [normal kernel read fault]
  PGD 21ffee067 P4D 21ffee067 PUD 21ffed067 PMD 9f942067 PTE 0
  Oops: 0000 [#1] PREEMPT SMP KASAN
  CPU: 0 PID: 9869 Comm: kworker/0:7 Not tainted 5.0.0-rc8+ #1
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
  Workqueue: events bpf_prog_free_deferred
  RIP: 0010:bpf_get_prog_addr_region kernel/bpf/core.c:495 [inline]
  RIP: 0010:bpf_tree_comp kernel/bpf/core.c:558 [inline]
  RIP: 0010:__lt_find include/linux/rbtree_latch.h:115 [inline]
  RIP: 0010:latch_tree_find include/linux/rbtree_latch.h:208 [inline]
  RIP: 0010:bpf_prog_kallsyms_find+0x107/0x2e0 kernel/bpf/core.c:632
  Code: 00 f0 ff ff 44 38 c8 7f 08 84 c0 0f 85 fa 00 00 00 41 f6 45 02 01 75 02 0f 0b 48 39 da 0f 82 92 00 00 00 48 89 d8 48 c1 e8 03 <42> 0f b6 04 30 84 c0 74 08 3c 03 0f 8e 45 01 00 00 8b 03 48 c1 e0
  [...]

Upon further debugging, it turns out that whenever we trigger this
issue, the kallsyms removal in bpf_prog_ksym_node_del() was /skipped/
but yet bpf_jit_free() reported that the entry is /in use/.

Problem is that symbol exposure via bpf_prog_kallsyms_add() but also
perf_event_bpf_event() were done /after/ bpf_prog_new_fd(). Once the
fd is exposed to the public, a parallel close request came in right
before we attempted to do the bpf_prog_kallsyms_add().

Given at this time the prog reference count is one, we start to rip
everything underneath us via bpf_prog_release() -> bpf_prog_put().
The memory is eventually released via deferred free, so we're seeing
that bpf_jit_free() has a kallsym entry because we added it from
bpf_prog_load() but /after/ bpf_prog_put() from the remote CPU.

Therefore, move both notifications /before/ we install the fd. The
issue was never seen between bpf_prog_alloc_id() and bpf_prog_new_fd()
because upon bpf_prog_get_fd_by_id() we'll take another reference to
the BPF prog, so we're still holding the original reference from the
bpf_prog_load().

Fixes: 6ee52e2a3f ("perf, bpf: Introduce PERF_RECORD_BPF_EVENT")
Fixes: 74451e66d5 ("bpf: make jited programs visible in traces")
Reported-by: syzbot+bd3bba6ff3fcea7a6ec6@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Song Liu <songliubraving@fb.com>
2019-08-24 01:17:47 +02:00
Alexei Starovoitov
6754172c20 bpf: fix precision tracking in presence of bpf2bpf calls
While adding extra tests for precision tracking and extra infra
to adjust verifier heuristics the existing test
"calls: cross frame pruning - liveness propagation" started to fail.
The root cause is the same as described in verifer.c comment:

 * Also if parent's curframe > frame where backtracking started,
 * the verifier need to mark registers in both frames, otherwise callees
 * may incorrectly prune callers. This is similar to
 * commit 7640ead939 ("bpf: verifier: make sure callees don't prune with caller differences")
 * For now backtracking falls back into conservative marking.

Turned out though that returning -ENOTSUPP from backtrack_insn() and
doing mark_all_scalars_precise() in the current parentage chain is not enough.
Depending on how is_state_visited() heuristic is creating parentage chain
it's possible that callee will incorrectly prune caller.
Fix the issue by setting precise=true earlier and more aggressively.
Before this fix the precision tracking _within_ functions that don't do
bpf2bpf calls would still work. Whereas now precision tracking is completely
disabled when bpf2bpf calls are present anywhere in the program.

No difference in cilium tests (they don't have bpf2bpf calls).
No difference in test_progs though some of them have bpf2bpf calls,
but precision tracking wasn't effective there.

Fixes: b5dc0163d8 ("bpf: precise scalar_value tracking")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-24 01:17:12 +02:00
Linus Torvalds
e3fb13b7e4 Modules fixes for v5.3-rc6
- Fix BUG_ON() being triggered in frob_text() due to non-page-aligned module
   sections
 
 Signed-off-by: Jessica Yu <jeyu@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCgAGBQJdX/7kAAoJEMBFfjjOO8Fy/l0P/j7S2HY31eD+ol5vsohp2rWv
 CDDNiDlDWKkmCNzQy1giynbH5EVOZCZyoLcQgVTMJKHoRVeS/Mo9BU9FU4ZW2WWK
 DxoCUN4WrCvhCrt5MLiZEkrAgIyYIoGeiHTRyBYSv7PrxFNxSWuQQQZ7IRssWHSQ
 1sbf1rt9xX6uddQyCaZlCTzPrc2M6v2YVMFR2Wq5Ts8qazbtf6ZoUXfg6vYMc/bm
 k55aJPKmq/HEfifMZxrwRPIPD+v2MddZ8R6ti3QSWMbzSSNiUR8rlJPIa7GQy5ia
 Xbut/a9Q8/2qR81D2Mkcw1xlff7QI0i55JAmJ8OgkXA55Xyv+TSzW/RJRmCanYAo
 gj4TCncbtAj8GYL+R309rhPjG222/GA42/Gosugdfa15ojn/LI9g1XjOdyPpT38I
 RpnhL4kxzpCbUnWX++fB8tmMk0pQdlsorhTcA+y8zyFAifdf6KC7Fw7SEpKWtPjH
 gbOkjgjpFo6JOS9ywAeoMet5BQ0lngGlS3iLdQA/zOr0q3JWTI3BmXa3DrREnJSs
 6R+alYUAKAbwolB4yLvlavfcjIJTUvpP++VZHR+xJo51t7onkv8a1T8844Y6xVaq
 WItpudredat7m+dZEd4cYmbMmEzn8Z6hvbLgzJOOv2ZSFeURuZlK2Xsbz8Yxdfeu
 sGjDYiKQOaV0VmI/l8fY
 =oXjz
 -----END PGP SIGNATURE-----

Merge tag 'modules-for-v5.3-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux

Pull modules fixes from Jessica Yu:
 "Fix BUG_ON() being triggered in frob_text() due to non-page-aligned
  module sections"

* tag 'modules-for-v5.3-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
  modules: page-align module section allocations only for arches supporting strict module rwx
  modules: always page-align module section allocations
2019-08-23 09:22:00 -07:00
Thomas Gleixner
b99328a60a timekeeping/vsyscall: Prevent math overflow in BOOTTIME update
The VDSO update for CLOCK_BOOTTIME has a overflow issue as it shifts the
nanoseconds based boot time offset left by the clocksource shift. That
overflows once the boot time offset becomes large enough. As a consequence
CLOCK_BOOTTIME in the VDSO becomes a random number causing applications to
misbehave.

Fix it by storing a timespec64 representation of the offset when boot time
is adjusted and add that to the MONOTONIC base time value in the vdso data
page. Using the timespec64 representation avoids a 64bit division in the
update code.

Fixes: 44f57d788e ("timekeeping: Provide a generic update_vsyscall() implementation")
Reported-by: Chris Clayton <chris2553@googlemail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Chris Clayton <chris2553@googlemail.com>
Tested-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1908221257580.1983@nanos.tec.linutronix.de
2019-08-23 02:12:11 +02:00
Ingo Molnar
6c06b66e95 Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU and LKMM changes from Paul E. McKenney:

 - A few more RCU flavor consolidation cleanups.

 - Miscellaneous fixes.

 - Updates to RCU's list-traversal macros improving lockdep usability.

 - Torture-test updates.

 - Forward-progress improvements for no-CBs CPUs: Avoid ignoring
   incoming callbacks during grace-period waits.

 - Forward-progress improvements for no-CBs CPUs: Use ->cblist
   structure to take advantage of others' grace periods.

 - Also added a small commit that avoids needlessly inflicting
   scheduler-clock ticks on callback-offloaded CPUs.

 - Forward-progress improvements for no-CBs CPUs: Reduce contention
   on ->nocb_lock guarding ->cblist.

 - Forward-progress improvements for no-CBs CPUs: Add ->nocb_bypass
   list to further reduce contention on ->nocb_lock guarding ->cblist.

 - LKMM updates.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-08-22 20:52:04 +02:00
Jason Gunthorpe
daa138a58c Merge branch 'odp_fixes' into hmm.git
From rdma.git

Jason Gunthorpe says:

====================
This is a collection of general cleanups for ODP to clarify some of the
flows around umem creation and use of the interval tree.
====================

The branch is based on v5.3-rc5 due to dependencies, and is being taken
into hmm.git due to dependencies in the next patches.

* odp_fixes:
  RDMA/mlx5: Use odp instead of mr->umem in pagefault_mr
  RDMA/mlx5: Use ib_umem_start instead of umem.address
  RDMA/core: Make invalidate_range a device operation
  RDMA/odp: Use kvcalloc for the dma_list and page_list
  RDMA/odp: Check for overflow when computing the umem_odp end
  RDMA/odp: Provide ib_umem_odp_release() to undo the allocs
  RDMA/odp: Split creating a umem_odp from ib_umem_get
  RDMA/odp: Make the three ways to create a umem_odp clear
  RMDA/odp: Consolidate umem_odp initialization
  RDMA/odp: Make it clearer when a umem is an implicit ODP umem
  RDMA/odp: Iterate over the whole rbtree directly
  RDMA/odp: Use the common interval tree library instead of generic
  RDMA/mlx5: Fix MR npages calculation for IB_ACCESS_HUGETLB

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-21 20:58:18 -03:00
Thomas Gleixner
dce3e8fd03 posix-cpu-timers: Remove tsk argument from run_posix_cpu_timers()
It's always current. Don't give people wrong ideas.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190819143801.945469967@linutronix.de
2019-08-21 20:27:16 +02:00
Thomas Gleixner
692117c1f7 posix-cpu-timers: Sanitize bogus WARNONS
Warning when p == NULL and then proceeding and dereferencing p does not
make any sense as the kernel will crash with a NULL pointer dereference
right away.

Bailing out when p == NULL and returning an error code does not cure the
underlying problem which caused p to be NULL. Though it might allow to
do proper debugging.

Same applies to the clock id check in set_process_cpu_timer().

Clean them up and make them return without trying to do further damage.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190819143801.846497772@linutronix.de
2019-08-21 20:27:15 +02:00
Peter Wu
5cbd22c179 bpf: clarify description for CONFIG_BPF_EVENTS
PERF_EVENT_IOC_SET_BPF supports uprobes since v4.3, and tracepoints
since v4.7 via commit 04a22fae4c ("tracing, perf: Implement BPF
programs attached to uprobes"), and commit 98b5c2c65c ("perf, bpf:
allow bpf programs attach to tracepoints") respectively.

Signed-off-by: Peter Wu <peter@lekensteyn.nl>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-08-21 10:17:24 -07:00
Julien Grall
68b2c8c1e4 hrtimer: Don't take expiry_lock when timer is currently migrated
migration_base is used as a placeholder when an hrtimer is migrated to a
different CPU. In the case that hrtimer_cancel_wait_running() hits a timer
which is currently migrated it would pointlessly acquire the expiry lock of
the migration base, which is even not initialized.

Surely it could be initialized, but there is absolutely no point in
acquiring this lock because the timer is guaranteed not to run it's
callback for which the caller waits to finish on that base. So it would
just do the inc/lock/dec/unlock dance for nothing.

As the base switch is short and non-preemptible, there is no issue when the
wait function returns immediately.

The timer base and base->cpu_base cannot be NULL in the code path which is
invoking that, so just replace those checks with a check whether base is
migration base.

[ tglx: Updated from RT patch. Massaged changelog. Added comment. ]

Signed-off-by: Julien Grall <julien.grall@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190821092409.13225-4-julien.grall@arm.com
2019-08-21 16:10:01 +02:00
Julien Grall
dd2261ed45 hrtimer: Protect lockless access to timer->base
The update to timer->base is protected by the base->cpu_base->lock().
However, hrtimer_cancel_wait_running() does access it lockless.  So the
compiler is allowed to refetch timer->base which can cause havoc when the
timer base is changed concurrently.

Use READ_ONCE() to prevent this.

[ tglx: Adapted from a RT patch ]

Signed-off-by: Julien Grall <julien.grall@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190821092409.13225-2-julien.grall@arm.com
2019-08-21 16:10:01 +02:00
Santosh Sivaraj
49ec9177b8 extable: Add function to search only kernel exception table
Certain architecture specific operating modes (e.g., in powerpc machine
check handler that is unable to access vmalloc memory), the
search_exception_tables cannot be called because it also searches the
module exception tables if entry is not found in the kernel exception
table.

Signed-off-by: Santosh Sivaraj <santosh@fossix.org>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190820081352.8641-5-santosh@fossix.org
2019-08-21 22:23:48 +10:00
He Zhe
3b5be16c7e modules: page-align module section allocations only for arches supporting strict module rwx
We should keep the case of "#define debug_align(X) (X)" for all arches
without CONFIG_HAS_STRICT_MODULE_RWX ability, which would save people, who
are sensitive to system size, a lot of memory when using modules,
especially for embedded systems. This is also the intention of the
original #ifdef... statement and still valid for now.

Note that this still keeps the effect of the fix of the following commit,
38f054d549 ("modules: always page-align module section allocations"),
since when CONFIG_ARCH_HAS_STRICT_MODULE_RWX is enabled, module pages are
aligned.

Signed-off-by: He Zhe <zhe.he@windriver.com>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2019-08-21 10:43:56 +02:00
Amit Kucheria
c3082a674f PM: QoS: Get rid of unused flags
The network_latency and network_throughput flags for PM-QoS have not
found much use in drivers or in userspace since they were introduced.

Commit 4a733ef1be ("mac80211: remove PM-QoS listener") removed the
only user PM_QOS_NETWORK_LATENCY in the kernel a while ago and there
don't seem to be any userspace tools using the character device files
either.

PM_QOS_MEMORY_BANDWIDTH was never even added to the trace events.

Remove all the flags except cpu_dma_latency.

Signed-off-by: Amit Kucheria <amit.kucheria@linaro.org>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-08-21 00:38:54 +02:00
Tri Vo
c8377adfa7 PM / wakeup: Show wakeup sources stats in sysfs
Add an ID and a device pointer to 'struct wakeup_source'. Use them to to
expose wakeup sources statistics in sysfs under
/sys/class/wakeup/wakeup<ID>/*.

Co-developed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Co-developed-by: Stephen Boyd <swboyd@chromium.org>
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Signed-off-by: Tri Vo <trong@android.com>
Tested-by: Kalesh Singh <kaleshsingh@google.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-08-21 00:20:40 +02:00
Tri Vo
2434aea58e PM / wakeup: Use wakeup_source_register() in wakelock.c
kernel/power/wakelock.c duplicates wakeup source creation and
registration code from drivers/base/power/wakeup.c.

Change struct wakelock's wakeup source to a pointer and use
wakeup_source_register() function to create and register said wakeup
source. Use wakeup_source_unregister() on cleanup path.

Signed-off-by: Tri Vo <trong@android.com>
Reviewed-by: Stephen Boyd <swboyd@chromium.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-08-21 00:20:40 +02:00
Christoph Hellwig
90ae409f9e dma-direct: fix zone selection after an unaddressable CMA allocation
The new dma_alloc_contiguous hides if we allocate CMA or regular
pages, and thus fails to retry a ZONE_NORMAL allocation if the CMA
allocation succeeds but isn't addressable.  That means we either fail
outright or dip into a small zone that might not succeed either.

Thanks to Hillf Danton for debugging this issue.

Fixes: b1d2dc009d ("dma-contiguous: add dma_{alloc,free}_contiguous() helpers")
Reported-by: Tobias Klausmann <tobias.johannes.klausmann@mni.thm.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Tobias Klausmann <tobias.johannes.klausmann@mni.thm.de>
2019-08-21 07:14:10 +09:00
Thomas Gleixner
7cb9a94c15 posix-cpu-timers: Fixup stale comment
The comment above cleanup_timers() is outdated. The timers are only removed
from the task/process list heads but not modified in any other way.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190819143801.747233612@linutronix.de
2019-08-20 22:09:53 +02:00
Frederic Weisbecker
0bee3b601b hrtimer: Improve comments on handling priority inversion against softirq kthread
The handling of a priority inversion between timer cancelling and a a not
well defined possible preemption of softirq kthread is not very clear.

Especially in the posix timers side it's unclear why there is a specific RT
wait callback.

All the nice explanations can be found in the initial changelog of
f61eff83ce (hrtimer: Prepare support for PREEMPT_RT").

Extract the detailed informations from there and put it into comments.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190820132656.GC2093@lenoir
2019-08-20 22:05:46 +02:00
Thomas Gleixner
ec8f954a40 posix-timers: Use a callback for cancel synchronization on PREEMPT_RT
Posix timer delete retry loops are affected by the same priority inversion
and live lock issues as the other timers.
    
Provide a RT specific synchronization function which keeps a reference to
the timer by holding rcu read lock to prevent the timer from being freed,
dropping the timer lock and invoking the timer specific wait function via a
new callback.
    
This does not yet cover posix CPU timers because they need more special
treatment on PREEMPT_RT.

[ This is folded into the original attempt which did not use a callback. ]

Originally-by: Anna-Maria Gleixenr <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190819143801.656864506@linutronix.de
2019-08-20 22:05:46 +02:00
Catalin Marinas
3e91ec89f5 arm64: Tighten the PR_{SET, GET}_TAGGED_ADDR_CTRL prctl() unused arguments
Require that arg{3,4,5} of the PR_{SET,GET}_TAGGED_ADDR_CTRL prctl and
arg2 of the PR_GET_TAGGED_ADDR_CTRL prctl() are zero rather than ignored
for future extensions.

Acked-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
2019-08-20 18:17:55 +01:00
Quentin Monnet
1b9ed84ecf bpf: add new BPF_BTF_GET_NEXT_ID syscall command
Add a new command for the bpf() system call: BPF_BTF_GET_NEXT_ID is used
to cycle through all BTF objects loaded on the system.

The motivation is to be able to inspect (list) all BTF objects presents
on the system.

Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-08-20 09:51:06 -07:00
Quentin Monnet
3481e64bbe bpf: add BTF ids in procfs for file descriptors to BTF objects
Implement the show_fdinfo hook for BTF FDs file operations, and make it
print the id of the BTF object. This allows for a quick retrieval of the
BTF id from its FD; or it can help understanding what type of object
(BTF) the file descriptor points to.

v2:
- Do not expose data_size, only btf_id, in FD info.

Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-20 16:19:12 +02:00
YueHaibing
ede6bc88d6 bpf: Use PTR_ERR_OR_ZERO in xsk_map_inc()
Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-20 16:03:52 +02:00
Christoph Hellwig
6869b7b206 memremap: provide a not device managed memremap_pages
The kvmppc ultravisor code wants a device private memory pool that is
system wide and not attached to a device.  Instead of faking up one
provide a low-level memremap_pages for it.

Link: https://lore.kernel.org/r/20190818090557.17853-5-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Bharata B Rao <bharata@linux.ibm.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-20 09:41:35 -03:00
Christoph Hellwig
6f42193fd8 memremap: don't use a separate devm action for devmap_managed_enable_get
Just clean up for early failures and then piggy back on
devm_memremap_pages_release.  This helps with a pending not device managed
version of devm_memremap_pages.

Link: https://lore.kernel.org/r/20190818090557.17853-4-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Bharata B Rao <bharata@linux.ibm.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-20 09:41:35 -03:00
Christoph Hellwig
fdc029b19d memremap: remove the dev field in struct dev_pagemap
The dev field in struct dev_pagemap is only used to print dev_name in two
places, which are at best nice to have.  Just remove the field and thus
the name in those two messages.

Link: https://lore.kernel.org/r/20190818090557.17853-3-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Bharata B Rao <bharata@linux.ibm.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-20 09:41:35 -03:00
Christoph Hellwig
0c38519039 resource: add a not device managed request_free_mem_region variant
Factor out the guts of devm_request_free_mem_region so that we can
implement both a device managed and a manually release version as tiny
wrappers around it.

Link: https://lore.kernel.org/r/20190818090557.17853-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Bharata B Rao <bharata@linux.ibm.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-20 09:39:41 -03:00
Jason Gunthorpe
c7d8b7824f hmm: use mmu_notifier_get/put for 'struct hmm'
This is a significant simplification, it eliminates all the remaining
'hmm' stuff in mm_struct, eliminates krefing along the critical notifier
paths, and takes away all the ugly locking and abuse of page_table_lock.

mmu_notifier_get() provides the single struct hmm per struct mm which
eliminates mm->hmm.

It also directly guarantees that no mmu_notifier op callback is callable
while concurrent free is possible, this eliminates all the krefs inside
the mmu_notifier callbacks.

The remaining krefs in the range code were overly cautious, drivers are
already not permitted to free the mirror while a range exists.

Link: https://lore.kernel.org/r/20190806231548.25242-6-jgg@ziepe.ca
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Tested-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-08-20 09:35:02 -03:00
Matthew Garrett
29d3c1c8df kexec: Allow kexec_file() with appropriate IMA policy when locked down
Systems in lockdown mode should block the kexec of untrusted kernels.
For x86 and ARM we can ensure that a kernel is trustworthy by validating
a PE signature, but this isn't possible on other architectures. On those
platforms we can use IMA digital signatures instead. Add a function to
determine whether IMA has or will verify signatures for a given event type,
and if so permit kexec_file() even if the kernel is otherwise locked down.
This is restricted to cases where CONFIG_INTEGRITY_TRUSTED_KEYRING is set
in order to prevent an attacker from loading additional keys at runtime.

Signed-off-by: Matthew Garrett <mjg59@google.com>
Acked-by: Mimi Zohar <zohar@linux.ibm.com>
Cc: Dmitry Kasatkin <dmitry.kasatkin@gmail.com>
Cc: linux-integrity@vger.kernel.org
Signed-off-by: James Morris <jmorris@namei.org>
2019-08-19 21:54:16 -07:00
David Howells
b0c8fdc7fd lockdown: Lock down perf when in confidentiality mode
Disallow the use of certain perf facilities that might allow userspace to
access kernel data.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Matthew Garrett <mjg59@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Signed-off-by: James Morris <jmorris@namei.org>
2019-08-19 21:54:16 -07:00
David Howells
9d1f8be5cf bpf: Restrict bpf when kernel lockdown is in confidentiality mode
bpf_read() and bpf_read_str() could potentially be abused to (eg) allow
private keys in kernel memory to be leaked. Disable them if the kernel
has been locked down in confidentiality mode.

Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Matthew Garrett <mjg59@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
cc: netdev@vger.kernel.org
cc: Chun-Yi Lee <jlee@suse.com>
cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: James Morris <jmorris@namei.org>
2019-08-19 21:54:16 -07:00
David Howells
a94549dd87 lockdown: Lock down tracing and perf kprobes when in confidentiality mode
Disallow the creation of perf and ftrace kprobes when the kernel is
locked down in confidentiality mode by preventing their registration.
This prevents kprobes from being used to access kernel memory to steal
crypto data, but continues to allow the use of kprobes from signed
modules.

Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Matthew Garrett <mjg59@google.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: davem@davemloft.net
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: James Morris <jmorris@namei.org>
2019-08-19 21:54:16 -07:00
David Howells
20657f66ef lockdown: Lock down module params that specify hardware parameters (eg. ioport)
Provided an annotation for module parameters that specify hardware
parameters (such as io ports, iomem addresses, irqs, dma channels, fixed
dma buffers and other types).

Suggested-by: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Matthew Garrett <mjg59@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Jessica Yu <jeyu@kernel.org>
Signed-off-by: James Morris <jmorris@namei.org>
2019-08-19 21:54:16 -07:00
Josh Boyer
38bd94b8a1 hibernate: Disable when the kernel is locked down
There is currently no way to verify the resume image when returning
from hibernate.  This might compromise the signed modules trust model,
so until we can work with signed hibernate images we disable it when the
kernel is locked down.

Signed-off-by: Josh Boyer <jwboyer@fedoraproject.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Matthew Garrett <mjg59@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: rjw@rjwysocki.net
Cc: pavel@ucw.cz
cc: linux-pm@vger.kernel.org
Signed-off-by: James Morris <jmorris@namei.org>
2019-08-19 21:54:15 -07:00
Jiri Bohac
155bdd30af kexec_file: Restrict at runtime if the kernel is locked down
When KEXEC_SIG is not enabled, kernel should not load images through
kexec_file systemcall if the kernel is locked down.

[Modified by David Howells to fit with modifications to the previous patch
 and to return -EPERM if the kernel is locked down for consistency with
 other lockdowns. Modified by Matthew Garrett to remove the IMA
 integration, which will be replaced by integrating with the IMA
 architecture policy patches.]

Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Matthew Garrett <mjg59@google.com>
cc: kexec@lists.infradead.org
Signed-off-by: James Morris <jmorris@namei.org>
2019-08-19 21:54:15 -07:00
Jiri Bohac
99d5cadfde kexec_file: split KEXEC_VERIFY_SIG into KEXEC_SIG and KEXEC_SIG_FORCE
This is a preparatory patch for kexec_file_load() lockdown.  A locked down
kernel needs to prevent unsigned kernel images from being loaded with
kexec_file_load().  Currently, the only way to force the signature
verification is compiling with KEXEC_VERIFY_SIG.  This prevents loading
usigned images even when the kernel is not locked down at runtime.

This patch splits KEXEC_VERIFY_SIG into KEXEC_SIG and KEXEC_SIG_FORCE.
Analogous to the MODULE_SIG and MODULE_SIG_FORCE for modules, KEXEC_SIG
turns on the signature verification but allows unsigned images to be
loaded.  KEXEC_SIG_FORCE disallows images without a valid signature.

Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Matthew Garrett <mjg59@google.com>
cc: kexec@lists.infradead.org
Signed-off-by: James Morris <jmorris@namei.org>
2019-08-19 21:54:15 -07:00
Matthew Garrett
7d31f4602f kexec_load: Disable at runtime if the kernel is locked down
The kexec_load() syscall permits the loading and execution of arbitrary
code in ring 0, which is something that lock-down is meant to prevent. It
makes sense to disable kexec_load() in this situation.

This does not affect kexec_file_load() syscall which can check for a
signature on the image to be booted.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Matthew Garrett <mjg59@google.com>
Acked-by: Dave Young <dyoung@redhat.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
cc: kexec@lists.infradead.org
Signed-off-by: James Morris <jmorris@namei.org>
2019-08-19 21:54:15 -07:00
David Howells
49fcf732bd lockdown: Enforce module signatures if the kernel is locked down
If the kernel is locked down, require that all modules have valid
signatures that we can verify.

I have adjusted the errors generated:

 (1) If there's no signature (ENODATA) or we can't check it (ENOPKG,
     ENOKEY), then:

     (a) If signatures are enforced then EKEYREJECTED is returned.

     (b) If there's no signature or we can't check it, but the kernel is
	 locked down then EPERM is returned (this is then consistent with
	 other lockdown cases).

 (2) If the signature is unparseable (EBADMSG, EINVAL), the signature fails
     the check (EKEYREJECTED) or a system error occurs (eg. ENOMEM), we
     return the error we got.

Note that the X.509 code doesn't check for key expiry as the RTC might not
be valid or might not have been transferred to the kernel's clock yet.

 [Modified by Matthew Garrett to remove the IMA integration. This will
  be replaced with integration with the IMA architecture policy
  patchset.]

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Matthew Garrett <matthewgarrett@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Jessica Yu <jeyu@kernel.org>
Signed-off-by: James Morris <jmorris@namei.org>
2019-08-19 21:54:15 -07:00
Linus Torvalds
287c55ed7d Merge branch 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull kernel thread signal handling fix from Eric Biederman:
 "I overlooked the fact that kernel threads are created with all signals
  set to SIG_IGN, and accidentally caused a regression in cifs and drbd
  when replacing force_sig with send_sig.

  This is my fix for that regression. I add a new function
  allow_kernel_signal which allows kernel threads to receive signals
  sent from the kernel, but continues to ignore all signals sent from
  userspace. This ensures the user space interface for cifs and drbd
  remain the same.

  These kernel threads depend on blocking networking calls which block
  until something is received or a signal is pending. Making receiving
  of signals somewhat necessary for these kernel threads.

  Perhaps someday we can cleanup those interfaces and remove
  allow_kernel_signal. If not allow_kernel_signal is pretty trivial and
  clearly documents what is going on so I don't think we will mind
  carrying it"

* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  signal: Allow cifs and drbd to receive their terminating signals
2019-08-19 16:17:59 -07:00
Michael Kelley
d0ff14fdc9 genirq: Properly pair kobject_del() with kobject_add()
If alloc_descs() fails before irq_sysfs_init() has run, free_desc() in the
cleanup path will call kobject_del() even though the kobject has not been
added with kobject_add().

Fix this by making the call to kobject_del() conditional on whether
irq_sysfs_init() has run.

This problem surfaced because commit aa30f47cf6 ("kobject: Add support
for default attribute groups to kobj_type") makes kobject_del() stricter
about pairing with kobject_add(). If the pairing is incorrrect, a WARNING
and backtrace occur in sysfs_remove_group() because there is no parent.

[ tglx: Add a comment to the code and make it work with CONFIG_SYSFS=n ]

Fixes: ecb3f394c5 ("genirq: Expose interrupt information through sysfs")
Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1564703564-4116-1-git-send-email-mikelley@microsoft.com
2019-08-19 21:41:19 +02:00
David S. Miller
446bf64b61 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Merge conflict of mlx5 resolved using instructions in merge
commit 9566e650bf.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-19 11:54:03 -07:00
Thomas Gleixner
b6a32bbd87 genirq: Force interrupt threading on RT
Switch force_irqthreads from a boot time modifiable variable to a compile
time constant when CONFIG_PREEMPT_RT is enabled.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190816160923.12855-1-bigeasy@linutronix.de
2019-08-19 15:45:48 +02:00
Eric W. Biederman
33da8e7c81 signal: Allow cifs and drbd to receive their terminating signals
My recent to change to only use force_sig for a synchronous events
wound up breaking signal reception cifs and drbd.  I had overlooked
the fact that by default kthreads start out with all signals set to
SIG_IGN.  So a change I thought was safe turned out to have made it
impossible for those kernel thread to catch their signals.

Reverting the work on force_sig is a bad idea because what the code
was doing was very much a misuse of force_sig.  As the way force_sig
ultimately allowed the signal to happen was to change the signal
handler to SIG_DFL.  Which after the first signal will allow userspace
to send signals to these kernel threads.  At least for
wake_ack_receiver in drbd that does not appear actively wrong.

So correct this problem by adding allow_kernel_signal that will allow
signals whose siginfo reports they were sent by the kernel through,
but will not allow userspace generated signals, and update cifs and
drbd to call allow_kernel_signal in an appropriate place so that their
thread can receive this signal.

Fixing things this way ensures that userspace won't be able to send
signals and cause problems, that it is clear which signals the
threads are expecting to receive, and it guarantees that nothing
else in the system will be affected.

This change was partly inspired by similar cifs and drbd patches that
added allow_signal.

Reported-by: ronnie sahlberg <ronniesahlberg@gmail.com>
Reported-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Tested-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Cc: Steve French <smfrench@gmail.com>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: David Laight <David.Laight@ACULAB.COM>
Fixes: 247bc9470b ("cifs: fix rmmod regression in cifs.ko caused by force_sig changes")
Fixes: 72abe3bcf0 ("signal/cifs: Fix cifs_put_tcp_session to call send_sig instead of force_sig")
Fixes: fee109901f ("signal/drbd: Use send_sig not force_sig")
Fixes: 3cf5d076fb ("signal: Remove task parameter from force_sig")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2019-08-19 06:34:13 -05:00
Miroslav Benes
4ff96fb52c livepatch: Nullify obj->mod in klp_module_coming()'s error path
klp_module_coming() is called for every module appearing in the system.
It sets obj->mod to a patched module for klp_object obj. Unfortunately
it leaves it set even if an error happens later in the function and the
patched module is not allowed to be loaded.

klp_is_object_loaded() uses obj->mod variable and could currently give a
wrong return value. The bug is probably harmless as of now.

Signed-off-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2019-08-19 13:03:37 +02:00
Andrea Righi
f1c6ece237 kprobes: Fix potential deadlock in kprobe_optimizer()
lockdep reports the following deadlock scenario:

 WARNING: possible circular locking dependency detected

 kworker/1:1/48 is trying to acquire lock:
 000000008d7a62b2 (text_mutex){+.+.}, at: kprobe_optimizer+0x163/0x290

 but task is already holding lock:
 00000000850b5e2d (module_mutex){+.+.}, at: kprobe_optimizer+0x31/0x290

 which lock already depends on the new lock.

 the existing dependency chain (in reverse order) is:

 -> #1 (module_mutex){+.+.}:
        __mutex_lock+0xac/0x9f0
        mutex_lock_nested+0x1b/0x20
        set_all_modules_text_rw+0x22/0x90
        ftrace_arch_code_modify_prepare+0x1c/0x20
        ftrace_run_update_code+0xe/0x30
        ftrace_startup_enable+0x2e/0x50
        ftrace_startup+0xa7/0x100
        register_ftrace_function+0x27/0x70
        arm_kprobe+0xb3/0x130
        enable_kprobe+0x83/0xa0
        enable_trace_kprobe.part.0+0x2e/0x80
        kprobe_register+0x6f/0xc0
        perf_trace_event_init+0x16b/0x270
        perf_kprobe_init+0xa7/0xe0
        perf_kprobe_event_init+0x3e/0x70
        perf_try_init_event+0x4a/0x140
        perf_event_alloc+0x93a/0xde0
        __do_sys_perf_event_open+0x19f/0xf30
        __x64_sys_perf_event_open+0x20/0x30
        do_syscall_64+0x65/0x1d0
        entry_SYSCALL_64_after_hwframe+0x49/0xbe

 -> #0 (text_mutex){+.+.}:
        __lock_acquire+0xfcb/0x1b60
        lock_acquire+0xca/0x1d0
        __mutex_lock+0xac/0x9f0
        mutex_lock_nested+0x1b/0x20
        kprobe_optimizer+0x163/0x290
        process_one_work+0x22b/0x560
        worker_thread+0x50/0x3c0
        kthread+0x112/0x150
        ret_from_fork+0x3a/0x50

 other info that might help us debug this:

  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock(module_mutex);
                                lock(text_mutex);
                                lock(module_mutex);
   lock(text_mutex);

  *** DEADLOCK ***

As a reproducer I've been using bcc's funccount.py
(https://github.com/iovisor/bcc/blob/master/tools/funccount.py),
for example:

 # ./funccount.py '*interrupt*'

That immediately triggers the lockdep splat.

Fix by acquiring text_mutex before module_mutex in kprobe_optimizer().

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: d5b844a2cf ("ftrace/x86: Remove possible deadlock between register_kprobe() and ftrace_run_update_code()")
Link: http://lkml.kernel.org/r/20190812184302.GA7010@xps-13
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-08-19 12:22:19 +02:00
Sebastian Andrzej Siewior
b0fdc01354 sched/core: Schedule new worker even if PI-blocked
If a task is PI-blocked (blocking on sleeping spinlock) then we don't want to
schedule a new kworker if we schedule out due to lock contention because !RT
does not do that as well. A spinning spinlock disables preemption and a worker
does not schedule out on lock contention (but spin).

On RT the RW-semaphore implementation uses an rtmutex so
tsk_is_pi_blocked() will return true if a task blocks on it. In this case we
will now start a new worker which may deadlock if one worker is waiting on
progress from another worker. Since a RW-semaphore starts a new worker on !RT,
we should do the same on RT.

XFS is able to trigger this deadlock.

Allow to schedule new worker if the current worker is PI-blocked.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20190816160626.12742-1-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-08-19 10:57:26 +02:00
Linus Torvalds
5bba5c9c86 SPDX fixes for 5.3-rc5
Here are 4 small SPDX fixes for 5.3-rc5.  A few style fixes for some
 SPDX comments, added an SPDX tag for one file, and fix up some GPL
 boilerplate for another file.
 
 All of these have been in linux-next for a few weeks with no reported
 issues (they are comment changes only, so that's to be expected...)
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXVkVSg8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+yneygCfdBxdIl98qXA2SRDLeKl/PkSJH1gAoLwnkoKq
 WK/gN0IMFf25UrItBsGe
 =b31n
 -----END PGP SIGNATURE-----

Merge tag 'spdx-5.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx

Pull SPDX fixes from Greg KH:
 "Here are four small SPDX fixes for 5.3-rc5.

  A few style fixes for some SPDX comments, added an SPDX tag for one
  file, and fix up some GPL boilerplate for another file.

  All of these have been in linux-next for a few weeks with no reported
  issues (they are comment changes only, so that's to be expected...)"

* tag 'spdx-5.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx:
  i2c: stm32: Use the correct style for SPDX License Identifier
  intel_th: Use the correct style for SPDX License Identifier
  coccinelle: api/atomic_as_refcounter: add SPDX License Identifier
  kernel/configs: Replace GPL boilerplate code with SPDX identifier
2019-08-18 09:26:16 -07:00
Björn Töpel
36cc34358c xsk: support BPF_EXIST and BPF_NOEXIST flags in XSKMAP
The XSKMAP did not honor the BPF_EXIST/BPF_NOEXIST flags when updating
an entry. This patch addresses that.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-17 23:24:45 +02:00
Björn Töpel
0402acd683 xsk: remove AF_XDP socket from map when the socket is released
When an AF_XDP socket is released/closed the XSKMAP still holds a
reference to the socket in a "released" state. The socket will still
use the netdev queue resource, and block newly created sockets from
attaching to that queue, but no user application can access the
fill/complete/rx/tx queues. This results in that all applications need
to explicitly clear the map entry from the old "zombie state"
socket. This should be done automatically.

In this patch, the sockets tracks, and have a reference to, which maps
it resides in. When the socket is released, it will remove itself from
all maps.

Suggested-by: Bruce Richardson <bruce.richardson@intel.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-17 23:24:45 +02:00
Stanislav Fomichev
b0e4701ce1 bpf: export bpf_map_inc_not_zero
Rename existing bpf_map_inc_not_zero to __bpf_map_inc_not_zero to
indicate that it's caller's responsibility to do proper locking.
Create and export bpf_map_inc_not_zero wrapper that properly
locks map_idr_lock. Will be used in the next commit to
hold a map while cloning a socket.

Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Yonghong Song <yhs@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-17 23:18:54 +02:00
Christoph Hellwig
0d3d343560 genirq: remove the is_affinity_mask_valid hook
This override was only used by the ia64 SGI SN2 platform, which is
gone now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lkml.kernel.org/r/20190813072514.23299-29-hch@lst.de
Signed-off-by: Tony Luck <tony.luck@intel.com>
2019-08-16 14:32:26 -07:00
Linus Torvalds
2d63ba3e41 Power management fixes for 5.3-rc5
- Disable NVMe power optimization related to suspend-to-idle added
    recently on systems where PCIe ASPM is not able to put PCIe links
    into low-power states to prevent excess power from being drawn by
    the system while suspended (Rafael Wysocki).
 
  - Make the schedutil governor handle frequency limits changes
    properly in all cases (Viresh Kumar).
 
  - Prevent the cpufreq core from treating positive values returned
    by dev_pm_qos_update_request() as errors (Viresh Kumar).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl1WqLMSHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxCkAP/146AuGXj8tOdHxkpl6DVgm0WVRNxCtL
 Z9Y+1xBRBSYVZkeDsjzox995z8Ha/0tnMp6EPcnxebkFpRx3fyldXKUKxqJARPPi
 n2jGhqCPNcAHK2UPdGH8EvHOI2uWBMBa2jW2Qw9m0V/+9Zy58ZvKqso/+myFkz2S
 YRekJPADsI3GZW1SZ3dY4/12jcKsQt32TWaGOLqKx3R1J1BnpyxduXfqJ6FUrH9b
 P/F9cVb2UEbawh5QpNmfMsfBb/DsE08NQhPWe91m0VgcLd6IZsoNux0Rd8HJOvRM
 +5vh6qPTABnNN1+7blFw64/hCu1N2hq8KLl6DzPeKohysKiDkmLh3QGB+ISRpj+H
 5GKF8gnQFvN0fPJF8NU+eIZ0IaOryrooSu4TeCcAWAozJ0ln2mjNoC2h6U1B8Y29
 UH+e2z+6kVTHwjiTjPacjQ0wnkUctoiT71kMxQ8Q+GFG3fQcz3GFFM17eITnAI/Q
 ws1bPHn1ovxl1GmdQwQK3KnT1cK5/fApaVKQLJiRkUvZ1gCZ3ZcruPlh+qA5zpGf
 +RGPXn/Rm1LA1uCkS4j6REBp6vhcVJoVEVnEGzhovdtJcuJ9erlh5I2zz4UxURnn
 cHH48exFmwC+uBhIyQVuYOYgLU3naztBLFg1/l68sMQFonWjIQ/Hp1B9cIgigwbf
 5+BlT1llvIH3
 =eCcy
 -----END PGP SIGNATURE-----

Merge tag 'pm-5.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
 "These add a check to avoid recent suspend-to-idle power regression on
  systems with NVMe drives where the PCIe ASPM policy is "performance"
  (or when the kernel is built without ASPM support), fix an issue
  related to frequency limits in the schedutil cpufreq governor and fix
  a mistake related to the PM QoS usage in the cpufreq core introduced
  recently.

  Specifics:

   - Disable NVMe power optimization related to suspend-to-idle added
     recently on systems where PCIe ASPM is not able to put PCIe links
     into low-power states to prevent excess power from being drawn by
     the system while suspended (Rafael Wysocki).

   - Make the schedutil governor handle frequency limits changes
     properly in all cases (Viresh Kumar).

   - Prevent the cpufreq core from treating positive values returned by
     dev_pm_qos_update_request() as errors (Viresh Kumar)"

* tag 'pm-5.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  nvme-pci: Allow PCI bus-level PM to be used if ASPM is disabled
  PCI/ASPM: Add pcie_aspm_enabled()
  cpufreq: schedutil: Don't skip freq update when limits change
  cpufreq: dev_pm_qos_update_request() can return 1 on success
2019-08-16 09:13:16 -07:00
Rafael J. Wysocki
a3ee2477c4 Merge branch 'pm-cpufreq'
* pm-cpufreq:
  cpufreq: schedutil: Don't skip freq update when limits change
  cpufreq: dev_pm_qos_update_request() can return 1 on success
2019-08-16 14:24:51 +02:00
Chuhong Yuan
d30bdfc0ec PM: sleep: Replace strncmp() with str_has_prefix()
Use str_has_prefix() instead of strncmp() which is less
straightforward in decode_state().

Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
[ rjw: Subject & changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-08-16 14:18:53 +02:00
Chuhong Yuan
35c35493b0 printk: Replace strncmp() with str_has_prefix()
strncmp(str, const, len) is error-prone because len is easy to have typo.
An example is the hard-coded len has counting error or sizeof(const)
forgets - 1.

So we prefer using newly introduced str_has_prefix() to substitute
such strncmp() to make code better.

Link: http://lkml.kernel.org/r/20190809071034.17279-1-hslester96@gmail.com
Cc: "Steven Rostedt" <rostedt@goodmis.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
Reviewed-by:  Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
[pmladek@suse.com: Slightly updated and reformatted the commit message.]
Signed-off-by: Petr Mladek <pmladek@suse.com>
2019-08-16 09:54:08 +02:00
Wei Yongjun
e03250061b btf: fix return value check in btf_vmlinux_init()
In case of error, the function kobject_create_and_add() returns NULL
pointer not ERR_PTR(). The IS_ERR() test in the return value check
should be replaced with NULL test.

Fixes: 341dfcf8d7 ("btf: expose BTF info through sysfs")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-08-15 22:18:17 -07:00
Linus Torvalds
e83b009c5c dma-mapping fixes for 5.3-rc
- fix the handling of the bus_dma_mask in dma_get_required_mask, which
    caused a regression in this merge window (Lucas Stach)
  - fix a regression in the handling of DMA_ATTR_NO_KERNEL_MAPPING (me)
  - fix dma_mmap_coherent to not cause page attribute mismatches on
    coherent architectures like x86 (me)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAl1UFhILHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYOjexAAjPKLo4WGBGO1nd0btwXcI9A7jQTQlXrokmorDVzx
 5++GmTUBeEgvUJath5D3qpQTRZXo9Wb9oGMdS5U6bWJB+SbWtErM304t905TJoDM
 Cs7xcB1ZQeG/5OrQ+qGPgQCo6WO1dOl9FpaIptjNm4dn+OYhyO/YA+dgrJDwgkiA
 140RYUWa+Zhq3df4YqP4M4EnezLN1c4uE80wUxVQKDcq59sxCJek0QT0pUAMbdmQ
 /cUd2XSU113o1llmIRUh0Oj6VSEhWKHb+bdb8JfGndLzxvDcXZKl60tikWe6xpy2
 Ue0kkHRk6OPVRIxWkRjt8D+mlrCyNqN6HWx6eBmVnRKHxZ4ia2hYOFuYN9FFLLK+
 kCUlu5P/HUabBedKIxk4rbWITUqcRSviPD2WdnH2RWblvXNSDoSAufYuJ/9IGSoL
 P6a43DVKFesVF/MxeH9Ko8bnxMUO9Zn97GHcQIUplRwaqrnrCEPlvLVf/teswSQG
 C13rTnouZ0FA4z/uV96G6HfGIj87MLe/RovmLCMTeiSKrDpbcO7szP037Km73M+V
 UBmatoYCioVLxBjw3NkxCRc9UpDPdRUu31uVHrAarh4tutUASEWLrb6s9vFlGyED
 zis9IHWtIAYP3VfFtkXdZ7oDlqC/3KdEErHZuT+z4PK3Wj/QtQVfQ8SB79xFMneD
 V2E=
 =Jzmo
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-5.3-4' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping fixes from Christoph Hellwig:

 - fix the handling of the bus_dma_mask in dma_get_required_mask, which
   caused a regression in this merge window (Lucas Stach)

 - fix a regression in the handling of DMA_ATTR_NO_KERNEL_MAPPING (me)

 - fix dma_mmap_coherent to not cause page attribute mismatches on
   coherent architectures like x86 (me)

* tag 'dma-mapping-5.3-4' of git://git.infradead.org/users/hch/dma-mapping:
  dma-mapping: fix page attributes for dma_mmap_*
  dma-direct: don't truncate dma_required_mask to bus addressing capabilities
  dma-direct: fix DMA_ATTR_NO_KERNEL_MAPPING
2019-08-14 10:31:11 -07:00
Jakub Kicinski
708852dcac Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
The following pull-request contains BPF updates for your *net-next* tree.

There is a small merge conflict in libbpf (Cc Andrii so he's in the loop
as well):

        for (i = 1; i <= btf__get_nr_types(btf); i++) {
                t = (struct btf_type *)btf__type_by_id(btf, i);

                if (!has_datasec && btf_is_var(t)) {
                        /* replace VAR with INT */
                        t->info = BTF_INFO_ENC(BTF_KIND_INT, 0, 0);
  <<<<<<< HEAD
                        /*
                         * using size = 1 is the safest choice, 4 will be too
                         * big and cause kernel BTF validation failure if
                         * original variable took less than 4 bytes
                         */
                        t->size = 1;
                        *(int *)(t+1) = BTF_INT_ENC(0, 0, 8);
                } else if (!has_datasec && kind == BTF_KIND_DATASEC) {
  =======
                        t->size = sizeof(int);
                        *(int *)(t + 1) = BTF_INT_ENC(0, 0, 32);
                } else if (!has_datasec && btf_is_datasec(t)) {
  >>>>>>> 72ef80b5ee
                        /* replace DATASEC with STRUCT */

Conflict is between the two commits 1d4126c4e1 ("libbpf: sanitize VAR to
conservative 1-byte INT") and b03bc6853c ("libbpf: convert libbpf code to
use new btf helpers"), so we need to pick the sanitation fixup as well as
use the new btf_is_datasec() helper and the whitespace cleanup. Looks like
the following:

  [...]
                if (!has_datasec && btf_is_var(t)) {
                        /* replace VAR with INT */
                        t->info = BTF_INFO_ENC(BTF_KIND_INT, 0, 0);
                        /*
                         * using size = 1 is the safest choice, 4 will be too
                         * big and cause kernel BTF validation failure if
                         * original variable took less than 4 bytes
                         */
                        t->size = 1;
                        *(int *)(t + 1) = BTF_INT_ENC(0, 0, 8);
                } else if (!has_datasec && btf_is_datasec(t)) {
                        /* replace DATASEC with STRUCT */
  [...]

The main changes are:

1) Addition of core parts of compile once - run everywhere (co-re) effort,
   that is, relocation of fields offsets in libbpf as well as exposure of
   kernel's own BTF via sysfs and loading through libbpf, from Andrii.

   More info on co-re: http://vger.kernel.org/bpfconf2019.html#session-2
   and http://vger.kernel.org/lpc-bpf2018.html#session-2

2) Enable passing input flags to the BPF flow dissector to customize parsing
   and allowing it to stop early similar to the C based one, from Stanislav.

3) Add a BPF helper function that allows generating SYN cookies from XDP and
   tc BPF, from Petar.

4) Add devmap hash-based map type for more flexibility in device lookup for
   redirects, from Toke.

5) Improvements to XDP forwarding sample code now utilizing recently enabled
   devmap lookups, from Jesper.

6) Add support for reporting the effective cgroup progs in bpftool, from Jakub
   and Takshak.

7) Fix reading kernel config from bpftool via /proc/config.gz, from Peter.

8) Fix AF_XDP umem pages mapping for 32 bit architectures, from Ivan.

9) Follow-up to add two more BPF loop tests for the selftest suite, from Alexei.

10) Add perf event output helper also for other skb-based program types, from Allan.

11) Fix a co-re related compilation error in selftests, from Yonghong.
====================

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-08-13 16:24:57 -07:00
Eric Dumazet
cfcdef5e30 rcu: Allow rcu_do_batch() to dynamically adjust batch sizes
Bimodal behavior of rcu_do_batch() is not really suited to Google
applications like gfe servers.

When a process with millions of sockets exits, closing all files
queues two rcu callbacks per socket.

This eventually reaches the point where RCU enters an emergency
mode, where rcu_do_batch() do not return until whole queue is flushed.

Each rcu callback lasts at least 70 nsec, so with millions of
elements, we easily spend more than 100 msec without rescheduling.

Goal of this patch is to avoid the infamous message like following
"need_resched set for > 51999388 ns (52 ticks) without schedule"

We dynamically adjust the number of elements we process, instead
of 10 / INFINITE choices, we use a floor of ~1 % of current entries.

If the number is above 1000, we switch to a time based limit of 3 msec
per batch, adjustable with /sys/module/rcutree/parameters/rcu_resched_ns

Signed-off-by: Eric Dumazet <edumazet@google.com>
[ paulmck: Forward-port and remove debug statements. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:38:24 -07:00
Paul E. McKenney
f48fe4c586 rcu/nocb: Don't wake no-CBs GP kthread if timer posted under overload
When under overload conditions, __call_rcu_nocb_wake() will wake the
no-CBs GP kthread any time the no-CBs CB kthread is asleep or there
are no ready-to-invoke callbacks, but only after a timer delay.  If the
no-CBs GP kthread has a ->nocb_bypass_timer pending, the deferred wakeup
from __call_rcu_nocb_wake() is redundant.  This commit therefore makes
__call_rcu_nocb_wake() avoid posting the redundant deferred wakeup if
->nocb_bypass_timer is pending.  This requires adding a bit of ordering
of timer actions.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:38:24 -07:00
Paul E. McKenney
296181d78d rcu/nocb: Reduce __call_rcu_nocb_wake() leaf rcu_node ->lock contention
Currently, __call_rcu_nocb_wake() advances callbacks each time that it
detects excessive numbers of callbacks, though only if it succeeds in
conditionally acquiring its leaf rcu_node structure's ->lock.  Despite
the conditional acquisition of ->lock, this does increase contention.
This commit therefore avoids advancing callbacks unless there are
callbacks in ->cblist whose grace period has completed and advancing
has not yet been done during this jiffy.

Note that this decision does not take the presence of new callbacks
into account.  That is because on this code path, there will always be
at least one new callback, namely the one we just enqueued.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:38:24 -07:00
Paul E. McKenney
1d5a81c18d rcu/nocb: Reduce nocb_cb_wait() leaf rcu_node ->lock contention
Currently, nocb_cb_wait() advances callbacks on each pass through its
loop, though only if it succeeds in conditionally acquiring its leaf
rcu_node structure's ->lock.  Despite the conditional acquisition of
->lock, this does increase contention.  This commit therefore avoids
advancing callbacks unless there are callbacks in ->cblist whose grace
period has completed.

Note that nocb_cb_wait() doesn't worry about callbacks that have not
yet been assigned a grace period.  The idea is that the only reason for
nocb_cb_wait() to advance callbacks is to allow it to continue invoking
callbacks.  Time will tell whether this is the correct choice.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:38:24 -07:00
Paul E. McKenney
23651d9b96 rcu/nocb: Advance CBs after merge in rcutree_migrate_callbacks()
The rcutree_migrate_callbacks() invokes rcu_advance_cbs() on both the
offlined CPU's ->cblist and that of the surviving CPU, then merges
them.  However, after the merge, and of the offlined CPU's callbacks
that were not ready to be invoked will no longer be associated with a
grace-period number.  This commit therefore invokes rcu_advance_cbs()
one more time on the merged ->cblist in order to assign a grace-period
number to these callbacks.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:38:24 -07:00
Paul E. McKenney
273f034065 rcu/nocb: Avoid synchronous wakeup in __call_rcu_nocb_wake()
When callbacks are in full flow, the common case is waiting for a
grace period, and this grace period will normally take a few jiffies to
complete.  It therefore isn't all that helpful for __call_rcu_nocb_wake()
to do a synchronous wakeup in this case.  This commit therefore turns this
into a timer-based deferred wakeup of the no-CBs grace-period kthread.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:38:24 -07:00
Paul E. McKenney
f7a81b12d6 rcu/nocb: Print no-CBs diagnostics when rcutorture writer unduly delayed
This commit causes locking, sleeping, and callback state to be printed
for no-CBs CPUs when the rcutorture writer is delayed sufficiently for
rcutorture to complain.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:38:24 -07:00
Paul E. McKenney
6aacd88d17 rcu/nocb: EXP Check use and usefulness of ->nocb_lock_contended
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:38:24 -07:00
Paul E. McKenney
d1b222c6be rcu/nocb: Add bypass callback queueing
Use of the rcu_data structure's segmented ->cblist for no-CBs CPUs
takes advantage of unrelated grace periods, thus reducing the memory
footprint in the face of floods of call_rcu() invocations.  However,
the ->cblist field is a more-complex rcu_segcblist structure which must
be protected via locking.  Even though there are only three entities
which can acquire this lock (the CPU invoking call_rcu(), the no-CBs
grace-period kthread, and the no-CBs callbacks kthread), the contention
on this lock is excessive under heavy stress.

This commit therefore greatly reduces contention by provisioning
an rcu_cblist structure field named ->nocb_bypass within the
rcu_data structure.  Each no-CBs CPU is permitted only a limited
number of enqueues onto the ->cblist per jiffy, controlled by a new
nocb_nobypass_lim_per_jiffy kernel boot parameter that defaults to
about 16 enqueues per millisecond (16 * 1000 / HZ).  When that limit is
exceeded, the CPU instead enqueues onto the new ->nocb_bypass.

The ->nocb_bypass is flushed into the ->cblist every jiffy or when
the number of callbacks on ->nocb_bypass exceeds qhimark, whichever
happens first.  During call_rcu() floods, this flushing is carried out
by the CPU during the course of its call_rcu() invocations.  However,
a CPU could simply stop invoking call_rcu() at any time.  The no-CBs
grace-period kthread therefore carries out less-aggressive flushing
(every few jiffies or when the number of callbacks on ->nocb_bypass
exceeds (2 * qhimark), whichever comes first).  This means that the
no-CBs grace-period kthread cannot be permitted to do unbounded waits
while there are callbacks on ->nocb_bypass.  A ->nocb_bypass_timer is
used to provide the needed wakeups.

[ paulmck: Apply Coverity feedback reported by Colin Ian King. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:37:32 -07:00
Paul E. McKenney
eda669a6a2 rcu/nocb: Atomic ->len field in rcu_segcblist structure
Upcoming ->nocb_lock contention-reduction work requires that the
rcu_segcblist structure's ->len field be concurrently manipulated,
but only if there are no-CBs CPUs in the kernel.  This commit
therefore makes this ->len field be an atomic_long_t, but only
in CONFIG_RCU_NOCB_CPU=y kernels.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
faca5c2509 rcu/nocb: Unconditionally advance and wake for excessive CBs
When there are excessive numbers of callbacks, and when either the
corresponding no-CBs callback kthread is asleep or there is no more
ready-to-invoke callbacks, and when least one callback is pending,
__call_rcu_nocb_wake() will advance the callbacks, but refrain from
awakening the corresponding no-CBs grace-period kthread.  However,
because rcu_advance_cbs_nowake() is used, it is possible (if a bit
unlikely) that the needed advancement could not happen due to a grace
period not being in progress.  Plus there will always be at least one
pending callback due to one having just now been enqueued.

This commit therefore attempts to advance callbacks and awakens the
no-CBs grace-period kthread when there are excessive numbers of callbacks
posted and when the no-CBs callback kthread is not in a position to do
anything helpful.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
4fd8c5f153 rcu/nocb: Reduce ->nocb_lock contention with separate ->nocb_gp_lock
The sleep/wakeup of the no-CBs grace-period kthreads is synchronized
using the ->nocb_lock of the first CPU corresponding to that kthread.
This commit provides a separate ->nocb_gp_lock for this purpose, thus
reducing contention on ->nocb_lock.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
523bddd553 rcu/nocb: Reduce contention at no-CBs invocation-done time
Currently, nocb_cb_wait() unconditionally acquires the leaf rcu_node
->lock to advance callbacks when done invoking the previous batch.
It does this while holding ->nocb_lock, which means that contention on
the leaf rcu_node ->lock visits itself on the ->nocb_lock.  This commit
therefore makes this lock acquisition conditional, forgoing callback
advancement when the leaf rcu_node ->lock is not immediately available.
(In this case, the no-CBs grace-period kthread will eventually do any
needed callback advancement.)

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
6608c3a027 rcu/nocb: Reduce contention at no-CBs registry-time CB advancement
Currently, __call_rcu_nocb_wake() conditionally acquires the leaf rcu_node
structure's ->lock, and only afterwards does rcu_advance_cbs_nowake()
check to see if it is possible to advance callbacks without potentially
needing to awaken the grace-period kthread.  Given that the no-awaken
check can be done locklessly, this commit reverses the order, so that
rcu_advance_cbs_nowake() is invoked without holding the leaf rcu_node
structure's ->lock and rcu_advance_cbs_nowake() checks the grace-period
state before conditionally acquiring that lock, thus reducing the number
of needless acquistions of the leaf rcu_node structure's ->lock.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
9fcb09bddd rcu/nocb: Round down for number of no-CBs grace-period kthreads
Currently, when the square root of the number of CPUs is rounded down
by int_sqrt(), this round-down is applied to the number of callback
kthreads per grace-period kthreads.  This makes almost no difference
for large systems, but results in oddities such as three no-CBs
grace-period kthreads for a five-CPU system, which is a bit excessive.
This commit therefore causes the round-down to apply to the number of
no-CBs grace-period kthreads, so that systems with from four to eight
CPUs have only two no-CBs grace period kthreads.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
81c0b3d724 rcu/nocb: Avoid ->nocb_lock capture by corresponding CPU
A given rcu_data structure's ->nocb_lock can be acquired very frequently
by the corresponding CPU and occasionally by the corresponding no-CBs
grace-period and callbacks kthreads.  In particular, these two kthreads
will have frequent gaps between ->nocb_lock acquisitions that are roughly
a grace period in duration.  This means that any excessive ->nocb_lock
contention will be due to the CPU's acquisitions, and this in turn
enables a very naive contention-avoidance strategy to be quite effective.

This commit therefore modifies rcu_nocb_lock() to first
attempt a raw_spin_trylock(), and to atomically increment a
separate ->nocb_lock_contended across a raw_spin_lock().  This new
->nocb_lock_contended field is checked in __call_rcu_nocb_wake() when
interrupts are enabled, with a spin-wait for contending acquisitions
to complete, thus allowing the kthreads a chance to acquire the lock.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
7f36ef82e5 rcu/nocb: Avoid needless wakeups of no-CBs grace-period kthread
Currently, the code provides an extra wakeup for the no-CBs grace-period
kthread if one of its CPUs is generating excessive numbers of callbacks.
But satisfying though it is to wake something up when things are going
south, unless the thing being awakened can actually help solve the
problem, that extra wakeup does nothing but consume additional CPU time,
which is exactly what you don't want during a call_rcu() flood.

This commit therefore avoids doing anything if the corresponding
no-CBs callback kthread is going full tilt.  Otherwise, if advancing
callbacks immediately might help and if the leaf rcu_node structure's
lock is immediately available, this commit invokes a new variant of
rcu_advance_cbs() that advances callbacks only if doing so won't require
awakening the grace-period kthread (not to be confused with any of the
no-CBs grace-period kthreads).

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
ce0a825e40 rcu/nocb: Make __call_rcu_nocb_wake() safe for many callbacks
It might be hard to imagine having more than two billion callbacks
queued on a single CPU's ->cblist, but someone will do it sometime.
This commit therefore makes __call_rcu_nocb_wake() handle this situation
by upgrading local variable "len" from "int" to "long".

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
383e133283 rcu/nocb: Never downgrade ->nocb_defer_wakeup in wake_nocb_gp_defer()
Currently, wake_nocb_gp_defer() simply stores whatever waketype was
passed in, which can result in a RCU_NOCB_WAKE_FORCE being downgraded
to RCU_NOCB_WAKE, which could in turn delay callback processing.
This commit therefore adds a check so that wake_nocb_gp_defer() only
updates ->nocb_defer_wakeup when the update increases the forcefulness,
thus avoiding downgrades.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
aeeacd9d84 rcu/nocb: Enable re-awakening under high callback load
The __call_rcu_nocb_wake() function and its predecessors set
->qlen_last_fqs_check to zero for the first callback and to LONG_MAX / 2
for forced reawakenings.  The former can result in a too-quick reawakening
when there are many callbacks ready to invoke and the latter prevents a
second reawakening.  This commit therefore sets ->qlen_last_fqs_check
to the current number of callbacks in both cases.  While in the area,
this commit also moves both assignments under ->nocb_lock.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
0bd55c6936 rcu/nohz: Turn off tick for offloaded CPUs
Historically, no-CBs CPUs allowed the scheduler-clock tick to be
unconditionally disabled on any transition to idle or nohz_full userspace
execution (see the rcu_needs_cpu() implementations).  Unfortunately,
the checks used by rcu_needs_cpu() are defeated now that no-CBs CPUs
use ->cblist, which might make users of battery-powered devices rather
unhappy.  This commit therefore adds explicit rcu_segcblist_is_offloaded()
checks to return to the historical energy-efficient semantics.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
969974e5c5 rcu/nocb: Suppress uninitialized false-positive in nocb_gp_wait()
Some compilers complain that wait_gp_seq might be used uninitialized
in nocb_gp_wait().  This cannot actually happen because when wait_gp_seq
is uninitialized, needwait_gp must be false, which prevents wait_gp_seq
from being used.  But this analysis is apparently beyond some compilers,
so this commit adds a bogus initialization of wait_gp_seq for the sole
purpose of suppressing the false-positive warning.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
921bb5fad1 rcu/nocb: Use build-time no-CBs check in rcu_pending()
Currently, rcu_pending() invokes rcu_segcblist_is_offloaded() even
in CONFIG_RCU_NOCB_CPU=n kernels, which cannot possibly be offloaded.
Given that rcu_pending() is on a fastpath, it makes sense to check for
CONFIG_RCU_NOCB_CPU=y before invoking rcu_segcblist_is_offloaded().
This commit therefore makes this change.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
c1ab99d66e rcu/nocb: Use build-time no-CBs check in rcu_core()
Currently, rcu_core() invokes rcu_segcblist_is_offloaded() each time it
needs to know whether the current CPU is a no-CBs CPU.  Given that it is
not possible to change the no-CBs status of a CPU after boot, and given
that it is not possible to even have no-CBs CPUs in CONFIG_RCU_NOCB_CPU=n
kernels, this repeated runtime invocation wastes CPU.  This commit
therefore created a const on-stack variable to allow this check to be
done only once per rcu_core() invocation.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
ec5ef87bac rcu/nocb: Use build-time no-CBs check in rcu_do_batch()
Currently, rcu_do_batch() invokes rcu_segcblist_is_offloaded() each time
it needs to know whether the current CPU is a no-CBs CPU.  Given that it
is not possible to change the no-CBs status of a CPU after boot, and given
that it is not possible to even have no-CBs CPUs in CONFIG_RCU_NOCB_CPU=n
kernels, this per-callback invocation wastes CPU.  This commit therefore
created a const on-stack variable to allow this check to be done only
once per rcu_do_batch() invocation.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
4f9c1bc727 rcu/nocb: Remove obsolete nocb_gp_head and nocb_gp_tail fields
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
2a777de757 rcu/nocb: Remove obsolete nocb_cb_tail and nocb_cb_head fields
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
c035280f17 rcu/nocb: Remove obsolete nocb_q_count and nocb_q_count_lazy fields
This commit removes the obsolete nocb_q_count and nocb_q_count_lazy
fields, also removing rcu_get_n_cbs_nocb_cpu(), adjusting
rcu_get_n_cbs_cpu(), and making rcutree_migrate_callbacks() once again
disable the ->cblist fields of offline CPUs.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
e7f4c5b399 rcu/nocb: Remove obsolete nocb_head and nocb_tail fields
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
5d6742b377 rcu/nocb: Use rcu_segcblist for no-CBs CPUs
Currently the RCU callbacks for no-CBs CPUs are queued on a series of
ad-hoc linked lists, which means that these callbacks cannot benefit
from "drive-by" grace periods, thus suffering needless delays prior
to invocation.  In addition, the no-CBs grace-period kthreads first
wait for callbacks to appear and later wait for a new grace period,
which means that callbacks appearing during a grace-period wait can
be delayed.  These delays increase memory footprint, and could even
result in an out-of-memory condition.

This commit therefore enqueues RCU callbacks from no-CBs CPUs on the
rcu_segcblist structure that is already used by non-no-CBs CPUs.  It also
restructures the no-CBs grace-period kthread to be checking for incoming
callbacks while waiting for grace periods.  Also, instead of waiting
for a new grace period, it waits for the closest grace period that will
cause some of the callbacks to be safe to invoke.  All of these changes
reduce callback latency and thus the number of outstanding callbacks,
in turn reducing the probability of an out-of-memory condition.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
e83e73f5b0 rcu/nocb: Leave ->cblist enabled for no-CBs CPUs
As a first step towards making no-CBs CPUs use the ->cblist, this commit
leaves the ->cblist enabled for these CPUs.  The main reason to make
no-CBs CPUs use ->cblist is to take advantage of callback numbering,
which will reduce the effects of missed grace periods which in turn will
reduce forward-progress problems for no-CBs CPUs.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
e6060b41c9 rcu/nocb: Allow lockless use of rcu_segcblist_empty()
Currently, rcu_segcblist_empty() assumes that the callback list is not
being changed by other CPUs, but upcoming changes will require it to
operate locklessly.  This commit therefore adds the needed READ_ONCE()
call, along with the WRITE_ONCE() calls when updating the callback list's
->head field.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
76c6927c3e rcu/nocb: Allow lockless use of rcu_segcblist_restempty()
Currently, rcu_segcblist_restempty() assumes that the callback list
is not being changed by other CPUs, but upcoming changes will require
it to operate locklessly.  This commit therefore adds the needed
READ_ONCE() calls, along with the WRITE_ONCE() calls when updating
the callback list.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
ca5c825808 rcu/nocb: Remove deferred wakeup checks for extended quiescent states
The idea behind the checks for extended quiescent states at the end of
__call_rcu_nocb() is to handle cases where call_rcu() is invoked directly
from within an extended quiescent state, for example, from the idle loop.
However, this will result in a timer-mediated deferred wakeup, which
will cause the needed wakeup to happen within a jiffy or thereabouts.
There should be no forward-progress concerns, and if there are, the proper
response is to exit the extended quiescent state while executing the
endless blast of call_rcu() invocations, for example, using RCU_NONIDLE().
Given the more realistic case of an isolated call_rcu() invocation, there
should be no problem.

This commit therefore removes the checks for invoking call_rcu() within
an extended quiescent state for on no-CBs CPUs.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
85f69b3212 rcu/nocb: Check for deferred nocb wakeups before nohz_full early exit
In theory, a timer is used to defer wakeups of no-CBs grace-period
kthreads when the wakeup cannot be done safely directly from the
call_rcu().  In practice, the one-jiffy delay is not always consistent
with timely callback invocation under heavy call_rcu() loads.  Therefore,
there are a number of checks for a pending deferred wakeup, including
from the scheduling-clock interrupt.  Unfortunately, this check follows
the rcu_nohz_full_cpu() early exit, which renders it useless on such CPUs.

This commit therefore moves the check for the pending deferred no-CB
wakeup to precede the rcu_nohz_full_cpu() early exit.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
c00045be32 rcu/nocb: Make rcutree_migrate_callbacks() start at leaf rcu_node structure
Because rcutree_migrate_callbacks() is invoked infrequently and because
an exact snapshot of the grace-period state might save some callbacks a
second trip through a grace period, this function has used the root
rcu_node structure.  However, this safe-second-trip optimization
happens only if rcutree_migrate_callbacks() races with grace-period
initialization, so it is not worth the added mental load.  This commit
therefore makes rcutree_migrate_callbacks() start with the leaf rcu_node
structures, as is done elsewhere.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
750d7f6a43 rcu/nocb: Add checks for offloaded callback processing
This commit is a preparatory patch for offloaded callbacks using the
same ->cblist structure used by non-offloaded callbacks.  It therefore
adds rcu_segcblist_is_offloaded() calls where they will be needed when
!rcu_segcblist_is_enabled() no longer flags the offloaded case.  It also
adds checks in rcu_do_batch() to ensure that there are no missed checks:
Currently, it should not be possible for offloaded execution to reach
rcu_do_batch(), though this will change later in this series.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
ce5215c134 rcu/nocb: Use separate flag to indicate offloaded ->cblist
RCU callback processing currently uses rcu_is_nocb_cpu() to determine
whether or not the current CPU's callbacks are to be offloaded.
This works, but it is not so good for cache locality.  Plus use of
->cblist for offloaded callbacks will greatly increase the frequency
of these checks.  This commit therefore adds a ->offloaded flag to the
rcu_segcblist structure to provide a more flexible and cache-friendly
means of checking for callback offloading.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
1bb5f9b95a rcu/nocb: Use separate flag to indicate disabled ->cblist
NULLing the RCU_NEXT_TAIL pointer was a clever way to save a byte, but
forward-progress considerations would require that this pointer be both
NULL and non-NULL, which, absent a quantum-computer port of the Linux
kernel, simply won't happen.  This commit therefore creates as separate
->enabled flag to replace the current NULL checks.

[ paulmck: Add include files per 0day test robot and -next. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:34:50 -07:00
Paul E. McKenney
18cd8c93e6 rcu/nocb: Print gp/cb kthread hierarchy if dump_tree
This commit causes the no-CBs grace-period/callback hierarchy to be
printed to the console when the dump_tree kernel boot parameter is set.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:32:39 -07:00
Paul E. McKenney
f7c612b000 rcu/nocb: Rename rcu_nocb_leader_stride kernel boot parameter
This commit changes the name of the rcu_nocb_leader_stride kernel
boot parameter to rcu_nocb_gp_stride in order to account for the new
distinction between callback and grace-period no-CBs kthreads.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:32:39 -07:00
Paul E. McKenney
f7c9a9b664 rcu/nocb: Rename and document no-CB CB kthread sleep trace event
The nocb_cb_wait() function traces a "FollowerSleep" trace_rcu_nocb_wake()
event, which never was documented and is now misleading.  This commit
therefore changes "FollowerSleep" to "CBSleep", documents this, and
updates the documentation for "Sleep" as well.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:32:39 -07:00
Paul E. McKenney
0bdc33daef rcu/nocb: Rename rcu_organize_nocb_kthreads() local variable
This commit renames rdp_leader to rdp_gp in order to account for the
new distinction between callback and grace-period no-CBs kthreads.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:32:39 -07:00
Paul E. McKenney
0d52a6652f rcu/nocb: Rename wake_nocb_leader_defer() to wake_nocb_gp_defer()
This commit adjusts naming to account for the new distinction between
callback and grace-period no-CBs kthreads.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:32:39 -07:00
Paul E. McKenney
5f675ba6eb rcu/nocb: Rename __wake_nocb_leader() to __wake_nocb_gp()
This commit adjusts naming to account for the new distinction between
callback and grace-period no-CBs kthreads.  While in the area, it also
updates local variables.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:32:39 -07:00
Paul E. McKenney
5d62c08c5f rcu/nocb: Rename wake_nocb_leader() to wake_nocb_gp()
This commit adjusts naming to account for the new distinction between
callback and grace-period no-CBs kthreads.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:32:39 -07:00
Paul E. McKenney
9fa471a881 rcu/nocb: Rename nocb_follower_wait() to nocb_cb_wait()
This commit adjusts naming to account for the new distinction between
callback and grace-period no-CBs kthreads.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:32:39 -07:00
Paul E. McKenney
12f54c3a84 rcu/nocb: Provide separate no-CBs grace-period kthreads
Currently, there is one no-CBs rcuo kthread per CPU, and these kthreads
are divided into groups.  The first rcuo kthread to come online in a
given group is that group's leader, and the leader both waits for grace
periods and invokes its CPU's callbacks.  The non-leader rcuo kthreads
only invoke callbacks.

This works well in the real-time/embedded environments for which it was
intended because such environments tend not to generate all that many
callbacks.  However, given huge floods of callbacks, it is possible for
the leader kthread to be stuck invoking callbacks while its followers
wait helplessly while their callbacks pile up.  This is a good recipe
for an OOM, and rcutorture's new callback-flood capability does generate
such OOMs.

One strategy would be to wait until such OOMs start happening in
production, but similar OOMs have in fact happened starting in 2018.
It would therefore be wise to take a more proactive approach.

This commit therefore features per-CPU rcuo kthreads that do nothing
but invoke callbacks.  Instead of having one of these kthreads act as
leader, each group has a separate rcog kthread that handles grace periods
for its group.  Because these rcuog kthreads do not invoke callbacks,
callback floods on one CPU no longer block callbacks from reaching the
rcuc callback-invocation kthreads on other CPUs.

This change does introduce additional kthreads, however:

1.	The number of additional kthreads is about the square root of
	the number of CPUs, so that a 4096-CPU system would have only
	about 64 additional kthreads.  Note that recent changes
	decreased the number of rcuo kthreads by a factor of two
	(CONFIG_PREEMPT=n) or even three (CONFIG_PREEMPT=y), so
	this still represents a significant improvement on most systems.

2.	The leading "rcuo" of the rcuog kthreads should allow existing
	scripting to affinity these additional kthreads as needed, the
	same as for the rcuop and rcuos kthreads.  (There are no longer
	any rcuob kthreads.)

3.	A state-machine approach was considered and rejected.  Although
	this would allow the rcuo kthreads to continue their dual
	leader/follower roles, it complicates callback invocation
	and makes it more difficult to consolidate rcuo callback
	invocation with existing softirq callback invocation.

The introduction of rcuog kthreads should thus be acceptable.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:32:39 -07:00
Paul E. McKenney
6484fe54b5 rcu/nocb: Update comments to prepare for forward-progress work
This commit simply rewords comments to prepare for leader nocb kthreads
doing only grace-period work and callback shuffling.  This will mean
the addition of replacement kthreads to invoke callbacks.  The "leader"
and "follower" thus become less meaningful, so the commit changes no-CB
comments with these strings to "GP" and "CB", respectively.  (Give or
take the usual grammatical transformations.)

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:32:39 -07:00
Paul E. McKenney
58bf6f77c6 rcu/nocb: Rename rcu_data fields to prepare for forward-progress work
This commit simply renames rcu_data fields to prepare for leader
nocb kthreads doing only grace-period work and callback shuffling.
This will mean the addition of replacement kthreads to invoke callbacks.
The "leader" and "follower" thus become less meaningful, so the commit
changes no-CB fields with these strings to "gp" and "cb", respectively.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:32:39 -07:00
Paul E. McKenney
31da067023 Merge branches 'consolidate.2019.08.01b', 'fixes.2019.08.12a', 'lists.2019.08.13a' and 'torture.2019.08.01b' into HEAD
consolidate.2019.08.01b: Further consolidation cleanups
fixes.2019.08.12a: Miscellaneous fixes
lists.2019.08.13a: Optional lockdep arguments for RCU list macros
torture.2019.08.01b: Torture-test updates
2019-08-13 14:30:30 -07:00
Andrii Nakryiko
7fd785685e btf: rename /sys/kernel/btf/kernel into /sys/kernel/btf/vmlinux
Expose kernel's BTF under the name vmlinux to be more uniform with using
kernel module names as file names in the future.

Fixes: 341dfcf8d7 ("btf: expose BTF info through sysfs")
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-13 23:19:42 +02:00
Andrii Nakryiko
341dfcf8d7 btf: expose BTF info through sysfs
Make .BTF section allocated and expose its contents through sysfs.

/sys/kernel/btf directory is created to contain all the BTFs present
inside kernel. Currently there is only kernel's main BTF, represented as
/sys/kernel/btf/kernel file. Once kernel modules' BTFs are supported,
each module will expose its BTF as /sys/kernel/btf/<module-name> file.

Current approach relies on a few pieces coming together:
1. pahole is used to take almost final vmlinux image (modulo .BTF and
   kallsyms) and generate .BTF section by converting DWARF info into
   BTF. This section is not allocated and not mapped to any segment,
   though, so is not yet accessible from inside kernel at runtime.
2. objcopy dumps .BTF contents into binary file and subsequently
   convert binary file into linkable object file with automatically
   generated symbols _binary__btf_kernel_bin_start and
   _binary__btf_kernel_bin_end, pointing to start and end, respectively,
   of BTF raw data.
3. final vmlinux image is generated by linking this object file (and
   kallsyms, if necessary). sysfs_btf.c then creates
   /sys/kernel/btf/kernel file and exposes embedded BTF contents through
   it. This allows, e.g., libbpf and bpftool access BTF info at
   well-known location, without resorting to searching for vmlinux image
   on disk (location of which is not standardized and vmlinux image
   might not be even available in some scenarios, e.g., inside qemu
   during testing).

Alternative approach using .incbin assembler directive to embed BTF
contents directly was attempted but didn't work, because sysfs_proc.o is
not re-compiled during link-vmlinux.sh stage. This is required, though,
to update embedded BTF data (initially empty data is embedded, then
pahole generates BTF info and we need to regenerate sysfs_btf.o with
updated contents, but it's too late at that point).

If BTF couldn't be generated due to missing or too old pahole,
sysfs_btf.c handles that gracefully by detecting that
_binary__btf_kernel_bin_start (weak symbol) is 0 and not creating
/sys/kernel/btf at all.

v2->v3:
- added Documentation/ABI/testing/sysfs-kernel-btf (Greg K-H);
- created proper kobject (btf_kobj) for btf directory (Greg K-H);
- undo v2 change of reusing vmlinux, as it causes extra kallsyms pass
  due to initially missing  __binary__btf_kernel_bin_{start/end} symbols;

v1->v2:
- allow kallsyms stage to re-use vmlinux generated by gen_btf();

Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-13 16:14:15 +02:00
Mukesh Ojha
511b44f759 rcu: Fix spelling mistake "greate"->"great"
This commit fixes a spelling mistake in file tree_exp.h.

Signed-off-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-12 11:25:06 -07:00
Paul E. McKenney
b823cafa75 rcu: Remove redundant "if" condition from rcu_gp_is_expedited()
Because rcu_expedited_nesting is initialized to 1 and not decremented
until just before init is spawned, rcu_expedited_nesting is guaranteed
to be non-zero whenever rcu_scheduler_active == RCU_SCHEDULER_INIT.
This commit therefore removes this redundant "if" equality test.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2019-08-12 11:25:06 -07:00
Peter Zijlstra
e78a7614f3 idle: Prevent late-arriving interrupts from disrupting offline
Scheduling-clock interrupts can arrive late in the CPU-offline process,
after idle entry and the subsequent call to cpuhp_report_idle_dead().
Once execution passes the call to rcu_report_dead(), RCU is ignoring
the CPU, which results in lockdep complaints when the interrupt handler
uses RCU:

------------------------------------------------------------------------

=============================
WARNING: suspicious RCU usage
5.2.0-rc1+ #681 Not tainted
-----------------------------
kernel/sched/fair.c:9542 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

RCU used illegally from offline CPU!
rcu_scheduler_active = 2, debug_locks = 1
no locks held by swapper/5/0.

stack backtrace:
CPU: 5 PID: 0 Comm: swapper/5 Not tainted 5.2.0-rc1+ #681
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Bochs 01/01/2011
Call Trace:
 <IRQ>
 dump_stack+0x5e/0x8b
 trigger_load_balance+0xa8/0x390
 ? tick_sched_do_timer+0x60/0x60
 update_process_times+0x3b/0x50
 tick_sched_handle+0x2f/0x40
 tick_sched_timer+0x32/0x70
 __hrtimer_run_queues+0xd3/0x3b0
 hrtimer_interrupt+0x11d/0x270
 ? sched_clock_local+0xc/0x74
 smp_apic_timer_interrupt+0x79/0x200
 apic_timer_interrupt+0xf/0x20
 </IRQ>
RIP: 0010:delay_tsc+0x22/0x50
Code: ff 0f 1f 80 00 00 00 00 65 44 8b 05 18 a7 11 48 0f ae e8 0f 31 48 89 d6 48 c1 e6 20 48 09 c6 eb 0e f3 90 65 8b 05 fe a6 11 48 <41> 39 c0 75 18 0f ae e8 0f 31 48 c1 e2 20 48 09 c2 48 89 d0 48 29
RSP: 0000:ffff8f92c0157ed0 EFLAGS: 00000212 ORIG_RAX: ffffffffffffff13
RAX: 0000000000000005 RBX: ffff8c861f356400 RCX: ffff8f92c0157e64
RDX: 000000321214c8cc RSI: 00000032120daa7f RDI: 0000000000260f15
RBP: 0000000000000005 R08: 0000000000000005 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
R13: 0000000000000000 R14: ffff8c861ee18000 R15: ffff8c861ee18000
 cpuhp_report_idle_dead+0x31/0x60
 do_idle+0x1d5/0x200
 ? _raw_spin_unlock_irqrestore+0x2d/0x40
 cpu_startup_entry+0x14/0x20
 start_secondary+0x151/0x170
 secondary_startup_64+0xa4/0xb0

------------------------------------------------------------------------

This happens rarely, but can be forced by happen more often by
placing delays in cpuhp_report_idle_dead() following the call to
rcu_report_dead().  With this in place, the following rcutorture
scenario reproduces the problem within a few minutes:

tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 8 --duration 5 --kconfig "CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_PROVE_LOCKING=y" --configs "TREE04"

This commit uses the crude but effective expedient of moving the disabling
of interrupts within the idle loop to precede the cpu_is_offline()
check.  It also invokes tick_nohz_idle_stop_tick() instead of
tick_nohz_idle_stop_tick_protected() to shut off the scheduling-clock
interrupt.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
[ paulmck: Revert tick_nohz_idle_stop_tick_protected() removal, new callers. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-12 11:23:56 -07:00
Christoph Hellwig
4189ff2348 kernel: only define task_struct_whitelist conditionally
If CONFIG_ARCH_TASK_STRUCT_ALLOCATOR is set task_struct_whitelist is
never called, and thus generates a compiler warning.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lkml.kernel.org/r/20190812065524.19959-5-hch@lst.de
Signed-off-by: Tony Luck <tony.luck@intel.com>
2019-08-12 09:53:28 -07:00
Phil Auld
a46d14eca7 sched/fair: Use rq_lock/unlock in online_fair_sched_group
Enabling WARN_DOUBLE_CLOCK in /sys/kernel/debug/sched_features causes
warning to fire in update_rq_clock. This seems to be caused by onlining
a new fair sched group not using the rq lock wrappers.

  [] rq->clock_update_flags & RQCF_UPDATED
  [] WARNING: CPU: 5 PID: 54385 at kernel/sched/core.c:210 update_rq_clock+0xec/0x150

  [] Call Trace:
  []  online_fair_sched_group+0x53/0x100
  []  cpu_cgroup_css_online+0x16/0x20
  []  online_css+0x1c/0x60
  []  cgroup_apply_control_enable+0x231/0x3b0
  []  cgroup_mkdir+0x41b/0x530
  []  kernfs_iop_mkdir+0x61/0xa0
  []  vfs_mkdir+0x108/0x1a0
  []  do_mkdirat+0x77/0xe0
  []  do_syscall_64+0x55/0x1d0
  []  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Using the wrappers in online_fair_sched_group instead of the raw locking
removes this warning.

[ tglx: Use rq_*lock_irq() ]

Signed-off-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20190801133749.11033-1-pauld@redhat.com
2019-08-12 14:45:34 +02:00
Linus Torvalds
dcbb4a1539 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Thomas Gleixner:
 "Three fixlets for the scheduler:

   - Avoid double bandwidth accounting in the push & pull code

   - Use a sane FIFO priority for the Pressure Stall Information (PSI)
     thread.

   - Avoid permission checks when setting the scheduler params for the
     PSI thread"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/psi: Do not require setsched permission from the trigger creator
  sched/psi: Reduce psimon FIFO priority
  sched/deadline: Fix double accounting of rq/running bw in push & pull
2019-08-10 15:48:02 -07:00
Christoph Hellwig
33dcb37cef dma-mapping: fix page attributes for dma_mmap_*
All the way back to introducing dma_common_mmap we've defaulted to mark
the pages as uncached.  But this is wrong for DMA coherent devices.
Later on DMA_ATTR_WRITE_COMBINE also got incorrect treatment as that
flag is only treated special on the alloc side for non-coherent devices.

Introduce a new dma_pgprot helper that deals with the check for coherent
devices so that only the remapping cases ever reach arch_dma_mmap_pgprot
and we thus ensure no aliasing of page attributes happens, which makes
the powerpc version of arch_dma_mmap_pgprot obsolete and simplifies the
remaining ones.

Note that this means arch_dma_mmap_pgprot is a bit misnamed now, but
we'll phase it out soon.

Fixes: 64ccc9c033 ("common: dma-mapping: add support for generic dma_mmap_* calls")
Reported-by: Shawn Anastasio <shawn@anastas.io>
Reported-by: Gavin Li <git@thegavinli.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Catalin Marinas <catalin.marinas@arm.com> # arm64
2019-08-10 19:52:45 +02:00
Lucas Stach
d8ad55538a dma-direct: don't truncate dma_required_mask to bus addressing capabilities
The dma required_mask needs to reflect the actual addressing capabilities
needed to handle the whole system RAM. When truncated down to the bus
addressing capabilities dma_addressing_limited() will incorrectly signal
no limitations for devices which are restricted by the bus_dma_mask.

Fixes: b4ebe60632 (dma-direct: implement complete bus_dma_mask handling)
Signed-off-by: Lucas Stach <l.stach@pengutronix.de>
Tested-by: Atish Patra <atish.patra@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-08-10 19:52:45 +02:00
Christoph Hellwig
cf14be0b41 dma-direct: fix DMA_ATTR_NO_KERNEL_MAPPING
The new DMA_ATTR_NO_KERNEL_MAPPING needs to actually assign
a dma_addr to work.  Also skip it if the architecture needs
forced decryption handling, as that needs a kernel virtual
address.

Fixes: d98849aff8 (dma-direct: handle DMA_ATTR_NO_KERNEL_MAPPING in common code)
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Lucas Stach <l.stach@pengutronix.de>
2019-08-10 19:52:45 +02:00
Viresh Kumar
600f5badb7 cpufreq: schedutil: Don't skip freq update when limits change
To avoid reducing the frequency of a CPU prematurely, we skip reducing
the frequency if the CPU had been busy recently.

This should not be done when the limits of the policy are changed, for
example due to thermal throttling. We should always get the frequency
within the new limits as soon as possible.

Trying to fix this by using only one flag, i.e. need_freq_update, can
lead to a race condition where the flag gets cleared without forcing us
to change the frequency at least once. And so this patch introduces
another flag to avoid that race condition.

Fixes: ecd2884291 ("cpufreq: schedutil: Don't set next_freq to UINT_MAX")
Cc: v4.18+ <stable@vger.kernel.org> # v4.18+
Reported-by: Doug Smythies <dsmythies@telus.net>
Tested-by: Doug Smythies <dsmythies@telus.net>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-08-10 13:53:19 +02:00
Rafael J. Wysocki
11f26633cc PM: suspend: Fix platform_suspend_prepare_noirq()
After commit ac9eafbe93 ("ACPI: PM: s2idle: Execute LPS0 _DSM
functions with suspended devices"), a NULL pointer may be dereferenced
if suspend-to-idle is attempted on a platform without "traditional"
suspend support due to invalid fall-through in
platform_suspend_prepare_noirq().

Fix that and while at it add missing braces in platform_resume_noirq().

Fixes: ac9eafbe93 ("ACPI: PM: s2idle: Execute LPS0 _DSM functions with suspended devices")
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-08-10 13:18:06 +02:00
Joel Fernandes (Google)
28875945ba rcu: Add support for consolidated-RCU reader checking
This commit adds RCU-reader checks to list_for_each_entry_rcu() and
hlist_for_each_entry_rcu().  These checks are optional, and are indicated
by a lockdep expression passed to a new optional argument to these two
macros.  If this optional lockdep expression is omitted, these two macros
act as before, checking for an RCU read-side critical section.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
[ paulmck: Update to eliminate return within macro and update comment. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-09 11:00:35 -07:00
Thiago Jung Bauermann
e740815a97 dma-mapping: Remove dma_check_mask()
sme_active() is an x86-specific function so it's better not to call it from
generic code. Christoph Hellwig mentioned that "There is no reason why we
should have a special debug printk just for one specific reason why there
is a requirement for a large DMA mask.", so just remove dma_check_mask().

Signed-off-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190806044919.10622-4-bauerman@linux.ibm.com
2019-08-09 22:52:07 +10:00
Thiago Jung Bauermann
47e5d8f9ed swiotlb: Remove call to sme_active()
sme_active() is an x86-specific function so it's better not to call it from
generic code.

There's no need to mention which memory encryption feature is active, so
just use a more generic message. Besides, other architectures will have
different names for similar technology.

Signed-off-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190806044919.10622-3-bauerman@linux.ibm.com
2019-08-09 22:52:06 +10:00
Daniel Jordan
ec9c7d1933 padata: initialize pd->cpu with effective cpumask
Exercising CPU hotplug on a 5.2 kernel with recent padata fixes from
cryptodev-2.6.git in an 8-CPU kvm guest...

    # modprobe tcrypt alg="pcrypt(rfc4106(gcm(aes)))" type=3
    # echo 0 > /sys/devices/system/cpu/cpu1/online
    # echo c > /sys/kernel/pcrypt/pencrypt/parallel_cpumask
    # modprobe tcrypt mode=215

...caused the following crash:

    BUG: kernel NULL pointer dereference, address: 0000000000000000
    #PF: supervisor read access in kernel mode
    #PF: error_code(0x0000) - not-present page
    PGD 0 P4D 0
    Oops: 0000 [#1] SMP PTI
    CPU: 2 PID: 134 Comm: kworker/2:2 Not tainted 5.2.0-padata-base+ #7
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-<snip>
    Workqueue: pencrypt padata_parallel_worker
    RIP: 0010:padata_reorder+0xcb/0x180
    ...
    Call Trace:
     padata_do_serial+0x57/0x60
     pcrypt_aead_enc+0x3a/0x50 [pcrypt]
     padata_parallel_worker+0x9b/0xe0
     process_one_work+0x1b5/0x3f0
     worker_thread+0x4a/0x3c0
     ...

In padata_alloc_pd, pd->cpu is set using the user-supplied cpumask
instead of the effective cpumask, and in this case cpumask_first picked
an offline CPU.

The offline CPU's reorder->list.next is NULL in padata_reorder because
the list wasn't initialized in padata_init_pqueues, which only operates
on CPUs in the effective mask.

Fix by using the effective mask in padata_alloc_pd.

Fixes: 6fc4dbcf02 ("padata: Replace delayed timer with immediate workqueue in padata_reorder")
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-08-09 15:13:52 +10:00
Rafael J. Wysocki
ac9eafbe93 ACPI: PM: s2idle: Execute LPS0 _DSM functions with suspended devices
According to Section 3.5 of the "Intel Low Power S0 Idle" document [1],
Function 5 of the LPS0 _DSM is expected to be invoked when the system
configuration matches the criteria for entering the target low-power
state of the platform.  In particular, this means that all devices
should be suspended and in low-power states already when that function
is invoked.

This is not the case currently, however, because Function 5 of the
LPS0 _DSM is invoked by it before the "noirq" phase of device suspend,
which means that some devices may not have been put into low-power
states yet at that point.  That is a consequence of the previous
design of the suspend-to-idle flow that allowed the "noirq" phase of
device suspend and the "noirq" phase of device resume to be carried
out for multiple times while "suspended" (if any spurious wakeup
events were detected) and the point of the LPS0 _DSM Function 5
invocation was chosen so as to call it (and LPS0 _DSM Function 6
analogously) once per suspend-resume cycle (regardless of how many
times the "noirq" phases of device suspend and resume were carried
out while "suspended").

Now that the suspend-to-idle flow has been redesigned to carry out
the "noirq" phases of device suspend and resume once in each cycle,
the code can be reordered to follow the specification that it is
based on more closely.

For this purpose, add ->prepare_late and ->restore_early platform
callbacks for suspend-to-idle, to be executed, respectively, after
the "noirq" phase of suspending devices and before the "noirq"
phase of resuming them and make ACPI use them for the invocation
of LPS0 _DSM functions as appropriate.

While at it, move the LPS0 entry requirements check to be made
before invoking Functions 3 and 5 of the LPS0 _DSM (also once
per cycle) as follows from the specification [1].

Link: https://uefi.org/sites/default/files/resources/Intel_ACPI_Low_Power_S0_Idle.pdf # [1]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
2019-08-08 11:26:01 +02:00
Qais Yousef
5c3ceef9ad cpufreq: schedutil: fix equation in comment
scale_irq_capacity() call in schedutil_cpu_util() does

	util *= (max - irq)
	util /= max

But the comment says

	util *= (1 - irq)
	util /= max

Fix the comment to match what the scaling function does.

Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "Rafael J . Wysocki" <rjw@rjwysocki.net>
Link: https://lkml.kernel.org/r/20190802104628.8410-1-qais.yousef@arm.com
2019-08-08 09:09:31 +02:00
Peter Zijlstra
67692435c4 sched: Rework pick_next_task() slow-path
Avoid the RETRY_TASK case in the pick_next_task() slow path.

By doing the put_prev_task() early, we get the rt/deadline pull done,
and by testing rq->nr_running we know if we need newidle_balance().

This then gives a stable state to pick a task from.

Since the fast-path is fair only; it means the other classes will
always have pick_next_task(.prev=NULL, .rf=NULL) and we can simplify.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Aaron Lu <aaron.lwe@gmail.com>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: mingo@kernel.org
Cc: Phil Auld <pauld@redhat.com>
Cc: Julien Desfossez <jdesfossez@digitalocean.com>
Cc: Nishanth Aravamudan <naravamudan@digitalocean.com>
Link: https://lkml.kernel.org/r/aa34d24b36547139248f32a30138791ac6c02bd6.1559129225.git.vpillai@digitalocean.com
2019-08-08 09:09:31 +02:00
Peter Zijlstra
5f2a45fc9e sched: Allow put_prev_task() to drop rq->lock
Currently the pick_next_task() loop is convoluted and ugly because of
how it can drop the rq->lock and needs to restart the picking.

For the RT/Deadline classes, it is put_prev_task() where we do
balancing, and we could do this before the picking loop. Make this
possible.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Aaron Lu <aaron.lwe@gmail.com>
Cc: mingo@kernel.org
Cc: Phil Auld <pauld@redhat.com>
Cc: Julien Desfossez <jdesfossez@digitalocean.com>
Cc: Nishanth Aravamudan <naravamudan@digitalocean.com>
Link: https://lkml.kernel.org/r/e4519f6850477ab7f3d257062796e6425ee4ba7c.1559129225.git.vpillai@digitalocean.com
2019-08-08 09:09:31 +02:00
Peter Zijlstra
5ba553eff0 sched/fair: Expose newidle_balance()
For pick_next_task_fair() it is the newidle balance that requires
dropping the rq->lock; provided we do put_prev_task() early, we can
also detect the condition for doing newidle early.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Aaron Lu <aaron.lwe@gmail.com>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: mingo@kernel.org
Cc: Phil Auld <pauld@redhat.com>
Cc: Julien Desfossez <jdesfossez@digitalocean.com>
Cc: Nishanth Aravamudan <naravamudan@digitalocean.com>
Link: https://lkml.kernel.org/r/9e3eb1859b946f03d7e500453a885725b68957ba.1559129225.git.vpillai@digitalocean.com
2019-08-08 09:09:31 +02:00
Peter Zijlstra
03b7fad167 sched: Add task_struct pointer to sched_class::set_curr_task
In preparation of further separating pick_next_task() and
set_curr_task() we have to pass the actual task into it, while there,
rename the thing to better pair with put_prev_task().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Aaron Lu <aaron.lwe@gmail.com>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: mingo@kernel.org
Cc: Phil Auld <pauld@redhat.com>
Cc: Julien Desfossez <jdesfossez@digitalocean.com>
Cc: Nishanth Aravamudan <naravamudan@digitalocean.com>
Link: https://lkml.kernel.org/r/a96d1bcdd716db4a4c5da2fece647a1456c0ed78.1559129225.git.vpillai@digitalocean.com
2019-08-08 09:09:31 +02:00
Peter Zijlstra
10e7071b2f sched: Rework CPU hotplug task selection
The CPU hotplug task selection is the only place where we used
put_prev_task() on a task that is not current. While looking at that,
it occured to me that we can simplify all that by by using a custom
pick loop.

Since we don't need to put current, we can do away with the fake task
too.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Aaron Lu <aaron.lwe@gmail.com>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: mingo@kernel.org
Cc: Phil Auld <pauld@redhat.com>
Cc: Julien Desfossez <jdesfossez@digitalocean.com>
Cc: Nishanth Aravamudan <naravamudan@digitalocean.com>
2019-08-08 09:09:30 +02:00
Peter Zijlstra
f95d4eaee6 sched/{rt,deadline}: Fix set_next_task vs pick_next_task
Because pick_next_task() implies set_curr_task() and some of the
details haven't mattered too much, some of what _should_ be in
set_curr_task() ended up in pick_next_task, correct this.

This prepares the way for a pick_next_task() variant that does not
affect the current state; allowing remote picking.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Aaron Lu <aaron.lwe@gmail.com>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: mingo@kernel.org
Cc: Phil Auld <pauld@redhat.com>
Cc: Julien Desfossez <jdesfossez@digitalocean.com>
Cc: Nishanth Aravamudan <naravamudan@digitalocean.com>
Link: https://lkml.kernel.org/r/38c61d5240553e043c27c5e00b9dd0d184dd6081.1559129225.git.vpillai@digitalocean.com
2019-08-08 09:09:30 +02:00
Peter Zijlstra
5feeb7837a sched: Fix kerneldoc comment for ia64_set_curr_task
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Aaron Lu <aaron.lwe@gmail.com>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: mingo@kernel.org
Cc: Phil Auld <pauld@redhat.com>
Cc: Julien Desfossez <jdesfossez@digitalocean.com>
Cc: Nishanth Aravamudan <naravamudan@digitalocean.com>
Link: https://lkml.kernel.org/r/fde3a65ea3091ec6b84dac3c19639f85f452c5d1.1559129225.git.vpillai@digitalocean.com
2019-08-08 09:09:30 +02:00
Peter Zijlstra
99d84bf8c6 stop_machine: Fix stop_cpus_in_progress ordering
Make sure the entire for loop has stop_cpus_in_progress set.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Aaron Lu <aaron.lwe@gmail.com>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: mingo@kernel.org
Cc: Phil Auld <pauld@redhat.com>
Cc: Julien Desfossez <jdesfossez@digitalocean.com>
Cc: Nishanth Aravamudan <naravamudan@digitalocean.com>
Link: https://lkml.kernel.org/r/0fd8fd4b99b9b9aa88d8b2dff897f7fd0d88f72c.1559129225.git.vpillai@digitalocean.com
2019-08-08 09:09:30 +02:00
Dave Chiluk
de53fd7aed sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
It has been observed, that highly-threaded, non-cpu-bound applications
running under cpu.cfs_quota_us constraints can hit a high percentage of
periods throttled while simultaneously not consuming the allocated
amount of quota. This use case is typical of user-interactive non-cpu
bound applications, such as those running in kubernetes or mesos when
run on multiple cpu cores.

This has been root caused to cpu-local run queue being allocated per cpu
bandwidth slices, and then not fully using that slice within the period.
At which point the slice and quota expires. This expiration of unused
slice results in applications not being able to utilize the quota for
which they are allocated.

The non-expiration of per-cpu slices was recently fixed by
'commit 512ac999d2 ("sched/fair: Fix bandwidth timer clock drift
condition")'. Prior to that it appears that this had been broken since
at least 'commit 51f2176d74 ("sched/fair: Fix unlocked reads of some
cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
added the following conditional which resulted in slices never being
expired.

if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
	/* extend local deadline, drift is bounded above by 2 ticks */
	cfs_rq->runtime_expires += TICK_NSEC;

Because this was broken for nearly 5 years, and has recently been fixed
and is now being noticed by many users running kubernetes
(https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
that the mechanisms around expiring runtime should be removed
altogether.

This allows quota already allocated to per-cpu run-queues to live longer
than the period boundary. This allows threads on runqueues that do not
use much CPU to continue to use their remaining slice over a longer
period of time than cpu.cfs_period_us. However, this helps prevent the
above condition of hitting throttling while also not fully utilizing
your cpu quota.

This theoretically allows a machine to use slightly more than its
allotted quota in some periods. This overflow would be bounded by the
remaining quota left on each per-cpu runqueueu. This is typically no
more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will
change nothing, as they should theoretically fully utilize all of their
quota in each period. For user-interactive tasks as described above this
provides a much better user/application experience as their cpu
utilization will more closely match the amount they requested when they
hit throttling. This means that cpu limits no longer strictly apply per
period for non-cpu bound applications, but that they are still accurate
over longer timeframes.

This greatly improves performance of high-thread-count, non-cpu bound
applications with low cfs_quota_us allocation on high-core-count
machines. In the case of an artificial testcase (10ms/100ms of quota on
80 CPU machine), this commit resulted in almost 30x performance
improvement, while still maintaining correct cpu quota restrictions.
That testcase is available at https://github.com/indeedeng/fibtest.

Fixes: 512ac999d2 ("sched/fair: Fix bandwidth timer clock drift condition")
Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: John Hammond <jhammond@indeed.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kyle Anderson <kwa@yelp.com>
Cc: Gabriel Munos <gmunoz@netflix.com>
Cc: Peter Oskolkov <posk@posk.io>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Brendan Gregg <bgregg@netflix.com>
Link: https://lkml.kernel.org/r/1563900266-19734-2-git-send-email-chiluk+linux@indeed.com
2019-08-08 09:09:30 +02:00
Peter Zijlstra
139d025cda sched: Clean up active_mm reference counting
The current active_mm reference counting is confusing and sub-optimal.

Rewrite the code to explicitly consider the 4 separate cases:

    user -> user

	When switching between two user tasks, all we need to consider
	is switch_mm().

    user -> kernel

	When switching from a user task to a kernel task (which
	doesn't have an associated mm) we retain the last mm in our
	active_mm. Increment a reference count on active_mm.

  kernel -> kernel

	When switching between kernel threads, all we need to do is
	pass along the active_mm reference.

  kernel -> user

	When switching between a kernel and user task, we must switch
	from the last active_mm to the next mm, hoping of course that
	these are the same. Decrement a reference on the active_mm.

The code keeps a different order, because as you'll note, both 'to
user' cases require switch_mm().

And where the old code would increment/decrement for the 'kernel ->
kernel' case, the new code observes this is a neutral operation and
avoids touching the reference count.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: luto@kernel.org
2019-08-08 09:09:30 +02:00
Peter Zijlstra
130d9c331b rcu/tree: Fix SCHED_FIFO params
A rather embarrasing mistake had us call sched_setscheduler() before
initializing the parameters passed to it.

Fixes: 1a763fd7c6 ("rcu/tree: Call setschedule() gp ktread to SCHED_FIFO outside of atomic region")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
2019-08-08 09:09:30 +02:00
Peter Zijlstra
e57d143091 mutex: Fix up mutex_waiter usage
The patch moving bits into mutex.c was a little too much; by also
moving struct mutex_waiter a few less common CONFIGs would no longer
build.

Fixes: 5f35d5a66b ("locking/mutex: Make __mutex_owner static to mutex.c")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2019-08-08 09:09:25 +02:00
Ming Lei
491beed3b1 genirq/affinity: Create affinity mask for single vector
Since commit c66d4bd110 ("genirq/affinity: Add new callback for
(re)calculating interrupt sets"), irq_create_affinity_masks() returns
NULL in case of single vector. This change has caused regression on some
drivers, such as lpfc.

The problem is that single vector requests can happen in some generic cases:

  1) kdump kernel

  2) irq vectors resource is close to exhaustion.

If in that situation the affinity mask for a single vector is not created,
every caller has to handle the special case.

There is no reason why the mask cannot be created, so remove the check for
a single vector and create the mask.

Fixes: c66d4bd110 ("genirq/affinity: Add new callback for (re)calculating interrupt sets")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20190805011906.5020-1-ming.lei@redhat.com
2019-08-08 08:47:55 +02:00
Marc Koderer
653a23ca7e Use kvmalloc in cgroups-v1
Instead of using its own logic for k-/vmalloc rely on
kvmalloc which is actually doing quite the same.

Signed-off-by: Marc Koderer <marc@koderer.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-08-07 11:37:58 -07:00
Marc Zyngier
b977fcf477 irqdomain/debugfs: Use PAs to generate fwnode names
Booting a large arm64 server (HiSi D05) leads to the following
shouting at boot time:

[   20.722132] debugfs: File 'irqchip@(____ptrval____)-3' in directory 'domains' already present!
[   20.730851] debugfs: File 'irqchip@(____ptrval____)-3' in directory 'domains' already present!
[   20.739560] debugfs: File 'irqchip@(____ptrval____)-3' in directory 'domains' already present!
[   20.748267] debugfs: File 'irqchip@(____ptrval____)-3' in directory 'domains' already present!
[   20.756975] debugfs: File 'irqchip@(____ptrval____)-3' in directory 'domains' already present!
[   20.765683] debugfs: File 'irqchip@(____ptrval____)-3' in directory 'domains' already present!
[   20.774391] debugfs: File 'irqchip@(____ptrval____)-3' in directory 'domains' already present!

and many more... Evidently, we expect something a bit more informative
than ____ptrval____, and certainly we want all of our domains, not just
the first one.

For that, turn the %p used to generate the fwnode name into something
that won't be repainted (%pa). Given that we've now fixed all users to
pass a pointer to a PA, it will actually do the right thing.

Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Marc Zyngier <maz@kernel.org>
2019-08-07 14:24:54 +01:00
Linus Torvalds
33920f1ec5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:
 "Yeah I should have sent a pull request last week, so there is a lot
  more here than usual:

   1) Fix memory leak in ebtables compat code, from Wenwen Wang.

   2) Several kTLS bug fixes from Jakub Kicinski (circular close on
      disconnect etc.)

   3) Force slave speed check on link state recovery in bonding 802.3ad
      mode, from Thomas Falcon.

   4) Clear RX descriptor bits before assigning buffers to them in
      stmmac, from Jose Abreu.

   5) Several missing of_node_put() calls, mostly wrt. for_each_*() OF
      loops, from Nishka Dasgupta.

   6) Double kfree_skb() in peak_usb can driver, from Stephane Grosjean.

   7) Need to hold sock across skb->destructor invocation, from Cong
      Wang.

   8) IP header length needs to be validated in ipip tunnel xmit, from
      Haishuang Yan.

   9) Use after free in ip6 tunnel driver, also from Haishuang Yan.

  10) Do not use MSI interrupts on r8169 chips before RTL8168d, from
      Heiner Kallweit.

  11) Upon bridge device init failure, we need to delete the local fdb.
      From Nikolay Aleksandrov.

  12) Handle erros from of_get_mac_address() properly in stmmac, from
      Martin Blumenstingl.

  13) Handle concurrent rename vs. dump in netfilter ipset, from Jozsef
      Kadlecsik.

  14) Setting NETIF_F_LLTX on mac80211 causes complete breakage with
      some devices, so revert. From Johannes Berg.

  15) Fix deadlock in rxrpc, from David Howells.

  16) Fix Kconfig deps of enetc driver, we must have PHYLIB. From Yue
      Haibing.

  17) Fix mvpp2 crash on module removal, from Matteo Croce.

  18) Fix race in genphy_update_link, from Heiner Kallweit.

  19) bpf_xdp_adjust_head() stopped working with generic XDP when we
      fixes generic XDP to support stacked devices properly, fix from
      Jesper Dangaard Brouer.

  20) Unbalanced RCU locking in rt6_update_exception_stamp_rt(), from
      David Ahern.

  21) Several memory leaks in new sja1105 driver, from Vladimir Oltean"

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (214 commits)
  net: dsa: sja1105: Fix memory leak on meta state machine error path
  net: dsa: sja1105: Fix memory leak on meta state machine normal path
  net: dsa: sja1105: Really fix panic on unregistering PTP clock
  net: dsa: sja1105: Use the LOCKEDS bit for SJA1105 E/T as well
  net: dsa: sja1105: Fix broken learning with vlan_filtering disabled
  net: dsa: qca8k: Add of_node_put() in qca8k_setup_mdio_bus()
  net: sched: sample: allow accessing psample_group with rtnl
  net: sched: police: allow accessing police->params with rtnl
  net: hisilicon: Fix dma_map_single failed on arm64
  net: hisilicon: fix hip04-xmit never return TX_BUSY
  net: hisilicon: make hip04_tx_reclaim non-reentrant
  tc-testing: updated vlan action tests with batch create/delete
  net sched: update vlan action for batched events operations
  net: stmmac: tc: Do not return a fragment entry
  net: stmmac: Fix issues when number of Queues >= 4
  net: stmmac: xgmac: Fix XGMAC selftests
  be2net: disable bh with spin_lock in be_process_mcc
  net: cxgb3_main: Fix a resource leak in a error path in 'init_one()'
  net: ethernet: sun4i-emac: Support phy-handle property for finding PHYs
  net: bridge: move default pvid init/deinit to NETDEV_REGISTER/UNREGISTER
  ...
2019-08-06 17:11:59 -07:00
Catalin Marinas
63f0c60379 arm64: Introduce prctl() options to control the tagged user addresses ABI
It is not desirable to relax the ABI to allow tagged user addresses into
the kernel indiscriminately. This patch introduces a prctl() interface
for enabling or disabling the tagged ABI with a global sysctl control
for preventing applications from enabling the relaxed ABI (meant for
testing user-space prctl() return error checking without reconfiguring
the kernel). The ABI properties are inherited by threads of the same
application and fork()'ed children but cleared on execve(). A Kconfig
option allows the overall disabling of the relaxed ABI.

The PR_SET_TAGGED_ADDR_CTRL will be expanded in the future to handle
MTE-specific settings like imprecise vs precise exceptions.

Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
2019-08-06 18:08:45 +01:00
Suren Baghdasaryan
04e048cf09 sched/psi: Do not require setsched permission from the trigger creator
When a process creates a new trigger by writing into /proc/pressure/*
files, permissions to write such a file should be used to determine whether
the process is allowed to do so or not. Current implementation would also
require such a process to have setsched capability. Setting of psi trigger
thread's scheduling policy is an implementation detail and should not be
exposed to the user level. Remove the permission check by using _nocheck
version of the function.

Suggested-by: Nick Kralevich <nnk@google.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: lizefan@huawei.com
Cc: mingo@redhat.com
Cc: akpm@linux-foundation.org
Cc: kernel-team@android.com
Cc: dennisszhou@gmail.com
Cc: dennis@kernel.org
Cc: hannes@cmpxchg.org
Cc: axboe@kernel.dk
Link: https://lkml.kernel.org/r/20190730013310.162367-1-surenb@google.com
2019-08-06 12:49:18 +02:00
Peter Zijlstra
14f5c7b46a sched/psi: Reduce psimon FIFO priority
PSI defaults to a FIFO-99 thread, reduce this to FIFO-1.

FIFO-99 is the very highest priority available to SCHED_FIFO and
it not a suitable default; it would indicate the psi work is the
most important work on the machine.

Since Real-Time tasks will have pre-allocated memory and locked it in
place, Real-Time tasks do not care about PSI. All it needs is to be
above OTHER.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
2019-08-06 12:49:18 +02:00
Dietmar Eggemann
f4904815f9 sched/deadline: Fix double accounting of rq/running bw in push & pull
{push,pull}_dl_task() always calls {de,}activate_task() with .flags=0
which sets p->on_rq=TASK_ON_RQ_MIGRATING.

{push,pull}_dl_task()->{de,}activate_task()->{de,en}queue_task()->
{de,en}queue_task_dl() calls {sub,add}_{running,rq}_bw() since
p->on_rq==TASK_ON_RQ_MIGRATING.
So {sub,add}_{running,rq}_bw() in {push,pull}_dl_task() is
double-accounting for that task.

Fix it by removing rq/running bw accounting in [push/pull]_dl_task().

Fixes: 7dd7788411 ("sched/core: Unify p->on_rq updates")
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Luca Abeni <luca.abeni@santannapisa.it>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Qais Yousef <qais.yousef@arm.com>
Link: https://lkml.kernel.org/r/20190802145945.18702-2-dietmar.eggemann@arm.com
2019-08-06 12:49:18 +02:00
Mukesh Ojha
a037d26922 locking/mutex: Use mutex flags macro instead of hard code
Use the mutex flag macro instead of hard code value inside
__mutex_owner().

Signed-off-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: mingo@redhat.com
Cc: will@kernel.org
Link: https://lkml.kernel.org/r/1564585504-3543-2-git-send-email-mojha@codeaurora.org
2019-08-06 12:49:16 +02:00
Mukesh Ojha
5f35d5a66b locking/mutex: Make __mutex_owner static to mutex.c
__mutex_owner() should only be used by the mutex api's.
So, to put this restiction let's move the __mutex_owner()
function definition from linux/mutex.h to mutex.c file.

There exist functions that uses __mutex_owner() like
mutex_is_locked() and mutex_trylock_recursive(), So
to keep legacy thing intact move them as well and
export them.

Move mutex_waiter structure also to keep it private to the
file.

Signed-off-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: mingo@redhat.com
Cc: will@kernel.org
Link: https://lkml.kernel.org/r/1564585504-3543-1-git-send-email-mojha@codeaurora.org
2019-08-06 12:49:16 +02:00
Davidlohr Bueso
fce45cd411 locking/rwsem: Check for operations on an uninitialized rwsem
Currently rwsems is the only locking primitive that lacks this
debug feature. Add it under CONFIG_DEBUG_RWSEMS and do the magic
checking in the locking fastpath (trylock) operation such that
we cover all cases. The unlocking part is pretty straightforward.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Cc: mingo@kernel.org
Cc: Davidlohr Bueso <dave@stgolabs.net>
Link: https://lkml.kernel.org/r/20190729044735.9632-1-dave@stgolabs.net
2019-08-06 12:49:15 +02:00
Waiman Long
91d2a812df locking/rwsem: Make handoff writer optimistically spin on owner
When the handoff bit is set by a writer, no other tasks other than
the setting writer itself is allowed to acquire the lock. If the
to-be-handoff'ed writer goes to sleep, there will be a wakeup latency
period where the lock is free, but no one can acquire it. That is less
than ideal.

To reduce that latency, the handoff writer will now optimistically spin
on the owner if it happens to be a on-cpu writer. It will spin until
it releases the lock and the to-be-handoff'ed writer can then acquire
the lock immediately without any delay. Of course, if the owner is not
a on-cpu writer, the to-be-handoff'ed writer will have to sleep anyway.

The optimistic spinning code is also modified to not stop spinning
when the handoff bit is set. This will prevent an occasional setting of
handoff bit from causing a bunch of optimistic spinners from entering
into the wait queue causing significant reduction in throughput.

On a 1-socket 22-core 44-thread Skylake system, the AIM7 shared_memory
workload was run with 7000 users. The throughput (jobs/min) of the
following kernels were as follows:

 1) 5.2-rc6
    - 8,092,486
 2) 5.2-rc6 + tip's rwsem patches
    - 7,567,568
 3) 5.2-rc6 + tip's rwsem patches + this patch
    - 7,954,545

Using perf-record(1), the %cpu time used by rwsem_down_write_slowpath(),
rwsem_down_write_failed() and their callees for the 3 kernels were 1.70%,
5.46% and 2.08% respectively.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: x86@kernel.org
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lkml.kernel.org/r/20190625143913.24154-1-longman@redhat.com
2019-08-06 12:49:15 +02:00
Thiago Jung Bauermann
c8424e776b MODSIGN: Export module signature definitions
IMA will use the module_signature format for append signatures, so export
the relevant definitions and factor out the code which verifies that the
appended signature trailer is valid.

Also, create a CONFIG_MODULE_SIG_FORMAT option so that IMA can select it
and be able to use mod_check_sig() without having to depend on either
CONFIG_MODULE_SIG or CONFIG_MODULES.

s390 duplicated the definition of struct module_signature so now they can
use the new <linux/module_signature.h> header instead.

Signed-off-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Acked-by: Jessica Yu <jeyu@kernel.org>
Reviewed-by: Philipp Rudo <prudo@linux.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
2019-08-05 18:39:56 -04:00
Kalesh Singh
2c8db5bef9 PM/sleep: Expose suspend stats in sysfs
Userspace can get suspend stats from the suspend stats debugfs node.
Since debugfs doesn't have stable ABI, expose suspend stats in
sysfs under /sys/power/suspend_stats.

Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-08-05 12:03:18 +02:00
Christoph Hellwig
14c5cebad5 memremap: move from kernel/ to mm/
memremap.c implements MM functionality for ZONE_DEVICE, so it really
should be in the mm/ directory, not the kernel/ one.

Link: http://lkml.kernel.org/r/20190722094143.18387-1-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-03 07:02:01 -07:00
Mauro Carvalho Chehab
68d8681e97 kernel/signal.c: fix a kernel-doc markup
The kernel-doc parser doesn't handle expressions with %foo*.  Instead,
when an asterisk should be part of a constant, it uses an alternative
notation: `foo*`.

Link: http://lkml.kernel.org/r/7f18c2e0b5e39e6b7eb55ddeb043b8b260b49f2d.1563361575.git.mchehab+samsung@kernel.org
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-03 07:02:00 -07:00
Linus Torvalds
234172f6bb add swiotlb support to arm
This fixes a cascade of regressions that originally started with
 the addition of the ia64 port, but only got fatal once we removed
 most uses of block layer bounce buffering in Linux 4.18.
 
 The reason is that while the original i386/PAE code that was the first
 architecture that supported > 4GB of memory without an iommu decided to
 leave bounce buffering to the subsystems, which in those days just mean
 block and networking as no one else consumer arbitrary userspace memory.
 
 Later with ia64, x86_64 and other ports we assumed that either an iommu
 or something that fakes it up ("software IOTLB" in beautiful Intel
 speak) is present and that subsystems can rely on that for dealing with
 addressing limitations in devices.   Except that the ARM LPAE scheme
 that added larger physical address to 32-bit ARM did not follow that
 scheme and thus only worked by chance and only for block and networking
 I/O directly to highmem.
 
 Long story, short fix - add swiotlb support to arm when build for LPAE
 platforms, which actuallys turns out to be pretty trivial with the
 modern dma-direct / swiotlb code to fix the Linux 4.18-ish regression.
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAl1DFj8LHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYPFqg/+Oh62VCFCkIK07NAeTq6EmrfHI8I1Wm/SFWPOOB+a
 vm7nMcSG3C8K8PRHzGc6Zk3SC1+RrHghcyKw54yLT1Mhroakv6Um7p2y8S3M4tmZ
 uEg8yYbtzxvuaY9T42s2msZURbBCEELzA2bYbQzgQ1zczRI1zuMI07ssMr91IQ91
 HC1OjAUoxUkp/+2uU/X2k6DvPQLSJSyWvKgbi1bjNpE+FRCKJP+2a2K3psBQuDBe
 aJXiz/kD2L/JNvF/e4c414d5GnGXwtIYs1kbskmnj3LeToS+JjX+6ZcENorpScIP
 c20s/3H6nsb14TFy548rJUlAHdcd9kOdeTw+0oPUliNLCogGs6FKNU4N5gVAo+bC
 AWDP0wMHMWkrVz6lQL9PR78IHrHOxFYS5/uHsqqdKo5YTsgaHnwKEiPxX1aiKQ67
 ovUrOnGRo4R9Y4YwD+BbHY9qw9jFMqazBdLWMivK5NxqltsahOug8w2emTFfXzQn
 m4APJYa0RVJA4mkh3ejcci5qHyyzPOjslyIJn7eaJPV2rknkxRn9UngkgJLnzHfc
 +lKiD1zaRy82nV4auPjYRiOdAoQN40YFB/RT16OVkjkT+jJEE2UAMjqh2SRlRusp
 Ce8vK7pw6VpDNGJRQveQA+1n9OR/jl0Jf8R7GFRrf9c/bM1J8GErJ6xS/EwNPrgI
 5dE=
 =D6Uy
 -----END PGP SIGNATURE-----

Merge tag 'arm-swiotlb-5.3' of git://git.infradead.org/users/hch/dma-mapping

Pull arm swiotlb support from Christoph Hellwig:
 "This fixes a cascade of regressions that originally started with the
  addition of the ia64 port, but only got fatal once we removed most
  uses of block layer bounce buffering in Linux 4.18.

  The reason is that while the original i386/PAE code that was the first
  architecture that supported > 4GB of memory without an iommu decided
  to leave bounce buffering to the subsystems, which in those days just
  mean block and networking as no one else consumed arbitrary userspace
  memory.

  Later with ia64, x86_64 and other ports we assumed that either an
  iommu or something that fakes it up ("software IOTLB" in beautiful
  Intel speak) is present and that subsystems can rely on that for
  dealing with addressing limitations in devices. Except that the ARM
  LPAE scheme that added larger physical address to 32-bit ARM did not
  follow that scheme and thus only worked by chance and only for block
  and networking I/O directly to highmem.

  Long story, short fix - add swiotlb support to arm when build for LPAE
  platforms, which actuallys turns out to be pretty trivial with the
  modern dma-direct / swiotlb code to fix the Linux 4.18-ish regression"

* tag 'arm-swiotlb-5.3' of git://git.infradead.org/users/hch/dma-mapping:
  arm: use swiotlb for bounce buffering on LPAE configs
  dma-mapping: check pfn validity in dma_common_{mmap,get_sgtable}
2019-08-02 08:44:33 -07:00
Linus Torvalds
35fca9f8a9 dma-mapping regression fixes for 5.3
- fix alignment issues introduced in the CMA allocation rework
    (Nicolin Chen)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAl1DE+YLHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYOrkA//Y7YxsJ9MaI0DNu9gYbYHg9u4ORBWXJ4fN67g2AUe
 rdrHdEHyd4uYduy4Ggi65oknfMZms6xYHGaFr1iDforBmOk0CXfvovocHzJkcV/R
 gtHSkp0L+SVmsLrWFXjd7pbkVhBziLEVrFlw07FbtBlNIeS2VvcAnU+CUcTKpxNi
 7MHhCqxTjVuUZ/qRKyyAnKHGoLCdLqTvpaf1uJq9ca838I/9E5UqitaxL/72G7ab
 q/fe93d94Xj3QNk+ekim6xBSD82VPU+OnFUf+f5dELDwyhgI0LAtz6iL8gH+NnK1
 P9cIIs2sFyBLXRQEaRXF5KhA97sjlWLioXYWs/AxvCphDeb1Zk4u3uGn0bGt90fQ
 g8DryY++nVo6sKpFsaNN7RQ9w/LfxejIcf0hVbNfH6tP8KDO19ds/05kE4O2LUC2
 gLOmPMt+dIOJlBQY0fUNrZN/IH6u60LnULmCWDiy7iY7VBJOf+H3zXM0UAJ+XEbs
 l2OG5vxkQ4hnFZVD1csNRd9gKYyjhrqOA0VssopgdBS53/seYMNopSbQsMTdp8J7
 V3c7Rozz3f62pwxJ7Jd7AwCgpvw8zHOESb5WOzi5DEmxAqRaJQ80H2DdAvQc3orL
 x0SummHKX2mY0cJdrFFbXkGoYt3sjJ+J6P+0CP11UIEIqtw4Zfa2hyWmbs8Q9SFt
 KYg=
 =rM5q
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-5.3-3' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping regression fixes from Christoph Hellwig:
 "Two related regression fixes for changes from this merge window to fix
  alignment issues introduced in the CMA allocation rework (Nicolin
  Chen)"

* tag 'dma-mapping-5.3-3' of git://git.infradead.org/users/hch/dma-mapping:
  dma-contiguous: page-align the size in dma_free_contiguous()
  dma-contiguous: do not overwrite align in dma_alloc_contiguous()
2019-08-02 08:41:11 -07:00
Paul E. McKenney
60013d5d2b rcutorture: Aggressive forward-progress tests shouldn't block shutdown
The more aggressive forward-progress tests can interfere with rcutorture
shutdown, resulting in false-positive diagnostics.  This commit therefore
ends any such tests 30 seconds prior to shutdown.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-01 14:30:22 -07:00
Joel Fernandes (Google)
77e9752ce6 rcuperf: Make rcuperf kernel test more robust for !expedited mode
It is possible that the rcuperf kernel test runs concurrently with init
starting up.  During this time, the system is running all grace periods
as expedited.  However, rcuperf can also be run for normal GP tests.
Right now, it depends on a holdoff time before starting the test to
ensure grace periods start later. This works fine with the default
holdoff time however it is not robust in situations where init takes
greater than the holdoff time to finish running. Or, as in my case:

I modified the rcuperf test locally to also run a thread that did
preempt disable/enable in a loop. This had the effect of slowing down
init. The end result was that the "batches:" counter in rcuperf was 0
causing a division by 0 error in the results. This counter was 0 because
only expedited GPs seem to happen, not normal ones which led to the
rcu_state.gp_seq counter remaining constant across grace periods which
unexpectedly happen to be expedited. The system was running expedited
RCU all the time because rcu_unexpedited_gp() would not have run yet
from init.  In other words, the test would concurrently with init
booting in expedited GP mode.

To fix this properly, this commit waits until system_state is set to
SYSTEM_RUNNING before starting the test.  This change is made just
before kernel_init() invokes rcu_end_inkernel_boot(), and this latter
is what turns off boot-time expediting of RCU grace periods.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-01 14:30:22 -07:00
Denis Efremov
21f57546ce torture: Remove exporting of internal functions
The functions torture_onoff_cleanup() and torture_shuffle_cleanup()
are declared static and marked EXPORT_SYMBOL_GPL(), which is at best an
odd combination.  Because these functions are not used outside of the
kernel/torture.c file they are defined in, this commit removes their
EXPORT_SYMBOL_GPL() marking.

Fixes: cc47ae0830 ("rcutorture: Abstract torture-test cleanup")
Signed-off-by: Denis Efremov <efremov@linux.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-01 14:30:22 -07:00
Paul E. McKenney
bd1bfc51a3 rcutorture: Emulate userspace sojourn during call_rcu() floods
During an actual call_rcu() flood, there would be frequent trips to
userspace (in-kernel call_rcu() floods must be otherwise housebroken).
Userspace execution allows a great many things to interrupt execution,
and rcutorture needs to also allow such interruptions.  This commit
therefore causes call_rcu() floods to occasionally invoke schedule(),
thus preventing spurious rcutorture failures due to other parts of the
kernel becoming irate at the call_rcu() flood events.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-01 14:30:22 -07:00
Xiao Yang
b3f3886c59 rcuperf: Fix perf_type module-parameter description
The rcu_bh rcuperf type was removed by commit 620d246065cd("rcuperf:
Remove the "rcu_bh" and "sched" torture types"), but it lives on in the
MODULE_PARM_DESC() of perf_type.  This commit therefore changes that
module-parameter description to substitute srcu for rcu_bh.

Signed-off-by: Xiao Yang <ice_yangxiao@163.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-01 14:30:22 -07:00
Joel Fernandes (Google)
9147089bee rcu: Remove redundant debug_locks check in rcu_read_lock_sched_held()
The debug_locks flag can never be true at the end of
rcu_read_lock_sched_held() because it is already checked by the earlier
call todebug_lockdep_rcu_enabled().   This commit therefore removes this
redundant check.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-01 14:17:01 -07:00
Joel Fernandes (Google)
0a5b99f578 treewide: Rename rcu_dereference_raw_notrace() to _check()
The rcu_dereference_raw_notrace() API name is confusing.  It is equivalent
to rcu_dereference_raw() except that it also does sparse pointer checking.

There are only a few users of rcu_dereference_raw_notrace(). This patches
renames all of them to be rcu_dereference_raw_check() with the "_check()"
indicating sparse checking.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
[ paulmck: Fix checkpatch warnings about parentheses. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-01 14:16:21 -07:00
Byungchul Park
3545832fc2 rcu: Change return type of rcu_spawn_one_boost_kthread()
The return value of rcu_spawn_one_boost_kthread() is not used any longer.
This commit therefore changes its return type from int to void, and
removes the cast to void from its callers.

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-01 14:05:51 -07:00
Paul E. McKenney
7e210a653e srcu: Avoid srcutorture security-based pointer obfuscation
Because pointer output is now obfuscated, and because what you really
want to know is whether or not the callback lists are empty, this commit
replaces the srcu_data structure's head callback pointer printout with
a single character that is "." is the callback list is empty or "C"
otherwise.

This is the only remaining user of rcu_segcblist_head(), so this
commit also removes this function's definition.  It also turns out that
rcu_segcblist_tail() no longer has any callers, so this commit removes
that function's definition while in the area.  They were both marked
"Interim", and their end has come.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-01 14:05:51 -07:00
Paul E. McKenney
fbad01af8c rcu: Add destroy_work_on_stack() to match INIT_WORK_ONSTACK()
The synchronize_rcu_expedited() function has an INIT_WORK_ONSTACK(),
but lacks the corresponding destroy_work_on_stack().  This commit
therefore adds destroy_work_on_stack().

Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Acked-by: Andrea Arcangeli <aarcange@redhat.com>
2019-08-01 14:05:51 -07:00
Paul E. McKenney
cdc694b235 rcu: Add kernel parameter to dump trace after RCU CPU stall warning
This commit adds a rcu_cpu_stall_ftrace_dump kernel boot parameter, that,
when set, causes the trace buffer to be dumped after an RCU CPU stall
warning is printed.  This kernel boot parameter is disabled by default,
maintaining compatibility with previous behavior.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-01 14:05:51 -07:00
Paul E. McKenney
1f3ebc8253 rcu: Restore barrier() to rcu_read_lock() and rcu_read_unlock()
Commit bb73c52bad ("rcu: Don't disable preemption for Tiny and Tree
RCU readers") removed the barrier() calls from rcu_read_lock() and
rcu_write_lock() in CONFIG_PREEMPT=n&&CONFIG_PREEMPT_COUNT=n kernels.
Within RCU, this commit was OK, but it failed to account for things like
get_user() that can pagefault and that can be reordered by the compiler.
Lack of the barrier() calls in rcu_read_lock() and rcu_read_unlock()
can cause these page faults to migrate into RCU read-side critical
sections, which in CONFIG_PREEMPT=n kernels could result in too-short
grace periods and arbitrary misbehavior.  Please see commit 386afc9114
("spinlocks and preemption points need to be at least compiler barriers")
and Linus's commit 66be4e66a7 ("rcu: locking and unlocking need to
always be at least barriers"), this last of which restores the barrier()
call to both rcu_read_lock() and rcu_read_unlock().

This commit removes barrier() calls that are no longer needed given that
the addition of them in Linus's commit noted above.  The combination of
this commit and Linus's commit effectively reverts commit bb73c52bad
("rcu: Don't disable preemption for Tiny and Tree RCU readers").

Reported-by: Herbert Xu <herbert@gondor.apana.org.au>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
[ paulmck: Fix embarrassing typo located by Alan Stern. ]
2019-08-01 14:05:51 -07:00
Paul E. McKenney
b55bd58555 time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint
The TASKS03 and TREE04 rcutorture scenarios produce the following
lockdep complaint:

------------------------------------------------------------------------

================================
WARNING: inconsistent lock state
5.2.0-rc1+ #513 Not tainted
--------------------------------
inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
migration/1/14 [HC0[0]:SC0[0]:HE1:SE1] takes:
(____ptrval____) (tick_broadcast_lock){?...}, at: tick_broadcast_offline+0xf/0x70
{IN-HARDIRQ-W} state was registered at:
  lock_acquire+0xb0/0x1c0
  _raw_spin_lock_irqsave+0x3c/0x50
  tick_broadcast_switch_to_oneshot+0xd/0x40
  tick_switch_to_oneshot+0x4f/0xd0
  hrtimer_run_queues+0xf3/0x130
  run_local_timers+0x1c/0x50
  update_process_times+0x1c/0x50
  tick_periodic+0x26/0xc0
  tick_handle_periodic+0x1a/0x60
  smp_apic_timer_interrupt+0x80/0x2a0
  apic_timer_interrupt+0xf/0x20
  _raw_spin_unlock_irqrestore+0x4e/0x60
  rcu_nocb_gp_kthread+0x15d/0x590
  kthread+0xf3/0x130
  ret_from_fork+0x3a/0x50
irq event stamp: 171
hardirqs last  enabled at (171): [<ffffffff8a201a37>] trace_hardirqs_on_thunk+0x1a/0x1c
hardirqs last disabled at (170): [<ffffffff8a201a53>] trace_hardirqs_off_thunk+0x1a/0x1c
softirqs last  enabled at (0): [<ffffffff8a264ee0>] copy_process.part.56+0x650/0x1cb0
softirqs last disabled at (0): [<0000000000000000>] 0x0

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(tick_broadcast_lock);
  <Interrupt>
    lock(tick_broadcast_lock);

 *** DEADLOCK ***

1 lock held by migration/1/14:
 #0: (____ptrval____) (clockevents_lock){+.+.}, at: tick_offline_cpu+0xf/0x30

stack backtrace:
CPU: 1 PID: 14 Comm: migration/1 Not tainted 5.2.0-rc1+ #513
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Bochs 01/01/2011
Call Trace:
 dump_stack+0x5e/0x8b
 print_usage_bug+0x1fc/0x216
 ? print_shortest_lock_dependencies+0x1b0/0x1b0
 mark_lock+0x1f2/0x280
 __lock_acquire+0x1e0/0x18f0
 ? __lock_acquire+0x21b/0x18f0
 ? _raw_spin_unlock_irqrestore+0x4e/0x60
 lock_acquire+0xb0/0x1c0
 ? tick_broadcast_offline+0xf/0x70
 _raw_spin_lock+0x33/0x40
 ? tick_broadcast_offline+0xf/0x70
 tick_broadcast_offline+0xf/0x70
 tick_offline_cpu+0x16/0x30
 take_cpu_down+0x7d/0xa0
 multi_cpu_stop+0xa2/0xe0
 ? cpu_stop_queue_work+0xc0/0xc0
 cpu_stopper_thread+0x6d/0x100
 smpboot_thread_fn+0x169/0x240
 kthread+0xf3/0x130
 ? sort_range+0x20/0x20
 ? kthread_cancel_delayed_work_sync+0x10/0x10
 ret_from_fork+0x3a/0x50

------------------------------------------------------------------------

To reproduce, run the following rcutorture test:

        tools/testing/selftests/rcutorture/bin/kvm.sh --duration 5 --kconfig "CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_PROVE_LOCKING=y" --configs "TASKS03 TREE04"

It turns out that tick_broadcast_offline() was an innocent bystander.
After all, interrupts are supposed to be disabled throughout
take_cpu_down(), and therefore should have been disabled upon entry to
tick_offline_cpu() and thus to tick_broadcast_offline().  This suggests
that one of the CPU-hotplug notifiers was incorrectly enabling interrupts,
and leaving them enabled on return.

Some debugging code showed that the culprit was sched_cpu_dying().
It had irqs enabled after return from sched_tick_stop().  Which in turn
had irqs enabled after return from cancel_delayed_work_sync().  Which is a
wrapper around __cancel_work_timer().  Which can sleep in the case where
something else is concurrently trying to cancel the same delayed work,
and as Thomas Gleixner pointed out on IRC, sleeping is a decidedly bad
idea when you are invoked from take_cpu_down(), regardless of the state
you leave interrupts in upon return.

Code inspection located no reason why the delayed work absolutely
needed to be canceled from sched_tick_stop():  The work is not
bound to the outgoing CPU by design, given that the whole point is
to collect statistics without disturbing the outgoing CPU.

This commit therefore simply drops the cancel_delayed_work_sync() from
sched_tick_stop().  Instead, a new ->state field is added to the tick_work
structure so that the delayed-work handler function sched_tick_remote()
can avoid reposting itself.  A cpu_is_offline() check is also added to
sched_tick_remote() to avoid mucking with the state of an offlined CPU
(though it does appear safe to do so).  The sched_tick_start() and
sched_tick_stop() functions also update ->state, and sched_tick_start()
also schedules the delayed work if ->state indicates that it is not
already in flight.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
[ paulmck: Apply Peter Zijlstra and Frederic Weisbecker atomics feedback. ]
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2019-08-01 14:05:51 -07:00
Paul E. McKenney
519248f36d lockdep: Make print_lock() address visible
Security is a wonderful thing, but so is the ability to debug based on
lockdep warnings.  This commit therefore makes lockdep lock addresses
visible in the clear.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-01 14:05:51 -07:00
Joel Fernandes (Google)
cb4dbbfaa1 rcu: Simplify rcu_note_context_switch exit from critical section
Because __rcu_read_unlock() can be preempted just before the call to
rcu_read_unlock_special(), it is possible for a task to be preempted just
before it would have fully exited its RCU read-side critical section.
This would result in a needless extension of that critical section until
that task was resumed, which might in turn result in a needlessly
long grace period, needless RCU priority boosting, and needless
force-quiescent-state actions.  Therefore, rcu_note_context_switch()
invokes __rcu_read_unlock() followed by rcu_preempt_deferred_qs() when
it detects this situation.  This action by rcu_note_context_switch()
ends the RCU read-side critical section immediately.

Of course, once the task resumes, it will invoke rcu_read_unlock_special()
redundantly.  This is harmless because the fact that a preemption
happened means that interrupts, preemption, and softirqs cannot
have been disabled, so there would be no deferred quiescent state.
While ->rcu_read_lock_nesting remains less than zero, none of the
->rcu_read_unlock_special.b bits can be set, and they were all zeroed by
the call to rcu_note_context_switch() at task-preemption time.  Therefore,
setting ->rcu_read_unlock_special.b.exp_hint to false has no effect.

Therefore, the extra call to rcu_preempt_deferred_qs_irqrestore()
would return immediately.  With one possible exception, which is
if an expedited grace period started just as the task was being
resumed, which could leave ->exp_deferred_qs set.  This will cause
rcu_preempt_deferred_qs_irqrestore() to invoke rcu_report_exp_rdp(),
reporting the quiescent state, just as it should.  (Such an expedited
grace period won't affect the preemption code path due to interrupts
having already been disabled.)

But when rcu_note_context_switch() invokes __rcu_read_unlock(), it
is doing so with preemption disabled, hence __rcu_read_unlock() will
unconditionally defer the quiescent state, only to immediately invoke
rcu_preempt_deferred_qs(), thus immediately reporting the deferred
quiescent state.  It turns out to be safe (and faster) to instead
just invoke rcu_preempt_deferred_qs() without the __rcu_read_unlock()
middleman.

Because this is the invocation during the preemption (as opposed to
the invocation just after the resume), at least one of the bits in
->rcu_read_unlock_special.b must be set and ->rcu_read_lock_nesting
must be negative.  This means that rcu_preempt_need_deferred_qs() must
return true, avoiding the early exit from rcu_preempt_deferred_qs().
Thus, rcu_preempt_deferred_qs_irqrestore() will be invoked immediately,
as required.

This commit therefore simplifies the CONFIG_PREEMPT=y version of
rcu_note_context_switch() by removing the "else if" branch of its
"if" statement.  This change means that all callers that would have
invoked rcu_read_unlock_special() followed by rcu_preempt_deferred_qs()
will now simply invoke rcu_preempt_deferred_qs(), thus avoiding the
rcu_read_unlock_special() middleman when __rcu_read_unlock() is preempted.

Cc: rcu@vger.kernel.org
Cc: kernel-team@android.com
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-01 14:04:20 -07:00
Paul E. McKenney
87446b4874 rcu: Make rcu_read_unlock_special() checks match raise_softirq_irqoff()
Threaded interrupts provide additional interesting interactions between
RCU and raise_softirq() that can result in self-deadlocks in v5.0-2 of
the Linux kernel.  These self-deadlocks can be provoked in susceptible
kernels within a few minutes using the following rcutorture command on
an 8-CPU system:

tools/testing/selftests/rcutorture/bin/kvm.sh --duration 5 --configs "TREE03" --bootargs "threadirqs"

Although post-v5.2 RCU commits have at least greatly reduced the
probability of these self-deadlocks, this was entirely by accident.
Although this sort of accident should be rowdily celebrated on those
rare occasions when it does occur, such celebrations should be quickly
followed by a principled patch, which is what this patch purports to be.

The key point behind this patch is that when in_interrupt() returns
true, __raise_softirq_irqoff() will never attempt a wakeup.  Therefore,
if in_interrupt(), calls to raise_softirq*() are both safe and
extremely cheap.

This commit therefore replaces the in_irq() calls in the "if" statement
in rcu_read_unlock_special() with in_interrupt() and simplifies the
"if" condition to the following:

	if (irqs_were_disabled && use_softirq &&
	    (in_interrupt() ||
	     (exp && !t->rcu_read_unlock_special.b.deferred_qs))) {
		raise_softirq_irqoff(RCU_SOFTIRQ);
	} else {
		/* Appeal to the scheduler. */
	}

The rationale behind the "if" condition is as follows:

1.	irqs_were_disabled:  If interrupts are enabled, we should
	instead appeal to the scheduler so as to let the upcoming
	irq_enable()/local_bh_enable() do the rescheduling for us.
2.	use_softirq: If this kernel isn't using softirq, then
	raise_softirq_irqoff() will be unhelpful.
3.	a.	in_interrupt(): If this returns true, the subsequent
		call to raise_softirq_irqoff() is guaranteed not to
		do a wakeup, so that call will be both very cheap and
		quite safe.
	b.	Otherwise, if !in_interrupt() the raise_softirq_irqoff()
		might do a wakeup, which is expensive and, in some
		contexts, unsafe.
		i.	The "exp" (an expedited RCU grace period is being
			blocked) says that the wakeup is worthwhile, and:
		ii.	The !.deferred_qs says that scheduler locks
			cannot be held, so the wakeup will be safe.

Backporting this requires considerable care, so no auto-backport, please!

Fixes: 05f415715c ("rcu: Speed up expedited GPs when interrupting RCU reader")
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-01 14:04:20 -07:00
Paul E. McKenney
d143b3d1cd rcu: Simplify rcu_read_unlock_special() deferred wakeups
In !use_softirq runs, we clearly cannot rely on raise_softirq() and
its lightweight bit setting, so we must instead do some form of wakeup.
In the absence of a self-IPI when interrupts are disabled, these wakeups
can be delayed until the next interrupt occurs.  This means that calling
invoke_rcu_core() doesn't actually do any expediting.

In this case, it is better to take the "else" clause, which sets the
current CPU's resched bits and, if there is an expedited grace period
in flight, uses IRQ-work to force the needed self-IPI.  This commit
therefore removes the "else if" clause that calls invoke_rcu_core().

Reported-by: Scott Wood <swood@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-01 14:04:20 -07:00
Stanislav Fomichev
9babe825da bpf: always allocate at least 16 bytes for setsockopt hook
Since we always allocate memory, allocate just a little bit more
for the BPF program in case it need to override user input with
bigger value. The canonical example is TCP_CONGESTION where
input string might be too small to override (nv -> bbr or cubic).

16 bytes are chosen to match the size of TCP_CA_NAME_MAX and can
be extended in the future if needed.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-08-01 13:55:52 -07:00
Christian Brauner
3695eae5fe
pidfd: add P_PIDFD to waitid()
This adds the P_PIDFD type to waitid().
One of the last remaining bits for the pidfd api is to make it possible
to wait on pidfds. With P_PIDFD added to waitid() the parts of userspace
that want to use the pidfd api to exclusively manage processes can do so
now.

One of the things this will unblock in the future is the ability to make
it possible to retrieve the exit status via waitid(P_PIDFD) for
non-parent processes if handed a _suitable_ pidfd that has this feature
set. This is similar to what you can do on FreeBSD with kqueue(). It
might even end up being possible to wait on a process as a non-parent if
an appropriate property is enabled on the pidfd.

With P_PIDFD no scoping of the process identified by the pidfd is
possible, i.e. it explicitly blocks things such as wait4(-1), wait4(0),
waitid(P_ALL), waitid(P_PGID) etc. It only allows for semantics
equivalent to wait4(pid), waitid(P_PID). Users that need scoping should
rely on pid-based wait*() syscalls for now.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/r/20190727222229.6516-2-christian@brauner.io
2019-08-01 21:49:46 +02:00
Sebastian Andrzej Siewior
5d99b32a00 posix-timers: Move rcu_head out of it union
Timer deletion on PREEMPT_RT is prone to priority inversion and live
locks. The hrtimer code has a synchronization mechanism for this. Posix CPU
timers will grow one.

But that mechanism cannot be invoked while holding the k_itimer lock
because that can deadlock against the running timer callback. So the lock
must be dropped which allows the timer to be freed.

The timer free can be prevented by taking RCU readlock before dropping the
lock, but because the rcu_head is part of the 'it' union a concurrent free
will overwrite the hrtimer on which the task is trying to synchronize.

Move the rcu_head out of the union to prevent this.

[ tglx: Fixed up kernel-doc. Rewrote changelog ]

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190730223828.965541887@linutronix.de
2019-08-01 20:51:25 +02:00
Thomas Gleixner
6945e5c2ab posix-timers: Rework cancel retry loops
As a preparatory step for adding the PREEMPT RT specific synchronization
mechanism to wait for a running timer callback, rework the timer cancel
retry loops so they call a common function. This allows trivial
substitution in one place.

Originally-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190730223828.874901027@linutronix.de
2019-08-01 20:51:24 +02:00
Thomas Gleixner
21670ee44f posix-timers: Cleanup the flag/flags confusion
do_timer_settime() has a 'flags' argument and uses 'flag' for the interrupt
flags, which is confusing at best.

Rename the argument so 'flags' can be used for interrupt flags as usual.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190730223828.782664411@linutronix.de
2019-08-01 20:51:24 +02:00
Anna-Maria Gleixner
c7e6d704a0 itimers: Prepare for PREEMPT_RT
Use the hrtimer_cancel_wait_running() synchronization mechanism to prevent
priority inversion and live locks on PREEMPT_RT.

As a benefit the retry loop gains the missing cpu_relax() on !RT.

[ tglx: Split out of combo patch ]

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190730223828.690771827@linutronix.de
2019-08-01 20:51:24 +02:00
Anna-Maria Gleixner
51ae33092b alarmtimer: Prepare for PREEMPT_RT
Use the hrtimer_cancel_wait_running() synchronization mechanism to prevent
priority inversion and live locks on PREEMPT_RT.

[ tglx: Split out of combo patch ]

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190730223828.508744705@linutronix.de
2019-08-01 20:51:23 +02:00
Juri Lelli
850377a875 sched/deadline: Ensure inactive_timer runs in hardirq context
SCHED_DEADLINE inactive timer needs to run in hardirq context (as
dl_task_timer already does) on PREEMPT_RT

Change the mode to HRTIMER_MODE_REL_HARD.

[ tglx: Fixed up the start site, so mode debugging works ]

Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190731103715.4047-1-juri.lelli@redhat.com
2019-08-01 20:51:22 +02:00
Anna-Maria Gleixner
030dcdd197 timers: Prepare support for PREEMPT_RT
When PREEMPT_RT is enabled, the soft interrupt thread can be preempted.  If
the soft interrupt thread is preempted in the middle of a timer callback,
then calling del_timer_sync() can lead to two issues:

  - If the caller is on a remote CPU then it has to spin wait for the timer
    handler to complete. This can result in unbound priority inversion.

  - If the caller originates from the task which preempted the timer
    handler on the same CPU, then spin waiting for the timer handler to
    complete is never going to end.

To avoid these issues, add a new lock to the timer base which is held
around the execution of the timer callbacks. If del_timer_sync() detects
that the timer callback is currently running, it blocks on the expiry
lock. When the callback is finished, the expiry lock is dropped by the
softirq thread which wakes up the waiter and the system makes progress.

This addresses both the priority inversion and the life lock issues.

This mechanism is not used for timers which are marked IRQSAFE as for those
preemption is disabled accross the callback and therefore this situation
cannot happen. The callbacks for such timers need to be individually
audited for RT compliance.

The same issue can happen in virtual machines when the vCPU which runs a
timer callback is scheduled out. If a second vCPU of the same guest calls
del_timer_sync() it will spin wait for the other vCPU to be scheduled back
in. The expiry lock mechanism would avoid that. It'd be trivial to enable
this when paravirt spinlocks are enabled in a guest, but it's not clear
whether this is an actual problem in the wild, so for now it's an RT only
mechanism.

As the softirq thread can be preempted with PREEMPT_RT=y, the SMP variant
of del_timer_sync() needs to be used on UP as well.

[ tglx: Refactored it for mainline ]

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190726185753.832418500@linutronix.de
2019-08-01 20:51:22 +02:00
Anna-Maria Gleixner
f61eff83ce hrtimer: Prepare support for PREEMPT_RT
When PREEMPT_RT is enabled, the soft interrupt thread can be preempted.  If
the soft interrupt thread is preempted in the middle of a timer callback,
then calling hrtimer_cancel() can lead to two issues:

  - If the caller is on a remote CPU then it has to spin wait for the timer
    handler to complete. This can result in unbound priority inversion.

  - If the caller originates from the task which preempted the timer
    handler on the same CPU, then spin waiting for the timer handler to
    complete is never going to end.

To avoid these issues, add a new lock to the timer base which is held
around the execution of the timer callbacks. If hrtimer_cancel() detects
that the timer callback is currently running, it blocks on the expiry
lock. When the callback is finished, the expiry lock is dropped by the
softirq thread which wakes up the waiter and the system makes progress.

This addresses both the priority inversion and the life lock issues.

The same issue can happen in virtual machines when the vCPU which runs a
timer callback is scheduled out. If a second vCPU of the same guest calls
hrtimer_cancel() it will spin wait for the other vCPU to be scheduled back
in. The expiry lock mechanism would avoid that. It'd be trivial to enable
this when paravirt spinlocks are enabled in a guest, but it's not clear
whether this is an actual problem in the wild, so for now it's an RT only
mechanism.

[ tglx: Refactored it for mainline ]

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190726185753.737767218@linutronix.de
2019-08-01 20:51:22 +02:00
Sebastian Andrzej Siewior
1842f5a427 hrtimer: Determine hard/soft expiry mode for hrtimer sleepers on RT
On PREEMPT_RT enabled kernels hrtimers which are not explicitely marked for
hard interrupt expiry mode are moved into soft interrupt context either for
latency reasons or because the hrtimer callback takes regular spinlocks or
invokes other functions which are not suitable for hard interrupt context
on PREEMPT_RT.

The hrtimer_sleeper callback is RT compatible in hard interrupt context,
but there is a latency concern: Untrusted userspace can spawn many threads
which arm timers for the same expiry time on the same CPU. On expiry that
causes a latency spike due to the wakeup of a gazillion threads.

OTOH, priviledged real-time user space applications rely on the low latency
of hard interrupt wakeups. These syscall related wakeups are all based on
hrtimer sleepers.

If the current task is in a real-time scheduling class, mark the mode for
hard interrupt expiry.

[ tglx: Split out of a larger combo patch. Added changelog ]

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190726185753.645792403@linutronix.de
2019-08-01 20:51:22 +02:00
Sebastian Andrzej Siewior
f5c2f0215e hrtimer: Move unmarked hrtimers to soft interrupt expiry on RT
On PREEMPT_RT not all hrtimers can be expired in hard interrupt context
even if that is perfectly fine on a PREEMPT_RT=n kernel, e.g. because they
take regular spinlocks. Also for latency reasons PREEMPT_RT tries to defer
most hrtimers' expiry into softirq context.

hrtimers marked with HRTIMER_MODE_HARD must be kept in hard interrupt
context expiry mode. Add the required logic.

No functional change for PREEMPT_RT=n kernels.

[ tglx: Split out of a larger combo patch. Added changelog ]

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190726185753.551967692@linutronix.de
2019-08-01 20:51:21 +02:00
Sebastian Andrzej Siewior
902a9f9c50 tick: Mark tick related hrtimers to expiry in hard interrupt context
The tick related hrtimers, which drive the scheduler tick and hrtimer based
broadcasting are required to expire in hard interrupt context for obvious
reasons.

Mark them so PREEMPT_RT kernels wont move them to soft interrupt expiry.

Make the horribly formatted RCU_NONIDLE bracket maze readable while at it.

No functional change, 

[ tglx: Split out from larger combo patch. Add changelog ]

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190726185753.459144407@linutronix.de
2019-08-01 20:51:21 +02:00
Sebastian Andrzej Siewior
d2ab4cf494 watchdog: Mark watchdog_hrtimer to expire in hard interrupt context
The watchdog hrtimer must expire in hard interrupt context even on
PREEMPT_RT=y kernels as otherwise the hard/softlockup detection logic would
not work.

No functional change.

[ tglx: Split out from larger combo patch. Added changelog ]

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190726185753.262895510@linutronix.de
2019-08-01 20:51:20 +02:00
Sebastian Andrzej Siewior
30f9028b6c perf/core: Mark hrtimers to expire in hard interrupt context
To guarantee that the multiplexing mechanism and the hrtimer driven events
work on PREEMPT_RT enabled kernels it's required that the related hrtimers
expire in hard interrupt context. Mark them so PREEMPT_RT kernels wont
defer them to soft interrupt context.

No functional change.

[ tglx: Split out of larger combo patch. Added changelog ]

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190726185753.169509224@linutronix.de
2019-08-01 20:51:20 +02:00
Sebastian Andrzej Siewior
d5096aa65a sched: Mark hrtimers to expire in hard interrupt context
The scheduler related hrtimers need to expire in hard interrupt context
even on PREEMPT_RT enabled kernels. Mark then as such.

No functional change.

[ tglx: Split out from larger combo patch. Add changelog. ]

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190726185753.077004842@linutronix.de
2019-08-01 20:51:19 +02:00
Thomas Gleixner
0ab6a3ddba hrtimer: Make enqueue mode check work on RT
hrtimer_start_range_ns() has a WARN_ONCE() which verifies that a timer
which is marker for softirq expiry is not queued in the hard interrupt base
and vice versa.

When PREEMPT_RT is enabled, timers which are not explicitely marked to
expire in hard interrupt context are deferrred to the soft interrupt. So
the regular check would trigger.

Change the check, so when PREEMPT_RT is enabled, it is verified that the
timers marked for hard interrupt expiry are not tried to be queued for soft
interrupt expiry or any of the unmarked and softirq marked is tried to be
expired in hard interrupt context.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2019-08-01 20:51:19 +02:00
Thomas Gleixner
9dd8813ed9 hrtimer/treewide: Use hrtimer_sleeper_start_expires()
hrtimer_sleepers will gain a scheduling class dependent treatment on
PREEMPT_RT. Use the new hrtimer_sleeper_start_expires() function to make
that possible.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2019-08-01 17:43:16 +02:00
Thomas Gleixner
01656464fc hrtimer: Provide hrtimer_sleeper_start_expires()
hrtimer_sleepers will gain a scheduling class dependent treatment on
PREEMPT_RT. Create a wrapper around hrtimer_start_expires() to make that
possible.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2019-08-01 17:43:15 +02:00
Sebastian Andrzej Siewior
dbc1625fc9 hrtimer: Consolidate hrtimer_init() + hrtimer_init_sleeper() calls
hrtimer_init_sleeper() calls require prior initialisation of the hrtimer
object which is embedded into the hrtimer_sleeper.

Combine the initialization and spare a function call. Fixup all call sites.

This is also a preparatory change for PREEMPT_RT to do hrtimer sleeper
specific initializations of the embedded hrtimer without modifying any of
the call sites.

No functional change.

[ anna-maria: Minor cleanups ]
[ tglx: Adopted to the removal of the task argument of
  	hrtimer_init_sleeper() and trivial polishing.
	Folded a fix from Stephen Rothwell for the vsoc code ]

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190726185752.887468908@linutronix.de
2019-08-01 17:43:15 +02:00
Linus Torvalds
d2eee9fca1 Two minor fixes:
- Fix trace event header include guards, as several did not match
    the #define to the #ifdef
 
  - Remove a redundant test to ftrace_graph_notrace_addr() that
    was accidentally added.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXUFythQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qsjjAP41jDPQogZMaQZrPW1qyp69DHhuZXI2
 j/d7A2LG76mCggD/WHPBk8P98IHk7pO5Ndl4yLKS3plMCYqTcgylpJLMXQI=
 =yEpO
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.3-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
 "Two minor fixes:

   - Fix trace event header include guards, as several did not match the
     #define to the #ifdef

   - Remove a redundant test to ftrace_graph_notrace_addr() that was
     accidentally added"

* tag 'trace-v5.3-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  fgraph: Remove redundant ftrace_graph_notrace_addr() test
  tracing: Fix header include guards in trace event headers
2019-07-31 10:26:59 -07:00
Thomas Gleixner
9261660636 kprobes: Use CONFIG_PREEMPTION
CONFIG_PREEMPTION is selected by CONFIG_PREEMPT and by
CONFIG_PREEMPT_RT. Both PREEMPT and PREEMPT_RT require the same
functionality which today depends on CONFIG_PREEMPT.

Switch kprobes conditional over to CONFIG_PREEMPTION.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20190726212124.516286187@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-31 19:03:35 +02:00
Thomas Gleixner
30c937043b tracing: Use CONFIG_PREEMPTION
CONFIG_PREEMPTION is selected by CONFIG_PREEMPT and by
CONFIG_PREEMPT_RT. Both PREEMPT and PREEMPT_RT require the same
functionality which today depends on CONFIG_PREEMPT.

Switch the conditionals in the tracer over to CONFIG_PREEMPTION.

This is the first step to make the tracer work on RT. The other small
tweaks are submitted separately.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20190726212124.409766323@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-31 19:03:35 +02:00
Thomas Gleixner
01b1d88b09 rcu: Use CONFIG_PREEMPTION
CONFIG_PREEMPTION is selected by CONFIG_PREEMPT and by
CONFIG_PREEMPT_RT. Both PREEMPT and PREEMPT_RT require the same
functionality which today depends on CONFIG_PREEMPT.

Switch the conditionals in RCU to use CONFIG_PREEMPTION.

That's the first step towards RCU on RT. The further tweaks are work in
progress. This neither touches the selftest bits which need a closer look
by Paul.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20190726212124.210156346@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-31 19:03:35 +02:00
Thomas Gleixner
c1a280b68d sched/preempt: Use CONFIG_PREEMPTION where appropriate
CONFIG_PREEMPTION is selected by CONFIG_PREEMPT and by
CONFIG_PREEMPT_RT. Both PREEMPT and PREEMPT_RT require the same
functionality which today depends on CONFIG_PREEMPT.

Switch the preemption code, scheduler and init task over to use
CONFIG_PREEMPTION.

That's the first step towards RT in that area. The more complex changes are
coming separately.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20190726212124.117528401@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-31 19:03:34 +02:00
Changbin Du
6c77221df9 fgraph: Remove redundant ftrace_graph_notrace_addr() test
We already have tested it before. The second one should be removed.
With this change, the performance should have little improvement.

Link: http://lkml.kernel.org/r/20190730140850.7927-1-changbin.du@gmail.com

Cc: stable@vger.kernel.org
Fixes: 9cd2992f2d ("fgraph: Have set_graph_notrace only affect function_graph tracer")
Signed-off-by: Changbin Du <changbin.du@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-07-30 21:50:03 -04:00
Thomas Gleixner
b744948725 hrtimer: Remove task argument from hrtimer_init_sleeper()
All callers hand in 'current' and that's the only task pointer which
actually makes sense. Remove the task argument and set current in the
function.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190726185752.791885290@linutronix.de
2019-07-30 23:57:51 +02:00
Christian Brauner
30b692d3b3
exit: make setting exit_state consistent
Since commit b191d6491b ("pidfd: fix a poll race when setting exit_state")
we unconditionally set exit_state to EXIT_ZOMBIE before calling into
do_notify_parent(). This was done to eliminate a race when querying
exit_state in do_notify_pidfd().
Back then we decided to do the absolute minimal thing to fix this and
not touch the rest of the exit_notify() function where exit_state is
set.
Since this fix has not caused any issues change the setting of
exit_state to EXIT_DEAD in the autoreap case to account for the fact hat
exit_state is set to EXIT_ZOMBIE unconditionally. This fix was planned
but also explicitly requested in [1] and makes the whole code more
consistent.

/* References */
[1]: https://lore.kernel.org/lkml/CAHk-=wigcxGFR2szue4wavJtH5cYTTeNES=toUBVGsmX0rzX+g@mail.gmail.com

Signed-off-by: Christian Brauner <christian@brauner.io>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-30 19:57:14 +02:00
Thomas Huth
2b089bf8d1 kernel/configs: Replace GPL boilerplate code with SPDX identifier
The FSF does not reside in "675 Mass Ave, Cambridge" anymore...
let's replace the old GPL boilerplate code with a proper SPDX
identifier instead.

Signed-off-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-30 18:34:15 +02:00
Jessica Yu
38f054d549 modules: always page-align module section allocations
Some arches (e.g., arm64, x86) have moved towards non-executable
module_alloc() allocations for security hardening reasons. That means
that the module loader will need to set the text section of a module to
executable, regardless of whether or not CONFIG_STRICT_MODULE_RWX is set.

When CONFIG_STRICT_MODULE_RWX=y, module section allocations are always
page-aligned to handle memory rwx permissions. On some arches with
CONFIG_STRICT_MODULE_RWX=n however, when setting the module text to
executable, the BUG_ON() in frob_text() gets triggered since module
section allocations are not page-aligned when CONFIG_STRICT_MODULE_RWX=n.
Since the set_memory_* API works with pages, and since we need to call
set_memory_x() regardless of whether CONFIG_STRICT_MODULE_RWX is set, we
might as well page-align all module section allocations for ease of
managing rwx permissions of module sections (text, rodata, etc).

Fixes: 2eef1399a8 ("modules: fix BUG when load module with rodata=n")
Reported-by: Martin Kaiser <lists@kaiser.cx>
Reported-by: Bartosz Golaszewski <brgl@bgdev.pl>
Tested-by: David Lechner <david@lechnology.com>
Tested-by: Martin Kaiser <martin@kaiser.cx>
Tested-by: Bartosz Golaszewski <bgolaszewski@baylibre.com>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2019-07-30 10:35:23 +02:00
Toke Høiland-Jørgensen
6f9d451ab1 xdp: Add devmap_hash map type for looking up devices by hashed index
A common pattern when using xdp_redirect_map() is to create a device map
where the lookup key is simply ifindex. Because device maps are arrays,
this leaves holes in the map, and the map has to be sized to fit the
largest ifindex, regardless of how many devices actually are actually
needed in the map.

This patch adds a second type of device map where the key is looked up
using a hashmap, instead of being used as an array index. This allows maps
to be densely packed, so they can be smaller.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-07-29 13:50:48 -07:00
Toke Høiland-Jørgensen
fca16e5107 xdp: Refactor devmap allocation code for reuse
The subsequent patch to add a new devmap sub-type can re-use much of the
initialisation and allocation code, so refactor it into separate functions.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-07-29 13:50:48 -07:00
Joel Fernandes (Google)
1caf7d50f4 pidfd: Add warning if exit_state is 0 during notification
Previously a condition got missed where the pidfd waiters are awakened
before the exit_state gets set. This can result in a missed notification
[1] and the polling thread waiting forever.

It is fixed now, however it would be nice to avoid this kind of issue
going unnoticed in the future. So just add a warning to catch it in the
future.

/* References */
[1]: https://lore.kernel.org/lkml/20190717172100.261204-1-joel@joelfernandes.org/

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Link: https://lore.kernel.org/r/20190724164816.201099-1-joel@joelfernandes.org
Signed-off-by: Christian Brauner <christian@brauner.io>
2019-07-29 17:20:19 +02:00
Nicolin Chen
f46cc01525 dma-contiguous: page-align the size in dma_free_contiguous()
According to the original dma_direct_alloc_pages() code:
{
	unsigned int count = PAGE_ALIGN(size) >> PAGE_SHIFT;

	if (!dma_release_from_contiguous(dev, page, count))
		__free_pages(page, get_order(size));
}

The count parameter for dma_release_from_contiguous() was page
aligned before the right-shifting operation, while the new API
dma_free_contiguous() forgets to have PAGE_ALIGN() at the size.

So this patch simply adds it to prevent any corner case.

Fixes: fdaeec198ada ("dma-contiguous: add dma_{alloc,free}_contiguous() helpers")
Signed-off-by: Nicolin Chen <nicoleotsuka@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-07-29 09:50:04 +03:00
Nicolin Chen
c6622a425a dma-contiguous: do not overwrite align in dma_alloc_contiguous()
The dma_alloc_contiguous() limits align at CONFIG_CMA_ALIGNMENT for
cma_alloc() however it does not restore it for the fallback routine.
This will result in a size mismatch between the allocation and free
when running into the fallback routines after cma_alloc() fails, if
the align is larger than CONFIG_CMA_ALIGNMENT.

This patch adds a cma_align to take care of cma_alloc() and prevent
the align from being overwritten.

Fixes: fdaeec198ada ("dma-contiguous: add dma_{alloc,free}_contiguous() helpers")
Reported-by: Dafna Hirschfeld <dafna.hirschfeld@collabora.com>
Signed-off-by: Nicolin Chen <nicoleotsuka@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-07-29 09:50:04 +03:00
Linus Torvalds
e24ce84e85 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Thomas Gleixner:
 "Two fixes for the fair scheduling class:

   - Prevent freeing memory which is accessible by concurrent readers

   - Make the RCU annotations for numa groups consistent"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Use RCU accessors consistently for ->numa_group
  sched/fair: Don't free p->numa_faults with concurrent readers
2019-07-27 21:22:33 -07:00
Linus Torvalds
750991f9af Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner:
 "A pile of perf related fixes:

  Kernel:
   - Fix SLOTS PEBS event constraints for Icelake CPUs

   - Add the missing mask bit to allow counting hardware generated
     prefetches on L3 for Icelake CPUs

   - Make the test for hypervisor platforms more accurate (as far as
     possible)

   - Handle PMUs correctly which override event->cpu

   - Yet another missing fallthrough annotation

  Tools:
     perf.data:
        - Fix loading of compressed data split across adjacent records
        - Fix buffer size setting for processing CPU topology perf.data
          header.

     perf stat:
        - Fix segfault for event group in repeat mode
        - Always separate "stalled cycles per insn" line, it was being
          appended to the "instructions" line.

     perf script:
        - Fix --max-blocks man page description.
        - Improve man page description of metrics.
        - Fix off by one in brstackinsn IPC computation.

     perf probe:
        - Avoid calling freeing routine multiple times for same pointer.

     perf build:
        - Do not use -Wshadow on gcc < 4.8, avoiding too strict warnings
          treated as errors, breaking the build"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/intel: Mark expected switch fall-throughs
  perf/core: Fix creating kernel counters for PMUs that override event->cpu
  perf/x86: Apply more accurate check on hypervisor platform
  perf/x86/intel: Fix invalid Bit 13 for Icelake MSR_OFFCORE_RSP_x register
  perf/x86/intel: Fix SLOTS PEBS event constraint
  perf build: Do not use -Wshadow on gcc < 4.8
  perf probe: Avoid calling freeing routine multiple times for same pointer
  perf probe: Set pev->nargs to zero after freeing pev->args entries
  perf session: Fix loading of compressed data split across adjacent records
  perf stat: Always separate stalled cycles per insn
  perf stat: Fix segfault for event group in repeat mode
  perf tools: Fix proper buffer size for feature processing
  perf script: Fix off by one in brstackinsn IPC computation
  perf script: Improve man page description of metrics
  perf script: Fix --max-blocks man page description
2019-07-27 21:17:56 -07:00
Linus Torvalds
431f288ed7 Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Thomas Gleixner:
 "A set of locking fixes:

   - Address the fallout of the rwsem rework. Missing ACQUIREs and a
     sanity check to prevent a use-after-free

   - Add missing checks for unitialized mutexes when mutex debugging is
     enabled.

   - Remove the bogus code in the generic SMP variant of
     arch_futex_atomic_op_inuser()

   - Fixup the #ifdeffery in lockdep to prevent compile warnings"

* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/mutex: Test for initialized mutex
  locking/lockdep: Clean up #ifdef checks
  locking/lockdep: Hide unused 'class' variable
  locking/rwsem: Add ACQUIRE comments
  tty/ldsem, locking/rwsem: Add missing ACQUIRE to read_failed sleep loop
  lcoking/rwsem: Add missing ACQUIRE to read_slowpath sleep loop
  locking/rwsem: Add missing ACQUIRE to read_slowpath exit when queue is empty
  locking/rwsem: Don't call owner_on_cpu() on read-owner
  futex: Cleanup generic SMP variant of arch_futex_atomic_op_inuser()
2019-07-27 21:10:26 -07:00
Daniel Jordan
065cf57713 padata: purge get_cpu and reorder_via_wq from padata_do_serial
With the removal of the padata timer, padata_do_serial no longer
needs special CPU handling, so remove it.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-07-27 21:08:37 +10:00
Herbert Xu
6fc4dbcf02 padata: Replace delayed timer with immediate workqueue in padata_reorder
The function padata_reorder will use a timer when it cannot progress
while completed jobs are outstanding (pd->reorder_objects > 0).  This
is suboptimal as if we do end up using the timer then it would have
introduced a gratuitous delay of one second.

In fact we can easily distinguish between whether completed jobs
are outstanding and whether we can make progress.  All we have to
do is look at the next pqueue list.

This patch does that by replacing pd->processed with pd->cpu so
that the next pqueue is more accessible.

A work queue is used instead of the original try_again to avoid
hogging the CPU.

Note that we don't bother removing the work queue in
padata_flush_queues because the whole premise is broken.  You
cannot flush async crypto requests so it makes no sense to even
try.  A subsequent patch will fix it by replacing it with a ref
counting scheme.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-07-27 21:08:36 +10:00
David S. Miller
28ba934d28 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Alexei Starovoitov says:

====================
pull-request: bpf 2019-07-25

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) fix segfault in libbpf, from Andrii.

2) fix gso_segs access, from Eric.

3) tls/sockmap fixes, from Jakub and John.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-25 17:35:03 -07:00
Linus Torvalds
a29a0a467e Merge branch 'access-creds'
The access() (and faccessat()) credentials change can cause an
unnecessary load on the RCU machinery because every access() call ends
up freeing the temporary access credential using RCU.

This isn't really noticeable on small machines, but if you have hundreds
of cores you can cause huge slowdowns due to RCU storms.

It's easy to avoid: the temporary access crededntials aren't actually
normally accessed using RCU at all, so we can avoid the whole issue by
just marking them as such.

* access-creds:
  access: avoid the RCU grace period for the temporary subjective credentials
2019-07-25 08:36:29 -07:00
Qian Cai
a1dc0446d6 sched/core: Silence a warning in sched_init()
Compiling a kernel with both FAIR_GROUP_SCHED=n and RT_GROUP_SCHED=n
will generate a compiler warning:

  kernel/sched/core.c: In function 'sched_init':
  kernel/sched/core.c:5906:32: warning: variable 'ptr' set but not used

It is unnecessary to have both "alloc_size" and "ptr", so just combine
them.

Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: valentin.schneider@arm.com
Link: https://lkml.kernel.org/r/20190720012319.884-1-cai@lca.pw
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:55:05 +02:00
Juri Lelli
a07db5c086 sched/core: Fix CPU controller for !RT_GROUP_SCHED
On !CONFIG_RT_GROUP_SCHED configurations it is currently not possible to
move RT tasks between cgroups to which CPU controller has been attached;
but it is oddly possible to first move tasks around and then make them
RT (setschedule to FIFO/RR).

E.g.:

  # mkdir /sys/fs/cgroup/cpu,cpuacct/group1
  # chrt -fp 10 $$
  # echo $$ > /sys/fs/cgroup/cpu,cpuacct/group1/tasks
  bash: echo: write error: Invalid argument
  # chrt -op 0 $$
  # echo $$ > /sys/fs/cgroup/cpu,cpuacct/group1/tasks
  # chrt -fp 10 $$
  # cat /sys/fs/cgroup/cpu,cpuacct/group1/tasks
  2345
  2598
  # chrt -p 2345
  pid 2345's current scheduling policy: SCHED_FIFO
  pid 2345's current scheduling priority: 10

Also, as Michal noted, it is currently not possible to enable CPU
controller on unified hierarchy with !CONFIG_RT_GROUP_SCHED (if there
are any kernel RT threads in root cgroup, they can't be migrated to the
newly created CPU controller's root in cgroup_update_dfl_csses()).

Existing code comes with a comment saying the "we don't support RT-tasks
being in separate groups". Such comment is however stale and belongs to
pre-RT_GROUP_SCHED times. Also, it doesn't make much sense for
!RT_GROUP_ SCHED configurations, since checks related to RT bandwidth
are not performed at all in these cases.

Make moving RT tasks between CPU controller groups viable by removing
special case check for RT (and DEADLINE) tasks.

Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: lizefan@huawei.com
Cc: longman@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: rostedt@goodmis.org
Link: https://lkml.kernel.org/r/20190719063455.27328-1-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:55:05 +02:00
Juri Lelli
710da3c8ea sched/core: Prevent race condition between cpuset and __sched_setscheduler()
No synchronisation mechanism exists between the cpuset subsystem and
calls to function __sched_setscheduler(). As such, it is possible that
new root domains are created on the cpuset side while a deadline
acceptance test is carried out in __sched_setscheduler(), leading to a
potential oversell of CPU bandwidth.

Grab cpuset_rwsem read lock from core scheduler, so to prevent
situations such as the one described above from happening.

The only exception is normalize_rt_tasks() which needs to work under
tasklist_lock and can't therefore grab cpuset_rwsem. We are fine with
this, as this function is only called by sysrq and, if that gets
triggered, DEADLINE guarantees are already gone out of the window
anyway.

Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bristot@redhat.com
Cc: claudio@evidence.eu.com
Cc: lizefan@huawei.com
Cc: longman@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: mathieu.poirier@linaro.org
Cc: rostedt@goodmis.org
Cc: tj@kernel.org
Cc: tommaso.cucinotta@santannapisa.it
Link: https://lkml.kernel.org/r/20190719140000.31694-9-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:55:04 +02:00
Juri Lelli
1a763fd7c6 rcu/tree: Call setschedule() gp ktread to SCHED_FIFO outside of atomic region
sched_setscheduler() needs to acquire cpuset_rwsem, but it is currently
called from an invalid (atomic) context by rcu_spawn_gp_kthread().

Fix that by simply moving sched_setscheduler_nocheck() call outside of
the atomic region, as it doesn't actually require to be guarded by
rcu_node lock.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bristot@redhat.com
Cc: claudio@evidence.eu.com
Cc: lizefan@huawei.com
Cc: longman@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: mathieu.poirier@linaro.org
Cc: rostedt@goodmis.org
Cc: tj@kernel.org
Cc: tommaso.cucinotta@santannapisa.it
Link: https://lkml.kernel.org/r/20190719140000.31694-8-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:55:03 +02:00
Juri Lelli
d74b27d63a cgroup/cpuset: Change cpuset_rwsem and hotplug lock order
cpuset_rwsem is going to be acquired from sched_setscheduler() with a
following patch. There are however paths (e.g., spawn_ksoftirqd) in
which sched_scheduler() is eventually called while holding hotplug lock;
this creates a dependecy between hotplug lock (to be always acquired
first) and cpuset_rwsem (to be always acquired after hotplug lock).

Fix paths which currently take the two locks in the wrong order (after
a following patch is applied).

Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bristot@redhat.com
Cc: claudio@evidence.eu.com
Cc: lizefan@huawei.com
Cc: longman@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: mathieu.poirier@linaro.org
Cc: rostedt@goodmis.org
Cc: tj@kernel.org
Cc: tommaso.cucinotta@santannapisa.it
Link: https://lkml.kernel.org/r/20190719140000.31694-7-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:55:03 +02:00
Juri Lelli
1243dc518c cgroup/cpuset: Convert cpuset_mutex to percpu_rwsem
Holding cpuset_mutex means that cpusets are stable (only the holder can
make changes) and this is required for fixing a synchronization issue
between cpusets and scheduler core.  However, grabbing cpuset_mutex from
setscheduler() hotpath (as implemented in a later patch) is a no-go, as
it would create a bottleneck for tasks concurrently calling
setscheduler().

Convert cpuset_mutex to be a percpu_rwsem (cpuset_rwsem), so that
setscheduler() will then be able to read lock it and avoid concurrency
issues.

Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bristot@redhat.com
Cc: claudio@evidence.eu.com
Cc: lizefan@huawei.com
Cc: longman@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: mathieu.poirier@linaro.org
Cc: rostedt@goodmis.org
Cc: tj@kernel.org
Cc: tommaso.cucinotta@santannapisa.it
Link: https://lkml.kernel.org/r/20190719140000.31694-6-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:55:02 +02:00
Juri Lelli
59d06cea11 sched/deadline: Fix bandwidth accounting at all levels after offline migration
If a task happens to be throttled while the CPU it was running on gets
hotplugged off, the bandwidth associated with the task is not correctly
migrated with it when the replenishment timer fires (offline_migration).

Fix things up, for this_bw, running_bw and total_bw, when replenishment
timer fires and task is migrated (dl_task_offline_migration()).

Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bristot@redhat.com
Cc: claudio@evidence.eu.com
Cc: lizefan@huawei.com
Cc: longman@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: mathieu.poirier@linaro.org
Cc: rostedt@goodmis.org
Cc: tj@kernel.org
Cc: tommaso.cucinotta@santannapisa.it
Link: https://lkml.kernel.org/r/20190719140000.31694-5-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:55:02 +02:00
Mathieu Poirier
f9a25f776d cpusets: Rebuild root domain deadline accounting information
When the topology of root domains is modified by CPUset or CPUhotplug
operations information about the current deadline bandwidth held in the
root domain is lost.

This patch addresses the issue by recalculating the lost deadline
bandwidth information by circling through the deadline tasks held in
CPUsets and adding their current load to the root domain they are
associated with.

Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
[ Various additional modifications. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bristot@redhat.com
Cc: claudio@evidence.eu.com
Cc: lizefan@huawei.com
Cc: longman@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: rostedt@goodmis.org
Cc: tj@kernel.org
Cc: tommaso.cucinotta@santannapisa.it
Link: https://lkml.kernel.org/r/20190719140000.31694-4-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:55:01 +02:00
Mathieu Poirier
4b211f2b12 sched/core: Streamle calls to task_rq_unlock()
Calls to task_rq_unlock() are done several times in the
__sched_setscheduler() function.  This is fine when only the rq lock needs to be
handled but not so much when other locks come into play.

This patch streamlines the release of the rq lock so that only one
location need to be modified when dealing with more than one lock.

No change of functionality is introduced by this patch.

Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bristot@redhat.com
Cc: claudio@evidence.eu.com
Cc: lizefan@huawei.com
Cc: longman@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: tommaso.cucinotta@santannapisa.it
Link: https://lkml.kernel.org/r/20190719140000.31694-3-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:51:57 +02:00
Mathieu Poirier
c22645f4c8 sched/topology: Add partition_sched_domains_locked()
Introduce the partition_sched_domains_locked() function by taking
the mutex locking code out of the original function.  That way
the work done by partition_sched_domains_locked() can be reused
without dropping the mutex lock.

No change of functionality is introduced by this patch.

Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bristot@redhat.com
Cc: claudio@evidence.eu.com
Cc: lizefan@huawei.com
Cc: longman@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: rostedt@goodmis.org
Cc: tommaso.cucinotta@santannapisa.it
Link: https://lkml.kernel.org/r/20190719140000.31694-2-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:51:57 +02:00
Viresh Kumar
60e17f5cef sched/fair: Introduce fits_capacity()
The same formula to check utilization against capacity (after
considering capacity_margin) is already used at 5 different locations.

This patch creates a new macro, fits_capacity(), which can be used from
all these locations without exposing the details of it and hence
simplify code.

All the 5 code locations are updated as well to use it..

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/b477ac75a2b163048bdaeb37f57b4c3f04f75a31.1559631700.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:51:56 +02:00
Wanpeng Li
e0e8d4911e sched/isolation: Prefer housekeeping CPU in local node
In real product setup, there will be houseeking CPUs in each nodes, it
is prefer to do housekeeping from local node, fallback to global online
cpumask if failed to find houseeking CPU from local node.

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/1561711901-4755-2-git-send-email-wanpengli@tencent.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:51:55 +02:00
Yi Wang
65d74e9169 sched/stats: Fix unlikely() use of sched_info_on()
sched_info_on() is called with unlikely hint, however, the test
is to be a constant(1) on which compiler will do nothing when
make defconfig, so remove the hint.

Also, fix a lack of {}.

Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: up2wing@gmail.com
Cc: wang.liang82@zte.com.cn
Cc: xue.zhihong@zte.com.cn
Link: https://lkml.kernel.org/r/1562301307-43002-1-git-send-email-wang.yi59@zte.com.cn
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:51:55 +02:00
Matthew Wilcox (Oracle)
7b3c92b85a sched/core: Convert get_task_struct() to return the task
Returning the pointer that was passed in allows us to write
slightly more idiomatic code.  Convert a few users.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190704221323.24290-1-willy@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:51:54 +02:00
Viresh Kumar
3c29e651e1 sched/fair: Fall back to sched-idle CPU if idle CPU isn't found
We try to find an idle CPU to run the next task, but in case we don't
find an idle CPU it is better to pick a CPU which will run the task the
soonest, for performance reason.

A CPU which isn't idle but has only SCHED_IDLE activity queued on it
should be a good target based on this criteria as any normal fair task
will most likely preempt the currently running SCHED_IDLE task
immediately. In fact, choosing a SCHED_IDLE CPU over a fully idle one
shall give better results as it should be able to run the task sooner
than an idle CPU (which requires to be woken up from an idle state).

This patch updates both fast and slow paths with this optimization.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: chris.redpath@arm.com
Cc: quentin.perret@linaro.org
Cc: songliubraving@fb.com
Cc: steven.sistare@oracle.com
Cc: subhra.mazumdar@oracle.com
Cc: tkjos@google.com
Link: https://lkml.kernel.org/r/eeafa25fdeb6f6edd5b2da716bc8f0ba7708cbcf.1561523542.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:51:54 +02:00
Viresh Kumar
43e9f7f231 sched/fair: Start tracking SCHED_IDLE tasks count in cfs_rq
Track how many tasks are present with SCHED_IDLE policy in each cfs_rq.
This will be used by later commits.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: chris.redpath@arm.com
Cc: quentin.perret@linaro.org
Cc: songliubraving@fb.com
Cc: steven.sistare@oracle.com
Cc: subhra.mazumdar@oracle.com
Cc: tkjos@google.com
Link: https://lkml.kernel.org/r/0d3cdc427fc68808ad5bccc40e86ed0bf9da8bb4.1561523542.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:51:53 +02:00
Paul E. McKenney
84ec3a0787 time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint
time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint

The TASKS03 and TREE04 rcutorture scenarios produce the following
lockdep complaint:

	WARNING: inconsistent lock state
	5.2.0-rc1+ #513 Not tainted
	--------------------------------
	inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
	migration/1/14 [HC0[0]:SC0[0]:HE1:SE1] takes:
	(____ptrval____) (tick_broadcast_lock){?...}, at: tick_broadcast_offline+0xf/0x70
	{IN-HARDIRQ-W} state was registered at:
	  lock_acquire+0xb0/0x1c0
	  _raw_spin_lock_irqsave+0x3c/0x50
	  tick_broadcast_switch_to_oneshot+0xd/0x40
	  tick_switch_to_oneshot+0x4f/0xd0
	  hrtimer_run_queues+0xf3/0x130
	  run_local_timers+0x1c/0x50
	  update_process_times+0x1c/0x50
	  tick_periodic+0x26/0xc0
	  tick_handle_periodic+0x1a/0x60
	  smp_apic_timer_interrupt+0x80/0x2a0
	  apic_timer_interrupt+0xf/0x20
	  _raw_spin_unlock_irqrestore+0x4e/0x60
	  rcu_nocb_gp_kthread+0x15d/0x590
	  kthread+0xf3/0x130
	  ret_from_fork+0x3a/0x50
	irq event stamp: 171
	hardirqs last  enabled at (171): [<ffffffff8a201a37>] trace_hardirqs_on_thunk+0x1a/0x1c
	hardirqs last disabled at (170): [<ffffffff8a201a53>] trace_hardirqs_off_thunk+0x1a/0x1c
	softirqs last  enabled at (0): [<ffffffff8a264ee0>] copy_process.part.56+0x650/0x1cb0
	softirqs last disabled at (0): [<0000000000000000>] 0x0

        [...]

To reproduce, run the following rcutorture test:

 $ tools/testing/selftests/rcutorture/bin/kvm.sh --duration 5 --kconfig "CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_PROVE_LOCKING=y" --configs "TASKS03 TREE04"

It turns out that tick_broadcast_offline() was an innocent bystander.
After all, interrupts are supposed to be disabled throughout
take_cpu_down(), and therefore should have been disabled upon entry to
tick_offline_cpu() and thus to tick_broadcast_offline().  This suggests
that one of the CPU-hotplug notifiers was incorrectly enabling interrupts,
and leaving them enabled on return.

Some debugging code showed that the culprit was sched_cpu_dying().
It had irqs enabled after return from sched_tick_stop().  Which in turn
had irqs enabled after return from cancel_delayed_work_sync().  Which is a
wrapper around __cancel_work_timer().  Which can sleep in the case where
something else is concurrently trying to cancel the same delayed work,
and as Thomas Gleixner pointed out on IRC, sleeping is a decidedly bad
idea when you are invoked from take_cpu_down(), regardless of the state
you leave interrupts in upon return.

Code inspection located no reason why the delayed work absolutely
needed to be canceled from sched_tick_stop():  The work is not
bound to the outgoing CPU by design, given that the whole point is
to collect statistics without disturbing the outgoing CPU.

This commit therefore simply drops the cancel_delayed_work_sync() from
sched_tick_stop().  Instead, a new ->state field is added to the tick_work
structure so that the delayed-work handler function sched_tick_remote()
can avoid reposting itself.  A cpu_is_offline() check is also added to
sched_tick_remote() to avoid mucking with the state of an offlined CPU
(though it does appear safe to do so).  The sched_tick_start() and
sched_tick_stop() functions also update ->state, and sched_tick_start()
also schedules the delayed work if ->state indicates that it is not
already in flight.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
[ paulmck: Apply Peter Zijlstra and Frederic Weisbecker atomics feedback. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190625165238.GJ26519@linux.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:51:53 +02:00
Vincent Guittot
f6cad8df6b sched/fair: Fix imbalance due to CPU affinity
The load_balance() has a dedicated mecanism to detect when an imbalance
is due to CPU affinity and must be handled at parent level. In this case,
the imbalance field of the parent's sched_group is set.

The description of sg_imbalanced() gives a typical example of two groups
of 4 CPUs each and 4 tasks each with a cpumask covering 1 CPU of the first
group and 3 CPUs of the second group. Something like:

	{ 0 1 2 3 } { 4 5 6 7 }
	        *     * * *

But the load_balance fails to fix this UC on my octo cores system
made of 2 clusters of quad cores.

Whereas the load_balance is able to detect that the imbalanced is due to
CPU affinity, it fails to fix it because the imbalance field is cleared
before letting parent level a chance to run. In fact, when the imbalance is
detected, the load_balance reruns without the CPU with pinned tasks. But
there is no other running tasks in the situation described above and
everything looks balanced this time so the imbalance field is immediately
cleared.

The imbalance field should not be cleared if there is no other task to move
when the imbalance is detected.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/1561996022-28829-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:51:52 +02:00
Valentin Schneider
9434f9f5d1 sched/fair: Change task_numa_work() storage to static
There are no callers outside of fair.c.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: mgorman@suse.de
Cc: riel@surriel.com
Link: https://lkml.kernel.org/r/20190715102508.32434-4-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:51:52 +02:00
Valentin Schneider
b34920d4ce sched/fair: Move task_numa_work() init to init_numa_balancing()
We only need to set the callback_head worker function once, do it
during sched_fork().

While at it, move the comment regarding double task_work addition to
init_numa_balancing(), since the double add sentinel is first set there.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: mgorman@suse.de
Cc: riel@surriel.com
Link: https://lkml.kernel.org/r/20190715102508.32434-3-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:51:51 +02:00
Valentin Schneider
d35927a144 sched/fair: Move init_numa_balancing() below task_numa_work()
To reference task_numa_work() from within init_numa_balancing(), we
need the former to be declared before the latter. Do just that.

This is a pure code movement.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: mgorman@suse.de
Cc: riel@surriel.com
Link: https://lkml.kernel.org/r/20190715102508.32434-2-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:51:51 +02:00
Thomas Gleixner
0c09ab96fc cpu/hotplug: Cache number of online CPUs
Re-evaluating the bitmap wheight of the online cpus bitmap in every
invocation of num_online_cpus() over and over is a pretty useless
exercise. Especially when num_online_cpus() is used in code paths
like the IPI delivery of x86 or the membarrier code.

Cache the number of online CPUs in the core and just return the cached
variable. The accessor function provides only a snapshot when used without
protection against concurrent CPU hotplug.

The storage needs to use an atomic_t because the kexec and reboot code
(ab)use set_cpu_online() in their 'shutdown' handlers without any form of
serialization as pointed out by Mathieu. Regular CPU hotplug usage is
properly serialized.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1907091622590.1634@nanos.tec.linutronix.de
2019-07-25 15:48:01 +02:00
Thomas Gleixner
e797bda3fd smp/hotplug: Track booted once CPUs in a cpumask
The booted once information which is required to deal with the MCE
broadcast issue on X86 correctly is stored in the per cpu hotplug state,
which is perfectly fine for the intended purpose.

X86 needs that information for supporting NMI broadcasting via shortcuts,
but retrieving it from per cpu data is cumbersome.

Move it to a cpumask so the information can be checked against the
cpu_present_mask quickly.

No functional change intended.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190722105219.818822855@linutronix.de
2019-07-25 15:47:37 +02:00
Bart Van Assche
8c779229d0 locking/lockdep: Report more stack trace statistics
Report the number of stack traces and the number of stack trace hash
chains. These two numbers are useful because these allow to estimate
the number of stack trace hash collisions.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Link: https://lkml.kernel.org/r/20190722182443.216015-5-bvanassche@acm.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:43:28 +02:00
Bart Van Assche
12593b7467 locking/lockdep: Reduce space occupied by stack traces
Although commit 669de8bda8 ("kernel/workqueue: Use dynamic lockdep keys
for workqueues") unregisters dynamic lockdep keys when a workqueue is
destroyed, a side effect of that commit is that all stack traces
associated with the lockdep key are leaked when a workqueue is destroyed.
Fix this by storing each unique stack trace once. Other changes in this
patch are:

- Use NULL instead of { .nr_entries = 0 } to represent 'no trace'.
- Store a pointer to a stack trace in struct lock_class and struct
  lock_list instead of storing 'nr_entries' and 'offset'.

This patch avoids that the following program triggers the "BUG:
MAX_STACK_TRACE_ENTRIES too low!" complaint:

	#include <fcntl.h>
	#include <unistd.h>

	int main()
	{
		for (;;) {
			int fd = open("/dev/infiniband/rdma_cm", O_RDWR);
			close(fd);
		}
	}

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Reported-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yuyang Du <duyuyang@gmail.com>
Link: https://lkml.kernel.org/r/20190722182443.216015-4-bvanassche@acm.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:43:27 +02:00
Bart Van Assche
a297042164 stacktrace: Constify 'entries' arguments
Make it clear to humans and to the compiler that the stack trace
('entries') arguments are not modified.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Link: https://lkml.kernel.org/r/20190722182443.216015-3-bvanassche@acm.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:43:26 +02:00
Bart Van Assche
364f6afc4f locking/lockdep: Make it clear that what lock_class::key points at is not modified
This patch does not change the behavior of the lockdep code.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Link: https://lkml.kernel.org/r/20190722182443.216015-2-bvanassche@acm.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:43:26 +02:00
Leonard Crestez
4ce54af8b3 perf/core: Fix creating kernel counters for PMUs that override event->cpu
Some hardware PMU drivers will override perf_event.cpu inside their
event_init callback. This causes a lockdep splat when initialized through
the kernel API:

 WARNING: CPU: 0 PID: 250 at kernel/events/core.c:2917 ctx_sched_out+0x78/0x208
 pc : ctx_sched_out+0x78/0x208
 Call trace:
  ctx_sched_out+0x78/0x208
  __perf_install_in_context+0x160/0x248
  remote_function+0x58/0x68
  generic_exec_single+0x100/0x180
  smp_call_function_single+0x174/0x1b8
  perf_install_in_context+0x178/0x188
  perf_event_create_kernel_counter+0x118/0x160

Fix this by calling perf_install_in_context with event->cpu, just like
perf_event_open

Signed-off-by: Leonard Crestez <leonard.crestez@nxp.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Frank Li <Frank.li@nxp.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Link: https://lkml.kernel.org/r/c4ebe0503623066896d7046def4d6b1e06e0eb2e.1563972056.git.leonard.crestez@nxp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:41:31 +02:00
Sebastian Andrzej Siewior
6c11c6e3d5 locking/mutex: Test for initialized mutex
An uninitialized/ zeroed mutex will go unnoticed because there is no
check for it. There is a magic check in the unlock's slowpath path which
might go unnoticed if the unlock happens in the fastpath.

Add a ->magic check early in the mutex_lock() and mutex_trylock() path.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190703092125.lsdf4gpsh2plhavb@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:39:27 +02:00
Arnd Bergmann
30a35f79fa locking/lockdep: Clean up #ifdef checks
As Will Deacon points out, CONFIG_PROVE_LOCKING implies TRACE_IRQFLAGS,
so the conditions I added in the previous patch, and some others in the
same file can be simplified by only checking for the former.

No functional change.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: Yuyang Du <duyuyang@gmail.com>
Fixes: 886532aee3 ("locking/lockdep: Move mark_lock() inside CONFIG_TRACE_IRQFLAGS && CONFIG_PROVE_LOCKING")
Link: https://lkml.kernel.org/r/20190628102919.2345242-1-arnd@arndb.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:39:26 +02:00
Arnd Bergmann
68037aa782 locking/lockdep: Hide unused 'class' variable
The usage is now hidden in an #ifdef, so we need to move
the variable itself in there as well to avoid this warning:

  kernel/locking/lockdep_proc.c:203:21: error: unused variable 'class' [-Werror,-Wunused-variable]

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yuyang Du <duyuyang@gmail.com>
Cc: frederic@kernel.org
Fixes: 68d41d8c94 ("locking/lockdep: Fix lock used or unused stats error")
Link: https://lkml.kernel.org/r/20190715092809.736834-1-arnd@arndb.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:39:25 +02:00
Peter Zijlstra
6ffddfb9e1 locking/rwsem: Add ACQUIRE comments
Since we just reviewed read_slowpath for ACQUIRE correctness, add a
few coments to retain our findings.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:39:25 +02:00
Peter Zijlstra
99143f82a2 lcoking/rwsem: Add missing ACQUIRE to read_slowpath sleep loop
While reviewing another read_slowpath patch, both Will and I noticed
another missing ACQUIRE, namely:

  X = 0;

  CPU0			CPU1

  rwsem_down_read()
    for (;;) {
      set_current_state(TASK_UNINTERRUPTIBLE);

                        X = 1;
                        rwsem_up_write();
                          rwsem_mark_wake()
                            atomic_long_add(adjustment, &sem->count);
                            smp_store_release(&waiter->task, NULL);

      if (!waiter.task)
        break;

      ...
    }

  r = X;

Allows 'r == 0'.

Reported-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reported-by: Will Deacon <will@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:39:24 +02:00
Jan Stancek
e1b98fa316 locking/rwsem: Add missing ACQUIRE to read_slowpath exit when queue is empty
LTP mtest06 has been observed to occasionally hit "still mapped when
deleted" and following BUG_ON on arm64.

The extra mapcount originated from pagefault handler, which handled
pagefault for vma that has already been detached. vma is detached
under mmap_sem write lock by detach_vmas_to_be_unmapped(), which
also invalidates vmacache.

When the pagefault handler (under mmap_sem read lock) calls
find_vma(), vmacache_valid() wrongly reports vmacache as valid.

After rwsem down_read() returns via 'queue empty' path (as of v5.2),
it does so without an ACQUIRE on sem->count:

  down_read()
    __down_read()
      rwsem_down_read_failed()
        __rwsem_down_read_failed_common()
          raw_spin_lock_irq(&sem->wait_lock);
          if (list_empty(&sem->wait_list)) {
            if (atomic_long_read(&sem->count) >= 0) {
              raw_spin_unlock_irq(&sem->wait_lock);
              return sem;

The problem can be reproduced by running LTP mtest06 in a loop and
building the kernel (-j $NCPUS) in parallel. It does reproduces since
v4.20 on arm64 HPE Apollo 70 (224 CPUs, 256GB RAM, 2 nodes). It
triggers reliably in about an hour.

The patched kernel ran fine for 10+ hours.

Signed-off-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Will Deacon <will@kernel.org>
Acked-by: Waiman Long <longman@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dbueso@suse.de
Fixes: 4b486b535c ("locking/rwsem: Exit read lock slowpath if queue empty & no writer")
Link: https://lkml.kernel.org/r/50b8914e20d1d62bb2dee42d342836c2c16ebee7.1563438048.git.jstancek@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:39:23 +02:00
Waiman Long
7813430057 locking/rwsem: Don't call owner_on_cpu() on read-owner
For writer, the owner value is cleared on unlock. For reader, it is
left intact on unlock for providing better debugging aid on crash dump
and the unlock of one reader may not mean the lock is free.

As a result, the owner_on_cpu() shouldn't be used on read-owner
as the task pointer value may not be valid and it might have
been freed. That is the case in rwsem_spin_on_owner(), but not in
rwsem_can_spin_on_owner(). This can lead to use-after-free error from
KASAN. For example,

  BUG: KASAN: use-after-free in rwsem_down_write_slowpath
  (/home/miguel/kernel/linux/kernel/locking/rwsem.c:669
  /home/miguel/kernel/linux/kernel/locking/rwsem.c:1125)

Fix this by checking for RWSEM_READER_OWNED flag before calling
owner_on_cpu().

Reported-by: Luis Henriques <lhenriques@suse.com>
Tested-by: Luis Henriques <lhenriques@suse.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Fixes: 94a9717b3c ("locking/rwsem: Make rwsem->owner an atomic_long_t")
Link: https://lkml.kernel.org/r/81e82d5b-5074-77e8-7204-28479bbe0df0@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:39:22 +02:00
Jann Horn
cb361d8cde sched/fair: Use RCU accessors consistently for ->numa_group
The old code used RCU annotations and accessors inconsistently for
->numa_group, which can lead to use-after-frees and NULL dereferences.

Let all accesses to ->numa_group use proper RCU helpers to prevent such
issues.

Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Fixes: 8c8a743c50 ("sched/numa: Use {cpu, pid} to create task groups for shared faults")
Link: https://lkml.kernel.org/r/20190716152047.14424-3-jannh@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:37:05 +02:00
Jann Horn
16d51a590a sched/fair: Don't free p->numa_faults with concurrent readers
When going through execve(), zero out the NUMA fault statistics instead of
freeing them.

During execve, the task is reachable through procfs and the scheduler. A
concurrent /proc/*/sched reader can read data from a freed ->numa_faults
allocation (confirmed by KASAN) and write it back to userspace.
I believe that it would also be possible for a use-after-free read to occur
through a race between a NUMA fault and execve(): task_numa_fault() can
lead to task_numa_compare(), which invokes task_weight() on the currently
running task of a different CPU.

Another way to fix this would be to make ->numa_faults RCU-managed or add
extra locking, but it seems easier to wipe the NUMA fault statistics on
execve.

Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Fixes: 82727018b0 ("sched/numa: Call task_numa_free() from do_execve()")
Link: https://lkml.kernel.org/r/20190716152047.14424-1-jannh@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-25 15:37:04 +02:00
Linus Torvalds
d7852fbd0f access: avoid the RCU grace period for the temporary subjective credentials
It turns out that 'access()' (and 'faccessat()') can cause a lot of RCU
work because it installs a temporary credential that gets allocated and
freed for each system call.

The allocation and freeing overhead is mostly benign, but because
credentials can be accessed under the RCU read lock, the freeing
involves a RCU grace period.

Which is not a huge deal normally, but if you have a lot of access()
calls, this causes a fair amount of seconday damage: instead of having a
nice alloc/free patterns that hits in hot per-CPU slab caches, you have
all those delayed free's, and on big machines with hundreds of cores,
the RCU overhead can end up being enormous.

But it turns out that all of this is entirely unnecessary.  Exactly
because access() only installs the credential as the thread-local
subjective credential, the temporary cred pointer doesn't actually need
to be RCU free'd at all.  Once we're done using it, we can just free it
synchronously and avoid all the RCU overhead.

So add a 'non_rcu' flag to 'struct cred', which can be set by users that
know they only use it in non-RCU context (there are other potential
users for this).  We can make it a union with the rcu freeing list head
that we need for the RCU case, so this doesn't need any extra storage.

Note that this also makes 'get_current_cred()' clear the new non_rcu
flag, in case we have filesystems that take a long-term reference to the
cred and then expect the RCU delayed freeing afterwards.  It's not
entirely clear that this is required, but it makes for clear semantics:
the subjective cred remains non-RCU as long as you only access it
synchronously using the thread-local accessors, but you _can_ use it as
a generic cred if you want to.

It is possible that we should just remove the whole RCU markings for
->cred entirely.  Only ->real_cred is really supposed to be accessed
through RCU, and the long-term cred copies that nfs uses might want to
explicitly re-enable RCU freeing if required, rather than have
get_current_cred() do it implicitly.

But this is a "minimal semantic changes" change for the immediate
problem.

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Jan Glauber <jglauber@marvell.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Jayachandran Chandrasekharan Nair <jnair@marvell.com>
Cc: Greg KH <greg@kroah.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-24 10:12:09 -07:00
Christoph Hellwig
66d7780f18 dma-mapping: check pfn validity in dma_common_{mmap,get_sgtable}
Check that the pfn returned from arch_dma_coherent_to_pfn refers to
a valid page and reject the mmap / get_sgtable requests otherwise.

Based on the arm implementation of the mmap and get_sgtable methods.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Vignesh Raghavendra <vigneshr@ti.com>
2019-07-24 17:28:54 +02:00
Peng Wang
a581563f1b cgroup: minor tweak for logic to get cgroup css
We could only handle the case that css exists
and css_try_get_online() fails.

Signed-off-by: Peng Wang <rocking@whu.edu.cn>
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-07-23 15:46:33 -07:00
Markus Elfring
85db002337 cgroup: Replace a seq_printf() call by seq_puts() in cgroup_print_ss_mask()
A string which did not contain a data format specification should be put
into a sequence. Thus use the corresponding function “seq_puts”.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-07-23 15:44:00 -07:00
Ilya Leoshkevich
d9b8aadaff bpf: fix narrower loads on s390
The very first check in test_pkt_md_access is failing on s390, which
happens because loading a part of a struct __sk_buff field produces
an incorrect result.

The preprocessed code of the check is:

{
	__u8 tmp = *((volatile __u8 *)&skb->len +
		((sizeof(skb->len) - sizeof(__u8)) / sizeof(__u8)));
	if (tmp != ((*(volatile __u32 *)&skb->len) & 0xFF)) return 2;
};

clang generates the following code for it:

      0:	71 21 00 03 00 00 00 00	r2 = *(u8 *)(r1 + 3)
      1:	61 31 00 00 00 00 00 00	r3 = *(u32 *)(r1 + 0)
      2:	57 30 00 00 00 00 00 ff	r3 &= 255
      3:	5d 23 00 1d 00 00 00 00	if r2 != r3 goto +29 <LBB0_10>

Finally, verifier transforms it to:

  0: (61) r2 = *(u32 *)(r1 +104)
  1: (bc) w2 = w2
  2: (74) w2 >>= 24
  3: (bc) w2 = w2
  4: (54) w2 &= 255
  5: (bc) w2 = w2

The problem is that when verifier emits the code to replace a partial
load of a struct __sk_buff field (*(u8 *)(r1 + 3)) with a full load of
struct sk_buff field (*(u32 *)(r1 + 104)), an optional shift and a
bitwise AND, it assumes that the machine is little endian and
incorrectly decides to use a shift.

Adjust shift count calculation to account for endianness.

Fixes: 31fd85816d ("bpf: permits narrower load from bpf program context fields")
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-07-23 13:59:33 -07:00
Rafael J. Wysocki
8eb0fd3b55 PM: sleep: Integrate suspend-to-idle with generig suspend flow
After previous changes the suspend-to-idle code flow can be
integrated more tightly with the generic system suspend code flow
by making suspend_enter() call s2idle_loop() later and removing
the direct invocations of dpm_noirq_begin(),
dpm_noirq_suspend_devices(), dpm_noirq_end(), and
dpm_noirq_resume_devices() from the latter, so do that.

This change is not expected to alter functionality.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
2019-07-23 09:46:50 +02:00
Rafael J. Wysocki
56b9918490 PM: sleep: Simplify suspend-to-idle control flow
After commit 33e4f80ee6 ("ACPI / PM: Ignore spurious SCI wakeups
from suspend-to-idle") the "noirq" phases of device suspend and
resume may run for multiple times during suspend-to-idle, if there
are spurious system wakeup events while suspended.  However, this
is complicated and fragile and actually unnecessary.

The main reason for doing this is that on some systems the EC may
signal system wakeup events (power button events, for example) as
well as events that should not cause the system to resume (spurious
system wakeup events).  Thus, in order to determine whether or not
a given event signaled by the EC while suspended is a proper system
wakeup one, the EC GPE needs to be dispatched and to start with that
was achieved by allowing the ACPI SCI action handler to run, which
was only possible after calling resume_device_irqs().

However, dispatching the EC GPE this way turned out to take too much
time in some cases and some EC events might be missed due to that, so
commit 68e2201185 ("ACPI: EC: Dispatch the EC GPE directly on
s2idle wake") started to dispatch the EC GPE right after a wakeup
event has been detected, so in fact the full ACPI SCI action handler
doesn't need to run any more to deal with the wakeups coming from the
EC.

Use this observation to simplify the suspend-to-idle control flow
so that the "noirq" phases of device suspend and resume are each
run only once in every suspend-to-idle cycle, which is reported to
significantly reduce power drawn by some systems when suspended to
idle (by allowing them to reach a deep platform-wide low-power state
through the suspend-to-idle flow).  [What appears to happen is that
the "noirq" resume of devices after a spurious EC wakeup brings some
devices into a state in which they prevent the platform from reaching
the deep low-power state going forward, even after a subsequent
"noirq" suspend phase, and on some systems the EC triggers such
wakeups already when the "noirq" suspend of devices is running for
the first time in the given suspend/resume cycle, so the platform
cannot reach the deep low-power state at all.]

First, make acpi_s2idle_wake() use the acpi_ec_dispatch_gpe() return
value to determine whether or not the wakeup may have been triggered
by the EC (in which case the system wakeup is canceled and ACPI
events are processed in order to determine whether or not the event
is a proper system wakeup one) and use rearm_wake_irq() (introduced
by a previous change) in it to rearm the ACPI SCI for system wakeup
detection in case the system will remain suspended.

Second, drop acpi_s2idle_sync(), which is not needed any more, and
the corresponding global platform suspend-to-idle callback.

Next, drop the pm_wakeup_pending() check (which is an optimization
only) from __device_suspend_noirq() to prevent it from returning
errors on system wakeups occurring before the "noirq" phase of
device suspend is complete (as in the case of suspend-to-idle it is
not known whether or not these wakeups are suprious at that point),
in order to avoid having to carry out a "noirq" resume of devices
on a spurious system wakeup.

Finally, change the code flow in s2idle_loop() to (1) run the
"noirq" suspend of devices once before starting the loop, (2) check
for spurious EC wakeups (via the platform ->wake callback) for the
first time before calling s2idle_enter(), and (3) run the "noirq"
resume of devices once after leaving the loop.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
2019-07-23 09:46:40 +02:00
Rafael J. Wysocki
3a79bc63d9 PCI: irq: Introduce rearm_wake_irq()
Introduce a new function, rearm_wake_irq(), allowing a wakeup IRQ
to be armed for systen wakeup detection again without running any
action handlers associated with it after it has been armed for
wakeup detection and triggered.

That is useful for IRQs, like ACPI SCI, that may deliver wakeup
as well as non-wakeup interrupts when armed for systen wakeup
detection.  In those cases, it may be possible to determine whether
or not the delivered interrupt is a systen wakeup one without
running the entire action handler (or handlers, if the IRQ is
shared) for the IRQ, and if the interrupt turns out to be a
non-wakeup one, the IRQ can be rearmed with the help of the
new function.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
2019-07-23 09:37:51 +02:00
Linus Torvalds
7b5cf701ea Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull preemption Kconfig fix from Thomas Gleixner:
 "The PREEMPT_RT stub config renamed PREEMPT to PREEMPT_LL and defined
  PREEMPT outside of the menu and made it selectable by both PREEMPT_LL
  and PREEMPT_RT.

  Stupid me missed that 114 defconfigs select CONFIG_PREEMPT which
  obviously can't work anymore. oldconfig builds are affected as well,
  but it's more obvious as the user gets asked. [old]defconfig silently
  fixes it up and selects PREEMPT_NONE.

  Unbreak it by undoing the rename and adding a intermediate config
  symbol which is selected by both PREEMPT and PREEMPT_RT. That requires
  to chase down a few #ifdefs, but it's better than tweaking 114
  defconfigs and annoying users"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/rt, Kconfig: Unbreak def/oldconfig with CONFIG_PREEMPT=y
2019-07-22 09:30:34 -07:00
Thomas Gleixner
b8d3349803 sched/rt, Kconfig: Unbreak def/oldconfig with CONFIG_PREEMPT=y
The merge of the CONFIG_PREEMPT_RT stub renamed CONFIG_PREEMPT to
CONFIG_PREEMPT_LL which causes all defconfigs which have CONFIG_PREEMPT=y
set to fall back to CONFIG_PREEMPT_NONE because CONFIG_PREEMPT depends on
the preemption mode choice wich defaults to NONE. This also affects
oldconfig builds.

So rather than changing 114 defconfig files and being an annoyance to
users, revert the rename and select a new config symbol PREEMPTION. That
keeps everything working smoothly and the revelant ifdef's are going to be
fixed up step by step.

Reported-by: Mark Rutland <mark.rutland@arm.com>
Fixes: a50a3f4b6a ("sched/rt, Kconfig: Introduce CONFIG_PREEMPT_RT")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2019-07-22 18:05:11 +02:00
Suren Baghdasaryan
b191d6491b
pidfd: fix a poll race when setting exit_state
There is a race between reading task->exit_state in pidfd_poll and
writing it after do_notify_parent calls do_notify_pidfd. Expected
sequence of events is:

CPU 0                            CPU 1
------------------------------------------------
exit_notify
  do_notify_parent
    do_notify_pidfd
  tsk->exit_state = EXIT_DEAD
                                  pidfd_poll
                                     if (tsk->exit_state)

However nothing prevents the following sequence:

CPU 0                            CPU 1
------------------------------------------------
exit_notify
  do_notify_parent
    do_notify_pidfd
                                   pidfd_poll
                                      if (tsk->exit_state)
  tsk->exit_state = EXIT_DEAD

This causes a polling task to wait forever, since poll blocks because
exit_state is 0 and the waiting task is not notified again. A stress
test continuously doing pidfd poll and process exits uncovered this bug.

To fix it, we make sure that the task's exit_state is always set before
calling do_notify_pidfd.

Fixes: b53b0b9d9a ("pidfd: add polling support")
Cc: kernel-team@android.com
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Link: https://lore.kernel.org/r/20190717172100.261204-1-joel@joelfernandes.org
[christian@brauner.io: adapt commit message and drop unneeded changes from wait_task_zombie]
Signed-off-by: Christian Brauner <christian@brauner.io>
2019-07-22 16:02:03 +02:00
Dave Hansen
f240652b60 x86/mpx: Remove MPX APIs
MPX is being removed from the kernel due to a lack of support in the
toolchain going forward (gcc).

The first step is to remove the userspace-visible ABIs so that applications
will stop using it.  The most visible one are the enable/disable prctl()s.
Remove them first.

This is the most minimal and least invasive change needed to ensure that
apps stop using MPX with new kernels.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190705175321.DB42F0AD@viggo.jf.intel.com
2019-07-22 11:54:57 +02:00
Linus Torvalds
ac60602a6d dma-mapping fixes for 5.3-rc1
Fix various regressions:
 
  - force unencrypted dma-coherent buffers if encryption bit can't fit
    into the dma coherent mask (Tom Lendacky)
  - avoid limiting request size if swiotlb is not used (me)
  - fix swiotlb handling in dma_direct_sync_sg_for_cpu/device
    (Fugang Duan)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAl0zTvELHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYMAsQ/6AleklMsMbc1xPsYYMukmjAOUNf+nvsFG4PRs/KVn
 1/Yohkxx/FN3oXZ+zZEnyd8a5u0ghwkN1WDivEhpclzbDuQP+Z+jEDmb37Oea4aJ
 L6XRLQJYiFwwEA6oJ87FNVMZXK/QUo+/lnDvJg0xNW6+HiR4GAUmnqy+/KyEIRSf
 SX+aiUOX4tUkwHPWyMaWvTlZ4hZgSovXwkUnR08jCwyJFezUwJBr/Yf5G6M1C10B
 hPFTrREhaekXgFd5E1dwKNk5omvfihxGyVUujFZhtMvs//LP8GcFLcVtYRWM/SUZ
 XpKkXxnaRC0gEm2P4/tSEGL3xl1CST/oYde74KNBQDIe0svGFS0QrP68+4zu/1ih
 vaf2gHoCoJciFY2DHglw1OG/gMWW06OtdseOKe9LZXtsGA6HCVBZW4c01V5YHVQT
 TMQMr0UyxJzmrxCo+LafAf9DoQxIii8WapewomwceL0TUtIDIujirzC/ieLhNPKL
 L2Fk+zPtFL24IpVe52S1PngatlW4MioiyiJji1QM0RK1V68+r/nSKPBxeq9s+jR3
 CfGvfhfRDd/NbZ9m66YFUaRzHL6Fpi2hMvJc9O6dgcVEYEBrL0d8J9nH42cqOlfe
 OBGeCxnFNQMuBp4Tw1OZO9PjzR3+pQOb32pOWLDUUs9ed3gtdMrJYTKhw9/cLpyp
 838=
 =Bv+Q
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-5.3-1' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping fixes from Christoph Hellwig:
 "Fix various regressions:

   - force unencrypted dma-coherent buffers if encryption bit can't fit
     into the dma coherent mask (Tom Lendacky)

   - avoid limiting request size if swiotlb is not used (me)

   - fix swiotlb handling in dma_direct_sync_sg_for_cpu/device (Fugang
     Duan)"

* tag 'dma-mapping-5.3-1' of git://git.infradead.org/users/hch/dma-mapping:
  dma-direct: correct the physical addr in dma_direct_sync_sg_for_cpu/device
  dma-direct: only limit the mapping size if swiotlb could be used
  dma-mapping: add a dma_addressing_limited helper
  dma-direct: Force unencrypted DMA under SME for certain DMA masks
2019-07-20 12:09:52 -07:00
Linus Torvalds
e6023adc5c Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core fixes from Thomas Gleixner:

 - A collection of objtool fixes which address recent fallout partially
   exposed by newer toolchains, clang, BPF and general code changes.

 - Force USER_DS for user stack traces

[ Note: the "objtool fixes" are not all to objtool itself, but for
  kernel code that triggers objtool warnings.

  Things like missing function size annotations, or code that confuses
  the unwinder etc.   - Linus]

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
  objtool: Support conditional retpolines
  objtool: Convert insn type to enum
  objtool: Fix seg fault on bad switch table entry
  objtool: Support repeated uses of the same C jump table
  objtool: Refactor jump table code
  objtool: Refactor sibling call detection logic
  objtool: Do frame pointer check before dead end check
  objtool: Change dead_end_function() to return boolean
  objtool: Warn on zero-length functions
  objtool: Refactor function alias logic
  objtool: Track original function across branches
  objtool: Add mcsafe_handle_tail() to the uaccess safe list
  bpf: Disable GCC -fgcse optimization for ___bpf_prog_run()
  x86/uaccess: Remove redundant CLACs in getuser/putuser error paths
  x86/uaccess: Don't leak AC flag into fentry from mcsafe_handle_tail()
  x86/uaccess: Remove ELF function annotation from copy_user_handle_tail()
  x86/head/64: Annotate start_cpu0() as non-callable
  x86/entry: Fix thunk function ELF sizes
  x86/kvm: Don't call kvm_spurious_fault() from .fixup
  x86/kvm: Replace vmx_vmenter()'s call to kvm_spurious_fault() with UD2
  ...
2019-07-20 10:45:15 -07:00
Linus Torvalds
4b01f5a4c9 Merge branch 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull smp fix from Thomas Gleixner:
 "Add warnings to the smp function calls so callers from wrong contexts
  get detected"

* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  smp: Warn on function calls from softirq context
2019-07-20 10:43:03 -07:00
Linus Torvalds
70e6e1b971 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull CONFIG_PREEMPT_RT stub config from Thomas Gleixner:
 "The real-time preemption patch set exists for almost 15 years now and
  while the vast majority of infrastructure and enhancements have found
  their way into the mainline kernel, the final integration of RT is
  still missing.

  Over the course of the last few years, we have worked on reducing the
  intrusivenness of the RT patches by refactoring kernel infrastructure
  to be more real-time friendly. Almost all of these changes were
  benefitial to the mainline kernel on their own, so there was no
  objection to integrate them.

  Though except for the still ongoing printk refactoring, the remaining
  changes which are required to make RT a first class mainline citizen
  are not longer arguable as immediately beneficial for the mainline
  kernel. Most of them are either reordering code flows or adding RT
  specific functionality.

  But this now has hit a wall and turned into a classic hen and egg
  problem:

     Maintainers are rightfully wary vs. these changes as they make only
     sense if the final integration of RT into the mainline kernel takes
     place.

  Adding CONFIG_PREEMPT_RT aims to solve this as a clear sign that RT
  will be fully integrated into the mainline kernel. The final
  integration of the missing bits and pieces will be of course done with
  the same careful approach as we have used in the past.

  While I'm aware that you are not entirely enthusiastic about that, I
  think that RT should receive the same treatment as any other widely
  used out of tree functionality, which we have accepted into mainline
  over the years.

  RT has become the de-facto standard real-time enhancement and is
  shipped by enterprise, embedded and community distros. It's in use
  throughout a wide range of industries: telecommunications, industrial
  automation, professional audio, medical devices, data acquisition,
  automotive - just to name a few major use cases.

  RT development is backed by a Linuxfoundation project which is
  supported by major stakeholders of this technology. The funding will
  continue over the actual inclusion into mainline to make sure that the
  functionality is neither introducing regressions, regressing itself,
  nor becomes subject to bitrot. There is also a lifely user community
  around RT as well, so contrary to the grim situation 5 years ago, it's
  a healthy project.

  As RT is still a good vehicle to exercise rarely used code paths and
  to detect hard to trigger issues, you could at least view it as a QA
  tool if nothing else"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/rt, Kconfig: Introduce CONFIG_PREEMPT_RT
2019-07-20 10:33:44 -07:00
Linus Torvalds
07ab9d5bc5 Mostly bugfixes, but also:
- s390 support for KVM selftests
 - LAPIC timer offloading to housekeeping CPUs
 - Extend an s390 optimization for overcommitted hosts to all architectures
 - Debugging cleanups and improvements
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJdMr1FAAoJEL/70l94x66DvIkH/iVuUX9jO1NoQ7qhxeo04MnT
 GP9mX3XnWoI/iN0zAIRfQSP2/9a6+KblgdiziABhju58j5dCfAZGb5793TQppweb
 3ubl11vy7YkzaXJ0b35K7CFhOU9oSlHHGyi5Uh+yyje5qWNxwmHpizxjynbFTKb6
 +/S7O2Ua1VrAVvx0i0IRtwanIK/jF4dStVButgVaVdUva3zLaQmeI71iaJl9ddXY
 bh50xoYua5Ek6+ENi+nwCNVy4OF152AwDbXlxrU0QbeA1B888Qio7nIqb3bwwPpZ
 /8wMVvPzQgL7RmgtY5E5Z4cCYuu7mK8wgGxhuk3oszlVwZJ5rmnaYwGEl4x1s7o=
 =giag
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull more KVM updates from Paolo Bonzini:
 "Mostly bugfixes, but also:

   - s390 support for KVM selftests

   - LAPIC timer offloading to housekeeping CPUs

   - Extend an s390 optimization for overcommitted hosts to all
     architectures

   - Debugging cleanups and improvements"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (25 commits)
  KVM: x86: Add fixed counters to PMU filter
  KVM: nVMX: do not use dangling shadow VMCS after guest reset
  KVM: VMX: dump VMCS on failed entry
  KVM: x86/vPMU: refine kvm_pmu err msg when event creation failed
  KVM: s390: Use kvm_vcpu_wake_up in kvm_s390_vcpu_wakeup
  KVM: Boost vCPUs that are delivering interrupts
  KVM: selftests: Remove superfluous define from vmx.c
  KVM: SVM: Fix detection of AMD Errata 1096
  KVM: LAPIC: Inject timer interrupt via posted interrupt
  KVM: LAPIC: Make lapic timer unpinned
  KVM: x86/vPMU: reset pmc->counter to 0 for pmu fixed_counters
  KVM: nVMX: Ignore segment base for VMX memory operand when segment not FS or GS
  kvm: x86: ioapic and apic debug macros cleanup
  kvm: x86: some tsc debug cleanup
  kvm: vmx: fix coccinelle warnings
  x86: kvm: avoid constant-conversion warning
  x86: kvm: avoid -Wsometimes-uninitized warning
  KVM: x86: expose AVX512_BF16 feature to guest
  KVM: selftests: enable pgste option for the linker on s390
  KVM: selftests: Move kvm_create_max_vcpus test to generic code
  ...
2019-07-20 10:20:27 -07:00
Peter Zijlstra
19dbdcb803 smp: Warn on function calls from softirq context
It's clearly documented that smp function calls cannot be invoked from
softirq handling context. Unfortunately nothing enforces that or emits a
warning.

A single function call can be invoked from softirq context only via
smp_call_function_single_async().

The only legit context is task context, so add a warning to that effect.

Reported-by: luferry <luferry@163.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190718160601.GP3402@hirez.programming.kicks-ass.net
2019-07-20 11:27:16 +02:00
Wanpeng Li
0c5f81dad4 KVM: LAPIC: Inject timer interrupt via posted interrupt
Dedicated instances are currently disturbed by unnecessary jitter due
to the emulated lapic timers firing on the same pCPUs where the
vCPUs reside.  There is no hardware virtual timer on Intel for guest
like ARM, so both programming timer in guest and the emulated timer fires
incur vmexits.  This patch tries to avoid vmexit when the emulated timer
fires, at least in dedicated instance scenario when nohz_full is enabled.

In that case, the emulated timers can be offload to the nearest busy
housekeeping cpus since APICv has been found for several years in server
processors. The guest timer interrupt can then be injected via posted interrupts,
which are delivered by the housekeeping cpu once the emulated timer fires.

The host should tuned so that vCPUs are placed on isolated physical
processors, and with several pCPUs surplus for busy housekeeping.
If disabled mwait/hlt/pause vmexits keep the vCPUs in non-root mode,
~3% redis performance benefit can be observed on Skylake server, and the
number of external interrupt vmexits drops substantially.  Without patch

            VM-EXIT  Samples  Samples%  Time%   Min Time  Max Time   Avg time
EXTERNAL_INTERRUPT    42916    49.43%   39.30%   0.47us   106.09us   0.71us ( +-   1.09% )

While with patch:

            VM-EXIT  Samples  Samples%  Time%   Min Time  Max Time         Avg time
EXTERNAL_INTERRUPT    6871     9.29%     2.96%   0.44us    57.88us   0.72us ( +-   4.02% )

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-20 09:00:40 +02:00
Linus Torvalds
dd4542d282 Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto fixes from Herbert Xu:

 - Fix missed wake-up race in padata

 - Use crypto_memneq in ccp

 - Fix version check in ccp

 - Fix fuzz test failure in ccp

 - Fix potential double free in crypto4xx

 - Fix compile warning in stm32

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  padata: use smp_mb in padata_reorder to avoid orphaned padata jobs
  crypto: ccp - Fix SEV_VERSION_GREATER_OR_EQUAL
  crypto: ccp/gcm - use const time tag comparison.
  crypto: ccp - memset structure fields to zero before reuse
  crypto: crypto4xx - fix a potential double free in ppc4xx_trng_probe
  crypto: stm32/hash - Fix incorrect printk modifier for size_t
2019-07-19 12:23:37 -07:00
Linus Torvalds
41ba485ef1 Eiichi Tsukata found a small bug from the fixup of the stack code
Removing ULONG_MAX as the marker for the user stack trace end,
 made the tracing code not know where the end is. The end is now
 marked with a zero (NULL) pointer. Eiichi fixed this in the tracing
 code.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXTHsuRQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qgETAQDqRtu1KhJM6ujNlPY1aw6e9ncDAqWn
 6GaumMgAdBqEcAEAxJSjr5UlzXuJsCjUjwE0txLfTscyNwljKW77h4/WNwA=
 =bwtH
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fix from Steven Rostedt:
 "Eiichi Tsukata found a small bug from the fixup of the stack code

  Removing ULONG_MAX as the marker for the user stack trace end, made
  the tracing code not know where the end is. The end is now marked with
  a zero (NULL) pointer. Eiichi fixed this in the tracing code"

* tag 'trace-v5.3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Fix user stack trace "??" output
2019-07-19 12:18:46 -07:00
Linus Torvalds
4f5ed1318c Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
 "Assorted stuff"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  perf_event_get(): don't bother with fget_raw()
  vfs: update d_make_root() description
2019-07-19 11:35:08 -07:00
Linus Torvalds
933a90bf4f Merge branch 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs mount updates from Al Viro:
 "The first part of mount updates.

  Convert filesystems to use the new mount API"

* 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
  mnt_init(): call shmem_init() unconditionally
  constify ksys_mount() string arguments
  don't bother with registering rootfs
  init_rootfs(): don't bother with init_ramfs_fs()
  vfs: Convert smackfs to use the new mount API
  vfs: Convert selinuxfs to use the new mount API
  vfs: Convert securityfs to use the new mount API
  vfs: Convert apparmorfs to use the new mount API
  vfs: Convert openpromfs to use the new mount API
  vfs: Convert xenfs to use the new mount API
  vfs: Convert gadgetfs to use the new mount API
  vfs: Convert oprofilefs to use the new mount API
  vfs: Convert ibmasmfs to use the new mount API
  vfs: Convert qib_fs/ipathfs to use the new mount API
  vfs: Convert efivarfs to use the new mount API
  vfs: Convert configfs to use the new mount API
  vfs: Convert binfmt_misc to use the new mount API
  convenience helper: get_tree_single()
  convenience helper get_tree_nodev()
  vfs: Kill sget_userns()
  ...
2019-07-19 10:42:02 -07:00
Linus Torvalds
5f4fc6d440 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix AF_XDP cq entry leak, from Ilya Maximets.

 2) Fix handling of PHY power-down on RTL8411B, from Heiner Kallweit.

 3) Add some new PCI IDs to iwlwifi, from Ihab Zhaika.

 4) Fix handling of neigh timers wrt. entries added by userspace, from
    Lorenzo Bianconi.

 5) Various cases of missing of_node_put(), from Nishka Dasgupta.

 6) The new NET_ACT_CT needs to depend upon NF_NAT, from Yue Haibing.

 7) Various RDS layer fixes, from Gerd Rausch.

 8) Fix some more fallout from TCQ_F_CAN_BYPASS generalization, from
    Cong Wang.

 9) Fix FIB source validation checks over loopback, also from Cong Wang.

10) Use promisc for unsupported number of filters, from Justin Chen.

11) Missing sibling route unlink on failure in ipv6, from Ido Schimmel.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (90 commits)
  tcp: fix tcp_set_congestion_control() use from bpf hook
  ag71xx: fix return value check in ag71xx_probe()
  ag71xx: fix error return code in ag71xx_probe()
  usb: qmi_wwan: add D-Link DWM-222 A2 device ID
  bnxt_en: Fix VNIC accounting when enabling aRFS on 57500 chips.
  net: dsa: sja1105: Fix missing unlock on error in sk_buff()
  gve: replace kfree with kvfree
  selftests/bpf: fix test_xdp_noinline on s390
  selftests/bpf: fix "valid read map access into a read-only array 1" on s390
  net/mlx5: Replace kfree with kvfree
  MAINTAINERS: update netsec driver
  ipv6: Unlink sibling route in case of failure
  liquidio: Replace vmalloc + memset with vzalloc
  udp: Fix typo in net/ipv4/udp.c
  net: bcmgenet: use promisc for unsupported filters
  ipv6: rt6_check should return NULL if 'from' is NULL
  tipc: initialize 'validated' field of received packets
  selftests: add a test case for rp_filter
  fib: relax source validation check for loopback packets
  mlxsw: spectrum: Do not process learned records with a dummy FID
  ...
2019-07-19 10:06:06 -07:00
Linus Torvalds
249be8511b Merge branch 'akpm' (patches from Andrew)
Merge yet more updates from Andrew Morton:
 "The rest of MM and a kernel-wide procfs cleanup.

  Summary of the more significant patches:

   - Patch series "mm/memory_hotplug: Factor out memory block
     devicehandling", v3. David Hildenbrand.

     Some spring-cleaning of the memory hotplug code, notably in
     drivers/base/memory.c

   - "mm: thp: fix false negative of shmem vma's THP eligibility". Yang
     Shi.

     Fix /proc/pid/smaps output for THP pages used in shmem.

   - "resource: fix locking in find_next_iomem_res()" + 1. Nadav Amit.

     Bugfix and speedup for kernel/resource.c

   - Patch series "mm: Further memory block device cleanups", David
     Hildenbrand.

     More spring-cleaning of the memory hotplug code.

   - Patch series "mm: Sub-section memory hotplug support". Dan
     Williams.

     Generalise the memory hotplug code so that pmem can use it more
     completely. Then remove the hacks from the libnvdimm code which
     were there to work around the memory-hotplug code's constraints.

   - "proc/sysctl: add shared variables for range check", Matteo Croce.

     We have about 250 instances of

          int zero;
          ...
                  .extra1 = &zero,

     in the tree. This is a tree-wide sweep to make all those private
     "zero"s and "one"s use global variables.

     Alas, it isn't practical to make those two global integers const"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (38 commits)
  proc/sysctl: add shared variables for range check
  mm: migrate: remove unused mode argument
  mm/sparsemem: cleanup 'section number' data types
  libnvdimm/pfn: stop padding pmem namespaces to section alignment
  libnvdimm/pfn: fix fsdax-mode namespace info-block zero-fields
  mm/devm_memremap_pages: enable sub-section remap
  mm: document ZONE_DEVICE memory-model implications
  mm/sparsemem: support sub-section hotplug
  mm/sparsemem: prepare for sub-section ranges
  mm: kill is_dev_zone() helper
  mm/hotplug: kill is_dev_zone() usage in __remove_pages()
  mm/sparsemem: convert kmalloc_section_memmap() to populate_section_memmap()
  mm/hotplug: prepare shrink_{zone, pgdat}_span for sub-section removal
  mm/sparsemem: add helpers track active portions of a section at boot
  mm/sparsemem: introduce a SECTION_IS_EARLY flag
  mm/sparsemem: introduce struct mem_section_usage
  drivers/base/memory.c: get rid of find_memory_block_hinted()
  mm/memory_hotplug: move and simplify walk_memory_blocks()
  mm/memory_hotplug: rename walk_memory_range() and pass start+size instead of pfns
  mm: make register_mem_sect_under_node() static
  ...
2019-07-19 09:45:58 -07:00
Eiichi Tsukata
6d54ceb539 tracing: Fix user stack trace "??" output
Commit c5c27a0a58 ("x86/stacktrace: Remove the pointless ULONG_MAX
marker") removes ULONG_MAX marker from user stack trace entries but
trace_user_stack_print() still uses the marker and it outputs unnecessary
"??".

For example:

            less-1911  [001] d..2    34.758944: <user stack trace>
   =>  <00007f16f2295910>
   => ??
   => ??
   => ??
   => ??
   => ??
   => ??
   => ??

The user stack trace code zeroes the storage before saving the stack, so if
the trace is shorter than the maximum number of entries it can terminate
the print loop if a zero entry is detected.

Link: http://lkml.kernel.org/r/20190630085438.25545-1-devel@etsukata.com

Cc: stable@vger.kernel.org
Fixes: 4285f2fcef ("tracing: Remove the ULONG_MAX stack trace hackery")
Signed-off-by: Eiichi Tsukata <devel@etsukata.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-07-19 12:12:39 -04:00
Fugang Duan
449fa54d68 dma-direct: correct the physical addr in dma_direct_sync_sg_for_cpu/device
dma_map_sg() may use swiotlb buffer when the kernel command line includes
"swiotlb=force" or the dma_addr is out of dev->dma_mask range.  After
DMA complete the memory moving from device to memory, then user call
dma_sync_sg_for_cpu() to sync with DMA buffer, and copy the original
virtual buffer to other space.

So dma_direct_sync_sg_for_cpu() should use swiotlb physical addr, not
the original physical addr from sg_phys(sg).

dma_direct_sync_sg_for_device() also has the same issue, correct it as
well.

Fixes: 55897af63091("dma-direct: merge swiotlb_dma_ops into the dma_direct code")
Signed-off-by: Fugang Duan <fugang.duan@nxp.com>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-07-19 14:09:40 +02:00
Matteo Croce
eec4844fae proc/sysctl: add shared variables for range check
In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range.  This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.

On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.

The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:

    $ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
    248

Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.

This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:

    # scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
    add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
    Data                                         old     new   delta
    sysctl_vals                                    -      12     +12
    __kstrtab_sysctl_vals                          -      12     +12
    max                                           14      10      -4
    int_max                                       16       -     -16
    one                                           68       -     -68
    zero                                         128      28    -100
    Total: Before=20583249, After=20583085, chg -0.00%

[mcroce@redhat.com: tipc: remove two unused variables]
  Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
  Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 17:08:07 -07:00
Dan Williams
7cc7867fb0 mm/devm_memremap_pages: enable sub-section remap
Teach devm_memremap_pages() about the new sub-section capabilities of
arch_{add,remove}_memory().  Effectively, just replace all usage of
align_start, align_end, and align_size with res->start, res->end, and
resource_size(res).  The existing sanity check will still make sure that
the two separate remap attempts do not collide within a sub-section (2MB
on x86).

Link: http://lkml.kernel.org/r/156092355542.979959.10060071713397030576.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>	[ppc64]
Cc: Michal Hocko <mhocko@suse.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 17:08:07 -07:00
Nadav Amit
756398750e resource: avoid unnecessary lookups in find_next_iomem_res()
find_next_iomem_res() shows up to be a source for overhead in dax
benchmarks.

Improve performance by not considering children of the tree if the top
level does not match.  Since the range of the parents should include the
range of the children such check is redundant.

Running sysbench on dax (pmem emulation, with write_cache disabled):

  sysbench fileio --file-total-size=3G --file-test-mode=rndwr \
   --file-io-mode=mmap --threads=4 --file-fsync-mode=fdatasync run

Provides the following results:

		events (avg/stddev)
		-------------------
  5.2-rc3:	1247669.0000/16075.39
  w/patch:	1286320.5000/16402.72	(+3%)

Link: http://lkml.kernel.org/r/20190613045903.4922-3-namit@vmware.com
Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 17:08:06 -07:00
Nadav Amit
49f17c26c1 resource: fix locking in find_next_iomem_res()
Since resources can be removed, locking should ensure that the resource
is not removed while accessing it.  However, find_next_iomem_res() does
not hold the lock while copying the data of the resource.

Keep holding the lock while the data is copied.  While at it, change the
return value to a more informative value.  It is disregarded by the
callers.

[akpm@linux-foundation.org: fix find_next_iomem_res() documentation]
Link: http://lkml.kernel.org/r/20190613045903.4922-2-namit@vmware.com
Fixes: ff3cc952d3 ("resource: Add remove_resource interface")
Signed-off-by: Nadav Amit <namit@vmware.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 17:08:06 -07:00
Thomas Gleixner
a50a3f4b6a sched/rt, Kconfig: Introduce CONFIG_PREEMPT_RT
Add a new entry to the preemption menu which enables the real-time support
for the kernel. The choice is only enabled when an architecture supports
it.

It selects PREEMPT as the RT features depend on it. To achieve that the
existing PREEMPT choice is renamed to PREEMPT_LL which select PREEMPT as
well.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Paul E. McKenney <paulmck@linux.ibm.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Clark Williams <williams@redhat.com>
Acked-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Acked-by: Daniel Wagner <wagi@monom.org>
Acked-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
Acked-by: Julia Cartwright <julia@ni.com>
Acked-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Acked-by: Gratian Crisan <gratian.crisan@ni.com>
Acked-by: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Andrew Morton <akpm@linuxfoundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1907172200190.1778@nanos.tec.linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-18 23:10:57 +02:00
David S. Miller
bb74523167 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Alexei Starovoitov says:

====================
pull-request: bpf 2019-07-18

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) verifier precision propagation fix, from Andrii.

2) BTF size fix for typedefs, from Andrii.

3) a bunch of big endian fixes, from Ilya.

4) wide load from bpf_sock_addr fixes, from Stanislav.

5) a bunch of misc fixes from a number of developers.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-18 14:04:45 -07:00
Linus Torvalds
da0acd7c65 Modules updates for v5.3
Summary of modules changes for the 5.3 merge window:
 
 - Code fixes and cleanups
 
 - Fix bug where set_memory_x() wasn't being called when rodata=n
 
 - Fix bug where -EEXIST was being returned for going modules
 
 - Allow arches to override module_exit_section()
 
 Signed-off-by: Jessica Yu <jeyu@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCgAGBQJdMET+AAoJEMBFfjjOO8FyWg4P/RFUwHUxdpouLlVjlJVx40ko
 WWFMjQ7zDjhWJVvqataQJS3L3yqhZEC+KrSetqKgP0NxKQ2ev4BwgGF822yorQRw
 Aorw/aHk9xIyfv7tUEjQgLpa7j1zv15uLx+Y4mkRSgc1uZve0MYKgMsU4db97pTk
 XUyronakoM9h4z7NVyV21F9dgprnytzdWwgYdjEx2aMPzF/+3MAPN6H7d8Cb8v4X
 85UwPb8SB5F3U/wzYowrHmlFTlAOkroUrZRErwUOlCn8rvZefkt4NnJwShBC7YPa
 N0CVemw6sgy5KcDUczL7oQJ+kyYwtHN0i0EhvTMixPFKeeCI96Y+UNPZxxQANd6f
 0djYqvRtfqAp7b2bRSfSiApRdxMO3rl17tY/c1MiIb4/Yk5BtofX+8edW3eMX3Xc
 eL5EDL69idgjx1llXmScmjbWiy7Mg61mmo6PN8T/wyIqtDQ/CXqovHdh8Ehbubuf
 5cC7fQiVSRhqvg0N4w2FEr+hV+VpEjdnfxypd46ABDVhYYir7ybMLoHYwBimwKSD
 70qOJWfiRjnEES0qaO3JH/nlXhQGSnPpfe7kwx0LAIyIdTTfarPjGbK/5wcNcReA
 f3A7R2p//G7sNg3DWYSG1cKEjgOAwMjD00SOBov82K9VQ8spwMexIo93Hj6BEbY5
 5cex2njf5G2YNKDfAvoi
 =/sk3
 -----END PGP SIGNATURE-----

Merge tag 'modules-for-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux

Pull module updates from Jessica Yu:
 "Summary of modules changes for the 5.3 merge window:

   - Code fixes and cleanups

   - Fix bug where set_memory_x() wasn't being called when rodata=n

   - Fix bug where -EEXIST was being returned for going modules

   - Allow arches to override module_exit_section()"

* tag 'modules-for-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
  modules: fix compile error if don't have strict module rwx
  ARM: module: recognize unwind exit sections
  module: allow arch overrides for .exit section names
  modules: fix BUG when load module with rodata=n
  kernel/module: Fix mem leak in module_add_modinfo_attrs
  kernel: module: Use struct_size() helper
  kernel/module.c: Only return -EEXIST for modules that have finished loading
2019-07-18 12:06:57 -07:00
Josh Poimboeuf
3193c0836f bpf: Disable GCC -fgcse optimization for ___bpf_prog_run()
On x86-64, with CONFIG_RETPOLINE=n, GCC's "global common subexpression
elimination" optimization results in ___bpf_prog_run()'s jumptable code
changing from this:

	select_insn:
		jmp *jumptable(, %rax, 8)
		...
	ALU64_ADD_X:
		...
		jmp *jumptable(, %rax, 8)
	ALU_ADD_X:
		...
		jmp *jumptable(, %rax, 8)

to this:

	select_insn:
		mov jumptable, %r12
		jmp *(%r12, %rax, 8)
		...
	ALU64_ADD_X:
		...
		jmp *(%r12, %rax, 8)
	ALU_ADD_X:
		...
		jmp *(%r12, %rax, 8)

The jumptable address is placed in a register once, at the beginning of
the function.  The function execution can then go through multiple
indirect jumps which rely on that same register value.  This has a few
issues:

1) Objtool isn't smart enough to be able to track such a register value
   across multiple recursive indirect jumps through the jump table.

2) With CONFIG_RETPOLINE enabled, this optimization actually results in
   a small slowdown.  I measured a ~4.7% slowdown in the test_bpf
   "tcpdump port 22" selftest.

   This slowdown is actually predicted by the GCC manual:

     Note: When compiling a program using computed gotos, a GCC
     extension, you may get better run-time performance if you
     disable the global common subexpression elimination pass by
     adding -fno-gcse to the command line.

So just disable the optimization for this function.

Fixes: e55a73251d ("bpf: Fix ORC unwinding in non-JIT BPF code")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/30c3ca29ba037afcbd860a8672eef0021addf9fe.1563413318.git.jpoimboe@redhat.com
2019-07-18 21:01:06 +02:00
Linus Torvalds
818e95c768 The main changes in this release include:
- Add user space specific memory reading for kprobes
  - Allow kprobes to be executed earlier in boot
 
 The rest are mostly just various clean ups and small fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXS88txQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qhaPAQDHaAmu6wXtZjZE6GU4ZP61UNgDECmZ
 4wlGrNc1AAlqAQD/QC8339p37aDCp9n27VY1wmJwF3nca+jAHfQLqWkkYgw=
 =n/tz
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing updates from Steven Rostedt:
 "The main changes in this release include:

   - Add user space specific memory reading for kprobes

   - Allow kprobes to be executed earlier in boot

  The rest are mostly just various clean ups and small fixes"

* tag 'trace-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (33 commits)
  tracing: Make trace_get_fields() global
  tracing: Let filter_assign_type() detect FILTER_PTR_STRING
  tracing: Pass type into tracing_generic_entry_update()
  ftrace/selftest: Test if set_event/ftrace_pid exists before writing
  ftrace/selftests: Return the skip code when tracing directory not configured in kernel
  tracing/kprobe: Check registered state using kprobe
  tracing/probe: Add trace_event_call accesses APIs
  tracing/probe: Add probe event name and group name accesses APIs
  tracing/probe: Add trace flag access APIs for trace_probe
  tracing/probe: Add trace_event_file access APIs for trace_probe
  tracing/probe: Add trace_event_call register API for trace_probe
  tracing/probe: Add trace_probe init and free functions
  tracing/uprobe: Set print format when parsing command
  tracing/kprobe: Set print format right after parsed command
  kprobes: Fix to init kprobes in subsys_initcall
  tracepoint: Use struct_size() in kmalloc()
  ring-buffer: Remove HAVE_64BIT_ALIGNED_ACCESS
  ftrace: Enable trampoline when rec count returns back to one
  tracing/kprobe: Do not run kprobe boot tests if kprobe_event is on cmdline
  tracing: Make a separate config for trace event self tests
  ...
2019-07-18 11:51:00 -07:00
Thomas Gleixner
54f698f31e Merge branch 'x86/debug' into core/urgent
Pick up the two pending objtool patches as the next round of objtool fixes
depend on them.
2019-07-18 20:50:48 +02:00
Linus Torvalds
d4df33b0e9 Merge branch 'for-linus-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb
Pull swiotlb updates from Konrad Rzeszutek Wilk:
 "One compiler fix, and a bug-fix in swiotlb_nr_tbl() and
  swiotlb_max_segment() to check also for no_iotlb_memory"

* 'for-linus-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb:
  swiotlb: fix phys_addr_t overflow warning
  swiotlb: Return consistent SWIOTLB segments/nr_tbl
  swiotlb: Group identical cleanup in swiotlb_cleanup()
2019-07-18 11:48:05 -07:00
Peter Zijlstra
cac9b9a4b0 stacktrace: Force USER_DS for stack_trace_save_user()
When walking userspace stacks, USER_DS needs to be set, otherwise
access_ok() will not function as expected.

Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Reported-by: Eiichi Tsukata <devel@etsukata.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Vegard Nossum <vegard.nossum@oracle.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Link: https://lkml.kernel.org/r/20190718085754.GM3402@hirez.programming.kicks-ass.net
2019-07-18 16:47:24 +02:00
Daniel Jordan
cf144f81a9 padata: use smp_mb in padata_reorder to avoid orphaned padata jobs
Testing padata with the tcrypt module on a 5.2 kernel...

    # modprobe tcrypt alg="pcrypt(rfc4106(gcm(aes)))" type=3
    # modprobe tcrypt mode=211 sec=1

...produces this splat:

    INFO: task modprobe:10075 blocked for more than 120 seconds.
          Not tainted 5.2.0-base+ #16
    modprobe        D    0 10075  10064 0x80004080
    Call Trace:
     ? __schedule+0x4dd/0x610
     ? ring_buffer_unlock_commit+0x23/0x100
     schedule+0x6c/0x90
     schedule_timeout+0x3b/0x320
     ? trace_buffer_unlock_commit_regs+0x4f/0x1f0
     wait_for_common+0x160/0x1a0
     ? wake_up_q+0x80/0x80
     { crypto_wait_req }             # entries in braces added by hand
     { do_one_aead_op }
     { test_aead_jiffies }
     test_aead_speed.constprop.17+0x681/0xf30 [tcrypt]
     do_test+0x4053/0x6a2b [tcrypt]
     ? 0xffffffffa00f4000
     tcrypt_mod_init+0x50/0x1000 [tcrypt]
     ...

The second modprobe command never finishes because in padata_reorder,
CPU0's load of reorder_objects is executed before the unlocking store in
spin_unlock_bh(pd->lock), causing CPU0 to miss CPU1's increment:

CPU0                                 CPU1

padata_reorder                       padata_do_serial
  LOAD reorder_objects  // 0
                                       INC reorder_objects  // 1
                                       padata_reorder
                                         TRYLOCK pd->lock   // failed
  UNLOCK pd->lock

CPU0 deletes the timer before returning from padata_reorder and since no
other job is submitted to padata, modprobe waits indefinitely.

Add a pair of full barriers to guarantee proper ordering:

CPU0                                 CPU1

padata_reorder                       padata_do_serial
  UNLOCK pd->lock
  smp_mb()
  LOAD reorder_objects
                                       INC reorder_objects
                                       smp_mb__after_atomic()
                                       padata_reorder
                                         TRYLOCK pd->lock

smp_mb__after_atomic is needed so the read part of the trylock operation
comes after the INC, as Andrea points out.   Thanks also to Andrea for
help with writing a litmus test.

Fixes: 16295bec63 ("padata: Generic parallelization/serialization interface")
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: <stable@vger.kernel.org>
Cc: Andrea Parri <andrea.parri@amarulasolutions.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: linux-arch@vger.kernel.org
Cc: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-07-18 13:39:54 +08:00
Linus Torvalds
57a8ec387e Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:
 "VM:
   - z3fold fixes and enhancements by Henry Burns and Vitaly Wool

   - more accurate reclaimed slab caches calculations by Yafang Shao

   - fix MAP_UNINITIALIZED UAPI symbol to not depend on config, by
     Christoph Hellwig

   - !CONFIG_MMU fixes by Christoph Hellwig

   - new novmcoredd parameter to omit device dumps from vmcore, by
     Kairui Song

   - new test_meminit module for testing heap and pagealloc
     initialization, by Alexander Potapenko

   - ioremap improvements for huge mappings, by Anshuman Khandual

   - generalize kprobe page fault handling, by Anshuman Khandual

   - device-dax hotplug fixes and improvements, by Pavel Tatashin

   - enable synchronous DAX fault on powerpc, by Aneesh Kumar K.V

   - add pte_devmap() support for arm64, by Robin Murphy

   - unify locked_vm accounting with a helper, by Daniel Jordan

   - several misc fixes

  core/lib:
   - new typeof_member() macro including some users, by Alexey Dobriyan

   - make BIT() and GENMASK() available in asm, by Masahiro Yamada

   - changed LIST_POISON2 on x86_64 to 0xdead000000000122 for better
     code generation, by Alexey Dobriyan

   - rbtree code size optimizations, by Michel Lespinasse

   - convert struct pid count to refcount_t, by Joel Fernandes

  get_maintainer.pl:
   - add --no-moderated switch to skip moderated ML's, by Joe Perches

  misc:
   - ptrace PTRACE_GET_SYSCALL_INFO interface

   - coda updates

   - gdb scripts, various"

[ Using merge message suggestion from Vlastimil Babka, with some editing - Linus ]

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (100 commits)
  fs/select.c: use struct_size() in kmalloc()
  mm: add account_locked_vm utility function
  arm64: mm: implement pte_devmap support
  mm: introduce ARCH_HAS_PTE_DEVMAP
  mm: clean up is_device_*_page() definitions
  mm/mmap: move common defines to mman-common.h
  mm: move MAP_SYNC to asm-generic/mman-common.h
  device-dax: "Hotremove" persistent memory that is used like normal RAM
  mm/hotplug: make remove_memory() interface usable
  device-dax: fix memory and resource leak if hotplug fails
  include/linux/lz4.h: fix spelling and copy-paste errors in documentation
  ipc/mqueue.c: only perform resource calculation if user valid
  include/asm-generic/bug.h: fix "cut here" for WARN_ON for __WARN_TAINT architectures
  scripts/gdb: add helpers to find and list devices
  scripts/gdb: add lx-genpd-summary command
  drivers/pps/pps.c: clear offset flags in PPS_SETPARAMS ioctl
  kernel/pid.c: convert struct pid count to refcount_t
  drivers/rapidio/devices/rio_mport_cdev.c: NUL terminate some strings
  select: shift restore_saved_sigmask_unless() into poll_select_copy_remaining()
  select: change do_poll() to return -ERESTARTNOHAND rather than -EINTR
  ...
2019-07-17 08:58:04 -07:00
Christoph Hellwig
a5008b59cd dma-direct: only limit the mapping size if swiotlb could be used
Don't just check for a swiotlb buffer, but also if buffering might
be required for this particular device.

Fixes: 133d624b1c ("dma: Introduce dma_max_mapping_size()")
Reported-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2019-07-17 08:25:45 +02:00
Joel Fernandes (Google)
f57e515a1b kernel/pid.c: convert struct pid count to refcount_t
struct pid's count is an atomic_t field used as a refcount.  Use
refcount_t for it which is basically atomic_t but does additional
checking to prevent use-after-free bugs.

For memory ordering, the only change is with the following:

 -	if ((atomic_read(&pid->count) == 1) ||
 -	     atomic_dec_and_test(&pid->count)) {
 +	if (refcount_dec_and_test(&pid->count)) {
 		kmem_cache_free(ns->pid_cachep, pid);

Here the change is from: Fully ordered --> RELEASE + ACQUIRE (as per
refcount-vs-atomic.rst) This ACQUIRE should take care of making sure the
free happens after the refcount_dec_and_test().

The above hunk also removes atomic_read() since it is not needed for the
code to work and it is unclear how beneficial it is.  The removal lets
refcount_dec_and_test() check for cases where get_pid() happened before
the object was freed.

Link: http://lkml.kernel.org/r/20190701183826.191936-1-joel@joelfernandes.org
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Andrea Parri <andrea.parri@amarulasolutions.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Jann Horn <jannh@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: KJ Tsanaktsidis <ktsanaktsidis@zendesk.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-16 19:23:24 -07:00
Oleg Nesterov
b772434be0 signal: simplify set_user_sigmask/restore_user_sigmask
task->saved_sigmask and ->restore_sigmask are only used in the ret-from-
syscall paths.  This means that set_user_sigmask() can save ->blocked in
->saved_sigmask and do set_restore_sigmask() to indicate that ->blocked
was modified.

This way the callers do not need 2 sigset_t's passed to set/restore and
restore_user_sigmask() renamed to restore_saved_sigmask_unless() turns
into the trivial helper which just calls restore_saved_sigmask().

Link: http://lkml.kernel.org/r/20190606113206.GA9464@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Eric Wong <e@80x24.org>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: David Laight <David.Laight@aculab.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-16 19:23:24 -07:00
Elvira Khabirova
201766a20e ptrace: add PTRACE_GET_SYSCALL_INFO request
PTRACE_GET_SYSCALL_INFO is a generic ptrace API that lets ptracer obtain
details of the syscall the tracee is blocked in.

There are two reasons for a special syscall-related ptrace request.

Firstly, with the current ptrace API there are cases when ptracer cannot
retrieve necessary information about syscalls.  Some examples include:

 * The notorious int-0x80-from-64-bit-task issue. See [1] for details.
   In short, if a 64-bit task performs a syscall through int 0x80, its
   tracer has no reliable means to find out that the syscall was, in
   fact, a compat syscall, and misidentifies it.

 * Syscall-enter-stop and syscall-exit-stop look the same for the
   tracer. Common practice is to keep track of the sequence of
   ptrace-stops in order not to mix the two syscall-stops up. But it is
   not as simple as it looks; for example, strace had a (just recently
   fixed) long-standing bug where attaching strace to a tracee that is
   performing the execve system call led to the tracer identifying the
   following syscall-exit-stop as syscall-enter-stop, which messed up
   all the state tracking.

 * Since the introduction of commit 84d77d3f06 ("ptrace: Don't allow
   accessing an undumpable mm"), both PTRACE_PEEKDATA and
   process_vm_readv become unavailable when the process dumpable flag is
   cleared. On such architectures as ia64 this results in all syscall
   arguments being unavailable for the tracer.

Secondly, ptracers also have to support a lot of arch-specific code for
obtaining information about the tracee.  For some architectures, this
requires a ptrace(PTRACE_PEEKUSER, ...) invocation for every syscall
argument and return value.

ptrace(2) man page:

long ptrace(enum __ptrace_request request, pid_t pid,
            void *addr, void *data);
...
PTRACE_GET_SYSCALL_INFO
       Retrieve information about the syscall that caused the stop.
       The information is placed into the buffer pointed by "data"
       argument, which should be a pointer to a buffer of type
       "struct ptrace_syscall_info".
       The "addr" argument contains the size of the buffer pointed to
       by "data" argument (i.e., sizeof(struct ptrace_syscall_info)).
       The return value contains the number of bytes available
       to be written by the kernel.
       If the size of data to be written by the kernel exceeds the size
       specified by "addr" argument, the output is truncated.

[ldv@altlinux.org: selftests/seccomp/seccomp_bpf: update for PTRACE_GET_SYSCALL_INFO]
  Link: http://lkml.kernel.org/r/20190708182904.GA12332@altlinux.org
Link: http://lkml.kernel.org/r/20190510152842.GF28558@altlinux.org
Signed-off-by: Elvira Khabirova <lineprinter@altlinux.org>
Co-developed-by: Dmitry V. Levin <ldv@altlinux.org>
Signed-off-by: Dmitry V. Levin <ldv@altlinux.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Cc: Eugene Syromyatnikov <esyr@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Greentime Hu <greentime@andestech.com>
Cc: Helge Deller <deller@gmx.de>	[parisc]
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: James Hogan <jhogan@kernel.org>
Cc: kbuild test robot <lkp@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vincent Chen <deanbo422@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-16 19:23:24 -07:00
Weitao Hou
65f50f2553 kernel: fix typos and some coding style in comments
fix lenght to length

Link: http://lkml.kernel.org/r/20190521050937.4370-1-houweitaoo@gmail.com
Signed-off-by: Weitao Hou <houweitaoo@gmail.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-16 19:23:21 -07:00
Tom Lendacky
9087c37584 dma-direct: Force unencrypted DMA under SME for certain DMA masks
If a device doesn't support DMA to a physical address that includes the
encryption bit (currently bit 47, so 48-bit DMA), then the DMA must
occur to unencrypted memory. SWIOTLB is used to satisfy that requirement
if an IOMMU is not active (enabled or configured in passthrough mode).

However, commit fafadcd165 ("swiotlb: don't dip into swiotlb pool for
coherent allocations") modified the coherent allocation support in
SWIOTLB to use the DMA direct coherent allocation support. When an IOMMU
is not active, this resulted in dma_alloc_coherent() failing for devices
that didn't support DMA addresses that included the encryption bit.

Addressing this requires changes to the force_dma_unencrypted() function
in kernel/dma/direct.c. Since the function is now non-trivial and
SME/SEV specific, update the DMA direct support to add an arch override
for the force_dma_unencrypted() function. The arch override is selected
when CONFIG_AMD_MEM_ENCRYPT is set. The arch override function resides in
the arch/x86/mm/mem_encrypt.c file and forces unencrypted DMA when either
SEV is active or SME is active and the device does not support DMA to
physical addresses that include the encryption bit.

Fixes: fafadcd165 ("swiotlb: don't dip into swiotlb pool for coherent allocations")
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
[hch: moved the force_dma_unencrypted declaration to dma-mapping.h,
      fold the s390 fix from Halil Pasic]
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-07-16 22:15:46 +02:00
Linus Torvalds
c309b6f242 docs conversion for v5.3-rc1
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+QmuaPwR3wnBdVwACF8+vY7k4RUFAl0tpocACgkQCF8+vY7k
 4RWoxA//b/fmDXP3WPzrjjSmpyB9ml0/epKzPbT5S2j0lftqKBmet29k+PCjVrTx
 Nq2QauehY9ug5h8UMVUCmzPr95F0tSIGRoqk1vrn7z0K3q6k1SHrtvqbY1Bgb2Uk
 Qvh2YFU4fQLJg8WAbExCjxCdbdmBKQVGKTwCtM+tP5OMxwAFOmQrjGaUaKCKIIA2
 7Wzrx8CpSji+bJ3uK/d36c+4M9oDly5eaxBhoboL3BI0y+GqwiSASGwTO7BxrPOg
 0wq5IZHnqS8+bprT9xQdDOqf+UOY9U1cxE/+sqsHxblfUEx9gfLy/R+FLmJn+SS9
 Z3yLy4SqVHQMpWBjEAGodohikF60PAuTdymSC11jqFaKCUxWrIZg5xO+0blMrxPF
 7vYIexutCkaBMHBlNaNsHIqB7B/2FGGKoN7QW64hwvwJCGvF7OmJcV+R4bROGvh4
 nFuis9/Nm66Fq7I3aw37ThyZ0aWZdaQ0QJTH9ksxU/ZCz2hhMNYu/rXggrDvkS4U
 nr77ZT5Gd7nj4b110zf8+99uiGiinY6hTfzPAuTCLBhaxwrv4/xDHAhpwdEB5T4j
 8gOkxV8c0XWtL7sKqhGJvs/RRe2za0Y9XH6fyxsYfWcfuLjEvug8ouXMad9gxFWH
 DL3WnKJEMGLScei2wux4kGOwEbkR1bUf2cHJfh3GpCB/y8vgLOc=
 =smxY
 -----END PGP SIGNATURE-----

Merge tag 'docs/v5.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media

Pull rst conversion of docs from Mauro Carvalho Chehab:
 "As agreed with Jon, I'm sending this big series directly to you, c/c
  him, as this series required a special care, in order to avoid
  conflicts with other trees"

* tag 'docs/v5.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (77 commits)
  docs: kbuild: fix build with pdf and fix some minor issues
  docs: block: fix pdf output
  docs: arm: fix a breakage with pdf output
  docs: don't use nested tables
  docs: gpio: add sysfs interface to the admin-guide
  docs: locking: add it to the main index
  docs: add some directories to the main documentation index
  docs: add SPDX tags to new index files
  docs: add a memory-devices subdir to driver-api
  docs: phy: place documentation under driver-api
  docs: serial: move it to the driver-api
  docs: driver-api: add remaining converted dirs to it
  docs: driver-api: add xilinx driver API documentation
  docs: driver-api: add a series of orphaned documents
  docs: admin-guide: add a series of orphaned documents
  docs: cgroup-v1: add it to the admin-guide book
  docs: aoe: add it to the driver-api book
  docs: add some documentation dirs to the driver-api book
  docs: driver-model: move it to the driver-api book
  docs: lp855x-driver.rst: add it to the driver-api book
  ...
2019-07-16 12:21:41 -07:00
Cong Wang
0aeb1def44 tracing: Make trace_get_fields() global
trace_get_fields() is the only way to read tracepoint fields at
run time, as their fields are defined at compile-time with macros.
Make this function visible to all users and it will be used by
trace event injection code to calculate the size of a tracepoint
entry.
Link: http://lkml.kernel.org/r/20190525165802.25944-4-xiyou.wangcong@gmail.com

Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-07-16 15:14:48 -04:00
Cong Wang
5967bd5c42 tracing: Let filter_assign_type() detect FILTER_PTR_STRING
filter_assign_type() could detect dynamic string and static
string, but not string pointers. Teach filter_assign_type()
to detect string pointers, and this will be needed by trace
event injection code.

BTW, trace event hist uses FILTER_PTR_STRING too.
Link: http://lkml.kernel.org/r/20190525165802.25944-3-xiyou.wangcong@gmail.com

Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-07-16 15:14:48 -04:00
Cong Wang
46710f3a34 tracing: Pass type into tracing_generic_entry_update()
All callers of tracing_generic_entry_update() have to initialize
entry->type, so let's just simply move it inside.
Link: http://lkml.kernel.org/r/20190525165802.25944-2-xiyou.wangcong@gmail.com

Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-07-16 15:14:48 -04:00
Masami Hiramatsu
715fa2fd4c tracing/kprobe: Check registered state using kprobe
Change registered check only by trace_kprobe and remove
TP_FLAG_REGISTERED from trace_probe, since this feature
is only used for trace_kprobe.

Link: http://lkml.kernel.org/r/155931588704.28323.4952266828256245833.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-07-16 15:14:47 -04:00
Masami Hiramatsu
e3dc9f898e tracing/probe: Add trace_event_call accesses APIs
Add trace_event_call access APIs for trace_probe.
Instead of accessing trace_probe.call directly, use those
accesses by trace_probe_event_call() method. This hides
the relationship of trace_event_call and trace_probe from
trace_kprobe and trace_uprobe.

Link: http://lkml.kernel.org/r/155931587711.28323.8335129014686133120.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-07-16 15:14:47 -04:00
Masami Hiramatsu
b55ce203a8 tracing/probe: Add probe event name and group name accesses APIs
Add trace_probe_name() and trace_probe_group_name() functions
for accessing probe name and group name of trace_probe.

Link: http://lkml.kernel.org/r/155931586717.28323.8738615064952254761.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-07-16 15:14:47 -04:00
Masami Hiramatsu
747774d6b0 tracing/probe: Add trace flag access APIs for trace_probe
Add trace_probe_test/set/clear_flag() functions for accessing
trace_probe.flag field.
This flags field should not be accessed directly.

Link: http://lkml.kernel.org/r/155931585683.28323.314290023236905988.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-07-16 15:14:47 -04:00
Masami Hiramatsu
b5f935ee13 tracing/probe: Add trace_event_file access APIs for trace_probe
Add trace_event_file access APIs for trace_probe data structure.
This simplifies enabling/disabling operations in uprobe and kprobe
events so that those don't touch deep inside the trace_probe.

This also removing a redundant synchronization when the
kprobe event is used from perf, since the perf itself uses
tracepoint_synchronize_unregister() after disabling (ftrace-
defined) event, thus we don't have to synchronize in that
path. Also we don't need to identify local trace_kprobe too
anymore.

Link: http://lkml.kernel.org/r/155931584587.28323.372301976283354629.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-07-16 15:14:47 -04:00
Masami Hiramatsu
46e5376d40 tracing/probe: Add trace_event_call register API for trace_probe
Since trace_event_call is a field of trace_probe, these
operations should be done in trace_probe.c. trace_kprobe
and trace_uprobe use new functions to register/unregister
trace_event_call.

Link: http://lkml.kernel.org/r/155931583643.28323.14828411185591538876.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-07-16 15:14:47 -04:00
Masami Hiramatsu
455b289973 tracing/probe: Add trace_probe init and free functions
Add common trace_probe init and cleanup function in
trace_probe.c, and use it from trace_kprobe.c and trace_uprobe.c

Link: http://lkml.kernel.org/r/155931582664.28323.5934870189034740822.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-07-16 15:14:47 -04:00
Masami Hiramatsu
b4d4b96be8 tracing/uprobe: Set print format when parsing command
Set event call's print format right after parsed command for
simplifying (un)register_uprobe_event().

Link: http://lkml.kernel.org/r/155931581659.28323.5404667166417404076.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-07-16 15:14:47 -04:00
Masami Hiramatsu
f730e0f2da tracing/kprobe: Set print format right after parsed command
Set event call's print format right after parsed command for
simplifying (un)register_kprobe_event().

Link: http://lkml.kernel.org/r/155931580625.28323.5158822928646225903.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-07-16 15:14:47 -04:00
Masami Hiramatsu
65fc965c70 kprobes: Fix to init kprobes in subsys_initcall
Since arm64 kernel initializes breakpoint trap vector in arch_initcall(),
initializing kprobe (and run smoke test) in postcore_initcall() causes
a kernel panic.

To fix this issue, move the kprobe initialization in subsys_initcall()
(which is called right afer the arch_initcall).

In-kernel kprobe users (ftrace and bpf) are using fs_initcall() which is
called after subsys_initcall(), so this shouldn't cause more problem.

Link: http://lkml.kernel.org/r/155956708268.12228.10363800793132214198.stgit@devnote2
Link: http://lkml.kernel.org/r/20190709153755.GB10123@lakrids.cambridge.arm.com

Reported-by: Anders Roxell <anders.roxell@linaro.org>
Fixes: b5f8b32c93 ("kprobes: Initialize kprobes at postcore_initcall")
Tested-by: Anders Roxell <anders.roxell@linaro.org>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-07-16 15:13:45 -04:00
Linus Torvalds
3c69914b4c for-linus-20190715
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXSxgkQAKCRCRxhvAZXjc
 opJ3AP9TWIWhEC0dzbNmzh0STj/Vyl5KYEpdMbi7HNeAmIEAfQD6A3HI4bVJbN08
 jH44U7DLzHJyHefKlB8jHEKEVYJWqgo=
 =74Bg
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20190715' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull pidfd and clone3 fixes from Christian Brauner:
 "This contains a bugfix for CLONE_PIDFD when used with the legacy clone
  syscall, two fixes to ensure that syscall numbering and clone3
  entrypoint implementations will stay consistent, and an update for the
  maintainers file:

   - The addition of clone3 broke CLONE_PIDFD for legacy clone on all
     architectures that use do_fork() directly instead of calling the
     clone syscall itself. (Fwiw, cleaning do_fork() up is on my todo.)

     The reason this happened was that during conversion of _do_fork()
     to use struct kernel_clone_args we missed that do_fork() is called
     directly by various architectures. This is fixed by making sure
     that the pidfd argument in struct kernel_clone_args is correctly
     initialized with the parent_tidptr argument passed down from
     do_fork(). Additionally, do_fork() missed a check to make
     CLONE_PIDFD and CLONE_PARENT_SETTID mutually exclusive just a
     clone() does. This is now fixed too.

   - When clone3() was introduced we skipped architectures that require
     special handling for fork-like syscalls. Their syscall tables did
     not contain any mention of clone3().

     To make sure that Arnd's work to make syscall numbers on all
     architectures identical (minus alpha) was not for naught we are
     placing a comment in all syscall tables that do not yet implement
     clone3(). The comment makes it clear that 435 is reserved for
     clone3 and should not be used.

   - Also, this contains a patch to make the clone3() syscall definition
     in asm-generic/unist.h conditional on __ARCH_WANT_SYS_CLONE3. This
     lets us catch new architectures that implicitly make use of clone3
     without setting __ARCH_WANT_SYS_CLONE3 which is a good indicator
     that they did not check whether it needs special treatment or not.

   - Finally, this contains a patch to add me as maintainer for pidfd
     stuff so people can start blaming me (more)"

* tag 'for-linus-20190715' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  MAINTAINERS: add new entry for pidfd api
  unistd: protect clone3 via __ARCH_WANT_SYS_CLONE3
  arch: mark syscall number 435 reserved for clone3
  clone: fix CLONE_PIDFD support
2019-07-16 11:30:07 -07:00
Linus Torvalds
fb4da215ed pci-v5.3-changes
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAl0siFoUHGJoZWxnYWFz
 QGdvb2dsZS5jb20ACgkQWYigwDrT+vzi9A//S4jRyyZrgUr88Az0GbgMhE4b3yqc
 uL7om/Sf+443gG6C+aKkZSM/IE9hrbyIKuYq7GGxDkzZ/HkucZo2yIuAHkPgG4ik
 QQYJ8fJsmMq1bUht87c1ZZwGP0++Deq/Ns2+VNy/WBYqKLulnV0DvEEaJgPs9C5D
 ppwccGdo6UghiujBTpE4ddUBjFjjURWqT6wSnMRDQ4EGwfUhG0MWwwHKI4hbBuaL
 N6refuggdYyUUX5FeUOHa6VF6uTnSSAQ75k+40n4nljdayqoumHLskst77o9q5ZI
 oXjdpwgmuEqYhfp03HEA4Xo/bBxiRj76NuTiEMKvPokxjpanwbLrdV0GhF0OIlM0
 rp1NOI1w+vppFrU+rc2gtq+7hYXFmvdhjS29hFLeD91PP36N5d29jW5NVFpm7GCm
 n4TMGAOsu8RB+bNua6ZbZVcDk2EnPgQeIcM0ZPoBtPK19Fg/rScdEU4u/aFE1Y0Q
 C+Ks7D1qCvFpHzl/xAg0oo9v/jFsWef3qnQWOzot964Zz4W4NSVvB9Ox6Vbfj6C4
 v331LJmlPxG8fxBNA3q28FrTxcG1NW6sgo3WY9VoSp/vc0aqaPKhm7sbraTt5IrI
 TwqA/WhnAHv90MQCGFcofANyYTkjPkKk2QBFK6b0suoAmVdwVWWELi1WaZ+HdvgQ
 JP7YpmC2cXcQBPk=
 =ZGxL
 -----END PGP SIGNATURE-----

Merge tag 'pci-v5.3-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci

Pull PCI updates from Bjorn Helgaas:
 "Enumeration changes:

   - Evaluate PCI Boot Configuration _DSM to learn if firmware wants us
     to preserve its resource assignments (Benjamin Herrenschmidt)

   - Simplify resource distribution (Nicholas Johnson)

   - Decode 32 GT/s link speed (Gustavo Pimentel)

  Virtualization:

   - Fix incorrect caching of VF config space size (Alex Williamson)

   - Fix VF driver probing sysfs knobs (Alex Williamson)

  Peer-to-peer DMA:

   - Fix dma_virt_ops check (Logan Gunthorpe)

  Altera host bridge driver:

   - Allow building as module (Ley Foon Tan)

  Armada 8K host bridge driver:

   - add PHYs support (Miquel Raynal)

  DesignWare host bridge driver:

   - Export APIs to support removable loadable module (Vidya Sagar)

   - Enable Relaxed Ordering erratum workaround only on Tegra20 &
     Tegra30 (Vidya Sagar)

  Hyper-V host bridge driver:

   - Fix use-after-free in eject (Dexuan Cui)

  Mobiveil host bridge driver:

   - Clean up and fix many issues, including non-identify mapped
     windows, 64-bit windows, multi-MSI, class code, INTx clearing (Hou
     Zhiqiang)

  Qualcomm host bridge driver:

   - Use clk bulk API for 2.4.0 controllers (Bjorn Andersson)

   - Add QCS404 support (Bjorn Andersson)

   - Assert PERST for at least 100ms (Niklas Cassel)

  R-Car host bridge driver:

   - Add r8a774a1 DT support (Biju Das)

  Tegra host bridge driver:

   - Add support for Gen2, opportunistic UpdateFC and ACK (PCIe protocol
     details) AER, GPIO-based PERST# (Manikanta Maddireddy)

   - Fix many issues, including power-on failure cases, interrupt
     masking in suspend, UPHY settings, AFI dynamic clock gating,
     pending DLL transactions (Manikanta Maddireddy)

  Xilinx host bridge driver:

   - Fix NWL Multi-MSI programming (Bharat Kumar Gogada)

  Endpoint support:

   - Fix 64bit BAR support (Alan Mikhak)

   - Fix pcitest build issues (Alan Mikhak, Andy Shevchenko)

  Bug fixes:

   - Fix NVIDIA GPU multi-function power dependencies (Abhishek Sahu)

   - Fix NVIDIA GPU HDA enablement issue (Lukas Wunner)

   - Ignore lockdep for sysfs "remove" (Marek Vasut)

  Misc:

   - Convert docs to reST (Changbin Du, Mauro Carvalho Chehab)"

* tag 'pci-v5.3-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (107 commits)
  PCI: Enable NVIDIA HDA controllers
  tools: PCI: Fix installation when `make tools/pci_install`
  PCI: dwc: pci-dra7xx: Fix compilation when !CONFIG_GPIOLIB
  PCI: Fix typos and whitespace errors
  PCI: mobiveil: Fix INTx interrupt clearing in mobiveil_pcie_isr()
  PCI: mobiveil: Fix infinite-loop in the INTx handling function
  PCI: mobiveil: Move PCIe PIO enablement out of inbound window routine
  PCI: mobiveil: Add upper 32-bit PCI base address setup in inbound window
  PCI: mobiveil: Add upper 32-bit CPU base address setup in outbound window
  PCI: mobiveil: Mask out hardcoded bits in inbound/outbound windows setup
  PCI: mobiveil: Clear the control fields before updating it
  PCI: mobiveil: Add configured inbound windows counter
  PCI: mobiveil: Fix the valid check for inbound and outbound windows
  PCI: mobiveil: Clean-up program_{ib/ob}_windows()
  PCI: mobiveil: Remove an unnecessary return value check
  PCI: mobiveil: Fix error return values
  PCI: mobiveil: Refactor the MEM/IO outbound window initialization
  PCI: mobiveil: Make some register updates more readable
  PCI: mobiveil: Reformat the code for readability
  dt-bindings: PCI: mobiveil: Change gpio_slave and apb_csr to optional
  ...
2019-07-15 20:44:49 -07:00
Andrii Nakryiko
1acc5d5c58 bpf: fix BTF verifier size resolution logic
BTF verifier has a size resolution bug which in some circumstances leads to
invalid size resolution for, e.g., TYPEDEF modifier.  This happens if we have
[1] PTR -> [2] TYPEDEF -> [3] ARRAY, in which case due to being in pointer
context ARRAY size won't be resolved (because for pointer it doesn't matter, so
it's a sink in pointer context), but it will be permanently remembered as zero
for TYPEDEF and TYPEDEF will be marked as RESOLVED. Eventually ARRAY size will
be resolved correctly, but TYPEDEF resolved_size won't be updated anymore.
This, subsequently, will lead to erroneous map creation failure, if that
TYPEDEF is specified as either key or value, as key_size/value_size won't
correspond to resolved size of TYPEDEF (kernel will believe it's zero).

Note, that if BTF was ordered as [1] ARRAY <- [2] TYPEDEF <- [3] PTR, this
won't be a problem, as by the time we get to TYPEDEF, ARRAY's size is already
calculated and stored.

This bug manifests itself in rejecting BTF-defined maps that use array
typedef as a value type:

typedef int array_t[16];

struct {
    __uint(type, BPF_MAP_TYPE_ARRAY);
    __type(value, array_t); /* i.e., array_t *value; */
} test_map SEC(".maps");

The fix consists on not relying on modifier's resolved_size and instead using
modifier's resolved_id (type ID for "concrete" type to which modifier
eventually resolves) and doing size determination for that resolved type. This
allow to preserve existing "early DFS termination" logic for PTR or
STRUCT_OR_ARRAY contexts, but still do correct size determination for modifier
types.

Fixes: eb3f595dab ("bpf: btf: Validate type reference")
Cc: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-15 23:02:17 +02:00
Mauro Carvalho Chehab
da82c92f11 docs: cgroup-v1: add it to the admin-guide book
Those files belong to the admin guide, so add them.

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
2019-07-15 11:03:02 -03:00
Mauro Carvalho Chehab
5704324702 docs: admin-guide: move sysctl directory to it
The stuff under sysctl describes /sys interface from userspace
point of view. So, add it to the admin-guide and remove the
:orphan: from its index file.

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
2019-07-15 11:03:01 -03:00
Mauro Carvalho Chehab
53b9537509 docs: sysctl: convert to ReST
Rename the /proc/sys/ documentation files to ReST, using the
README file as a template for an index.rst, adding the other
files there via TOC markup.

Despite being written on different times with different
styles, try to make them somewhat coherent with a similar
look and feel, ensuring that they'll look nice as both
raw text file and as via the html output produced by the
Sphinx build system.

At its new index.rst, let's add a :orphan: while this is not linked to
the main index.rst file, in order to avoid build warnings.

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
2019-07-15 09:20:26 -03:00
Mauro Carvalho Chehab
387b14684f docs: locking: convert docs to ReST and rename to *.rst
Convert the locking documents to ReST and add them to the
kernel development book where it belongs.

Most of the stuff here is just to make Sphinx to properly
parse the text file, as they're already in good shape,
not requiring massive changes in order to be parsed.

The conversion is actually:
  - add blank lines and identation in order to identify paragraphs;
  - fix tables markups;
  - add some lists markups;
  - mark literal blocks;
  - adjust title markups.

At its new index.rst, let's add a :orphan: while this is not linked to
the main index.rst file, in order to avoid build warnings.

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Acked-by: Federico Vaga <federico.vaga@vaga.pv.it>
2019-07-15 08:53:27 -03:00
Linus Torvalds
fec88ab0af HMM patches for 5.3
Improvements and bug fixes for the hmm interface in the kernel:
 
 - Improve clarity, locking and APIs related to the 'hmm mirror' feature
   merged last cycle. In linux-next we now see AMDGPU and nouveau to be
   using this API.
 
 - Remove old or transitional hmm APIs. These are hold overs from the past
   with no users, or APIs that existed only to manage cross tree conflicts.
   There are still a few more of these cleanups that didn't make the merge
   window cut off.
 
 - Improve some core mm APIs:
   * export alloc_pages_vma() for driver use
   * refactor into devm_request_free_mem_region() to manage
     DEVICE_PRIVATE resource reservations
   * refactor duplicative driver code into the core dev_pagemap
     struct
 
 - Remove hmm wrappers of improved core mm APIs, instead have drivers use
   the simplified API directly
 
 - Remove DEVICE_PUBLIC
 
 - Simplify the kconfig flow for the hmm users and core code
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAl0k1zkACgkQOG33FX4g
 mxrO+w//QF/yI/9Hh30RWEBq8W107cODkDlaT0Z/7cVEXfGetZzIUpqzxnJofRfQ
 xTw1XmYkc9WpJe/mTTuFZFewNQwWuMM6X0Xi25fV438/Y64EclevlcJTeD49TIH1
 CIMsz8bX7CnCEq5sz+UypLg9LPnaD9L/JLyuSbyjqjms/o+yzqa7ji7p/DSINuhZ
 Qva9OZL1ZSEDJfNGi8uGpYBqryHoBAonIL12R9sCF5pbJEnHfWrH7C06q7AWOAjQ
 4vjN/p3F4L9l/v2IQ26Kn/S0AhmN7n3GT//0K66e2gJPfXa8fxRKGuFn/Kd79EGL
 YPASn5iu3cM23up1XkbMNtzacL8yiIeTOcMdqw26OaOClojy/9OJduv5AChe6qL/
 VUQIAn1zvPsJTyC5U7mhmkrGuTpP6ivHpxtcaUp+Ovvi1cyK40nLCmSNvLnbN5ES
 bxbb0SjE4uupDG5qU6Yct/hFp6uVMSxMqXZOb9Xy8ZBkbMsJyVOLj71G1/rVIfPU
 hO1AChX5CRG1eJoMo6oBIpiwmSvcOaPp3dqIOQZvwMOqrO869LR8qv7RXyh/g9gi
 FAEKnwLl4GK3YtEO4Kt/1YI5DXYjSFUbfgAs0SPsRKS6hK2+RgRk2M/B/5dAX0/d
 lgOf9WPODPwiSXBYLtJB8qHVDX0DIY8faOyTx6BYIKClUtgbBI8=
 =wKvp
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull HMM updates from Jason Gunthorpe:
 "Improvements and bug fixes for the hmm interface in the kernel:

   - Improve clarity, locking and APIs related to the 'hmm mirror'
     feature merged last cycle. In linux-next we now see AMDGPU and
     nouveau to be using this API.

   - Remove old or transitional hmm APIs. These are hold overs from the
     past with no users, or APIs that existed only to manage cross tree
     conflicts. There are still a few more of these cleanups that didn't
     make the merge window cut off.

   - Improve some core mm APIs:
       - export alloc_pages_vma() for driver use
       - refactor into devm_request_free_mem_region() to manage
         DEVICE_PRIVATE resource reservations
       - refactor duplicative driver code into the core dev_pagemap
         struct

   - Remove hmm wrappers of improved core mm APIs, instead have drivers
     use the simplified API directly

   - Remove DEVICE_PUBLIC

   - Simplify the kconfig flow for the hmm users and core code"

* tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (42 commits)
  mm: don't select MIGRATE_VMA_HELPER from HMM_MIRROR
  mm: remove the HMM config option
  mm: sort out the DEVICE_PRIVATE Kconfig mess
  mm: simplify ZONE_DEVICE page private data
  mm: remove hmm_devmem_add
  mm: remove hmm_vma_alloc_locked_page
  nouveau: use devm_memremap_pages directly
  nouveau: use alloc_page_vma directly
  PCI/P2PDMA: use the dev_pagemap internal refcount
  device-dax: use the dev_pagemap internal refcount
  memremap: provide an optional internal refcount in struct dev_pagemap
  memremap: replace the altmap_valid field with a PGMAP_ALTMAP_VALID flag
  memremap: remove the data field in struct dev_pagemap
  memremap: add a migrate_to_ram method to struct dev_pagemap_ops
  memremap: lift the devmap_enable manipulation into devm_memremap_pages
  memremap: pass a struct dev_pagemap to ->kill and ->cleanup
  memremap: move dev_pagemap callbacks into a separate structure
  memremap: validate the pagemap type passed to devm_memremap_pages
  mm: factor out a devm_request_free_mem_region helper
  mm: export alloc_pages_vma
  ...
2019-07-14 19:42:11 -07:00
Linus Torvalds
1d03985933 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "A number of PMU driver corner case fixes, a race fix, an event
  grouping fix, plus a bunch of tooling fixes/updates"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
  perf/x86/intel: Fix spurious NMI on fixed counter
  perf/core: Fix exclusive events' grouping
  perf/x86/amd/uncore: Set the thread mask for F17h L3 PMCs
  perf/x86/amd/uncore: Do not set 'ThreadMask' and 'SliceMask' for non-L3 PMCs
  perf/core: Fix race between close() and fork()
  perf intel-pt: Fix potential NULL pointer dereference found by the smatch tool
  perf intel-bts: Fix potential NULL pointer dereference found by the smatch tool
  perf script: Assume native_arch for pipe mode
  perf scripts python: export-to-sqlite.py: Fix DROP VIEW power_events_view
  perf scripts python: export-to-postgresql.py: Fix DROP VIEW power_events_view
  perf hists browser: Fix potential NULL pointer dereference found by the smatch tool
  perf cs-etm: Fix potential NULL pointer dereference found by the smatch tool
  perf parse-events: Remove unused variable: error
  perf parse-events: Remove unused variable 'i'
  perf metricgroup: Add missing list_del_init() when flushing egroups list
  perf tools: Use list_del_init() more thorougly
  perf tools: Use zfree() where applicable
  tools lib: Adopt zalloc()/zfree() from tools/perf
  perf tools: Move get_current_dir_name() cond prototype out of util.h
  perf namespaces: Move the conditional setns() prototype to namespaces.h
  ...
2019-07-14 11:40:33 -07:00
Linus Torvalds
0c85ce1354 Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fix from Ingo Molnar:
 "A single fix for a locking statistics bug"

* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/lockdep: Fix lock used or unused stats error
2019-07-14 11:39:21 -07:00
Dmitry V. Levin
028b6e8a89
clone: fix CLONE_PIDFD support
The introduction of clone3 syscall accidentally broke CLONE_PIDFD
support in traditional clone syscall on compat x86 and those
architectures that use do_fork to implement clone syscall.

This bug was found by strace test suite.

Link: https://strace.io/logs/strace/2019-07-12
Fixes: 7f192e3cd3 ("fork: add clone3")
Bisected-and-tested-by: Anatoly Pugachev <matorola@gmail.com>
Signed-off-by: Dmitry V. Levin <ldv@altlinux.org>
Link: https://lore.kernel.org/r/20190714162047.GB10389@altlinux.org
Signed-off-by: Christian Brauner <christian@brauner.io>
2019-07-14 20:36:12 +02:00
Linus Torvalds
50ec18819c Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Ingo Molnar:
 "Fix a sched statistics related bug that would trigger a kernel warning
  on certain configs"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/core: Fix preempt warning in ttwu
2019-07-14 11:34:05 -07:00
Yuyang Du
68d41d8c94 locking/lockdep: Fix lock used or unused stats error
The stats variable nr_unused_locks is incremented every time a new lock
class is register and decremented when the lock is first used in
__lock_acquire(). And after all, it is shown and checked in lockdep_stats.

However, under configurations that either CONFIG_TRACE_IRQFLAGS or
CONFIG_PROVE_LOCKING is not defined:

The commit:

  0918065151 ("locking/lockdep: Consolidate lock usage bit initialization")

missed marking the LOCK_USED flag at IRQ usage initialization because
as mark_usage() is not called. And the commit:

  886532aee3 ("locking/lockdep: Move mark_lock() inside CONFIG_TRACE_IRQFLAGS && CONFIG_PROVE_LOCKING")

further made mark_lock() not defined such that the LOCK_USED cannot be
marked at all when the lock is first acquired.

As a result, we fix this by not showing and checking the stats under such
configurations for lockdep_stats.

Reported-by: Qian Cai <cai@lca.pw>
Signed-off-by: Yuyang Du <duyuyang@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: arnd@arndb.de
Cc: frederic@kernel.org
Link: https://lkml.kernel.org/r/20190709101522.9117-1-duyuyang@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-13 11:24:53 +02:00
Peter Zijlstra
e3d85487fb sched/core: Fix preempt warning in ttwu
John reported a DEBUG_PREEMPT warning caused by commit:

  aacedf26fb ("sched/core: Optimize try_to_wake_up() for local wakeups")

I overlooked that ttwu_stat() requires preemption disabled.

Reported-by: John Stultz <john.stultz@linaro.org>
Tested-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: aacedf26fb ("sched/core: Optimize try_to_wake_up() for local wakeups")
Link: https://lkml.kernel.org/r/20190710105736.GK3402@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-13 11:23:27 +02:00
Alexander Shishkin
8a58ddae23 perf/core: Fix exclusive events' grouping
So far, we tried to disallow grouping exclusive events for the fear of
complications they would cause with moving between contexts. Specifically,
moving a software group to a hardware context would violate the exclusivity
rules if both groups contain matching exclusive events.

This attempt was, however, unsuccessful: the check that we have in the
perf_event_open() syscall is both wrong (looks at wrong PMU) and
insufficient (group leader may still be exclusive), as can be illustrated
by running:

  $ perf record -e '{intel_pt//,cycles}' uname
  $ perf record -e '{cycles,intel_pt//}' uname

ultimately successfully.

Furthermore, we are completely free to trigger the exclusivity violation
by:

   perf -e '{cycles,intel_pt//}' -e '{intel_pt//,instructions}'

even though the helpful perf record will not allow that, the ABI will.

The warning later in the perf_event_open() path will also not trigger, because
it's also wrong.

Fix all this by validating the original group before moving, getting rid
of broken safeguards and placing a useful one to perf_install_in_context().

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: mathieu.poirier@linaro.org
Cc: will.deacon@arm.com
Fixes: bed5b25ad9 ("perf: Add a pmu capability for "exclusive" events")
Link: https://lkml.kernel.org/r/20190701110755.24646-1-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-13 11:21:28 +02:00
Peter Zijlstra
1cf8dfe8a6 perf/core: Fix race between close() and fork()
Syzcaller reported the following Use-after-Free bug:

	close()						clone()

							  copy_process()
							    perf_event_init_task()
							      perf_event_init_context()
							        mutex_lock(parent_ctx->mutex)
								inherit_task_group()
								  inherit_group()
								    inherit_event()
								      mutex_lock(event->child_mutex)
								      // expose event on child list
								      list_add_tail()
								      mutex_unlock(event->child_mutex)
							        mutex_unlock(parent_ctx->mutex)

							    ...
							    goto bad_fork_*

							  bad_fork_cleanup_perf:
							    perf_event_free_task()

	  perf_release()
	    perf_event_release_kernel()
	      list_for_each_entry()
		mutex_lock(ctx->mutex)
		mutex_lock(event->child_mutex)
		// event is from the failing inherit
		// on the other CPU
		perf_remove_from_context()
		list_move()
		mutex_unlock(event->child_mutex)
		mutex_unlock(ctx->mutex)

							      mutex_lock(ctx->mutex)
							      list_for_each_entry_safe()
							        // event already stolen
							      mutex_unlock(ctx->mutex)

							    delayed_free_task()
							      free_task()

	     list_for_each_entry_safe()
	       list_del()
	       free_event()
	         _free_event()
		   // and so event->hw.target
		   // is the already freed failed clone()
		   if (event->hw.target)
		     put_task_struct(event->hw.target)
		       // WHOOPSIE, already quite dead

Which puts the lie to the the comment on perf_event_free_task():
'unexposed, unused context' not so much.

Which is a 'fun' confluence of fail; copy_process() doing an
unconditional free_task() and not respecting refcounts, and perf having
creative locking. In particular:

  82d94856fa ("perf/core: Fix lock inversion between perf,trace,cpuhp")

seems to have overlooked this 'fun' parade.

Solve it by using the fact that detached events still have a reference
count on their (previous) context. With this perf_event_free_task()
can detect when events have escaped and wait for their destruction.

Debugged-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Reported-by: syzbot+a24c397a29ad22d86c98@syzkaller.appspotmail.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: <stable@vger.kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Fixes: 82d94856fa ("perf/core: Fix lock inversion between perf,trace,cpuhp")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-13 11:21:25 +02:00
Linus Torvalds
39ceda5ce1 Kbuild updates for v5.3
- remove headers_{install,check}_all targets
 
 - remove unreasonable 'depends on !UML' from CONFIG_SAMPLES
 
 - re-implement 'make headers_install' more cleanly
 
 - add new header-test-y syntax to compile-test headers
 
 - compile-test exported headers to ensure they are compilable in
   user-space
 
 - compile-test headers under include/ to ensure they are self-contained
 
 - remove -Waggregate-return, -Wno-uninitialized, -Wno-unused-value flags
 
 - add -Werror=unknown-warning-option for Clang
 
 - add 128-bit built-in types support to genksyms
 
 - fix missed rebuild of modules.builtin
 
 - propagate 'No space left on device' error in fixdep to Make
 
 - allow Clang to use its integrated assembler
 
 - improve some coccinelle scripts
 
 - add a new flag KBUILD_ABS_SRCTREE to request Kbuild to use absolute
   path for $(srctree).
 
 - do not ignore errors when compression utility is missing
 
 - misc cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQJSBAABCgA8FiEEbmPs18K1szRHjPqEPYsBB53g2wYFAl0oxNkeHHlhbWFkYS5t
 YXNhaGlyb0Bzb2Npb25leHQuY29tAAoJED2LAQed4NsGnhcP/AuM8s+3SYFiLitJ
 ISbznLFP2Xatq0SPXp5+moez/AMTK6Mm1biPcdo20d+TjVEh4+9F2nq12Ii9U8/D
 tds9A6G8+Bb28r9GMIVQPdFohijW6ijtDziS31iQnIWyPsP/yx6PKfLAD9F4ca1x
 7/4btmu+BOMjtN0NrMWSNz5MM47xUzoWIALL40SV4PzGVXLCQZ2PBNPeSRIk22Jt
 ynDNPuNsmDWcFfwAE+sLSDrhCHZlwM8rg8rf6jmYdc4LcN4cj0oho5+K1TRyC9mn
 fO3PT25juFejthxQulxEfyGggnyLM6BNTgPDGcCHSP4nD7mlXA9GcpZICtJOgGGu
 SlDadMZ0GRMK5zcZ0MF0GQboeyViwsbXgrRcYuXt6cUFWX4P/1SeAQ5Mf4u1EKqf
 hEbwFXV/g81ht0lFS8gyWkvdpoNPtxGHNPusLjp65C4rc0/48/s+7EE/u8JTPl1g
 dQTeIOds6XUOkJgqhEfuq+8gfngbjKc9bYhs+ACbkCzBltQdnb6m5aLgk0ODxe8I
 WbGn0+cQcS9VVwre7E5DnFSVWVOHAG5taiUwj0KDcHB0Jxw9Gvorq9WU1ppHHYH2
 XQIFBx7XHdn28d+plS8R23vAPgDgrGdvE5RYK5tNQLhTJ6BbjlZ1n/Tmxzu62scK
 deG3aCOB13Om7OTzTUh9+C3TC9ZQ
 =E2Rz
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild updates from Masahiro Yamada:

 - remove headers_{install,check}_all targets

 - remove unreasonable 'depends on !UML' from CONFIG_SAMPLES

 - re-implement 'make headers_install' more cleanly

 - add new header-test-y syntax to compile-test headers

 - compile-test exported headers to ensure they are compilable in
   user-space

 - compile-test headers under include/ to ensure they are self-contained

 - remove -Waggregate-return, -Wno-uninitialized, -Wno-unused-value
   flags

 - add -Werror=unknown-warning-option for Clang

 - add 128-bit built-in types support to genksyms

 - fix missed rebuild of modules.builtin

 - propagate 'No space left on device' error in fixdep to Make

 - allow Clang to use its integrated assembler

 - improve some coccinelle scripts

 - add a new flag KBUILD_ABS_SRCTREE to request Kbuild to use absolute
   path for $(srctree).

 - do not ignore errors when compression utility is missing

 - misc cleanups

* tag 'kbuild-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (49 commits)
  kbuild: use -- separater intead of $(filter-out ...) for cc-cross-prefix
  kbuild: Inform user to pass ARCH= for make mrproper
  kbuild: fix compression errors getting ignored
  kbuild: add a flag to force absolute path for srctree
  kbuild: replace KBUILD_SRCTREE with boolean building_out_of_srctree
  kbuild: remove src and obj from the top Makefile
  scripts/tags.sh: remove unused environment variables from comments
  scripts/tags.sh: drop SUBARCH support for ARM
  kbuild: compile-test kernel headers to ensure they are self-contained
  kheaders: include only headers into kheaders_data.tar.xz
  kheaders: remove meaningless -R option of 'ls'
  kbuild: support header-test-pattern-y
  kbuild: do not create wrappers for header-test-y
  kbuild: compile-test exported headers to ensure they are self-contained
  init/Kconfig: add CONFIG_CC_CAN_LINK
  kallsyms: exclude kasan local symbols on s390
  kbuild: add more hints about SUBDIRS replacement
  coccinelle: api/stream_open: treat all wait_.*() calls as blocking
  coccinelle: put_device: Add a cast to an expression for an assignment
  coccinelle: put_device: Adjust a message construction
  ...
2019-07-12 16:03:16 -07:00
Linus Torvalds
9e3a25dc99 dma-mapping updates for Linux 5.3
- move the USB special case that bounced DMA through a device
    bar into the USB code instead of handling it in the common
    DMA code (Laurentiu Tudor and Fredrik Noring)
  - don't dip into the global CMA pool for single page allocations
    (Nicolin Chen)
  - fix a crash when allocating memory for the atomic pool failed
    during boot (Florian Fainelli)
  - move support for MIPS-style uncached segments to the common
    code and use that for MIPS and nios2 (me)
  - make support for DMA_ATTR_NON_CONSISTENT and
    DMA_ATTR_NO_KERNEL_MAPPING generic (me)
  - convert nds32 to the generic remapping allocator (me)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAl0nPqgLHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYNj2hAAxIv2O3wv6V5xhzWwOVo8e/xW1ZLlGAF0/z92u0do
 32Tm8jkdAGjZDnyxam7qisMSIjCNykpauQzVVxyUNBRSsn1V5t7KSaH3/OXCOVcr
 x2VWBirxGO2BbRseaCBjIcA/2qna+VIDGFcNXCtf6rM00YUK6qaJzkMwBKQAeYcM
 uJMJkaf8qaW4hygLJP8axXiGFdIJyFNLAlJ+ok6kYsJHHJNceOp0bo3CDa2mJBK9
 IhraK2zVkyE5EQkQM5cE/Kw1ppPelUKUkHwjgM4wpz2b18WbLu11nKP0hmUcvKRQ
 heY8xWiKxN0QTgS03ou7EVylyrSAE4dIKgzuA4VO32QCGsWypcAg4iU6s5TX6p9g
 tZEW2ckE6wbmRdQPyKoDpZg299/eQjRHc4MAA1yinT8tFMokw2tk8Fq1FWyltwL1
 8EiP5oNs2qUNvNgqUresl6/f6YOacFi1Q6IhgBVj6d6lyhMhlsHfW4w1XA1siv/I
 6l4qJbLohYab6hY7i+mBOd8iG/KrAlr4P6admnv2jDchswbb5t2j+ABE9xv++PFi
 u1HFqMlxqdWQaXGca2UeCUxUjkwO9N+kHpP+VRz+6D2b64dtCWSu8CN23sYXm2tO
 ubWIlrQQZPhhMkoFg7XqKSTacd+ut+SXN9Nxsyv548ETV0l1xbiLRHIbhyoIESD5
 RAI=
 =01Fr
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-5.3' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping updates from Christoph Hellwig:

 - move the USB special case that bounced DMA through a device bar into
   the USB code instead of handling it in the common DMA code (Laurentiu
   Tudor and Fredrik Noring)

 - don't dip into the global CMA pool for single page allocations
   (Nicolin Chen)

 - fix a crash when allocating memory for the atomic pool failed during
   boot (Florian Fainelli)

 - move support for MIPS-style uncached segments to the common code and
   use that for MIPS and nios2 (me)

 - make support for DMA_ATTR_NON_CONSISTENT and
   DMA_ATTR_NO_KERNEL_MAPPING generic (me)

 - convert nds32 to the generic remapping allocator (me)

* tag 'dma-mapping-5.3' of git://git.infradead.org/users/hch/dma-mapping: (29 commits)
  dma-mapping: mark dma_alloc_need_uncached as __always_inline
  MIPS: only select ARCH_HAS_UNCACHED_SEGMENT for non-coherent platforms
  usb: host: Fix excessive alignment restriction for local memory allocations
  lib/genalloc.c: Add algorithm, align and zeroed family of DMA allocators
  nios2: use the generic uncached segment support in dma-direct
  nds32: use the generic remapping allocator for coherent DMA allocations
  arc: use the generic remapping allocator for coherent DMA allocations
  dma-direct: handle DMA_ATTR_NO_KERNEL_MAPPING in common code
  dma-direct: handle DMA_ATTR_NON_CONSISTENT in common code
  dma-mapping: add a dma_alloc_need_uncached helper
  openrisc: remove the partial DMA_ATTR_NON_CONSISTENT support
  arc: remove the partial DMA_ATTR_NON_CONSISTENT support
  arm-nommu: remove the partial DMA_ATTR_NON_CONSISTENT support
  ARM: dma-mapping: allow larger DMA mask than supported
  dma-mapping: truncate dma masks to what dma_addr_t can hold
  iommu/dma: Apply dma_{alloc,free}_contiguous functions
  dma-remap: Avoid de-referencing NULL atomic_pool
  MIPS: use the generic uncached segment support in dma-direct
  dma-direct: provide generic support for uncached kernel segments
  au1100fb: fix DMA API abuse
  ...
2019-07-12 15:13:55 -07:00
Linus Torvalds
f632a8170a Driver Core and debugfs changes for 5.3-rc1
Here is the "big" driver core and debugfs changes for 5.3-rc1
 
 It's a lot of different patches, all across the tree due to some api
 changes and lots of debugfs cleanups.  Because of this, there is going
 to be some merge issues with your tree at the moment, I'll follow up
 with the expected resolutions to make it easier for you.
 
 Other than the debugfs cleanups, in this set of changes we have:
 	- bus iteration function cleanups (will cause build warnings
 	  with s390 and coresight drivers in your tree)
 	- scripts/get_abi.pl tool to display and parse Documentation/ABI
 	  entries in a simple way
 	- cleanups to Documenatation/ABI/ entries to make them parse
 	  easier due to typos and other minor things
 	- default_attrs use for some ktype users
 	- driver model documentation file conversions to .rst
 	- compressed firmware file loading
 	- deferred probe fixes
 
 All of these have been in linux-next for a while, with a bunch of merge
 issues that Stephen has been patient with me for.  Other than the merge
 issues, functionality is working properly in linux-next :)
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXSgpnQ8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ykcwgCfS30OR4JmwZydWGJ7zK/cHqk+KjsAnjOxjC1K
 LpRyb3zX29oChFaZkc5a
 =XrEZ
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-5.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core and debugfs updates from Greg KH:
 "Here is the "big" driver core and debugfs changes for 5.3-rc1

  It's a lot of different patches, all across the tree due to some api
  changes and lots of debugfs cleanups.

  Other than the debugfs cleanups, in this set of changes we have:

   - bus iteration function cleanups

   - scripts/get_abi.pl tool to display and parse Documentation/ABI
     entries in a simple way

   - cleanups to Documenatation/ABI/ entries to make them parse easier
     due to typos and other minor things

   - default_attrs use for some ktype users

   - driver model documentation file conversions to .rst

   - compressed firmware file loading

   - deferred probe fixes

  All of these have been in linux-next for a while, with a bunch of
  merge issues that Stephen has been patient with me for"

* tag 'driver-core-5.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (102 commits)
  debugfs: make error message a bit more verbose
  orangefs: fix build warning from debugfs cleanup patch
  ubifs: fix build warning after debugfs cleanup patch
  driver: core: Allow subsystems to continue deferring probe
  drivers: base: cacheinfo: Ensure cpu hotplug work is done before Intel RDT
  arch_topology: Remove error messages on out-of-memory conditions
  lib: notifier-error-inject: no need to check return value of debugfs_create functions
  swiotlb: no need to check return value of debugfs_create functions
  ceph: no need to check return value of debugfs_create functions
  sunrpc: no need to check return value of debugfs_create functions
  ubifs: no need to check return value of debugfs_create functions
  orangefs: no need to check return value of debugfs_create functions
  nfsd: no need to check return value of debugfs_create functions
  lib: 842: no need to check return value of debugfs_create functions
  debugfs: provide pr_fmt() macro
  debugfs: log errors when something goes wrong
  drivers: s390/cio: Fix compilation warning about const qualifiers
  drivers: Add generic helper to match by of_node
  driver_find_device: Unify the match function with class_find_device()
  bus_find_device: Unify the match callback with class_find_device
  ...
2019-07-12 12:24:03 -07:00
Linus Torvalds
ef8f3d48af Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:
 "Am experimenting with splitting MM up into identifiable subsystems
  perhaps with a view to gitifying it in complex ways. Also with more
  verbose "incoming" emails.

  Most of MM is here and a few other trees.

  Subsystems affected by this patch series:
   - hotfixes
   - iommu
   - scripts
   - arch/sh
   - ocfs2
   - mm:slab-generic
   - mm:slub
   - mm:kmemleak
   - mm:kasan
   - mm:cleanups
   - mm:debug
   - mm:pagecache
   - mm:swap
   - mm:memcg
   - mm:gup
   - mm:pagemap
   - mm:infrastructure
   - mm:vmalloc
   - mm:initialization
   - mm:pagealloc
   - mm:vmscan
   - mm:tools
   - mm:proc
   - mm:ras
   - mm:oom-kill

  hotfixes:
      mm: vmscan: scan anonymous pages on file refaults
      mm/nvdimm: add is_ioremap_addr and use that to check ioremap address
      mm/memcontrol: fix wrong statistics in memory.stat
      mm/z3fold.c: lock z3fold page before  __SetPageMovable()
      nilfs2: do not use unexported cpu_to_le32()/le32_to_cpu() in uapi header
      MAINTAINERS: nilfs2: update email address

  iommu:
      include/linux/dmar.h: replace single-char identifiers in macros

  scripts:
      scripts/decode_stacktrace: match basepath using shell prefix operator, not regex
      scripts/decode_stacktrace: look for modules with .ko.debug extension
      scripts/spelling.txt: drop "sepc" from the misspelling list
      scripts/spelling.txt: add spelling fix for prohibited
      scripts/decode_stacktrace: Accept dash/underscore in modules
      scripts/spelling.txt: add more spellings to spelling.txt

  arch/sh:
      arch/sh/configs/sdk7786_defconfig: remove CONFIG_LOGFS
      sh: config: remove left-over BACKLIGHT_LCD_SUPPORT
      sh: prevent warnings when using iounmap

  ocfs2:
      fs: ocfs: fix spelling mistake "hearbeating" -> "heartbeat"
      ocfs2/dlm: use struct_size() helper
      ocfs2: add last unlock times in locking_state
      ocfs2: add locking filter debugfs file
      ocfs2: add first lock wait time in locking_state
      ocfs: no need to check return value of debugfs_create functions
      fs/ocfs2/dlmglue.c: unneeded variable: "status"
      ocfs2: use kmemdup rather than duplicating its implementation

  mm:slab-generic:
    Patch series "mm/slab: Improved sanity checking":
      mm/slab: validate cache membership under freelist hardening
      mm/slab: sanity-check page type when looking up cache
      lkdtm/heap: add tests for freelist hardening

  mm:slub:
      mm/slub.c: avoid double string traverse in kmem_cache_flags()
      slub: don't panic for memcg kmem cache creation failure

  mm:kmemleak:
      mm/kmemleak.c: fix check for softirq context
      mm/kmemleak.c: change error at _write when kmemleak is disabled
      docs: kmemleak: add more documentation details

  mm:kasan:
      mm/kasan: print frame description for stack bugs
      Patch series "Bitops instrumentation for KASAN", v5:
        lib/test_kasan: add bitops tests
        x86: use static_cpu_has in uaccess region to avoid instrumentation
        asm-generic, x86: add bitops instrumentation for KASAN
      Patch series "mm/kasan: Add object validation in ksize()", v3:
        mm/kasan: introduce __kasan_check_{read,write}
        mm/kasan: change kasan_check_{read,write} to return boolean
        lib/test_kasan: Add test for double-kzfree detection
        mm/slab: refactor common ksize KASAN logic into slab_common.c
        mm/kasan: add object validation in ksize()

  mm:cleanups:
      include/linux/pfn_t.h: remove pfn_t_to_virt()
      Patch series "remove ARCH_SELECT_MEMORY_MODEL where it has no effect":
        arm: remove ARCH_SELECT_MEMORY_MODEL
        s390: remove ARCH_SELECT_MEMORY_MODEL
        sparc: remove ARCH_SELECT_MEMORY_MODEL
      mm/gup.c: make follow_page_mask() static
      mm/memory.c: trivial clean up in insert_page()
      mm: make !CONFIG_HUGE_PAGE wrappers into static inlines
      include/linux/mm_types.h: ifdef struct vm_area_struct::swap_readahead_info
      mm: remove the account_page_dirtied export
      mm/page_isolation.c: change the prototype of undo_isolate_page_range()
      include/linux/vmpressure.h: use spinlock_t instead of struct spinlock
      mm: remove the exporting of totalram_pages
      include/linux/pagemap.h: document trylock_page() return value

  mm:debug:
      mm/failslab.c: by default, do not fail allocations with direct reclaim only
      Patch series "debug_pagealloc improvements":
        mm, debug_pagelloc: use static keys to enable debugging
        mm, page_alloc: more extensive free page checking with debug_pagealloc
        mm, debug_pagealloc: use a page type instead of page_ext flag

  mm:pagecache:
      Patch series "fix filler_t callback type mismatches", v2:
        mm/filemap.c: fix an overly long line in read_cache_page
        mm/filemap: don't cast ->readpage to filler_t for do_read_cache_page
        jffs2: pass the correct prototype to read_cache_page
        9p: pass the correct prototype to read_cache_page
      mm/filemap.c: correct the comment about VM_FAULT_RETRY

  mm:swap:
      mm, swap: fix race between swapoff and some swap operations
      mm/swap_state.c: simplify total_swapcache_pages() with get_swap_device()
      mm, swap: use rbtree for swap_extent
      mm/mincore.c: fix race between swapoff and mincore

  mm:memcg:
      memcg, oom: no oom-kill for __GFP_RETRY_MAYFAIL
      memcg, fsnotify: no oom-kill for remote memcg charging
      mm, memcg: introduce memory.events.local
      mm: memcontrol: dump memory.stat during cgroup OOM
      Patch series "mm: reparent slab memory on cgroup removal", v7:
        mm: memcg/slab: postpone kmem_cache memcg pointer initialization to memcg_link_cache()
        mm: memcg/slab: rename slab delayed deactivation functions and fields
        mm: memcg/slab: generalize postponed non-root kmem_cache deactivation
        mm: memcg/slab: introduce __memcg_kmem_uncharge_memcg()
        mm: memcg/slab: unify SLAB and SLUB page accounting
        mm: memcg/slab: don't check the dying flag on kmem_cache creation
        mm: memcg/slab: synchronize access to kmem_cache dying flag using a spinlock
        mm: memcg/slab: rework non-root kmem_cache lifecycle management
        mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages
        mm: memcg/slab: reparent memcg kmem_caches on cgroup removal
      mm, memcg: add a memcg_slabinfo debugfs file

  mm:gup:
      Patch series "switch the remaining architectures to use generic GUP", v4:
        mm: use untagged_addr() for get_user_pages_fast addresses
        mm: simplify gup_fast_permitted
        mm: lift the x86_32 PAE version of gup_get_pte to common code
        MIPS: use the generic get_user_pages_fast code
        sh: add the missing pud_page definition
        sh: use the generic get_user_pages_fast code
        sparc64: add the missing pgd_page definition
        sparc64: define untagged_addr()
        sparc64: use the generic get_user_pages_fast code
        mm: rename CONFIG_HAVE_GENERIC_GUP to CONFIG_HAVE_FAST_GUP
        mm: reorder code blocks in gup.c
        mm: consolidate the get_user_pages* implementations
        mm: validate get_user_pages_fast flags
        mm: move the powerpc hugepd code to mm/gup.c
        mm: switch gup_hugepte to use try_get_compound_head
        mm: mark the page referenced in gup_hugepte
      mm/gup: speed up check_and_migrate_cma_pages() on huge page
      mm/gup.c: remove some BUG_ONs from get_gate_page()
      mm/gup.c: mark undo_dev_pagemap as __maybe_unused

  mm:pagemap:
      asm-generic, x86: introduce generic pte_{alloc,free}_one[_kernel]
      alpha: switch to generic version of pte allocation
      arm: switch to generic version of pte allocation
      arm64: switch to generic version of pte allocation
      csky: switch to generic version of pte allocation
      m68k: sun3: switch to generic version of pte allocation
      mips: switch to generic version of pte allocation
      nds32: switch to generic version of pte allocation
      nios2: switch to generic version of pte allocation
      parisc: switch to generic version of pte allocation
      riscv: switch to generic version of pte allocation
      um: switch to generic version of pte allocation
      unicore32: switch to generic version of pte allocation
      mm/pgtable: drop pgtable_t variable from pte_fn_t functions
      mm/memory.c: fail when offset == num in first check of __vm_map_pages()

  mm:infrastructure:
      mm/mmu_notifier: use hlist_add_head_rcu()

  mm:vmalloc:
      Patch series "Some cleanups for the KVA/vmalloc", v5:
        mm/vmalloc.c: remove "node" argument
        mm/vmalloc.c: preload a CPU with one object for split purpose
        mm/vmalloc.c: get rid of one single unlink_va() when merge
        mm/vmalloc.c: switch to WARN_ON() and move it under unlink_va()
      mm/vmalloc.c: spelling> s/informaion/information/

  mm:initialization:
      mm/large system hash: use vmalloc for size > MAX_ORDER when !hashdist
      mm/large system hash: clear hashdist when only one node with memory is booted

  mm:pagealloc:
      arm64: move jump_label_init() before parse_early_param()
      Patch series "add init_on_alloc/init_on_free boot options", v10:
        mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options
        mm: init: report memory auto-initialization features at boot time

  mm:vmscan:
      mm: vmscan: remove double slab pressure by inc'ing sc->nr_scanned
      mm: vmscan: correct some vmscan counters for THP swapout

  mm:tools:
      tools/vm/slabinfo: order command line options
      tools/vm/slabinfo: add partial slab listing to -X
      tools/vm/slabinfo: add option to sort by partial slabs
      tools/vm/slabinfo: add sorting info to help menu

  mm:proc:
      proc: use down_read_killable mmap_sem for /proc/pid/maps
      proc: use down_read_killable mmap_sem for /proc/pid/smaps_rollup
      proc: use down_read_killable mmap_sem for /proc/pid/pagemap
      proc: use down_read_killable mmap_sem for /proc/pid/clear_refs
      proc: use down_read_killable mmap_sem for /proc/pid/map_files
      mm: use down_read_killable for locking mmap_sem in access_remote_vm
      mm: smaps: split PSS into components
      mm: vmalloc: show number of vmalloc pages in /proc/meminfo

  mm:ras:
      mm/memory-failure.c: clarify error message

  mm:oom-kill:
      mm: memcontrol: use CSS_TASK_ITER_PROCS at mem_cgroup_scan_tasks()
      mm, oom: refactor dump_tasks for memcg OOMs
      mm, oom: remove redundant task_in_mem_cgroup() check
      oom: decouple mems_allowed from oom_unkillable_task
      mm/oom_kill.c: remove redundant OOM score normalization in select_bad_process()"

* akpm: (147 commits)
  mm/oom_kill.c: remove redundant OOM score normalization in select_bad_process()
  oom: decouple mems_allowed from oom_unkillable_task
  mm, oom: remove redundant task_in_mem_cgroup() check
  mm, oom: refactor dump_tasks for memcg OOMs
  mm: memcontrol: use CSS_TASK_ITER_PROCS at mem_cgroup_scan_tasks()
  mm/memory-failure.c: clarify error message
  mm: vmalloc: show number of vmalloc pages in /proc/meminfo
  mm: smaps: split PSS into components
  mm: use down_read_killable for locking mmap_sem in access_remote_vm
  proc: use down_read_killable mmap_sem for /proc/pid/map_files
  proc: use down_read_killable mmap_sem for /proc/pid/clear_refs
  proc: use down_read_killable mmap_sem for /proc/pid/pagemap
  proc: use down_read_killable mmap_sem for /proc/pid/smaps_rollup
  proc: use down_read_killable mmap_sem for /proc/pid/maps
  tools/vm/slabinfo: add sorting info to help menu
  tools/vm/slabinfo: add option to sort by partial slabs
  tools/vm/slabinfo: add partial slab listing to -X
  tools/vm/slabinfo: order command line options
  mm: vmscan: correct some vmscan counters for THP swapout
  mm: vmscan: remove double slab pressure by inc'ing sc->nr_scanned
  ...
2019-07-12 11:40:28 -07:00
Aneesh Kumar K.V
9bd3bb6703 mm/nvdimm: add is_ioremap_addr and use that to check ioremap address
Architectures like powerpc use different address range to map ioremap
and vmalloc range.  The memunmap() check used by the nvdimm layer was
wrongly using is_vmalloc_addr() to check for ioremap range which fails
for ppc64.  This result in ppc64 not freeing the ioremap mapping.  The
side effect of this is an unbind failure during module unload with
papr_scm nvdimm driver

Link: http://lkml.kernel.org/r/20190701134038.14165-1-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Fixes: b5beae5e22 ("powerpc/pseries: Add driver for PAPR SCM regions")
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12 11:05:40 -07:00
Gustavo A. R. Silva
ed4ed4043a bpf: verifier: avoid fall-through warnings
In preparation to enabling -Wimplicit-fallthrough, this patch silences
the following warning:

kernel/bpf/verifier.c: In function ‘check_return_code’:
kernel/bpf/verifier.c:6106:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
   if (env->prog->expected_attach_type == BPF_CGROUP_UDP4_RECVMSG ||
      ^
kernel/bpf/verifier.c:6109:2: note: here
  case BPF_PROG_TYPE_CGROUP_SKB:
  ^~~~

Warning level 3 was used: -Wimplicit-fallthrough=3

Notice that is much clearer to explicitly add breaks in each case
statement (that actually contains some code), rather than letting
the code to fall through.

This patch is part of the ongoing efforts to enable
-Wimplicit-fallthrough.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-12 15:13:20 +02:00
Andrii Nakryiko
b3b50f05dc bpf: fix precision bit propagation for BPF_ST instructions
When backtracking instructions to propagate precision bit for registers
and stack slots, one class of instructions (BPF_ST) weren't handled
causing extra stack slots to be propagated into parent state. Parent
state might not have that much stack allocated, though, which causes
warning on invalid stack slot usage.

This patch adds handling of BPF_ST instructions:

BPF_MEM | <size> | BPF_ST:   *(size *) (dst_reg + off) = imm32

Reported-by: syzbot+4da3ff23081bafe74fc2@syzkaller.appspotmail.com
Fixes: b5dc0163d8 ("bpf: precise scalar_value tracking")
Cc: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-12 14:48:52 +02:00
Vincent Whitchurch
c9dccacfcc printk: Do not lose last line in kmsg buffer dump
kmsg_dump_get_buffer() is supposed to select all the youngest log
messages which fit into the provided buffer.  It determines the correct
start index by using msg_print_text() with a NULL buffer to calculate
the size of each entry.  However, when performing the actual writes,
msg_print_text() only writes the entry to the buffer if the written len
is lesser than the size of the buffer.  So if the lengths of the
selected youngest log messages happen to precisely fill up the provided
buffer, the last log message is not included.

We don't want to modify msg_print_text() to fill up the buffer and start
returning a length which is equal to the size of the buffer, since
callers of its other users, such as kmsg_dump_get_line(), depend upon
the current behaviour.

Instead, fix kmsg_dump_get_buffer() to compensate for this.

For example, with the following two final prints:

[    6.427502] AAAAAAAAAAAAA
[    6.427769] BBBBBBBB12345

A dump of a 64-byte buffer filled by kmsg_dump_get_buffer(), before this
patch:

 00000000: 3c 30 3e 5b 20 20 20 20 36 2e 35 32 32 31 39 37  <0>[    6.522197
 00000010: 5d 20 41 41 41 41 41 41 41 41 41 41 41 41 41 0a  ] AAAAAAAAAAAAA.
 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
 00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................

After this patch:

 00000000: 3c 30 3e 5b 20 20 20 20 36 2e 34 35 36 36 37 38  <0>[    6.456678
 00000010: 5d 20 42 42 42 42 42 42 42 42 31 32 33 34 35 0a  ] BBBBBBBB12345.
 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
 00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................

Link: http://lkml.kernel.org/r/20190711142937.4083-1-vincent.whitchurch@axis.com
Fixes: e2ae715d66 ("kmsg - kmsg_dump() use iterator to receive log buffer content")
To: rostedt@goodmis.org
Cc: linux-kernel@vger.kernel.org
Cc: <stable@vger.kernel.org> # v3.5+
Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2019-07-12 14:10:58 +02:00
Linus Torvalds
db0457338e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching
Pull livepatching updates from Jiri Kosina:

 - stacktrace handling improvements from Miroslav benes

 - debug output improvements from Petr Mladek

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching:
  livepatch: Remove duplicate warning about missing reliable stacktrace support
  Revert "livepatch: Remove reliable stacktrace check in klp_try_switch_task()"
  stacktrace: Remove weak version of save_stack_trace_tsk_reliable()
  livepatch: Use static buffer for debugging messages under rq lock
  livepatch: Remove stale kobj_added entries from kernel-doc descriptions
2019-07-11 15:30:05 -07:00
Linus Torvalds
d7fe42a64a Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
 "Two small fixes from the timer departement:

   - Prevent the compiler from converting the nanoseconds adjustment
     loop in the VDSO update function to a division (__udivdi3) by using
     the __iter_div_u64_rem() inline function which exists to prevent
     exactly that problem.

   - Fix the wrong argument order of the GENMASK macro in the NPCM timer
     driver"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timekeeping/vsyscall: Use __iter_div_u64_rem()
  clocksource/drivers/npcm: Fix misuse of GENMASK macro
2019-07-11 13:52:23 -07:00
Linus Torvalds
02150fab6a Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull stacktrace fix from Thomas Gleixner:
 "Fix yet another instance of kernel thread check which ignores that
  kernel threads can call use_mm()"

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  stacktrace: Use PF_KTHREAD to check for kernel threads
2019-07-11 13:50:44 -07:00
Linus Torvalds
237f83dfbe Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
 "Some highlights from this development cycle:

   1) Big refactoring of ipv6 route and neigh handling to support
      nexthop objects configurable as units from userspace. From David
      Ahern.

   2) Convert explored_states in BPF verifier into a hash table,
      significantly decreased state held for programs with bpf2bpf
      calls, from Alexei Starovoitov.

   3) Implement bpf_send_signal() helper, from Yonghong Song.

   4) Various classifier enhancements to mvpp2 driver, from Maxime
      Chevallier.

   5) Add aRFS support to hns3 driver, from Jian Shen.

   6) Fix use after free in inet frags by allocating fqdirs dynamically
      and reworking how rhashtable dismantle occurs, from Eric Dumazet.

   7) Add act_ctinfo packet classifier action, from Kevin
      Darbyshire-Bryant.

   8) Add TFO key backup infrastructure, from Jason Baron.

   9) Remove several old and unused ISDN drivers, from Arnd Bergmann.

  10) Add devlink notifications for flash update status to mlxsw driver,
      from Jiri Pirko.

  11) Lots of kTLS offload infrastructure fixes, from Jakub Kicinski.

  12) Add support for mv88e6250 DSA chips, from Rasmus Villemoes.

  13) Various enhancements to ipv6 flow label handling, from Eric
      Dumazet and Willem de Bruijn.

  14) Support TLS offload in nfp driver, from Jakub Kicinski, Dirk van
      der Merwe, and others.

  15) Various improvements to axienet driver including converting it to
      phylink, from Robert Hancock.

  16) Add PTP support to sja1105 DSA driver, from Vladimir Oltean.

  17) Add mqprio qdisc offload support to dpaa2-eth, from Ioana
      Radulescu.

  18) Add devlink health reporting to mlx5, from Moshe Shemesh.

  19) Convert stmmac over to phylink, from Jose Abreu.

  20) Add PTP PHC (Physical Hardware Clock) support to mlxsw, from
      Shalom Toledo.

  21) Add nftables SYNPROXY support, from Fernando Fernandez Mancera.

  22) Convert tcp_fastopen over to use SipHash, from Ard Biesheuvel.

  23) Track spill/fill of constants in BPF verifier, from Alexei
      Starovoitov.

  24) Support bounded loops in BPF, from Alexei Starovoitov.

  25) Various page_pool API fixes and improvements, from Jesper Dangaard
      Brouer.

  26) Just like ipv4, support ref-countless ipv6 route handling. From
      Wei Wang.

  27) Support VLAN offloading in aquantia driver, from Igor Russkikh.

  28) Add AF_XDP zero-copy support to mlx5, from Maxim Mikityanskiy.

  29) Add flower GRE encap/decap support to nfp driver, from Pieter
      Jansen van Vuuren.

  30) Protect against stack overflow when using act_mirred, from John
      Hurley.

  31) Allow devmap map lookups from eBPF, from Toke Høiland-Jørgensen.

  32) Use page_pool API in netsec driver, Ilias Apalodimas.

  33) Add Google gve network driver, from Catherine Sullivan.

  34) More indirect call avoidance, from Paolo Abeni.

  35) Add kTLS TX HW offload support to mlx5, from Tariq Toukan.

  36) Add XDP_REDIRECT support to bnxt_en, from Andy Gospodarek.

  37) Add MPLS manipulation actions to TC, from John Hurley.

  38) Add sending a packet to connection tracking from TC actions, and
      then allow flower classifier matching on conntrack state. From
      Paul Blakey.

  39) Netfilter hw offload support, from Pablo Neira Ayuso"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2080 commits)
  net/mlx5e: Return in default case statement in tx_post_resync_params
  mlx5: Return -EINVAL when WARN_ON_ONCE triggers in mlx5e_tls_resync().
  net: dsa: add support for BRIDGE_MROUTER attribute
  pkt_sched: Include const.h
  net: netsec: remove static declaration for netsec_set_tx_de()
  net: netsec: remove superfluous if statement
  netfilter: nf_tables: add hardware offload support
  net: flow_offload: rename tc_cls_flower_offload to flow_cls_offload
  net: flow_offload: add flow_block_cb_is_busy() and use it
  net: sched: remove tcf block API
  drivers: net: use flow block API
  net: sched: use flow block API
  net: flow_offload: add flow_block_cb_{priv, incref, decref}()
  net: flow_offload: add list handling functions
  net: flow_offload: add flow_block_cb_alloc() and flow_block_cb_free()
  net: flow_offload: rename TCF_BLOCK_BINDER_TYPE_* to FLOW_BLOCK_BINDER_TYPE_*
  net: flow_offload: rename TC_BLOCK_{UN}BIND to FLOW_BLOCK_{UN}BIND
  net: flow_offload: add flow_block_cb_setup_simple()
  net: hisilicon: Add an tx_desc to adapt HI13X1_GMAC
  net: hisilicon: Add an rx_desc to adapt HI13X1_GMAC
  ...
2019-07-11 10:55:49 -07:00
Linus Torvalds
8f6ccf6159 clone3-v5.3
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXSMhhgAKCRCRxhvAZXjc
 or7kAP9VzDcQaK/WoDd2ezh2C7Wh5hNy9z/qJVCa6Tb+N+g1UgEAxbhFUg55uGOA
 JNf7fGar5JF5hBMIXR+NqOi1/sb4swg=
 =ELWo
 -----END PGP SIGNATURE-----

Merge tag 'clone3-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull clone3 system call from Christian Brauner:
 "This adds the clone3 syscall which is an extensible successor to clone
  after we snagged the last flag with CLONE_PIDFD during the 5.2 merge
  window for clone(). It cleanly supports all of the flags from clone()
  and thus all legacy workloads.

  There are few user visible differences between clone3 and clone.
  First, CLONE_DETACHED will cause EINVAL with clone3 so we can reuse
  this flag. Second, the CSIGNAL flag is deprecated and will cause
  EINVAL to be reported. It is superseeded by a dedicated "exit_signal"
  argument in struct clone_args thus freeing up even more flags. And
  third, clone3 gives CLONE_PIDFD a dedicated return argument in struct
  clone_args instead of abusing CLONE_PARENT_SETTID's parent_tidptr
  argument.

  The clone3 uapi is designed to be easy to handle on 32- and 64 bit:

    /* uapi */
    struct clone_args {
            __aligned_u64 flags;
            __aligned_u64 pidfd;
            __aligned_u64 child_tid;
            __aligned_u64 parent_tid;
            __aligned_u64 exit_signal;
            __aligned_u64 stack;
            __aligned_u64 stack_size;
            __aligned_u64 tls;
    };

  and a separate kernel struct is used that uses proper kernel typing:

    /* kernel internal */
    struct kernel_clone_args {
            u64 flags;
            int __user *pidfd;
            int __user *child_tid;
            int __user *parent_tid;
            int exit_signal;
            unsigned long stack;
            unsigned long stack_size;
            unsigned long tls;
    };

  The system call comes with a size argument which enables the kernel to
  detect what version of clone_args userspace is passing in. clone3
  validates that any additional bytes a given kernel does not know about
  are set to zero and that the size never exceeds a page.

  A nice feature is that this patchset allowed us to cleanup and
  simplify various core kernel codepaths in kernel/fork.c by making the
  internal _do_fork() function take struct kernel_clone_args even for
  legacy clone().

  This patch also unblocks the time namespace patchset which wants to
  introduce a new CLONE_TIMENS flag.

  Note, that clone3 has only been wired up for x86{_32,64}, arm{64}, and
  xtensa. These were the architectures that did not require special
  massaging.

  Other architectures treat fork-like system calls individually and
  after some back and forth neither Arnd nor I felt confident that we
  dared to add clone3 unconditionally to all architectures. We agreed to
  leave this up to individual architecture maintainers. This is why
  there's an additional patch that introduces __ARCH_WANT_SYS_CLONE3
  which any architecture can set once it has implemented support for
  clone3. The patch also adds a cond_syscall(clone3) for architectures
  such as nios2 or h8300 that generate their syscall table by simply
  including asm-generic/unistd.h. The hope is to get rid of
  __ARCH_WANT_SYS_CLONE3 and cond_syscall() rather soon"

* tag 'clone3-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  arch: handle arches who do not yet define clone3
  arch: wire-up clone3() syscall
  fork: add clone3
2019-07-11 10:09:44 -07:00
Linus Torvalds
5450e8a316 pidfd-updates-v5.3
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXSMhUgAKCRCRxhvAZXjc
 okkiAQC3Hlg/O2JoIb4PqgEvBkpHSdVxyuWagn0ksjACW9ANKQEAl5OadMhvOq16
 UHGhKlpE/M8HflknIffoEGlIAWHrdwU=
 =7kP5
 -----END PGP SIGNATURE-----

Merge tag 'pidfd-updates-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull pidfd updates from Christian Brauner:
 "This adds two main features.

   - First, it adds polling support for pidfds. This allows process
     managers to know when a (non-parent) process dies in a race-free
     way.

     The notification mechanism used follows the same logic that is
     currently used when the parent of a task is notified of a child's
     death. With this patchset it is possible to put pidfds in an
     {e}poll loop and get reliable notifications for process (i.e.
     thread-group) exit.

   - The second feature compliments the first one by making it possible
     to retrieve pollable pidfds for processes that were not created
     using CLONE_PIDFD.

     A lot of processes get created with traditional PID-based calls
     such as fork() or clone() (without CLONE_PIDFD). For these
     processes a caller can currently not create a pollable pidfd. This
     is a problem for Android's low memory killer (LMK) and service
     managers such as systemd.

  Both patchsets are accompanied by selftests.

  It's perhaps worth noting that the work done so far and the work done
  in this branch for pidfd_open() and polling support do already see
  some adoption:

   - Android is in the process of backporting this work to all their LTS
     kernels [1]

   - Service managers make use of pidfd_send_signal but will need to
     wait until we enable waiting on pidfds for full adoption.

   - And projects I maintain make use of both pidfd_send_signal and
     CLONE_PIDFD [2] and will use polling support and pidfd_open() too"

[1] https://android-review.googlesource.com/q/topic:%22pidfd+polling+support+4.9+backport%22
    https://android-review.googlesource.com/q/topic:%22pidfd+polling+support+4.14+backport%22
    https://android-review.googlesource.com/q/topic:%22pidfd+polling+support+4.19+backport%22

[2] aab6e3eb73/src/lxc/start.c (L1753)

* tag 'pidfd-updates-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  tests: add pidfd_open() tests
  arch: wire-up pidfd_open()
  pid: add pidfd_open()
  pidfd: add polling selftests
  pidfd: add polling support
2019-07-10 22:17:21 -07:00
Arnd Bergmann
0df1c9868c timekeeping/vsyscall: Use __iter_div_u64_rem()
On 32-bit x86 when building with clang-9, the 'division' loop gets turned
back into an inefficient division that causes a link error:

kernel/time/vsyscall.o: In function `update_vsyscall':
vsyscall.c:(.text+0xe3): undefined reference to `__udivdi3'

Use the existing __iter_div_u64_rem() function which is used to address the
same issue in other places.

Fixes: 44f57d788e ("timekeeping: Provide a generic update_vsyscall() implementation")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Tested-by: Nathan Chancellor <natechancellor@gmail.com>
Link: https://lkml.kernel.org/r/20190710130206.1670830-1-arnd@arndb.de
2019-07-10 20:37:49 +02:00
Linus Torvalds
e9a83bd232 It's been a relatively busy cycle for docs:
- A fair pile of RST conversions, many from Mauro.  These create more
    than the usual number of simple but annoying merge conflicts with other
    trees, unfortunately.  He has a lot more of these waiting on the wings
    that, I think, will go to you directly later on.
 
  - A new document on how to use merges and rebases in kernel repos, and one
    on Spectre vulnerabilities.
 
  - Various improvements to the build system, including automatic markup of
    function() references because some people, for reasons I will never
    understand, were of the opinion that :c:func:``function()`` is
    unattractive and not fun to type.
 
  - We now recommend using sphinx 1.7, but still support back to 1.4.
 
  - Lots of smaller improvements, warning fixes, typo fixes, etc.
 -----BEGIN PGP SIGNATURE-----
 
 iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAl0krAEPHGNvcmJldEBs
 d24ubmV0AAoJEBdDWhNsDH5Yg98H/AuLqO9LpOgUjF4LhyjxGPdzJkY9RExSJ7km
 gznyreLCZgFaJR+AY6YDsd4Jw6OJlPbu1YM/Qo3C3WrZVFVhgL/s2ebvBgCo50A8
 raAFd8jTf4/mGCHnAqRotAPQ3mETJUk315B66lBJ6Oc+YdpRhwXWq8ZW2bJxInFF
 3HDvoFgMf0KhLuMHUkkL0u3fxH1iA+KvDu8diPbJYFjOdOWENz/CV8wqdVkXRSEW
 DJxIq89h/7d+hIG3d1I7Nw+gibGsAdjSjKv4eRKauZs4Aoxd1Gpl62z0JNk6aT3m
 dtq4joLdwScydonXROD/Twn2jsu4xYTrPwVzChomElMowW/ZBBY=
 =D0eO
 -----END PGP SIGNATURE-----

Merge tag 'docs-5.3' of git://git.lwn.net/linux

Pull Documentation updates from Jonathan Corbet:
 "It's been a relatively busy cycle for docs:

   - A fair pile of RST conversions, many from Mauro. These create more
     than the usual number of simple but annoying merge conflicts with
     other trees, unfortunately. He has a lot more of these waiting on
     the wings that, I think, will go to you directly later on.

   - A new document on how to use merges and rebases in kernel repos,
     and one on Spectre vulnerabilities.

   - Various improvements to the build system, including automatic
     markup of function() references because some people, for reasons I
     will never understand, were of the opinion that
     :c:func:``function()`` is unattractive and not fun to type.

   - We now recommend using sphinx 1.7, but still support back to 1.4.

   - Lots of smaller improvements, warning fixes, typo fixes, etc"

* tag 'docs-5.3' of git://git.lwn.net/linux: (129 commits)
  docs: automarkup.py: ignore exceptions when seeking for xrefs
  docs: Move binderfs to admin-guide
  Disable Sphinx SmartyPants in HTML output
  doc: RCU callback locks need only _bh, not necessarily _irq
  docs: format kernel-parameters -- as code
  Doc : doc-guide : Fix a typo
  platform: x86: get rid of a non-existent document
  Add the RCU docs to the core-api manual
  Documentation: RCU: Add TOC tree hooks
  Documentation: RCU: Rename txt files to rst
  Documentation: RCU: Convert RCU UP systems to reST
  Documentation: RCU: Convert RCU linked list to reST
  Documentation: RCU: Convert RCU basic concepts to reST
  docs: filesystems: Remove uneeded .rst extension on toctables
  scripts/sphinx-pre-install: fix out-of-tree build
  docs: zh_CN: submitting-drivers.rst: Remove a duplicated Documentation/
  Documentation: PGP: update for newer HW devices
  Documentation: Add section about CPU vulnerabilities for Spectre
  Documentation: platform: Delete x86-laptop-drivers.txt
  docs: Note that :c:func: should no longer be used
  ...
2019-07-09 12:34:26 -07:00
Linus Torvalds
608745f124 Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Ingo Molnar:
 "The main changes in this cycle on the kernel side were:

   - CPU PMU and uncore driver updates to Intel Snow Ridge, IceLake,
     KabyLake, AmberLake and WhiskeyLake CPUs.

   - Rework the MSR probing infrastructure to make it more robust, make
     it work better on virtualized systems and to better expose it on
     sysfs.

   - Rework PMU attributes group support based on the feedback from
     Greg. The core sysfs patch that adds sysfs_update_groups() was
     acked by Greg.

  There's a lot of perf tooling changes as well, all around the place:

   - vendor updates to Intel, cs-etm (ARM), ARM64, s390,

   - various enhancements to Intel PT tooling support:
      - Improve CBR (Core to Bus Ratio) packets support.
      - Export power and ptwrite events to sqlite and postgresql.
      - Add support for decoding PEBS via PT packets.
      - Add support for samples to contain IPC ratio, collecting cycles
        information from CYC packets, showing the IPC info periodically
      - Allow using time ranges

   - lots of updates to perf pmu, perf stat, perf trace, eBPF support,
     perf record, perf diff, etc. - please see the shortlog and Git log
     for details"

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (252 commits)
  tools arch x86: Sync asm/cpufeatures.h with the with the kernel
  tools build: Check if gettid() is available before providing helper
  perf jvmti: Address gcc string overflow warning for strncpy()
  perf python: Remove -fstack-protector-strong if clang doesn't have it
  perf annotate TUI browser: Do not use member from variable within its own initialization
  perf tests: Fix record+probe_libc_inet_pton.sh for powerpc64
  perf evsel: Do not rely on errno values for precise_ip fallback
  perf thread: Allow references to thread objects after machine__exit()
  perf header: Assign proper ff->ph in perf_event__synthesize_features()
  tools arch kvm: Sync kvm headers with the kernel sources
  perf script: Allow specifying the files to process guest samples
  perf tools metric: Don't include duration_time in group
  perf list: Avoid extra : for --raw metrics
  perf vendor events intel: Metric fixes for SKX/CLX
  perf tools: Fix typos / broken sentences
  perf jevents: Add support for Hisi hip08 L3C PMU aliasing
  perf jevents: Add support for Hisi hip08 HHA PMU aliasing
  perf jevents: Add support for Hisi hip08 DDRC PMU aliasing
  perf pmu: Support more complex PMU event aliasing
  perf diff: Documentation -c cycles option
  ...
2019-07-09 11:15:52 -07:00
Linus Torvalds
3b99107f0e for-5.3/block-20190708
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl0jrIMQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgptlFD/9CNsBX+Aap2lO6wKNr6QISwNAK76GMzEay
 s4LSY2kGkXvzv8i89mCuY+8UVNI8WH2/22WnU+8CBAJOjWyFQMsIwH/mrq0oZWRD
 J6STJE8rTr6Fc2MvJUWryp/xdBh3+eDIsAdIZVHVAkIzqYPBnpIAwEIeIw8t0xsm
 v9ngpQ3WD6ep8tOj9pnG1DGKFg1CmukZCC/Y4CQV1vZtmm2I935zUwNV/TB+Egfx
 G8JSC0cSV02LMK88HCnA6MnC/XSUC0qgfXbnmP+TpKlgjVX+P/fuB3oIYcZEu2Rk
 3YBpIkhsQytKYbF42KRLsmBH72u6oB9G+tNZTgB1STUDrZqdtD9xwX1rjDlY0ZzP
 EUDnk48jl/cxbs+VZrHoE2TcNonLiymV7Kb92juHXdIYmKFQStprGcQUbMaTkMfB
 6BYrYLifWx0leu1JJ1i7qhNmug94BYCSCxcRmH0p6kPazPcY9LXNmDWMfMuBPZT7
 z79VLZnHF2wNXJyT1cBluwRYYJRT4osWZ3XUaBWFKDgf1qyvXJfrN/4zmgkEIyW7
 ivXC+KLlGkhntDlWo2pLKbbyOIKY1HmU6aROaI11k5Zyh0ixKB7tHKavK39l+NOo
 YB41+4l6VEpQEyxyRk8tO0sbHpKaKB+evVIK3tTwbY+Q0qTExErxjfWUtOgRWhjx
 iXJssPRo4w==
 =VSYT
 -----END PGP SIGNATURE-----

Merge tag 'for-5.3/block-20190708' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:
 "This is the main block updates for 5.3. Nothing earth shattering or
  major in here, just fixes, additions, and improvements all over the
  map. This contains:

   - Series of documentation fixes (Bart)

   - Optimization of the blk-mq ctx get/put (Bart)

   - null_blk removal race condition fix (Bob)

   - req/bio_op() cleanups (Chaitanya)

   - Series cleaning up the segment accounting, and request/bio mapping
     (Christoph)

   - Series cleaning up the page getting/putting for bios (Christoph)

   - block cgroup cleanups and moving it to where it is used (Christoph)

   - block cgroup fixes (Tejun)

   - Series of fixes and improvements to bcache, most notably a write
     deadlock fix (Coly)

   - blk-iolatency STS_AGAIN and accounting fixes (Dennis)

   - Series of improvements and fixes to BFQ (Douglas, Paolo)

   - debugfs_create() return value check removal for drbd (Greg)

   - Use struct_size(), where appropriate (Gustavo)

   - Two lighnvm fixes (Heiner, Geert)

   - MD fixes, including a read balance and corruption fix (Guoqing,
     Marcos, Xiao, Yufen)

   - block opal shadow mbr additions (Jonas, Revanth)

   - sbitmap compare-and-exhange improvemnts (Pavel)

   - Fix for potential bio->bi_size overflow (Ming)

   - NVMe pull requests:
       - improved PCIe suspent support (Keith Busch)
       - error injection support for the admin queue (Akinobu Mita)
       - Fibre Channel discovery improvements (James Smart)
       - tracing improvements including nvmetc tracing support (Minwoo Im)
       - misc fixes and cleanups (Anton Eidelman, Minwoo Im, Chaitanya
         Kulkarni)"

   - Various little fixes and improvements to drivers and core"

* tag 'for-5.3/block-20190708' of git://git.kernel.dk/linux-block: (153 commits)
  blk-iolatency: fix STS_AGAIN handling
  block: nr_phys_segments needs to be zero for REQ_OP_WRITE_ZEROES
  blk-mq: simplify blk_mq_make_request()
  blk-mq: remove blk_mq_put_ctx()
  sbitmap: Replace cmpxchg with xchg
  block: fix .bi_size overflow
  block: sed-opal: check size of shadow mbr
  block: sed-opal: ioctl for writing to shadow mbr
  block: sed-opal: add ioctl for done-mark of shadow mbr
  block: never take page references for ITER_BVEC
  direct-io: use bio_release_pages in dio_bio_complete
  block_dev: use bio_release_pages in bio_unmap_user
  block_dev: use bio_release_pages in blkdev_bio_end_io
  iomap: use bio_release_pages in iomap_dio_bio_end_io
  block: use bio_release_pages in bio_map_user_iov
  block: use bio_release_pages in bio_unmap_user
  block: optionally mark pages dirty in bio_release_pages
  block: move the BIO_NO_PAGE_REF check into bio_release_pages
  block: skd_main.c: Remove call to memset after dma_alloc_coherent
  block: mtip32xx: Remove call to memset after dma_alloc_coherent
  ...
2019-07-09 10:45:06 -07:00
Linus Torvalds
cf2d213e49 Power management updates for 5.3-rc1
- Improve the handling of shared ACPI power resources in the PCI
    bus type layer (Mika Westerberg).
 
  - Make the PCI layer take link delays required by the PCIe spec
    into account as appropriate and avoid polling devices in D3cold
    for PME (Mika Westerberg).
 
  - Fix some corner case issues in ACPI device power management and
    in the PCI bus type layer, optimiza and clean up the handling of
    runtime-suspended PCI devices during system-wide transitions to
    sleep states (Rafael Wysocki).
 
  - Rework hibernation handling in the ACPI core and the PCI bus type
    to resume runtime-suspended devices before hibernation (which
    allows some functional problems to be avoided) and fix some ACPI
    power management issues related to hiberation (Rafael Wysocki).
 
  - Extend the operating performance points (OPP) framework to support
    a wider range of devices (Rajendra Nayak, Stehpen Boyd).
 
  - Fix issues related to genpd_virt_devs and issues with platforms
    using the set_opp() callback in the OPP framework (Viresh Kumar,
    Dmitry Osipenko).
 
  - Add new cpufreq driver for Raspberry Pi (Nicolas Saenz Julienne).
 
  - Add new cpufreq driver for imx8m and imx7d chips (Leonard Crestez).
 
  - Fix and clean up the pcc-cpufreq, brcmstb-avs-cpufreq, s5pv210,
    and armada-37xx cpufreq drivers (David Arcari, Florian Fainelli,
    Paweł Chmiel, YueHaibing).
 
  - Clean up and fix the cpufreq core (Viresh Kumar, Daniel Lezcano).
 
  - Fix minor issue in the ACPI system sleep support code and export
    one function from it (Lenny Szubowicz, Dexuan Cui).
 
  - Clean up assorted pieces of PM code and documentation (Kefeng Wang,
    Andy Shevchenko, Bart Van Assche, Greg Kroah-Hartman, Fuqian Huang,
    Geert Uytterhoeven, Mathieu Malaterre, Rafael Wysocki).
 
  - Update the pm-graph utility to v5.4 (Todd Brandt).
 
  - Fix and clean up the cpupower utility (Abhishek Goel, Nick Black).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl0jK18SHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxgEAP/RbPe71Y9ufKu64L3EtgV6mS9iuLEhux
 /Ad9laLNeM1b0oceT3QxGk7xQCacnZcBlcaqXVWI4NRsn4RBZp1cYZngpgJ9DP6E
 ONr8hzyzDOMVReba3XJIF8H+WoTKjywMYtFutjdx6dRe2ZJLutqnuZ0JbH1YSSK7
 IxOt0mJVALf2M4Zz7F17d+n3yGE/4xAPBVbj/rBRcTEsGYlR/Hoxs7iF6EBau7fy
 R5drUH6XSrWk8adc+z7l3BTGqMMYj9deRSfAWB3wpM4YK7Fv7msX/amBoGINkdn6
 xP/ZcrHvhKKzE89MS8OUGP4rGVwq+7tu6mktnYL/tpKgutJqqx5LVvrLsGDSWr+W
 /aJExN8Eb4Jh98C6vog3XUJoqBxkVGbU8qoCBU3jlFsaznFEWjW9IKhBHs5CIaqz
 MXte6AsJ8lvFzxILjvx0m2206wNpRJRXYLX3a/BSBxa4OgOESjIpBTmbPfOwbxwj
 8z9hIVlDTTDtnF6BEyDQr1fjPi3Mxl7ibGnoqRrJm36VKBy9VZNNwG/0Y2oSvm6k
 Es8CiTWA3A/46dCZxGr18/9Vbfxn1Yvg9QZ1lCE5Fqij+0F2yRApbHZgUro+1rji
 6J8OWC5r5JccdKuGHh4RH/asMFhD0cAR/gUsRzS4dz/wz2jVIN1FstdD1aKN5GBy
 d0lchx/AKR5H
 =aBN3
 -----END PGP SIGNATURE-----

Merge tag 'pm-5.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:
 "These update PCI and ACPI power management (improved handling of ACPI
  power resources and PCIe link delays, fixes related to corner cases,
  hibernation handling rework), fix and extend the operating performance
  points (OPP) framework, add new cpufreq drivers for Raspberry Pi and
  imx8m chips, update some other cpufreq drivers, clean up assorted
  pieces of PM code and documentation and update tools.

  Specifics:

   - Improve the handling of shared ACPI power resources in the PCI bus
     type layer (Mika Westerberg).

   - Make the PCI layer take link delays required by the PCIe spec into
     account as appropriate and avoid polling devices in D3cold for PME
     (Mika Westerberg).

   - Fix some corner case issues in ACPI device power management and in
     the PCI bus type layer, optimiza and clean up the handling of
     runtime-suspended PCI devices during system-wide transitions to
     sleep states (Rafael Wysocki).

   - Rework hibernation handling in the ACPI core and the PCI bus type
     to resume runtime-suspended devices before hibernation (which
     allows some functional problems to be avoided) and fix some ACPI
     power management issues related to hiberation (Rafael Wysocki).

   - Extend the operating performance points (OPP) framework to support
     a wider range of devices (Rajendra Nayak, Stehpen Boyd).

   - Fix issues related to genpd_virt_devs and issues with platforms
     using the set_opp() callback in the OPP framework (Viresh Kumar,
     Dmitry Osipenko).

   - Add new cpufreq driver for Raspberry Pi (Nicolas Saenz Julienne).

   - Add new cpufreq driver for imx8m and imx7d chips (Leonard Crestez).

   - Fix and clean up the pcc-cpufreq, brcmstb-avs-cpufreq, s5pv210, and
     armada-37xx cpufreq drivers (David Arcari, Florian Fainelli, Paweł
     Chmiel, YueHaibing).

   - Clean up and fix the cpufreq core (Viresh Kumar, Daniel Lezcano).

   - Fix minor issue in the ACPI system sleep support code and export
     one function from it (Lenny Szubowicz, Dexuan Cui).

   - Clean up assorted pieces of PM code and documentation (Kefeng Wang,
     Andy Shevchenko, Bart Van Assche, Greg Kroah-Hartman, Fuqian Huang,
     Geert Uytterhoeven, Mathieu Malaterre, Rafael Wysocki).

   - Update the pm-graph utility to v5.4 (Todd Brandt).

   - Fix and clean up the cpupower utility (Abhishek Goel, Nick Black)"

* tag 'pm-5.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (57 commits)
  ACPI: PM: Make acpi_sleep_state_supported() non-static
  PM: sleep: Drop dev_pm_skip_next_resume_phases()
  ACPI: PM: Unexport acpi_device_get_power()
  Documentation: ABI: power: Add missing newline at end of file
  ACPI: PM: Drop unused function and function header
  ACPI: PM: Introduce "poweroff" callbacks for ACPI PM domain and LPSS
  ACPI: PM: Simplify and fix PM domain hibernation callbacks
  PCI: PM: Simplify bus-level hibernation callbacks
  PM: ACPI/PCI: Resume all devices during hibernation
  cpufreq: Avoid calling cpufreq_verify_current_freq() from handle_update()
  cpufreq: Consolidate cpufreq_update_current_freq() and __cpufreq_get()
  kernel: power: swap: use kzalloc() instead of kmalloc() followed by memset()
  cpufreq: Don't skip frequency validation for has_target() drivers
  PCI: PM/ACPI: Refresh all stale power state data in pci_pm_complete()
  PCI / ACPI: Add _PR0 dependent devices
  ACPI / PM: Introduce concept of a _PR0 dependent device
  PCI / ACPI: Use cached ACPI device state to get PCI device power state
  ACPI: PM: Allow transitions to D0 to occur in special cases
  ACPI: PM: Avoid evaluating _PS3 on transitions from D3hot to D3cold
  cpufreq: Use has_target() instead of !setpolicy
  ...
2019-07-09 10:05:22 -07:00
Josh Poimboeuf
e55a73251d bpf: Fix ORC unwinding in non-JIT BPF code
Objtool previously ignored ___bpf_prog_run() because it didn't understand
the jump table.  This resulted in the ORC unwinder not being able to unwind
through non-JIT BPF code.

Now that objtool knows how to read jump tables, remove the whitelist and
annotate the jump table so objtool can recognize it.

Also add an additional "const" to the jump table definition to clarify that
the text pointers are constant.  Otherwise GCC sets the section writable
flag and the assembler spits out warnings.

Fixes: d15d356887 ("perf/x86: Make perf callchains work without CONFIG_FRAME_POINTER")
Reported-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kairui Song <kasong@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lkml.kernel.org/r/881939122b88f32be4c374d248c09d7527a87e35.1561685471.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-09 13:55:57 +02:00
Linus Torvalds
5ad18b2e60 Merge branch 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull force_sig() argument change from Eric Biederman:
 "A source of error over the years has been that force_sig has taken a
  task parameter when it is only safe to use force_sig with the current
  task.

  The force_sig function is built for delivering synchronous signals
  such as SIGSEGV where the userspace application caused a synchronous
  fault (such as a page fault) and the kernel responded with a signal.

  Because the name force_sig does not make this clear, and because the
  force_sig takes a task parameter the function force_sig has been
  abused for sending other kinds of signals over the years. Slowly those
  have been fixed when the oopses have been tracked down.

  This set of changes fixes the remaining abusers of force_sig and
  carefully rips out the task parameter from force_sig and friends
  making this kind of error almost impossible in the future"

* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (27 commits)
  signal/x86: Move tsk inside of CONFIG_MEMORY_FAILURE in do_sigbus
  signal: Remove the signal number and task parameters from force_sig_info
  signal: Factor force_sig_info_to_task out of force_sig_info
  signal: Generate the siginfo in force_sig
  signal: Move the computation of force into send_signal and correct it.
  signal: Properly set TRACE_SIGNAL_LOSE_INFO in __send_signal
  signal: Remove the task parameter from force_sig_fault
  signal: Use force_sig_fault_to_task for the two calls that don't deliver to current
  signal: Explicitly call force_sig_fault on current
  signal/unicore32: Remove tsk parameter from __do_user_fault
  signal/arm: Remove tsk parameter from __do_user_fault
  signal/arm: Remove tsk parameter from ptrace_break
  signal/nds32: Remove tsk parameter from send_sigtrap
  signal/riscv: Remove tsk parameter from do_trap
  signal/sh: Remove tsk parameter from force_sig_info_fault
  signal/um: Remove task parameter from send_sigtrap
  signal/x86: Remove task parameter from send_sigtrap
  signal: Remove task parameter from force_sig_mceerr
  signal: Remove task parameter from force_sig
  signal: Remove task parameter from force_sigsegv
  ...
2019-07-08 21:48:15 -07:00
Linus Torvalds
92c1d65221 Merge branch 'for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "Documentation updates and the addition of cgroup_parse_float() which
  will be used by new controllers including blk-iocost"

* 'for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  docs: cgroup-v1: convert docs to ReST and rename to *.rst
  cgroup: Move cgroup_parse_float() implementation out of CONFIG_SYSFS
  cgroup: add cgroup_parse_float()
2019-07-08 21:35:12 -07:00
Linus Torvalds
df2a40f549 Merge branch 'for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue updates from Tejun Heo:
 "Just a couple cleanup patches.  No functional changes."

* 'for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: Remove GPF argument from alloc_workqueue_attrs()
  workqueue: Make alloc/apply/free_workqueue_attrs() static
2019-07-08 21:10:01 -07:00
Linus Torvalds
8b68150883 Merge branch 'next-integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity
Pull integrity updates from Mimi Zohar:
 "Bug fixes, code clean up, and new features:

   - IMA policy rules can be defined in terms of LSM labels, making the
     IMA policy dependent on LSM policy label changes, in particular LSM
     label deletions. The new environment, in which IMA-appraisal is
     being used, frequently updates the LSM policy and permits LSM label
     deletions.

   - Prevent an mmap'ed shared file opened for write from also being
     mmap'ed execute. In the long term, making this and other similar
     changes at the VFS layer would be preferable.

   - The IMA per policy rule template format support is needed for a
     couple of new/proposed features (eg. kexec boot command line
     measurement, appended signatures, and VFS provided file hashes).

   - Other than the "boot-aggregate" record in the IMA measuremeent
     list, all other measurements are of file data. Measuring and
     storing the kexec boot command line in the IMA measurement list is
     the first buffer based measurement included in the measurement
     list"

* 'next-integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity:
  integrity: Introduce struct evm_xattr
  ima: Update MAX_TEMPLATE_NAME_LEN to fit largest reasonable definition
  KEXEC: Call ima_kexec_cmdline to measure the boot command line args
  IMA: Define a new template field buf
  IMA: Define a new hook to measure the kexec boot command line arguments
  IMA: support for per policy rule template formats
  integrity: Fix __integrity_init_keyring() section mismatch
  ima: Use designated initializers for struct ima_event_data
  ima: use the lsm policy update notifier
  LSM: switch to blocking policy update notifiers
  x86/ima: fix the Kconfig dependency for IMA_ARCH_POLICY
  ima: Make arch_policy_entry static
  ima: prevent a file already mmap'ed write to be mmap'ed execute
  x86/ima: check EFI SetupMode too
2019-07-08 20:28:59 -07:00
David S. Miller
af144a9834 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Two cases of overlapping changes, nothing fancy.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-08 19:48:57 -07:00
Linus Torvalds
c84ca912b0 Keyrings namespacing
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAXRU89Pu3V2unywtrAQIdBBAAmMBsrfv+LUN4Vru/D6KdUO4zdYGcNK6m
 S56bcNfP6oIDEj6HrNNnzKkWIZpdZ61Odv1zle96+v4WZ/6rnLCTpcsdaFNTzaoO
 YT2jk7jplss0ImrMv1DSoykGqO3f0ThMIpGCxHKZADGSu0HMbjSEh+zLPV4BaMtT
 BVuF7P3eZtDRLdDtMtYcgvf5UlbdoBEY8w1FUjReQx8hKGxVopGmCo5vAeiY8W9S
 ybFSZhPS5ka33ynVrLJH2dqDo5A8pDhY8I4bdlcxmNtRhnPCYZnuvTqeAzyUKKdI
 YN9zJeDu1yHs9mi8dp45NPJiKy6xLzWmUwqH8AvR8MWEkrwzqbzNZCEHZ41j74hO
 YZWI0JXi72cboszFvOwqJERvITKxrQQyVQLPRQE2vVbG0bIZPl8i7oslFVhitsl+
 evWqHb4lXY91rI9cC6JIXR1OiUjp68zXPv7DAnxv08O+PGcioU1IeOvPivx8QSx4
 5aUeCkYIIAti/GISzv7xvcYh8mfO76kBjZSB35fX+R9DkeQpxsHmmpWe+UCykzWn
 EwhHQn86+VeBFP6RAXp8CgNCLbrwkEhjzXQl/70s1eYbwvK81VcpDAQ6+cjpf4Hb
 QUmrUJ9iE0wCNl7oqvJZoJvWVGlArvPmzpkTJk3N070X2R0T7x1WCsMlPDMJGhQ2
 fVHvA3QdgWs=
 =Push
 -----END PGP SIGNATURE-----

Merge tag 'keys-namespace-20190627' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull keyring namespacing from David Howells:
 "These patches help make keys and keyrings more namespace aware.

  Firstly some miscellaneous patches to make the process easier:

   - Simplify key index_key handling so that the word-sized chunks
     assoc_array requires don't have to be shifted about, making it
     easier to add more bits into the key.

   - Cache the hash value in the key so that we don't have to calculate
     on every key we examine during a search (it involves a bunch of
     multiplications).

   - Allow keying_search() to search non-recursively.

  Then the main patches:

   - Make it so that keyring names are per-user_namespace from the point
     of view of KEYCTL_JOIN_SESSION_KEYRING so that they're not
     accessible cross-user_namespace.

     keyctl_capabilities() shows KEYCTL_CAPS1_NS_KEYRING_NAME for this.

   - Move the user and user-session keyrings to the user_namespace
     rather than the user_struct. This prevents them propagating
     directly across user_namespaces boundaries (ie. the KEY_SPEC_*
     flags will only pick from the current user_namespace).

   - Make it possible to include the target namespace in which the key
     shall operate in the index_key. This will allow the possibility of
     multiple keys with the same description, but different target
     domains to be held in the same keyring.

     keyctl_capabilities() shows KEYCTL_CAPS1_NS_KEY_TAG for this.

   - Make it so that keys are implicitly invalidated by removal of a
     domain tag, causing them to be garbage collected.

   - Institute a network namespace domain tag that allows keys to be
     differentiated by the network namespace in which they operate. New
     keys that are of a type marked 'KEY_TYPE_NET_DOMAIN' are assigned
     the network domain in force when they are created.

   - Make it so that the desired network namespace can be handed down
     into the request_key() mechanism. This allows AFS, NFS, etc. to
     request keys specific to the network namespace of the superblock.

     This also means that the keys in the DNS record cache are
     thenceforth namespaced, provided network filesystems pass the
     appropriate network namespace down into dns_query().

     For DNS, AFS and NFS are good, whilst CIFS and Ceph are not. Other
     cache keyrings, such as idmapper keyrings, also need to set the
     domain tag - for which they need access to the network namespace of
     the superblock"

* tag 'keys-namespace-20190627' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  keys: Pass the network namespace into request_key mechanism
  keys: Network namespace domain tag
  keys: Garbage collect keys for which the domain has been removed
  keys: Include target namespace in match criteria
  keys: Move the user and user-session keyrings to the user_namespace
  keys: Namespace keyring names
  keys: Add a 'recurse' flag for keyring searches
  keys: Cache the hash value to avoid lots of recalculation
  keys: Simplify key description management
2019-07-08 19:36:47 -07:00
Linus Torvalds
c236b6dd48 request_key improvements
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAXRPObfu3V2unywtrAQJLKA//WENO5pZDHe49T+4GCY0ZmnGHKBUnU7g9
 DUjxSNS8a/nwCyEdApZk9uHp2xsOedP6pjQ4VRWMQfrIPx0Yh9o3J+BQxvyP7PDf
 jEH+5CYC8dZnJJjjteWCcPEGrUoNb1YKfDRBU745YY+rLdHWvhHc27B6SYBg5BGT
 OwW3qyHvp0WMp7TehMALdnkqGph5gR5QMr45tOrH6DkGAhN8mAIKD699d3MqZG73
 +S5KlQOlDlEVrxbD/BgzlzEJQUBQyq8hd61taBFT7LXBNlLJJOnMhd7UJY5IJE7J
 Vi9NpcLj4Emwv4wvZ2xneV0rMbsCbxRMKZLDRuqQ6Tm17xjpjro4n1ujneTAqmmy
 d+XlrVQ2ZMciMNmGleezOoBib9QbY5NWdilc2ls5ydFGiBVL73bIOYtEQNai8lWd
 LBBIIrxOmLO7bnipgqVKRnqeMdMkpWaLISoRfSeJbRt4lGxmka9bDBrSgONnxzJK
 JG+sB8ahSVZaBbhERW8DKnBz61Yf8ka7ijVvjH3zCXu0rbLTy+LLUz5kbzbBP9Fc
 LiUapLV/v420gD2ZRCgPQwtQui4TpBkSGJKS1Ippyn7LGBNCZLM4Y8vOoo4nqr7z
 RhpEKbKeOdVjORaYjO8Zttj8gN9rT6WnPcyCTHdNEnyjotU1ykyVBkzexj+VYvjM
 C3eIdjG7Jk0=
 =c2FO
 -----END PGP SIGNATURE-----

Merge tag 'keys-request-20190626' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull request_key improvements from David Howells:
 "These are all request_key()-related, including a fix and some improvements:

   - Fix the lack of a Link permission check on a key found by
     request_key(), thereby enabling request_key() to link keys that
     don't grant this permission to the target keyring (which must still
     grant Write permission).

     Note that the key must be in the caller's keyrings already to be
     found.

   - Invalidate used request_key authentication keys rather than
     revoking them, so that they get cleaned up immediately rather than
     hanging around till the expiry time is passed.

   - Move the RCU locks outwards from the keyring search functions so
     that a request_key_rcu() can be provided. This can be called in RCU
     mode, so it can't sleep and can't upcall - but it can be called
     from LOOKUP_RCU pathwalk mode.

   - Cache the latest positive result of request_key*() temporarily in
     task_struct so that filesystems that make a lot of request_key()
     calls during pathwalk can take advantage of it to avoid having to
     redo the searching. This requires CONFIG_KEYS_REQUEST_CACHE=y.

     It is assumed that the key just found is likely to be used multiple
     times in each step in an RCU pathwalk, and is likely to be reused
     for the next step too.

     Note that the cleanup of the cache is done on TIF_NOTIFY_RESUME,
     just before userspace resumes, and on exit"

* tag 'keys-request-20190626' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  keys: Kill off request_key_async{,_with_auxdata}
  keys: Cache result of request_key*() temporarily in task_struct
  keys: Provide request_key_rcu()
  keys: Move the RCU locks outwards from the keyring search functions
  keys: Invalidate used request_key authentication keys
  keys: Fix request_key() lack of Link perm check on found key
2019-07-08 19:19:37 -07:00
Linus Torvalds
d44a62742d Keyrings miscellany
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAXQo23fu3V2unywtrAQJghA/+Oi2W9tSfz67zMupYiqa71x5Zg5XlUVIz
 RJxSIwYhE4bhGwodTmqgRlT6f64Gbgt0K8YapGUIbtV/T6d1w02oEmt0V9vad9Zi
 wTH79hH5QKNvewUDhrWODsWhtOBWu1sGt9OozI+c65lsvTpHY4Ox7zIl4DtfBdNK
 nLUxl82h7EHF9H4TtIKxfKlLkIkmt7NRbK3z1eUP+IG/7MBzoyXgXo/gvoHUCOMR
 lhGxttZfxYdZuR9JoR2FBckvKulgafbwjoUc69EDfr8a8IZZrpaUuSTvSPbCfzj1
 j0yXfoowiWvsI1lFFBHeE0BfteJRQ9O2Pkwh1Z9M6v4zjwNNprDOw9a3VroeSgS/
 OWJyHNjeNLDMMZDm1YYCYs0B416q+lZtdAoE/nhR/lGZlBfKTyAa6Cfo4r0RBpYb
 zAxk6K4HcLBL0dkxkTXkxUJPnoDts5bMEL3YuZeVWd7Ef5s5GHW34JI+CFrMR29s
 fC9W+ZEZ74fVo2goPz2ekeiSyp28TkWusXxUCk07g0BsXQzB7v5XXUGtU9hAJ6pe
 aMBfLwAvQkkGi56CPnGWn6WlZ+AgxbRqnlYWpWf0q+PLiuyo4OeRZzhn6AdNQcCR
 2QsTBILOvZbhjEki84ZfsuLLq2k79C2xluEd9JlSAvx5/D93xjMB2qVzR1M6DbdA
 +u1nS8Z6WHA=
 =Oy7N
 -----END PGP SIGNATURE-----

Merge tag 'keys-misc-20190619' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull misc keyring updates from David Howells:
 "These are some miscellaneous keyrings fixes and improvements:

   - Fix a bunch of warnings from sparse, including missing RCU bits and
     kdoc-function argument mismatches

   - Implement a keyctl to allow a key to be moved from one keyring to
     another, with the option of prohibiting key replacement in the
     destination keyring.

   - Grant Link permission to possessors of request_key_auth tokens so
     that upcall servicing daemons can more easily arrange things such
     that only the necessary auth key is passed to the actual service
     program, and not all the auth keys a daemon might possesss.

   - Improvement in lookup_user_key().

   - Implement a keyctl to allow keyrings subsystem capabilities to be
     queried.

  The keyutils next branch has commits to make available, document and
  test the move-key and capabilities code:

        https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/keyutils.git/log

  They're currently on the 'next' branch"

* tag 'keys-misc-20190619' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  keys: Add capability-checking keyctl function
  keys: Reuse keyring_index_key::desc_len in lookup_user_key()
  keys: Grant Link permission to possessers of request_key auth keys
  keys: Add a keyctl to move a key between keyrings
  keys: Hoist locking out of __key_link_begin()
  keys: Break bits out of key_unlink()
  keys: Change keyring_serialise_link_sem to a mutex
  keys: sparse: Fix kdoc mismatches
  keys: sparse: Fix incorrect RCU accesses
  keys: sparse: Fix key_fs[ug]id_changed()
2019-07-08 19:02:11 -07:00
Linus Torvalds
61fc5771f5 audit/stable-5.3 PR 20190702
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAl0bgNYUHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXONcRAAqpeGVh3/eU5bmGeiOWZJ5TREx0Qf
 4M8Z3CElxtbPF4nz1nARUbH424zF91AOa0B4JVO8BFCgxWN5M3dDOLjqLLfJkfbE
 mQMmiPoua1qXTMRi/9S+3kNFYO4IL/sFFiiqY6XVcW6xIUzp3rLwEjcHC/deszP7
 /e8IqLUFAqj853W0k7qyLMRFEQVBzrABgtiSX+X06sCB8OmAVxhpevSRR1lmmfEu
 sjwuAvxexVlmojwI6HkoANyRzqJRX6y7sMGSbr10I/T9YJTk4VPfeFwSS3qBsf15
 z9gTbvFrRcXKoA9U8iG45K0lUinka9OuGxJD/AxuJv+ncyJjWqX+aokvzeo7Wmv6
 sbAyD+ikl9kxvE+sZ3l9yZEVHjFIbjmZY/gzG+ZZD2EEwKBuaQBN5mmSjrUkySJk
 sbF+oBABLptitJIa/cZJ5QHeAPR1NBqSXKhnhG26IR8iwQqpZhefa8yXpF/x3Tn8
 FckvY+YpIakOAMQ/ezVvFaaEELieiRZqqI/ShrochJzwRXHnnbCTPRtNb9NyjOeU
 DZCBASPhrYfBJz3n0fZR2HCnpMZwCSGBgmVn3jmh3YyxKnILdQ4DxKgJCv730jwh
 9T1+1g2/MW554Gted7KLlkE+aj+BzORx6XJ9H8SKmYB85NF5KnnJMiVktjfl4Jr4
 A8meV9KGwAcyBOU=
 =8HBN
 -----END PGP SIGNATURE-----

Merge tag 'audit-pr-20190702' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit

Pull audit updates from Paul Moore:
 "This pull request is a bit early, but with some vacation time coming
  up I wanted to send this out now just in case the remote Internet Gods
  decide not to smile on me once the merge window opens. The patchset
  for v5.3 is pretty minor this time, the highlights include:

   - When the audit daemon is sent a signal, ensure we deliver
     information about the sender even when syscall auditing is not
     enabled/supported.

   - Add the ability to filter audit records based on network address
     family.

   - Tighten the audit field filtering restrictions on string based
     fields.

   - Cleanup the audit field filtering verification code.

   - Remove a few BUG() calls from the audit code"

* tag 'audit-pr-20190702' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
  audit: remove the BUG() calls in the audit rule comparison functions
  audit: enforce op for string fields
  audit: add saddr_fam filter field
  audit: re-structure audit field valid checks
  audit: deliver signal_info regarless of syscall
2019-07-08 18:55:42 -07:00
Masahiro Yamada
7199ff7d74 kheaders: include only headers into kheaders_data.tar.xz
Currently, kheaders_data.tar.xz contains some build scripts as well as
headers. None of them is needed in the header archive.

For ARCH=x86, this commit excludes the following from the archive:

  arch/x86/include/asm/Kbuild
  arch/x86/include/uapi/asm/Kbuild
  include/asm-generic/Kbuild
  include/config/auto.conf
  include/config/kernel.release
  include/config/tristate.conf
  include/uapi/asm-generic/Kbuild
  include/uapi/Kbuild
  kernel/gen_kheaders.sh

This change is actually motivated for the planned header compile-testing
because it will generate more build artifacts, which should not be
included in the archive.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2019-07-09 10:10:52 +09:00
Masahiro Yamada
b60b7c2ea9 kheaders: remove meaningless -R option of 'ls'
The -R option of 'ls' is supposed to be used for directories.

   -R, --recursive
          list subdirectories recursively

Since 'find ... -type f' only matches to regular files, we do not
expect directories passed to the 'ls' command here.

Giving -R is harmless at least, but unneeded.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2019-07-09 10:10:52 +09:00
Linus Torvalds
dad1c12ed8 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:

 - Remove the unused per rq load array and all its infrastructure, by
   Dietmar Eggemann.

 - Add utilization clamping support by Patrick Bellasi. This is a
   refinement of the energy aware scheduling framework with support for
   boosting of interactive and capping of background workloads: to make
   sure critical GUI threads get maximum frequency ASAP, and to make
   sure background processing doesn't unnecessarily move to cpufreq
   governor to higher frequencies and less energy efficient CPU modes.

 - Add the bare minimum of tracepoints required for LISA EAS regression
   testing, by Qais Yousef - which allows automated testing of various
   power management features, including energy aware scheduling.

 - Restructure the former tsk_nr_cpus_allowed() facility that the -rt
   kernel used to modify the scheduler's CPU affinity logic such as
   migrate_disable() - introduce the task->cpus_ptr value instead of
   taking the address of &task->cpus_allowed directly - by Sebastian
   Andrzej Siewior.

 - Misc optimizations, fixes, cleanups and small enhancements - see the
   Git log for details.

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
  sched/uclamp: Add uclamp support to energy_compute()
  sched/uclamp: Add uclamp_util_with()
  sched/cpufreq, sched/uclamp: Add clamps for FAIR and RT tasks
  sched/uclamp: Set default clamps for RT tasks
  sched/uclamp: Reset uclamp values on RESET_ON_FORK
  sched/uclamp: Extend sched_setattr() to support utilization clamping
  sched/core: Allow sched_setattr() to use the current policy
  sched/uclamp: Add system default clamps
  sched/uclamp: Enforce last task's UCLAMP_MAX
  sched/uclamp: Add bucket local max tracking
  sched/uclamp: Add CPU's clamp buckets refcounting
  sched/fair: Rename weighted_cpuload() to cpu_runnable_load()
  sched/debug: Export the newly added tracepoints
  sched/debug: Add sched_overutilized tracepoint
  sched/debug: Add new tracepoint to track PELT at se level
  sched/debug: Add new tracepoints to track PELT at rq level
  sched/debug: Add a new sched_trace_*() helper functions
  sched/autogroup: Make autogroup_path() always available
  sched/wait: Deduplicate code with do-while
  sched/topology: Remove unused 'sd' parameter from arch_scale_cpu_capacity()
  ...
2019-07-08 16:39:53 -07:00
Linus Torvalds
e192832869 Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
 "The main changes in this cycle are:

   - rwsem scalability improvements, phase #2, by Waiman Long, which are
     rather impressive:

       "On a 2-socket 40-core 80-thread Skylake system with 40 reader
        and writer locking threads, the min/mean/max locking operations
        done in a 5-second testing window before the patchset were:

         40 readers, Iterations Min/Mean/Max = 1,807/1,808/1,810
         40 writers, Iterations Min/Mean/Max = 1,807/50,344/151,255

        After the patchset, they became:

         40 readers, Iterations Min/Mean/Max = 30,057/31,359/32,741
         40 writers, Iterations Min/Mean/Max = 94,466/95,845/97,098"

     There's a lot of changes to the locking implementation that makes
     it similar to qrwlock, including owner handoff for more fair
     locking.

     Another microbenchmark shows how across the spectrum the
     improvements are:

       "With a locking microbenchmark running on 5.1 based kernel, the
        total locking rates (in kops/s) on a 2-socket Skylake system
        with equal numbers of readers and writers (mixed) before and
        after this patchset were:

        # of Threads   Before Patch      After Patch
        ------------   ------------      -----------
             2            2,618             4,193
             4            1,202             3,726
             8              802             3,622
            16              729             3,359
            32              319             2,826
            64              102             2,744"

     The changes are extensive and the patch-set has been through
     several iterations addressing various locking workloads. There
     might be more regressions, but unless they are pathological I
     believe we want to use this new implementation as the baseline
     going forward.

   - jump-label optimizations by Daniel Bristot de Oliveira: the primary
     motivation was to remove IPI disturbance of isolated RT-workload
     CPUs, which resulted in the implementation of batched jump-label
     updates. Beyond the improvement of the real-time characteristics
     kernel, in one test this patchset improved static key update
     overhead from 57 msecs to just 1.4 msecs - which is a nice speedup
     as well.

   - atomic64_t cross-arch type cleanups by Mark Rutland: over the last
     ~10 years of atomic64_t existence the various types used by the
     APIs only had to be self-consistent within each architecture -
     which means they became wildly inconsistent across architectures.
     Mark puts and end to this by reworking all the atomic64
     implementations to use 's64' as the base type for atomic64_t, and
     to ensure that this type is consistently used for parameters and
     return values in the API, avoiding further problems in this area.

   - A large set of small improvements to lockdep by Yuyang Du: type
     cleanups, output cleanups, function return type and othr cleanups
     all around the place.

   - A set of percpu ops cleanups and fixes by Peter Zijlstra.

   - Misc other changes - please see the Git log for more details"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (82 commits)
  locking/lockdep: increase size of counters for lockdep statistics
  locking/atomics: Use sed(1) instead of non-standard head(1) option
  locking/lockdep: Move mark_lock() inside CONFIG_TRACE_IRQFLAGS && CONFIG_PROVE_LOCKING
  x86/jump_label: Make tp_vec_nr static
  x86/percpu: Optimize raw_cpu_xchg()
  x86/percpu, sched/fair: Avoid local_clock()
  x86/percpu, x86/irq: Relax {set,get}_irq_regs()
  x86/percpu: Relax smp_processor_id()
  x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}()
  locking/rwsem: Guard against making count negative
  locking/rwsem: Adaptive disabling of reader optimistic spinning
  locking/rwsem: Enable time-based spinning on reader-owned rwsem
  locking/rwsem: Make rwsem->owner an atomic_long_t
  locking/rwsem: Enable readers spinning on writer
  locking/rwsem: Clarify usage of owner's nonspinaable bit
  locking/rwsem: Wake up almost all readers in wait queue
  locking/rwsem: More optimal RT task handling of null owner
  locking/rwsem: Always release wait_lock before waking up tasks
  locking/rwsem: Implement lock handoff to prevent lock starvation
  locking/rwsem: Make rwsem_spin_on_owner() return owner state
  ...
2019-07-08 16:12:03 -07:00
Linus Torvalds
46f1ec23a4 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 "The changes in this cycle are:

   - RCU flavor consolidation cleanups and optmizations

   - Documentation updates

   - Miscellaneous fixes

   - SRCU updates

   - RCU-sync flavor consolidation

   - Torture-test updates

   - Linux-kernel memory-consistency-model updates, most notably the
     addition of plain C-language accesses"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (61 commits)
  tools/memory-model: Improve data-race detection
  tools/memory-model: Change definition of rcu-fence
  tools/memory-model: Expand definition of barrier
  tools/memory-model: Do not use "herd" to refer to "herd7"
  tools/memory-model: Fix comment in MP+poonceonces.litmus
  Documentation: atomic_t.txt: Explain ordering provided by smp_mb__{before,after}_atomic()
  rcu: Don't return a value from rcu_assign_pointer()
  rcu: Force inlining of rcu_read_lock()
  rcu: Fix irritating whitespace error in rcu_assign_pointer()
  rcu: Upgrade sync_exp_work_done() to smp_mb()
  rcutorture: Upper case solves the case of the vanishing NULL pointer
  torture: Suppress propagating trace_printk() warning
  rcutorture: Dump trace buffer for callback pipe drain failures
  torture: Add --trust-make to suppress "make clean"
  torture: Make --cpus override idleness calculations
  torture: Run kernel build in source directory
  torture: Add function graph-tracing cheat sheet
  torture: Capture qemu output
  rcutorture: Tweak kvm options
  rcutorture: Add trivial RCU implementation
  ...
2019-07-08 15:45:14 -07:00
Linus Torvalds
0902d5011c Merge branch 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x96 apic updates from Thomas Gleixner:
 "Updates for the x86 APIC interrupt handling and APIC timer:

   - Fix a long standing issue with spurious interrupts which was caused
     by the big vector management rework a few years ago. Robert Hodaszi
     provided finally enough debug data and an excellent initial failure
     analysis which allowed to understand the underlying issues.

     This contains a change to the core interrupt management code which
     is required to handle this correctly for the APIC/IO_APIC. The core
     changes are NOOPs for most architectures except ARM64. ARM64 is not
     impacted by the change as confirmed by Marc Zyngier.

   - Newer systems allow to disable the PIT clock for power saving
     causing panic in the timer interrupt delivery check of the IO/APIC
     when the HPET timer is not enabled either. While the clock could be
     turned on this would cause an endless whack a mole game to chase
     the proper register in each affected chipset.

     These systems provide the relevant frequencies for TSC, CPU and the
     local APIC timer via CPUID and/or MSRs, which allows to avoid the
     PIT/HPET based calibration. As the calibration code is the only
     usage of the legacy timers on modern systems and is skipped anyway
     when the frequencies are known already, there is no point in
     setting up the PIT and actually checking for the interrupt delivery
     via IO/APIC.

     To achieve this on a wide variety of platforms, the CPUID/MSR based
     frequency readout has been made more robust, which also allowed to
     remove quite some workarounds which turned out to be not longer
     required. Thanks to Daniel Drake for analysis, patches and
     verification"

* 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/irq: Seperate unused system vectors from spurious entry again
  x86/irq: Handle spurious interrupt after shutdown gracefully
  x86/ioapic: Implement irq_get_irqchip_state() callback
  genirq: Add optional hardware synchronization for shutdown
  genirq: Fix misleading synchronize_irq() documentation
  genirq: Delay deactivation in free_irq()
  x86/timer: Skip PIT initialization on modern chipsets
  x86/apic: Use non-atomic operations when possible
  x86/apic: Make apic_bsp_setup() static
  x86/tsc: Set LAPIC timer period to crystal clock frequency
  x86/apic: Rename 'lapic_timer_frequency' to 'lapic_timer_period'
  x86/tsc: Use CPUID.0x16 to calculate missing crystal frequency
2019-07-08 11:22:57 -07:00
Linus Torvalds
927ba67a63 Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
 "The timer and timekeeping departement delivers:

  Core:

   - The consolidation of the VDSO code into a generic library including
     the conversion of x86 and ARM64. Conversion of ARM and MIPS are en
     route through the relevant maintainer trees and should end up in
     5.4.

     This gets rid of the unnecessary different copies of the same code
     and brings all architectures on the same level of VDSO
     functionality.

   - Make the NTP user space interface more robust by restricting the
     TAI offset to prevent undefined behaviour. Includes a selftest.

   - Validate user input in the compat settimeofday() syscall to catch
     invalid values which would be turned into valid values by a
     multiplication overflow

   - Consolidate the time accessors

   - Small fixes, improvements and cleanups all over the place

  Drivers:

   - Support for the NXP system counter, TI davinci timer

   - Move the Microsoft HyperV clocksource/events code into the
     drivers/clocksource directory so it can be shared between x86 and
     ARM64.

   - Overhaul of the Tegra driver

   - Delay timer support for IXP4xx

   - Small fixes, improvements and cleanups as usual"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (71 commits)
  time: Validate user input in compat_settimeofday()
  timer: Document TIMER_PINNED
  clocksource/drivers: Continue making Hyper-V clocksource ISA agnostic
  clocksource/drivers: Make Hyper-V clocksource ISA agnostic
  MAINTAINERS: Fix Andy's surname and the directory entries of VDSO
  hrtimer: Use a bullet for the returns bullet list
  arm64: vdso: Fix compilation with clang older than 8
  arm64: compat: Fix __arch_get_hw_counter() implementation
  arm64: Fix __arch_get_hw_counter() implementation
  lib/vdso: Make delta calculation work correctly
  MAINTAINERS: Add entry for the generic VDSO library
  arm64: compat: No need for pre-ARMv7 barriers on an ARMv8 system
  arm64: vdso: Remove unnecessary asm-offsets.c definitions
  vdso: Remove superfluous #ifdef __KERNEL__ in vdso/datapage.h
  clocksource/drivers/davinci: Add support for clocksource
  clocksource/drivers/davinci: Add support for clockevents
  clocksource/drivers/tegra: Set up maximum-ticks limit properly
  clocksource/drivers/tegra: Cycles can't be 0
  clocksource/drivers/tegra: Restore base address before cleanup
  clocksource/drivers/tegra: Add verbose definition for 1MHz constant
  ...
2019-07-08 11:06:29 -07:00
Linus Torvalds
2a1ccd3142 Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
 "The irq departement provides the usual mixed bag:

  Core:

   - Further improvements to the irq timings code which aims to predict
     the next interrupt for power state selection to achieve better
     latency/power balance

   - Add interrupt statistics to the core NMI handlers

   - The usual small fixes and cleanups

  Drivers:

   - Support for Renesas RZ/A1, Annapurna Labs FIC, Meson-G12A SoC and
     Amazon Gravition AMR/GIC interrupt controllers.

   - Rework of the Renesas INTC controller driver

   - ACPI support for Socionext SoCs

   - Enhancements to the CSKY interrupt controller

   - The usual small fixes and cleanups"

* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (39 commits)
  irq/irqdomain: Fix comment typo
  genirq: Update irq stats from NMI handlers
  irqchip/gic-pm: Remove PM_CLK dependency
  irqchip/al-fic: Introduce Amazon's Annapurna Labs Fabric Interrupt Controller Driver
  dt-bindings: interrupt-controller: Add Amazon's Annapurna Labs FIC
  softirq: Use __this_cpu_write() in takeover_tasklets()
  irqchip/mbigen: Stop printing kernel addresses
  irqchip/gic: Add dependency for ARM_GIC_MAX_NR
  genirq/affinity: Remove unused argument from [__]irq_build_affinity_masks()
  genirq/timings: Add selftest for next event computation
  genirq/timings: Add selftest for irqs circular buffer
  genirq/timings: Add selftest for circular array
  genirq/timings: Encapsulate storing function
  genirq/timings: Encapsulate timings push
  genirq/timings: Optimize the period detection speed
  genirq/timings: Fix timings buffer inspection
  genirq/timings: Fix next event index function
  irqchip/qcom: Use struct_size() in devm_kzalloc()
  irqchip/irq-csky-mpintc: Remove unnecessary loop in interrupt handler
  dt-bindings: interrupt-controller: Update csky mpintc
  ...
2019-07-08 11:01:13 -07:00
Linus Torvalds
e0e86b111b Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull SMP/hotplug updates from Thomas Gleixner:
 "A small set of updates for SMP and CPU hotplug:

   - Abort disabling secondary CPUs in the freezer when a wakeup is
     pending instead of evaluating it only after all CPUs have been
     offlined.

   - Remove the shared annotation for the strict per CPU cfd_data in the
     smp function call core code.

   - Remove the return values of smp_call_function() and on_each_cpu()
     as they are unconditionally 0. Fixup the few callers which actually
     bothered to check the return value"

* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  smp: Remove smp_call_function() and on_each_cpu() return values
  smp: Do not mark call_function_data as shared
  cpu/hotplug: Abort disabling secondary CPUs if wakeup is pending
  cpu/hotplug: Fix notify_cpu_starting() reference in bringup_wait_for_ap()
2019-07-08 10:39:56 -07:00
Linus Torvalds
1758feddb0 s390 updates for the 5.3 merge window
- Improve stop_machine wait logic: replace cpu_relax_yield call in generic
    stop_machine function with a weak stop_machine_yield function. This is
    overridden on s390, which yields the current cpu to the neighbouring cpu
    after a couple of retries, instead of blindly giving up the cpu to the
    hipervisor. This significantly improves stop_machine performance on s390 in
    overcommitted scenarios.
    This includes common code changes which have been Acked by Peter Zijlstra
    and Thomas Gleixner.
 
  - Improve jump label transformation speed: transform jump labels without
    using stop_machine.
 
  - Refactoring of the vfio-ccw cp handling, simplifying the code and
    avoiding unneeded allocating/copying.
 
  - Various vfio-ccw fixes (ccw translation, state machine).
 
  - Add support for vfio-ap queue interrupt control in the guest.
    This includes s390 kvm changes which have been Acked by Christian
    Borntraeger.
 
  - Add protected virtualization support for virtio-ccw.
 
  - Enforce both CONFIG_SMP and CONFIG_HOTPLUG_CPU, which allows to remove some
    code which most likely isn't working at all, besides that s390 didn't even
    compile for !CONFIG_SMP.
 
  - Support for special flagged EP11 CPRBs for zcrypt.
 
  - Handle PCI devices with no support for new MIO instructions.
 
  - Avoid KASAN false positives in reworked stack unwinder.
 
  - Couple of fixes for the QDIO layer.
 
  - Convert s390 specific documentation to ReST format.
 
  - Let s390 crypto modules return -ENODEV instead of -EOPNOTSUPP if hardware is
    missing. This way our modules behave like most other modules and which is
    also what systemd's systemd-modules-load.service expects.
 
  - Replace defconfig with performance_defconfig, so there is one config file
    less to maintain.
 
  - Remove the SCLP call home device driver, which was never useful.
 
  - Cleanups all over the place.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEE3QHqV+H2a8xAv27vjYWKoQLXFBgFAl0iEpcACgkQjYWKoQLX
 FBgtZwf8DOJ6COUG91jKP0RSDlc2YvIMBxopQ38ql1lIsTj5t6DvJ2z3X5uct1wy
 6mMiF01VuyD4V4UXbTJQrihzNx7D4dUh47s2sS+diGHxJyXacVxlmjS5k+6pLIUO
 AyLvtCcoqDPPiThqnSTZFRm/TcfO/25fCG/IdjrFGj1MD09wHpUCh16tmRPTGFlC
 BWZeilDT77fVXnh7Ggn3JB0mQay5PAw2ODOxELHTUBaLmYF8RJPPVKBPmXGl9P1W
 84ESm2p+iALGGWDiTOUad9eu8wyQci/V/R+hFgs0Bz/HRcjznNH5EVvfQNCD4VNF
 g/PET10nIQYZv2BNdi0cwRjR9jCFbw==
 =jp0i
 -----END PGP SIGNATURE-----

Merge tag 's390-5.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

Pull s390 updates from Vasily Gorbik:

 - Improve stop_machine wait logic: replace cpu_relax_yield call in
   generic stop_machine function with a weak stop_machine_yield
   function. This is overridden on s390, which yields the current cpu to
   the neighbouring cpu after a couple of retries, instead of blindly
   giving up the cpu to the hipervisor. This significantly improves
   stop_machine performance on s390 in overcommitted scenarios.

   This includes common code changes which have been Acked by Peter
   Zijlstra and Thomas Gleixner.

 - Improve jump label transformation speed: transform jump labels
   without using stop_machine.

 - Refactoring of the vfio-ccw cp handling, simplifying the code and
   avoiding unneeded allocating/copying.

 - Various vfio-ccw fixes (ccw translation, state machine).

 - Add support for vfio-ap queue interrupt control in the guest. This
   includes s390 kvm changes which have been Acked by Christian
   Borntraeger.

 - Add protected virtualization support for virtio-ccw.

 - Enforce both CONFIG_SMP and CONFIG_HOTPLUG_CPU, which allows to
   remove some code which most likely isn't working at all, besides that
   s390 didn't even compile for !CONFIG_SMP.

 - Support for special flagged EP11 CPRBs for zcrypt.

 - Handle PCI devices with no support for new MIO instructions.

 - Avoid KASAN false positives in reworked stack unwinder.

 - Couple of fixes for the QDIO layer.

 - Convert s390 specific documentation to ReST format.

 - Let s390 crypto modules return -ENODEV instead of -EOPNOTSUPP if
   hardware is missing. This way our modules behave like most other
   modules and which is also what systemd's systemd-modules-load.service
   expects.

 - Replace defconfig with performance_defconfig, so there is one config
   file less to maintain.

 - Remove the SCLP call home device driver, which was never useful.

 - Cleanups all over the place.

* tag 's390-5.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (83 commits)
  docs: s390: s390dbf: typos and formatting, update crash command
  docs: s390: unify and update s390dbf kdocs at debug.c
  docs: s390: restore important non-kdoc parts of s390dbf.rst
  vfio-ccw: Fix the conversion of Format-0 CCWs to Format-1
  s390/pci: correctly handle MIO opt-out
  s390/pci: deal with devices that have no support for MIO instructions
  s390: ap: kvm: Enable PQAP/AQIC facility for the guest
  s390: ap: implement PAPQ AQIC interception in kernel
  vfio: ap: register IOMMU VFIO notifier
  s390: ap: kvm: add PQAP interception for AQIC
  s390/unwind: cleanup unused READ_ONCE_TASK_STACK
  s390/kasan: avoid false positives during stack unwind
  s390/qdio: don't touch the dsci in tiqdio_add_input_queues()
  s390/qdio: (re-)initialize tiqdio list entries
  s390/dasd: Fix a precision vs width bug in dasd_feature_list()
  s390/cio: introduce driver_override on the css bus
  vfio-ccw: make convert_ccw0_to_ccw1 static
  vfio-ccw: Remove copy_ccw_from_iova()
  vfio-ccw: Factor out the ccw0-to-ccw1 transition
  vfio-ccw: Copy CCW data outside length calculation
  ...
2019-07-08 10:06:12 -07:00
Linus Torvalds
dfd437a257 arm64 updates for 5.3:
- arm64 support for syscall emulation via PTRACE_SYSEMU{,_SINGLESTEP}
 
 - Wire up VM_FLUSH_RESET_PERMS for arm64, allowing the core code to
   manage the permissions of executable vmalloc regions more strictly
 
 - Slight performance improvement by keeping softirqs enabled while
   touching the FPSIMD/SVE state (kernel_neon_begin/end)
 
 - Expose a couple of ARMv8.5 features to user (HWCAP): CondM (new XAFLAG
   and AXFLAG instructions for floating point comparison flags
   manipulation) and FRINT (rounding floating point numbers to integers)
 
 - Re-instate ARM64_PSEUDO_NMI support which was previously marked as
   BROKEN due to some bugs (now fixed)
 
 - Improve parking of stopped CPUs and implement an arm64-specific
   panic_smp_self_stop() to avoid warning on not being able to stop
   secondary CPUs during panic
 
 - perf: enable the ARM Statistical Profiling Extensions (SPE) on ACPI
   platforms
 
 - perf: DDR performance monitor support for iMX8QXP
 
 - cache_line_size() can now be set from DT or ACPI/PPTT if provided to
   cope with a system cache info not exposed via the CPUID registers
 
 - Avoid warning on hardware cache line size greater than
   ARCH_DMA_MINALIGN if the system is fully coherent
 
 - arm64 do_page_fault() and hugetlb cleanups
 
 - Refactor set_pte_at() to avoid redundant READ_ONCE(*ptep)
 
 - Ignore ACPI 5.1 FADTs reported as 5.0 (infer from the 'arm_boot_flags'
   introduced in 5.1)
 
 - CONFIG_RANDOMIZE_BASE now enabled in defconfig
 
 - Allow the selection of ARM64_MODULE_PLTS, currently only done via
   RANDOMIZE_BASE (and an erratum workaround), allowing modules to spill
   over into the vmalloc area
 
 - Make ZONE_DMA32 configurable
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAl0eHqcACgkQa9axLQDI
 XvFyNA/+L+bnkz8m3ncydlqqfXomQn4eJJVQ8Uksb0knJz+1+3CUxxbO4ry4jXZN
 fMkbggYrDPRKpDbsUl0lsRipj7jW9bqan+N37c3SWqCkgb6HqDaHViwxdx6Ec/Uk
 gHudozDSPh/8c7hxGcSyt/CFyuW6b+8eYIQU5rtIgz8aVY2BypBvS/7YtYCbIkx0
 w4CFleRTK1zXD5mJQhrc6jyDx659sVkrAvdhf6YIymOY8nBTv40vwdNo3beJMYp8
 Po/+0Ixu+VkHUNtmYYZQgP/AGH96xiTcRnUqd172JdtRPpCLqnLqwFokXeVIlUKT
 KZFMDPzK+756Ayn4z4huEePPAOGlHbJje8JVNnFyreKhVVcCotW7YPY/oJR10bnc
 eo7yD+DxABTn+93G2yP436bNVa8qO1UqjOBfInWBtnNFJfANIkZweij/MQ6MjaTA
 o7KtviHnZFClefMPoiI7HDzwL8XSmsBDbeQ04s2Wxku1Y2xUHLx4iLmadwLQ1ZPb
 lZMTZP3N/T1554MoURVA1afCjAwiqU3bt1xDUGjbBVjLfSPBAn/25IacsG9Li9AF
 7Rp1M9VhrfLftjFFkB2HwpbhRASOxaOSx+EI3kzEfCtM2O9I1WHgP3rvCdc3l0HU
 tbK0/IggQicNgz7GSZ8xDlWPwwSadXYGLys+xlMZEYd3pDIOiFc=
 =0TDT
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Catalin Marinas:

 - arm64 support for syscall emulation via PTRACE_SYSEMU{,_SINGLESTEP}

 - Wire up VM_FLUSH_RESET_PERMS for arm64, allowing the core code to
   manage the permissions of executable vmalloc regions more strictly

 - Slight performance improvement by keeping softirqs enabled while
   touching the FPSIMD/SVE state (kernel_neon_begin/end)

 - Expose a couple of ARMv8.5 features to user (HWCAP): CondM (new
   XAFLAG and AXFLAG instructions for floating point comparison flags
   manipulation) and FRINT (rounding floating point numbers to integers)

 - Re-instate ARM64_PSEUDO_NMI support which was previously marked as
   BROKEN due to some bugs (now fixed)

 - Improve parking of stopped CPUs and implement an arm64-specific
   panic_smp_self_stop() to avoid warning on not being able to stop
   secondary CPUs during panic

 - perf: enable the ARM Statistical Profiling Extensions (SPE) on ACPI
   platforms

 - perf: DDR performance monitor support for iMX8QXP

 - cache_line_size() can now be set from DT or ACPI/PPTT if provided to
   cope with a system cache info not exposed via the CPUID registers

 - Avoid warning on hardware cache line size greater than
   ARCH_DMA_MINALIGN if the system is fully coherent

 - arm64 do_page_fault() and hugetlb cleanups

 - Refactor set_pte_at() to avoid redundant READ_ONCE(*ptep)

 - Ignore ACPI 5.1 FADTs reported as 5.0 (infer from the
   'arm_boot_flags' introduced in 5.1)

 - CONFIG_RANDOMIZE_BASE now enabled in defconfig

 - Allow the selection of ARM64_MODULE_PLTS, currently only done via
   RANDOMIZE_BASE (and an erratum workaround), allowing modules to spill
   over into the vmalloc area

 - Make ZONE_DMA32 configurable

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (54 commits)
  perf: arm_spe: Enable ACPI/Platform automatic module loading
  arm_pmu: acpi: spe: Add initial MADT/SPE probing
  ACPI/PPTT: Add function to return ACPI 6.3 Identical tokens
  ACPI/PPTT: Modify node flag detection to find last IDENTICAL
  x86/entry: Simplify _TIF_SYSCALL_EMU handling
  arm64: rename dump_instr as dump_kernel_instr
  arm64/mm: Drop [PTE|PMD]_TYPE_FAULT
  arm64: Implement panic_smp_self_stop()
  arm64: Improve parking of stopped CPUs
  arm64: Expose FRINT capabilities to userspace
  arm64: Expose ARMv8.5 CondM capability to userspace
  arm64: defconfig: enable CONFIG_RANDOMIZE_BASE
  arm64: ARM64_MODULES_PLTS must depend on MODULES
  arm64: bpf: do not allocate executable memory
  arm64/kprobes: set VM_FLUSH_RESET_PERMS on kprobe instruction pages
  arm64/mm: wire up CONFIG_ARCH_HAS_SET_DIRECT_MAP
  arm64: module: create module allocations without exec permissions
  arm64: Allow user selection of ARM64_MODULE_PLTS
  acpi/arm64: ignore 5.1 FADTs that are reported as 5.0
  arm64: Allow selecting Pseudo-NMI again
  ...
2019-07-08 09:54:55 -07:00
Ingo Molnar
552a031ba1 Linux 5.2
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl0idTweHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGZesIAJDKicw2Voyx8K8m
 3pXSK+71RuO/d3Y9M51mdfTMKRP4PHR9/4wVZ9wHPwC4dV6wxgsmIYCF69a1Wety
 LD1MpDCP1DK5wVfPNKVX2xmj7ua6iutPtSsJHzdzM2TlscgsrFKjmUccqJ5JLwL5
 c34nqwXWnzzRyI5Ga9cQSlwzAXq0vDHXyML3AnCosSsLX0lKFrHlK1zttdOPNkfj
 dXRN62g3q+9kVQozzhDXb8atZZ7IkBk8Q0lujpNXW83Ci1VjaVNv3SB8GZTXIlLj
 U15VdyuwfJDfpBgFBN6/unzVaAB6FFrEKy0jT1aeTyKarMKDKgOnJjn10aKjDNno
 /bXsKKc=
 =TVqV
 -----END PGP SIGNATURE-----

Merge tag 'v5.2' into perf/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-07-08 18:04:41 +02:00
YueHaibing
6705fea0c7 bpf: cgroup: Fix build error without CONFIG_NET
If CONFIG_NET is not set and CONFIG_CGROUP_BPF=y,
gcc building fails:

kernel/bpf/cgroup.o: In function `cg_sockopt_func_proto':
cgroup.c:(.text+0x237e): undefined reference to `bpf_sk_storage_get_proto'
cgroup.c:(.text+0x2394): undefined reference to `bpf_sk_storage_delete_proto'
kernel/bpf/cgroup.o: In function `__cgroup_bpf_run_filter_getsockopt':
(.text+0x2a1f): undefined reference to `lock_sock_nested'
(.text+0x2ca2): undefined reference to `release_sock'
kernel/bpf/cgroup.o: In function `__cgroup_bpf_run_filter_setsockopt':
(.text+0x3006): undefined reference to `lock_sock_nested'
(.text+0x32bb): undefined reference to `release_sock'

Reported-by: Hulk Robot <hulkci@huawei.com>
Suggested-by: Stanislav Fomichev <sdf@fomichev.me>
Fixes: 0d01da6afc ("bpf: implement getsockopt and setsockopt hooks")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-08 17:17:00 +02:00
Rafael J. Wysocki
3dbeb44854 Merge branch 'pm-sleep'
* pm-sleep:
  PM: sleep: Drop dev_pm_skip_next_resume_phases()
  ACPI: PM: Drop unused function and function header
  ACPI: PM: Introduce "poweroff" callbacks for ACPI PM domain and LPSS
  ACPI: PM: Simplify and fix PM domain hibernation callbacks
  PCI: PM: Simplify bus-level hibernation callbacks
  PM: ACPI/PCI: Resume all devices during hibernation
  kernel: power: swap: use kzalloc() instead of kmalloc() followed by memset()
  PM: sleep: Update struct wakeup_source documentation
  drivers: base: power: remove wakeup_sources_stats_dentry variable
  PM: suspend: Rename pm_suspend_via_s2idle()
  PM: sleep: Show how long dpm_suspend_start() and dpm_suspend_end() take
  PM: hibernate: powerpc: Expose pfn_is_nosave() prototype
2019-07-08 10:51:25 +02:00
zhengbin
9176ab1b84 time: Validate user input in compat_settimeofday()
The user value is validated after converting the timeval to a timespec, but
for a wide range of negative tv_usec values the multiplication overflow turns
them in positive numbers. So the 'validated later' is not catching the
invalid input.

Signed-off-by: zhengbin <zhengbin13@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/1562460701-113301-1-git-send-email-zhengbin13@huawei.com
2019-07-07 12:05:40 +02:00
Zenghui Yu
3a1d24ca95 irq/irqdomain: Fix comment typo
Fix typo in the comment on top of __irq_domain_add().

Signed-off-by: Zenghui Yu <yuzenghui@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/1562388072-23492-1-git-send-email-yuzenghui@huawei.com
2019-07-06 10:40:20 +02:00
Shijith Thotton
c09cb12935 genirq: Update irq stats from NMI handlers
The NMI handlers handle_percpu_devid_fasteoi_nmi() and handle_fasteoi_nmi()
do not update the interrupt counts. Due to that the NMI interrupt count
does not show up correctly in /proc/interrupts.

Add the statistics and treat the NMI handlers in the same way as per cpu
interrupts and prevent them from updating irq_desc::tot_count as this might
be corrupted due to concurrency.

[ tglx: Massaged changelog ]

Fixes: 2dcf1fbcad ("genirq: Provide NMI handlers")
Signed-off-by: Shijith Thotton <sthotton@marvell.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/1562313336-11888-1-git-send-email-sthotton@marvell.com
2019-07-06 10:40:19 +02:00
David S. Miller
c4cde5804d Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2019-07-03

The following pull-request contains BPF updates for your *net-next* tree.

There is a minor merge conflict in mlx5 due to 8960b38932 ("linux/dim:
Rename externally used net_dim members") which has been pulled into your
tree in the meantime, but resolution seems not that bad ... getting current
bpf-next out now before there's coming more on mlx5. ;) I'm Cc'ing Saeed
just so he's aware of the resolution below:

** First conflict in drivers/net/ethernet/mellanox/mlx5/core/en_main.c:

  <<<<<<< HEAD
  static int mlx5e_open_cq(struct mlx5e_channel *c,
                           struct dim_cq_moder moder,
                           struct mlx5e_cq_param *param,
                           struct mlx5e_cq *cq)
  =======
  int mlx5e_open_cq(struct mlx5e_channel *c, struct net_dim_cq_moder moder,
                    struct mlx5e_cq_param *param, struct mlx5e_cq *cq)
  >>>>>>> e5a3e259ef

Resolution is to take the second chunk and rename net_dim_cq_moder into
dim_cq_moder. Also the signature for mlx5e_open_cq() in ...

  drivers/net/ethernet/mellanox/mlx5/core/en.h +977

... and in mlx5e_open_xsk() ...

  drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.c +64

... needs the same rename from net_dim_cq_moder into dim_cq_moder.

** Second conflict in drivers/net/ethernet/mellanox/mlx5/core/en_main.c:

  <<<<<<< HEAD
          int cpu = cpumask_first(mlx5_comp_irq_get_affinity_mask(priv->mdev, ix));
          struct dim_cq_moder icocq_moder = {0, 0};
          struct net_device *netdev = priv->netdev;
          struct mlx5e_channel *c;
          unsigned int irq;
  =======
          struct net_dim_cq_moder icocq_moder = {0, 0};
  >>>>>>> e5a3e259ef

Take the second chunk and rename net_dim_cq_moder into dim_cq_moder
as well.

Let me know if you run into any issues. Anyway, the main changes are:

1) Long-awaited AF_XDP support for mlx5e driver, from Maxim.

2) Addition of two new per-cgroup BPF hooks for getsockopt and
   setsockopt along with a new sockopt program type which allows more
   fine-grained pass/reject settings for containers. Also add a sock_ops
   callback that can be selectively enabled on a per-socket basis and is
   executed for every RTT to help tracking TCP statistics, both features
   from Stanislav.

3) Follow-up fix from loops in precision tracking which was not propagating
   precision marks and as a result verifier assumed that some branches were
   not taken and therefore wrongly removed as dead code, from Alexei.

4) Fix BPF cgroup release synchronization race which could lead to a
   double-free if a leaf's cgroup_bpf object is released and a new BPF
   program is attached to the one of ancestor cgroups in parallel, from Roman.

5) Support for bulking XDP_TX on veth devices which improves performance
   in some cases by around 9%, from Toshiaki.

6) Allow for lookups into BPF devmap and improve feedback when calling into
   bpf_redirect_map() as lookup is now performed right away in the helper
   itself, from Toke.

7) Add support for fq's Earliest Departure Time to the Host Bandwidth
   Manager (HBM) sample BPF program, from Lawrence.

8) Various cleanups and minor fixes all over the place from many others.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-04 12:48:21 -07:00
Jann Horn
6994eefb00 ptrace: Fix ->ptracer_cred handling for PTRACE_TRACEME
Fix two issues:

When called for PTRACE_TRACEME, ptrace_link() would obtain an RCU
reference to the parent's objective credentials, then give that pointer
to get_cred().  However, the object lifetime rules for things like
struct cred do not permit unconditionally turning an RCU reference into
a stable reference.

PTRACE_TRACEME records the parent's credentials as if the parent was
acting as the subject, but that's not the case.  If a malicious
unprivileged child uses PTRACE_TRACEME and the parent is privileged, and
at a later point, the parent process becomes attacker-controlled
(because it drops privileges and calls execve()), the attacker ends up
with control over two processes with a privileged ptrace relationship,
which can be abused to ptrace a suid binary and obtain root privileges.

Fix both of these by always recording the credentials of the process
that is requesting the creation of the ptrace relationship:
current_cred() can't change under us, and current is the proper subject
for access control.

This change is theoretically userspace-visible, but I am not aware of
any code that it will actually break.

Fixes: 64b875f7ac ("ptrace: Capture the ptracer's creds not PT_PTRACE_CAP")
Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-05 02:00:41 +09:00
Linus Torvalds
550d1f5bda This includes three fixes:
- Fixes a deadlock from a previous fix to keep module loading
    and function tracing text modifications from stepping on each other.
    (this has a few patches to help document the issue in comments)
 
  - Fix a crash when the snapshot buffer gets out of sync with the
    main ring buffer.
 
  - Fix a memory leak when reading the memory logs
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXRzBCBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qnDaAP9qTFBOFtgIGCT5wVP8xjQeESxh1b8R
 tbaT7/U2oPpeiwEAvp1mYo5UYcc8KauBqVaLSLJVN4pv07xiZF5Qgh9C1QE=
 =m2IT
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.2-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
 "This includes three fixes:

   - Fix a deadlock from a previous fix to keep module loading and
     function tracing text modifications from stepping on each other
     (this has a few patches to help document the issue in comments)

   - Fix a crash when the snapshot buffer gets out of sync with the main
     ring buffer

   - Fix a memory leak when reading the memory logs"

* tag 'trace-v5.2-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  ftrace/x86: Anotate text_mutex split between ftrace_arch_code_modify_post_process() and ftrace_arch_code_modify_prepare()
  tracing/snapshot: Resize spare buffer if size changed
  tracing: Fix memory leak in tracing_err_log_open()
  ftrace/x86: Add a comment to why we take text_mutex in ftrace_arch_code_modify_prepare()
  ftrace/x86: Remove possible deadlock between register_kprobe() and ftrace_run_update_code()
2019-07-04 10:26:17 +09:00
David S. Miller
c3ead2df97 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2019-07-03

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Fix the interpreter to properly handle BPF_ALU32 | BPF_ARSH
   on BE architectures, from Jiong.

2) Fix several bugs in the x32 BPF JIT for handling shifts by 0,
   from Luke and Xi.

3) Fix NULL pointer deref in btf_type_is_resolve_source_only(),
   from Stanislav.

4) Properly handle the check that forwarding is enabled on the device
   in bpf_ipv6_fib_lookup() helper code, from Anton.

5) Fix UAPI bpf_prog_info fields alignment for archs that have 16 bit
   alignment such as m68k, from Baruch.

6) Fix kernel hanging in unregister_netdevice loop while unregistering
   device bound to XDP socket, from Ilya.

7) Properly terminate tail update in xskq_produce_flush_desc(), from Nathan.

8) Fix broken always_inline handling in test_lwt_seg6local, from Jiri.

9) Fix bpftool to use correct argument in cgroup errors, from Jakub.

10) Fix detaching dummy prog in XDP redirect sample code, from Prashant.

11) Add Jonathan to AF_XDP reviewers, from Björn.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03 12:09:00 -07:00
Greg Kroah-Hartman
1be51474f9 swiotlb: no need to check return value of debugfs_create functions
When calling debugfs functions, there is no need to ever check the
return value.  The function can work or not, but the code logic should
never do something different based on this.

Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: iommu@lists.linux-foundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20190612144314.GA16803@kroah.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-03 16:57:18 +02:00
Alexei Starovoitov
a3ce685dd0 bpf: fix precision tracking
When equivalent state is found the current state needs to propagate precision marks.
Otherwise the verifier will prune the search incorrectly.

There is a price for correctness:
                      before      before    broken    fixed
                      cnst spill  precise   precise
bpf_lb-DLB_L3.o       1923        8128      1863      1898
bpf_lb-DLB_L4.o       3077        6707      2468      2666
bpf_lb-DUNKNOWN.o     1062        1062      544       544
bpf_lxc-DDROP_ALL.o   166729      380712    22629     36823
bpf_lxc-DUNKNOWN.o    174607      440652    28805     45325
bpf_netdev.o          8407        31904     6801      7002
bpf_overlay.o         5420        23569     4754      4858
bpf_lxc_jit.o         39389       359445    50925     69631
Overall precision tracking is still very effective.

Fixes: b5dc0163d8 ("bpf: precise scalar_value tracking")
Reported-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Tested-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-03 11:12:14 +02:00
Thomas Gleixner
3419240495 Merge branch 'timers/vdso' into timers/core
so the hyper-v clocksource update can be applied.
2019-07-03 10:50:21 +02:00
Thomas Gleixner
62e0468650 genirq: Add optional hardware synchronization for shutdown
free_irq() ensures that no hardware interrupt handler is executing on a
different CPU before actually releasing resources and deactivating the
interrupt completely in a domain hierarchy.

But that does not catch the case where the interrupt is on flight at the
hardware level but not yet serviced by the target CPU. That creates an
interesing race condition:

   CPU 0                  CPU 1               IRQ CHIP

                                              interrupt is raised
                                              sent to CPU1
			  Unable to handle
			  immediately
			  (interrupts off,
			   deep idle delay)
   mask()
   ...
   free()
     shutdown()
     synchronize_irq()
     release_resources()
                          do_IRQ()
                            -> resources are not available

That might be harmless and just trigger a spurious interrupt warning, but
some interrupt chips might get into a wedged state.

Utilize the existing irq_get_irqchip_state() callback for the
synchronization in free_irq().

synchronize_hardirq() is not using this mechanism as it might actually
deadlock unter certain conditions, e.g. when called with interrupts
disabled and the target CPU is the one on which the synchronization is
invoked. synchronize_irq() uses it because that function cannot be called
from non preemtible contexts as it might sleep.

No functional change intended and according to Marc the existing GIC
implementations where the driver supports the callback should be able
to cope with that core change. Famous last words.

Fixes: 464d12309e ("x86/vector: Switch IOAPIC to global reservation mode")
Reported-by: Robert Hodaszi <Robert.Hodaszi@digi.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Tested-by: Marc Zyngier <marc.zyngier@arm.com>
Link: https://lkml.kernel.org/r/20190628111440.279463375@linutronix.de
2019-07-03 10:12:29 +02:00
Thomas Gleixner
1d21f2af85 genirq: Fix misleading synchronize_irq() documentation
The function might sleep, so it cannot be called from interrupt
context. Not even with care.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Link: https://lkml.kernel.org/r/20190628111440.189241552@linutronix.de
2019-07-03 10:12:29 +02:00
Thomas Gleixner
4001d8e876 genirq: Delay deactivation in free_irq()
When interrupts are shutdown, they are immediately deactivated in the
irqdomain hierarchy. While this looks obviously correct there is a subtle
issue:

There might be an interrupt in flight when free_irq() is invoking the
shutdown. This is properly handled at the irq descriptor / primary handler
level, but the deactivation might completely disable resources which are
required to acknowledge the interrupt.

Split the shutdown code and deactivate the interrupt after synchronization
in free_irq(). Fixup all other usage sites where this is not an issue to
invoke the combined shutdown_and_deactivate() function instead.

This still might be an issue if the interrupt in flight servicing is
delayed on a remote CPU beyond the invocation of synchronize_irq(), but
that cannot be handled at that level and needs to be handled in the
synchronize_irq() context.

Fixes: f8264e3496 ("irqdomain: Introduce new interfaces to support hierarchy irqdomains")
Reported-by: Robert Hodaszi <Robert.Hodaszi@digi.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Link: https://lkml.kernel.org/r/20190628111440.098196390@linutronix.de
2019-07-03 10:12:28 +02:00
Thomas Gleixner
7e8e6816c6 stacktrace: Use PF_KTHREAD to check for kernel threads
!current->mm is not a reliable indicator for kernel threads as they might
temporarily use a user mm. Check for PF_KTHREAD instead.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1907021750100.1802@nanos.tec.linutronix.de
2019-07-03 09:04:06 +02:00
Jason Gunthorpe
cc5dfd59e3 Merge branch 'hmm-devmem-cleanup.4' into rdma.git hmm
Christoph Hellwig says:

====================
Below is a series that cleans up the dev_pagemap interface so that it is
more easily usable, which removes the need to wrap it in hmm and thus
allowing to kill a lot of code

Changes since v3:
 - pull in "mm/swap: Fix release_pages() when releasing devmap pages" and
   rebase the other patches on top of that
 - fold the hmm_devmem_add_resource into the DEVICE_PUBLIC memory removal
   patch
 - remove _vm_normal_page as it isn't needed without DEVICE_PUBLIC memory
 - pick up various ACKs

Changes since v2:
 - fix nvdimm kunit build
 - add a new memory type for device dax
 - fix a few issues in intermediate patches that didn't show up in the end
   result
 - incorporate feedback from Michal Hocko, including killing of
   the DEVICE_PUBLIC memory type entirely

Changes since v1:
 - rebase
 - also switch p2pdma to the internal refcount
 - add type checking for pgmap->type
 - rename the migrate method to migrate_to_ram
 - cleanup the altmap_valid flag
 - various tidbits from the reviews
====================

Conflicts resolved by:
 - Keeping Ira's version of the code in swap.c
 - Using the delete for the section in hmm.rst
 - Using the delete for the devmap code in hmm.c and .h

* branch 'hmm-devmem-cleanup.4': (24 commits)
  mm: don't select MIGRATE_VMA_HELPER from HMM_MIRROR
  mm: remove the HMM config option
  mm: sort out the DEVICE_PRIVATE Kconfig mess
  mm: simplify ZONE_DEVICE page private data
  mm: remove hmm_devmem_add
  mm: remove hmm_vma_alloc_locked_page
  nouveau: use devm_memremap_pages directly
  nouveau: use alloc_page_vma directly
  PCI/P2PDMA: use the dev_pagemap internal refcount
  device-dax: use the dev_pagemap internal refcount
  memremap: provide an optional internal refcount in struct dev_pagemap
  memremap: replace the altmap_valid field with a PGMAP_ALTMAP_VALID flag
  memremap: remove the data field in struct dev_pagemap
  memremap: add a migrate_to_ram method to struct dev_pagemap_ops
  memremap: lift the devmap_enable manipulation into devm_memremap_pages
  memremap: pass a struct dev_pagemap to ->kill and ->cleanup
  memremap: move dev_pagemap callbacks into a separate structure
  memremap: validate the pagemap type passed to devm_memremap_pages
  mm: factor out a devm_request_free_mem_region helper
  mm: export alloc_pages_vma
  ...

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-07-02 15:10:45 -03:00
Jason Gunthorpe
9ec3f4cb35 Linux 5.2-rc7
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl0YK7ceHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGWfcH/36ep8GZHY9H1ARV
 RJJoGoMnwENoq2o4eKhH3iZgUIGPq2uonazequhePwnIsrOdFGT7AeMHWSW7W0o4
 wNlNFdUrTe0bvU00m+YtDwNIqgNCnFEoUbqn9H+VhAAWpSydKvhh2mlebTFO50KN
 hb9+jh59Q8tbxrQdCuNF6yJATdf4hcj1V/ZZMGgF34kx+dFY4wOooSfu/eaIxXIl
 fBDKN9K4Mmw8HWJvebV+ocOMZ7Zqknt1lbjx69OxpJmgxhb2Ks7heqSZanLTBPBB
 oZxOlEdNPSyOjBQUlsDC2S8VJ7g5gINZk1JcFjByzE7cIPOQ2UXE72R++wwANngm
 SR054NQ=
 =WlA8
 -----END PGP SIGNATURE-----

Merge tag 'v5.2-rc7' into rdma.git hmm

Required for dependencies in the next patches.
2019-07-02 14:34:43 -03:00
Christoph Hellwig
24917f6b10 memremap: provide an optional internal refcount in struct dev_pagemap
Provide an internal refcounting logic if no ->ref field is provided
in the pagemap passed into devm_memremap_pages so that callers don't
have to reinvent it poorly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-07-02 14:32:44 -03:00
Christoph Hellwig
514caf23a7 memremap: replace the altmap_valid field with a PGMAP_ALTMAP_VALID flag
Add a flags field to struct dev_pagemap to replace the altmap_valid
boolean to be a little more extensible.  Also add a pgmap_altmap() helper
to find the optional altmap and clean up the code using the altmap using
it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-07-02 14:32:44 -03:00
Christoph Hellwig
80a72d0af0 memremap: remove the data field in struct dev_pagemap
struct dev_pagemap is always embedded into a containing structure, so
there is no need to an additional private data field.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-07-02 14:32:44 -03:00
Christoph Hellwig
897e6365cd memremap: add a migrate_to_ram method to struct dev_pagemap_ops
This replaces the hacky ->fault callback, which is currently directly
called from common code through a hmm specific data structure as an
exercise in layering violations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-07-02 14:32:44 -03:00
Christoph Hellwig
f6a55e1a3f memremap: lift the devmap_enable manipulation into devm_memremap_pages
Just check if there is a ->page_free operation set and take care of the
static key enable, as well as the put using device managed resources.
Also check that a ->page_free is provided for the pgmaps types that
require it, and check for a valid type as well while we are at it.

Note that this also fixes the fact that hmm never called
dev_pagemap_put_ops and thus would leave the slow path enabled forever,
even after a device driver unload or disable.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-07-02 14:32:44 -03:00
Christoph Hellwig
d8668bb045 memremap: pass a struct dev_pagemap to ->kill and ->cleanup
Passing the actual typed structure leads to more understandable code
vs just passing the ref member.

Reported-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-07-02 14:32:44 -03:00
Christoph Hellwig
1e240e8d4a memremap: move dev_pagemap callbacks into a separate structure
The dev_pagemap is a growing too many callbacks.  Move them into a
separate ops structure so that they are not duplicated for multiple
instances, and an attacker can't easily overwrite them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-07-02 14:32:44 -03:00
Christoph Hellwig
3ed2dcdf54 memremap: validate the pagemap type passed to devm_memremap_pages
Most pgmap types are only supported when certain config options are
enabled.  Check for a type that is valid for the current configuration
before setting up the pagemap.  For this the usage of the 0 type for
device dax gets replaced with an explicit MEMORY_DEVICE_DEVDAX type.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-07-02 14:32:44 -03:00
Christoph Hellwig
0092908d16 mm: factor out a devm_request_free_mem_region helper
Keep the physical address allocation that hmm_add_device does with the
rest of the resource code, and allow future reuse of it without the hmm
wrapper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-07-02 14:32:44 -03:00
Christian Brauner
28dd29c06d
fork: return proper negative error code
Make sure to return a proper negative error code from copy_process()
when anon_inode_getfile() fails with CLONE_PIDFD.
Otherwise _do_fork() will not detect an error and get_task_pid() will
operator on a nonsensical pointer:

R10: 0000000000000000 R11: 0000000000000246 R12: 00000000006dbc2c
R13: 00007ffc15fbb0ff R14: 00007ff07e47e9c0 R15: 0000000000000000
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] PREEMPT SMP KASAN
CPU: 1 PID: 7990 Comm: syz-executor290 Not tainted 5.2.0-rc6+ #9
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline]
RIP: 0010:get_task_pid+0xe1/0x210 kernel/pid.c:372
Code: 89 ff e8 62 27 5f 00 49 8b 07 44 89 f1 4c 8d bc c8 90 01 00 00 eb 0c
e8 0d fe 25 00 49 81 c7 38 05 00 00 4c 89 f8 48 c1 e8 03 <80> 3c 18 00 74
08 4c 89 ff e8 31 27 5f 00 4d 8b 37 e8 f9 47 12 00
RSP: 0018:ffff88808a4a7d78 EFLAGS: 00010203
RAX: 00000000000000a7 RBX: dffffc0000000000 RCX: ffff888088180600
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff88808a4a7d90 R08: ffffffff814fb3a8 R09: ffffed1015d66bf8
R10: ffffed1015d66bf8 R11: 1ffff11015d66bf7 R12: 0000000000041ffc
R13: 1ffff11011494fbc R14: 0000000000000000 R15: 000000000000053d
FS:  00007ff07e47e700(0000) GS:ffff8880aeb00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000004b5100 CR3: 0000000094df2000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
  _do_fork+0x1b9/0x5f0 kernel/fork.c:2360
  __do_sys_clone kernel/fork.c:2454 [inline]
  __se_sys_clone kernel/fork.c:2448 [inline]
  __x64_sys_clone+0xc1/0xd0 kernel/fork.c:2448
  do_syscall_64+0xfe/0x140 arch/x86/entry/common.c:301
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

Link: https://lore.kernel.org/lkml/000000000000e0dc0d058c9e7142@google.com
Reported-and-tested-by: syzbot+002e636502bc4b64eb5c@syzkaller.appspotmail.com
Fixes: 6fd2fe494b ("copy_process(): don't use ksys_close() on cleanups")
Cc: Jann Horn <jannh@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Christian Brauner <christian@brauner.io>
2019-07-01 16:43:30 +02:00
Jens Axboe
5be1f9d82f Linux 5.2-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl0Os1seHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGtx4H/j6i482XzcGFKTBm
 A7mBoQpy+kLtoUov4EtBAR62OuwI8rsahW9di37QKndPoQrczWaKBmr3De6LCdPe
 v3pl3O6wBbvH5ru+qBPFX9PdNbDvimEChh7LHxmMxNQq3M+AjZAZVJyfpoiFnx35
 Fbge+LZaH/k8HMwZmkMr5t9Mpkip715qKg2o9Bua6dkH0AqlcpLlC8d9a+HIVw/z
 aAsyGSU8jRwhoAOJsE9bJf0acQ/pZSqmFp0rDKqeFTSDMsbDRKLGq/dgv4nW0RiW
 s7xqsjb/rdcvirRj3rv9+lcTVkOtEqwk0PVdL9WOf7g4iYrb3SOIZh8ZyViaDSeH
 VTS5zps=
 =huBY
 -----END PGP SIGNATURE-----

Merge tag 'v5.2-rc6' into for-5.3/block

Merge 5.2-rc6 into for-5.3/block, so we get the same page merge leak
fix. Otherwise we end up having conflicts with future patches between
for-5.3/block and master that touch this area. In particular, it makes
the bio_full() fix hard to backport to stable.

* tag 'v5.2-rc6': (482 commits)
  Linux 5.2-rc6
  Revert "iommu/vt-d: Fix lock inversion between iommu->lock and device_domain_lock"
  Bluetooth: Fix regression with minimum encryption key size alignment
  tcp: refine memory limit test in tcp_fragment()
  x86/vdso: Prevent segfaults due to hoisted vclock reads
  SUNRPC: Fix a credential refcount leak
  Revert "SUNRPC: Declare RPC timers as TIMER_DEFERRABLE"
  net :sunrpc :clnt :Fix xps refcount imbalance on the error path
  NFS4: Only set creation opendata if O_CREAT
  ARM: 8867/1: vdso: pass --be8 to linker if necessary
  KVM: nVMX: reorganize initial steps of vmx_set_nested_state
  KVM: PPC: Book3S HV: Invalidate ERAT when flushing guest TLB entries
  habanalabs: use u64_to_user_ptr() for reading user pointers
  nfsd: replace Jeff by Chuck as nfsd co-maintainer
  inet: clear num_timeout reqsk_alloc()
  PCI/P2PDMA: Ignore root complex whitelist when an IOMMU is present
  net: mvpp2: debugfs: Add pmap to fs dump
  ipv6: Default fib6_type to RTN_UNICAST when not set
  net: hns3: Fix inconsistent indenting
  net/af_iucv: always register net_device notifier
  ...
2019-07-01 08:16:08 -06:00
Prakhar Srivastava
6a31fcd4cf KEXEC: Call ima_kexec_cmdline to measure the boot command line args
During soft reboot(kexec_file_load) boot command line
arguments are not measured.

Call ima hook ima_kexec_cmdline to measure the boot command line
arguments into IMA measurement list.

- call ima_kexec_cmdline from kexec_file_load.
- move the call ima_add_kexec_buffer after the cmdline
args have been measured.

Signed-off-by: Prakhar Srivastava <prsriva02@gmail.com>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Acked-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
2019-06-30 17:54:39 -04:00
Linus Torvalds
7c15f41e87 Merge branch 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull SMP fixes from Thomas Gleixner:
 "Two small changes for the cpu hotplug code:

   - Prevent out of bounds access which actually might crash the machine
     caused by a missing bounds check in the fail injection code

   - Warn about unsupported migitation mode command line arguments to
     make people aware that they typoed the paramater. Not necessarily a
     fix but quite some people tripped over that"

* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  cpu/hotplug: Fix out-of-bounds read when setting fail state
  cpu/speculation: Warn on unsupported mitigations= parameter
2019-06-30 11:19:17 +08:00
Linus Torvalds
57103eb7c6 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Various fixes, most of them related to bugs perf fuzzing found in the
  x86 code"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/regs: Use PERF_REG_EXTENDED_MASK
  perf/x86: Remove pmu->pebs_no_xmm_regs
  perf/x86: Clean up PEBS_XMM_REGS
  perf/x86/regs: Check reserved bits
  perf/x86: Disable extended registers for non-supported PMUs
  perf/ioctl: Add check for the sample_period value
  perf/core: Fix perf_sample_regs_user() mm check
2019-06-29 19:39:17 +08:00
Linus Torvalds
2407e48606 Power management fix for 5.2-rc7
Avoid skipping bus-level PCI power management during system
 resume for PCIe ports left in D0 during the preceding suspend
 transition on platforms where the power states of those ports
 can change out of the PCI layer's control.
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl0XJzcSHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxylwQAI8owd3eQV6UNDybkT5MiP0lWb9nbl83
 2ouxla+FtAzRFJC0yW4RK86cW4i/Yl8767KV2yqX/69ftmz4XhZBJ63ijKAEoG6o
 tHFyY7twy7Sr0MvPRD9rtjUkmdOx9z0OFKHgLhSzC/V4PvgGZTt+eYBm1Bp3icZp
 ZY9CFx/bSt9tURY//VqXhvBWT6pEpn1B1D7hsiAp041EwhtTONNs7xAa7ucIP+aG
 Ufyb0waVYmiFCX+Lrt/gHzEO2YIpTHIUw3DaMcbR8plHc1gpYtbuZ2ZMScgt2TgL
 f0s7GeMOXtF3sODOd/1mhg127ShWbqUkf8EHDyU3JAWa9aesLr3BoFGtKyAT1rbg
 O9nyJGBGj5ByUNefua0S8+q0kWI2XHdLAQ8CHBlBQx5W1x1Yg2EeV2Kosxjuhfdp
 5K9wFIiPG0F/rtGoAA61dMH9tt87NnY8PgeCyHLFUCoJbhySWr18kwrwrdkimqa5
 9FR8OTa8CHGQ/0bPvw+w8S9FdxiEM6yw4wuMLIy3c+a22+lgIiPvkgqzdsWYULdX
 CrI62jvz5SvoTwK/UEp9PrCnnHbp4crbSp73Vgo1o1bi5eeaaSobRECq+IbN0T3P
 X1H/xn+18mUqmCg4WtDX++14Fe1rMHoe/5CqqE/mp8aCqE9q/3fbAs9INnWJcyrP
 a2O0Wk0jLE76
 =eGQi
 -----END PGP SIGNATURE-----

Merge tag 'pm-5.2-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fix from Rafael Wysocki:
 "Avoid skipping bus-level PCI power management during system resume for
  PCIe ports left in D0 during the preceding suspend transition on
  platforms where the power states of those ports can change out of the
  PCI layer's control"

* tag 'pm-5.2-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PCI: PM: Avoid skipping bus-level PM on platforms without ACPI
2019-06-29 19:29:45 +08:00
Andrea Arcangeli
1bf4580e00 fork,memcg: alloc_thread_stack_node needs to set tsk->stack
Commit 5eed6f1dff ("fork,memcg: fix crash in free_thread_stack on
memcg charge fail") corrected two instances, but there was a third
instance of this bug.

Without setting tsk->stack, if memcg_charge_kernel_stack fails, it'll
execute free_thread_stack() on a dangling pointer.

Enterprise kernels are compiled with VMAP_STACK=y so this isn't
critical, but custom VMAP_STACK=n builds should have some performance
advantage, with the drawback of risking to fail fork because compaction
didn't succeed.  So as long as VMAP_STACK=n is a supported option it's
worth fixing it upstream.

Link: http://lkml.kernel.org/r/20190619011450.28048-1-aarcange@redhat.com
Fixes: 9b6f7e163c ("mm: rework memcg kernel stack accounting")
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-29 16:43:45 +08:00
Oleg Nesterov
97abc889ee signal: remove the wrong signal_pending() check in restore_user_sigmask()
This is the minimal fix for stable, I'll send cleanups later.

Commit 854a6ed568 ("signal: Add restore_user_sigmask()") introduced
the visible change which breaks user-space: a signal temporary unblocked
by set_user_sigmask() can be delivered even if the caller returns
success or timeout.

Change restore_user_sigmask() to accept the additional "interrupted"
argument which should be used instead of signal_pending() check, and
update the callers.

Eric said:

: For clarity.  I don't think this is required by posix, or fundamentally to
: remove the races in select.  It is what linux has always done and we have
: applications who care so I agree this fix is needed.
:
: Further in any case where the semantic change that this patch rolls back
: (aka where allowing a signal to be delivered and the select like call to
: complete) would be advantage we can do as well if not better by using
: signalfd.
:
: Michael is there any chance we can get this guarantee of the linux
: implementation of pselect and friends clearly documented.  The guarantee
: that if the system call completes successfully we are guaranteed that no
: signal that is unblocked by using sigmask will be delivered?

Link: http://lkml.kernel.org/r/20190604134117.GA29963@redhat.com
Fixes: 854a6ed568 ("signal: Add restore_user_sigmask()")
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Eric Wong <e@80x24.org>
Tested-by: Eric Wong <e@80x24.org>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: <stable@vger.kernel.org>	[5.0+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-29 16:43:45 +08:00
Toke Høiland-Jørgensen
0cdbb4b09a devmap: Allow map lookups from eBPF
We don't currently allow lookups into a devmap from eBPF, because the map
lookup returns a pointer directly to the dev->ifindex, which shouldn't be
modifiable from eBPF.

However, being able to do lookups in devmaps is useful to know (e.g.)
whether forwarding to a specific interface is enabled. Currently, programs
work around this by keeping a shadow map of another type which indicates
whether a map index is valid.

Since we now have a flag to make maps read-only from the eBPF side, we can
simply lift the lookup restriction if we make sure this flag is always set.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-29 01:31:09 +02:00
Toke Høiland-Jørgensen
d5df2830ca devmap/cpumap: Use flush list instead of bitmap
The socket map uses a linked list instead of a bitmap to keep track of
which entries to flush. Do the same for devmap and cpumap, as this means we
don't have to care about the map index when enqueueing things into the
map (and so we can cache the map lookup).

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-29 01:31:08 +02:00
Toke Høiland-Jørgensen
c8af5cd75e xskmap: Move non-standard list manipulation to helper
Add a helper in list.h for the non-standard way of clearing a list that is
used in xskmap. This makes it easier to reuse it in the other map types,
and also makes sure this usage is not forgotten in any list refactorings in
the future.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-29 01:31:08 +02:00
Eiichi Tsukata
46cc0b4442 tracing/snapshot: Resize spare buffer if size changed
Current snapshot implementation swaps two ring_buffers even though their
sizes are different from each other, that can cause an inconsistency
between the contents of buffer_size_kb file and the current buffer size.

For example:

  # cat buffer_size_kb
  7 (expanded: 1408)
  # echo 1 > events/enable
  # grep bytes per_cpu/cpu0/stats
  bytes: 1441020
  # echo 1 > snapshot             // current:1408, spare:1408
  # echo 123 > buffer_size_kb     // current:123,  spare:1408
  # echo 1 > snapshot             // current:1408, spare:123
  # grep bytes per_cpu/cpu0/stats
  bytes: 1443700
  # cat buffer_size_kb
  123                             // != current:1408

And also, a similar per-cpu case hits the following WARNING:

Reproducer:

  # echo 1 > per_cpu/cpu0/snapshot
  # echo 123 > buffer_size_kb
  # echo 1 > per_cpu/cpu0/snapshot

WARNING:

  WARNING: CPU: 0 PID: 1946 at kernel/trace/trace.c:1607 update_max_tr_single.part.0+0x2b8/0x380
  Modules linked in:
  CPU: 0 PID: 1946 Comm: bash Not tainted 5.2.0-rc6 #20
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014
  RIP: 0010:update_max_tr_single.part.0+0x2b8/0x380
  Code: ff e8 dc da f9 ff 0f 0b e9 88 fe ff ff e8 d0 da f9 ff 44 89 ee bf f5 ff ff ff e8 33 dc f9 ff 41 83 fd f5 74 96 e8 b8 da f9 ff <0f> 0b eb 8d e8 af da f9 ff 0f 0b e9 bf fd ff ff e8 a3 da f9 ff 48
  RSP: 0018:ffff888063e4fca0 EFLAGS: 00010093
  RAX: ffff888066214380 RBX: ffffffff99850fe0 RCX: ffffffff964298a8
  RDX: 0000000000000000 RSI: 00000000fffffff5 RDI: 0000000000000005
  RBP: 1ffff1100c7c9f96 R08: ffff888066214380 R09: ffffed100c7c9f9b
  R10: ffffed100c7c9f9a R11: 0000000000000003 R12: 0000000000000000
  R13: 00000000ffffffea R14: ffff888066214380 R15: ffffffff99851060
  FS:  00007f9f8173c700(0000) GS:ffff88806d000000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000714dc0 CR3: 0000000066fa6000 CR4: 00000000000006f0
  Call Trace:
   ? trace_array_printk_buf+0x140/0x140
   ? __mutex_lock_slowpath+0x10/0x10
   tracing_snapshot_write+0x4c8/0x7f0
   ? trace_printk_init_buffers+0x60/0x60
   ? selinux_file_permission+0x3b/0x540
   ? tracer_preempt_off+0x38/0x506
   ? trace_printk_init_buffers+0x60/0x60
   __vfs_write+0x81/0x100
   vfs_write+0x1e1/0x560
   ksys_write+0x126/0x250
   ? __ia32_sys_read+0xb0/0xb0
   ? do_syscall_64+0x1f/0x390
   do_syscall_64+0xc1/0x390
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

This patch adds resize_buffer_duplicate_size() to check if there is a
difference between current/spare buffer sizes and resize a spare buffer
if necessary.

Link: http://lkml.kernel.org/r/20190625012910.13109-1-devel@etsukata.com

Cc: stable@vger.kernel.org
Fixes: ad909e21bb ("tracing: Add internal tracing_snapshot() functions")
Signed-off-by: Eiichi Tsukata <devel@etsukata.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-06-28 14:58:52 -04:00
Takeshi Misawa
d122ed6288 tracing: Fix memory leak in tracing_err_log_open()
When tracing_err_log_open() calls seq_open(), allocated memory is not freed.

kmemleak report:

unreferenced object 0xffff92c0781d1100 (size 128):
  comm "tail", pid 15116, jiffies 4295163855 (age 22.704s)
  hex dump (first 32 bytes):
    00 f0 08 e5 c0 92 ff ff 00 10 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<000000000d0687d5>] kmem_cache_alloc+0x11f/0x1e0
    [<000000003e3039a8>] seq_open+0x2f/0x90
    [<000000008dd36b7d>] tracing_err_log_open+0x67/0x140
    [<000000005a431ae2>] do_dentry_open+0x1df/0x3a0
    [<00000000a2910603>] vfs_open+0x2f/0x40
    [<0000000038b0a383>] path_openat+0x2e8/0x1690
    [<00000000fe025bda>] do_filp_open+0x9b/0x110
    [<00000000483a5091>] do_sys_open+0x1ba/0x260
    [<00000000c558b5fd>] __x64_sys_openat+0x20/0x30
    [<000000006881ec07>] do_syscall_64+0x5a/0x130
    [<00000000571c2e94>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fix this by calling seq_release() in tracing_err_log_fops.release().

Link: http://lkml.kernel.org/r/20190628105640.GA1863@DESKTOP

Fixes: 8a062902be ("tracing: Add tracing error log")
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Takeshi Misawa <jeliantsurux@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-06-28 14:57:23 -04:00
Petr Mladek
d5b844a2cf ftrace/x86: Remove possible deadlock between register_kprobe() and ftrace_run_update_code()
The commit 9f255b632b ("module: Fix livepatch/ftrace module text
permissions race") causes a possible deadlock between register_kprobe()
and ftrace_run_update_code() when ftrace is using stop_machine().

The existing dependency chain (in reverse order) is:

-> #1 (text_mutex){+.+.}:
       validate_chain.isra.21+0xb32/0xd70
       __lock_acquire+0x4b8/0x928
       lock_acquire+0x102/0x230
       __mutex_lock+0x88/0x908
       mutex_lock_nested+0x32/0x40
       register_kprobe+0x254/0x658
       init_kprobes+0x11a/0x168
       do_one_initcall+0x70/0x318
       kernel_init_freeable+0x456/0x508
       kernel_init+0x22/0x150
       ret_from_fork+0x30/0x34
       kernel_thread_starter+0x0/0xc

-> #0 (cpu_hotplug_lock.rw_sem){++++}:
       check_prev_add+0x90c/0xde0
       validate_chain.isra.21+0xb32/0xd70
       __lock_acquire+0x4b8/0x928
       lock_acquire+0x102/0x230
       cpus_read_lock+0x62/0xd0
       stop_machine+0x2e/0x60
       arch_ftrace_update_code+0x2e/0x40
       ftrace_run_update_code+0x40/0xa0
       ftrace_startup+0xb2/0x168
       register_ftrace_function+0x64/0x88
       klp_patch_object+0x1a2/0x290
       klp_enable_patch+0x554/0x980
       do_one_initcall+0x70/0x318
       do_init_module+0x6e/0x250
       load_module+0x1782/0x1990
       __s390x_sys_finit_module+0xaa/0xf0
       system_call+0xd8/0x2d0

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(text_mutex);
                               lock(cpu_hotplug_lock.rw_sem);
                               lock(text_mutex);
  lock(cpu_hotplug_lock.rw_sem);

It is similar problem that has been solved by the commit 2d1e38f566
("kprobes: Cure hotplug lock ordering issues"). Many locks are involved.
To be on the safe side, text_mutex must become a low level lock taken
after cpu_hotplug_lock.rw_sem.

This can't be achieved easily with the current ftrace design.
For example, arm calls set_all_modules_text_rw() already in
ftrace_arch_code_modify_prepare(), see arch/arm/kernel/ftrace.c.
This functions is called:

  + outside stop_machine() from ftrace_run_update_code()
  + without stop_machine() from ftrace_module_enable()

Fortunately, the problematic fix is needed only on x86_64. It is
the only architecture that calls set_all_modules_text_rw()
in ftrace path and supports livepatching at the same time.

Therefore it is enough to move text_mutex handling from the generic
kernel/trace/ftrace.c into arch/x86/kernel/ftrace.c:

   ftrace_arch_code_modify_prepare()
   ftrace_arch_code_modify_post_process()

This patch basically reverts the ftrace part of the problematic
commit 9f255b632b ("module: Fix livepatch/ftrace module
text permissions race"). And provides x86_64 specific-fix.

Some refactoring of the ftrace code will be needed when livepatching
is implemented for arm or nds32. These architectures call
set_all_modules_text_rw() and use stop_machine() at the same time.

Link: http://lkml.kernel.org/r/20190627081334.12793-1-pmladek@suse.com

Fixes: 9f255b632b ("module: Fix livepatch/ftrace module text permissions race")
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
[
  As reviewed by Miroslav Benes <mbenes@suse.cz>, removed return value of
  ftrace_run_update_code() as it is a void function.
]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-06-28 14:20:25 -04:00
Ingo Molnar
83086d654d Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull rcu/next + tools/memory-model changes from Paul E. McKenney:

 - RCU flavor consolidation cleanups and optmizations
 - Documentation updates
 - Miscellaneous fixes
 - SRCU updates
 - RCU-sync flavor consolidation
 - Torture-test updates
 - Linux-kernel memory-consistency-model updates, most notably the addition of plain C-language accesses

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-28 19:46:47 +02:00
Christian Brauner
32fcb426ec
pid: add pidfd_open()
This adds the pidfd_open() syscall. It allows a caller to retrieve pollable
pidfds for a process which did not get created via CLONE_PIDFD, i.e. for a
process that is created via traditional fork()/clone() calls that is only
referenced by a PID:

int pidfd = pidfd_open(1234, 0);
ret = pidfd_send_signal(pidfd, SIGSTOP, NULL, 0);

With the introduction of pidfds through CLONE_PIDFD it is possible to
created pidfds at process creation time.
However, a lot of processes get created with traditional PID-based calls
such as fork() or clone() (without CLONE_PIDFD). For these processes a
caller can currently not create a pollable pidfd. This is a problem for
Android's low memory killer (LMK) and service managers such as systemd.
Both are examples of tools that want to make use of pidfds to get reliable
notification of process exit for non-parents (pidfd polling) and race-free
signal sending (pidfd_send_signal()). They intend to switch to this API for
process supervision/management as soon as possible. Having no way to get
pollable pidfds from PID-only processes is one of the biggest blockers for
them in adopting this api. With pidfd_open() making it possible to retrieve
pidfds for PID-based processes we enable them to adopt this api.

In line with Arnd's recent changes to consolidate syscall numbers across
architectures, I have added the pidfd_open() syscall to all architectures
at the same time.

Signed-off-by: Christian Brauner <christian@brauner.io>
Reviewed-by: David Howells <dhowells@redhat.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jann Horn <jannh@google.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-api@vger.kernel.org
2019-06-28 12:17:55 +02:00
Joel Fernandes (Google)
b53b0b9d9a
pidfd: add polling support
This patch adds polling support to pidfd.

Android low memory killer (LMK) needs to know when a process dies once
it is sent the kill signal. It does so by checking for the existence of
/proc/pid which is both racy and slow. For example, if a PID is reused
between when LMK sends a kill signal and checks for existence of the
PID, since the wrong PID is now possibly checked for existence.
Using the polling support, LMK will be able to get notified when a process
exists in race-free and fast way, and allows the LMK to do other things
(such as by polling on other fds) while awaiting the process being killed
to die.

For notification to polling processes, we follow the same existing
mechanism in the kernel used when the parent of the task group is to be
notified of a child's death (do_notify_parent). This is precisely when the
tasks waiting on a poll of pidfd are also awakened in this patch.

We have decided to include the waitqueue in struct pid for the following
reasons:
1. The wait queue has to survive for the lifetime of the poll. Including
   it in task_struct would not be option in this case because the task can
   be reaped and destroyed before the poll returns.

2. By including the struct pid for the waitqueue means that during
   de_thread(), the new thread group leader automatically gets the new
   waitqueue/pid even though its task_struct is different.

Appropriate test cases are added in the second patch to provide coverage of
all the cases the patch is handling.

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Daniel Colascione <dancol@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Jonathan Kowalski <bl0pbl33p@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: kernel-team@android.com
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Co-developed-by: Daniel Colascione <dancol@google.com>
Signed-off-by: Daniel Colascione <dancol@google.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Christian Brauner <christian@brauner.io>
2019-06-28 12:17:55 +02:00
Fuqian Huang
2f02a7ecd5 kernel: power: swap: use kzalloc() instead of kmalloc() followed by memset()
Use zeroing allocator instead of using allocator
followed with memset with 0

Signed-off-by: Fuqian Huang <huangfq.daxian@gmail.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
[ rjw: Subject ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-06-28 10:20:39 +02:00
David S. Miller
d96ff269a0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
The new route handling in ip_mc_finish_output() from 'net' overlapped
with the new support for returning congestion notifications from BPF
programs.

In order to handle this I had to take the dev_loopback_xmit() calls
out of the switch statement.

The aquantia driver conflicts were simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-27 21:06:39 -07:00
Stanislav Fomichev
0d01da6afc bpf: implement getsockopt and setsockopt hooks
Implement new BPF_PROG_TYPE_CGROUP_SOCKOPT program type and
BPF_CGROUP_{G,S}ETSOCKOPT cgroup hooks.

BPF_CGROUP_SETSOCKOPT can modify user setsockopt arguments before
passing them down to the kernel or bypass kernel completely.
BPF_CGROUP_GETSOCKOPT can can inspect/modify getsockopt arguments that
kernel returns.
Both hooks reuse existing PTR_TO_PACKET{,_END} infrastructure.

The buffer memory is pre-allocated (because I don't think there is
a precedent for working with __user memory from bpf). This might be
slow to do for each {s,g}etsockopt call, that's why I've added
__cgroup_bpf_prog_array_is_empty that exits early if there is nothing
attached to a cgroup. Note, however, that there is a race between
__cgroup_bpf_prog_array_is_empty and BPF_PROG_RUN_ARRAY where cgroup
program layout might have changed; this should not be a problem
because in general there is a race between multiple calls to
{s,g}etsocktop and user adding/removing bpf progs from a cgroup.

The return code of the BPF program is handled as follows:
* 0: EPERM
* 1: success, continue with next BPF program in the cgroup chain

v9:
* allow overwriting setsockopt arguments (Alexei Starovoitov):
  * use set_fs (same as kernel_setsockopt)
  * buffer is always kzalloc'd (no small on-stack buffer)

v8:
* use s32 for optlen (Andrii Nakryiko)

v7:
* return only 0 or 1 (Alexei Starovoitov)
* always run all progs (Alexei Starovoitov)
* use optval=0 as kernel bypass in setsockopt (Alexei Starovoitov)
  (decided to use optval=-1 instead, optval=0 might be a valid input)
* call getsockopt hook after kernel handlers (Alexei Starovoitov)

v6:
* rework cgroup chaining; stop as soon as bpf program returns
  0 or 2; see patch with the documentation for the details
* drop Andrii's and Martin's Acked-by (not sure they are comfortable
  with the new state of things)

v5:
* skip copy_to_user() and put_user() when ret == 0 (Martin Lau)

v4:
* don't export bpf_sk_fullsock helper (Martin Lau)
* size != sizeof(__u64) for uapi pointers (Martin Lau)
* offsetof instead of bpf_ctx_range when checking ctx access (Martin Lau)

v3:
* typos in BPF_PROG_CGROUP_SOCKOPT_RUN_ARRAY comments (Andrii Nakryiko)
* reverse christmas tree in BPF_PROG_CGROUP_SOCKOPT_RUN_ARRAY (Andrii
  Nakryiko)
* use __bpf_md_ptr instead of __u32 for optval{,_end} (Martin Lau)
* use BPF_FIELD_SIZEOF() for consistency (Martin Lau)
* new CG_SOCKOPT_ACCESS macro to wrap repeated parts

v2:
* moved bpf_sockopt_kern fields around to remove a hole (Martin Lau)
* aligned bpf_sockopt_kern->buf to 8 bytes (Martin Lau)
* bpf_prog_array_is_empty instead of bpf_prog_array_length (Martin Lau)
* added [0,2] return code check to verifier (Martin Lau)
* dropped unused buf[64] from the stack (Martin Lau)
* use PTR_TO_SOCKET for bpf_sockopt->sk (Martin Lau)
* dropped bpf_target_off from ctx rewrites (Martin Lau)
* use return code for kernel bypass (Martin Lau & Andrii Nakryiko)

Cc: Andrii Nakryiko <andriin@fb.com>
Cc: Martin Lau <kafai@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-06-27 15:25:16 -07:00
Mauro Carvalho Chehab
516337048f hrtimer: Use a bullet for the returns bullet list
That gets rid of this warning:

   ./kernel/time/hrtimer.c:1119: WARNING: Block quote ends without a blank line; unexpected unindent.

and displays nicely both at the source code and at the produced
documentation.

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linux Doc Mailing List <linux-doc@vger.kernel.org>
Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Link: https://lkml.kernel.org/r/74ddad7dac331b4e5ce4a90e15c8a49e3a16d2ac.1561372382.git.mchehab+samsung@kernel.org
2019-06-27 23:30:04 +02:00
Thomas Gleixner
be69d00d97 workqueue: Remove GPF argument from alloc_workqueue_attrs()
All callers use GFP_KERNEL. No point in having that argument.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-06-27 14:12:19 -07:00
Thomas Gleixner
2c9858ecbe workqueue: Make alloc/apply/free_workqueue_attrs() static
None of those functions have any users outside of workqueue.c. Confine
them.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-06-27 14:12:15 -07:00
Roman Gushchin
e5c891a349 bpf: fix cgroup bpf release synchronization
Since commit 4bfc0bb2c6 ("bpf: decouple the lifetime of cgroup_bpf
from cgroup itself"), cgroup_bpf release occurs asynchronously
(from a worker context), and before the release of the cgroup itself.

This introduced a previously non-existing race between the release
and update paths. E.g. if a leaf's cgroup_bpf is released and a new
bpf program is attached to the one of ancestor cgroups at the same
time. The race may result in double-free and other memory corruptions.

To fix the problem, let's protect the body of cgroup_bpf_release()
with cgroup_mutex, as it was effectively previously, when all this
code was called from the cgroup release path with cgroup mutex held.

Also let's skip cgroups, which have no chances to invoke a bpf
program, on the update path. If the cgroup bpf refcnt reached 0,
it means that the cgroup is offline (no attached processes), and
there are no associated sockets left. It means there is no point
in updating effective progs array! And it can lead to a leak,
if it happens after the release. So, let's skip such cgroups.

Big thanks for Tejun Heo for discovering and debugging of this problem!

Fixes: 4bfc0bb2c6 ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself")
Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-27 22:51:58 +02:00
Al Viro
6fd2fe494b
copy_process(): don't use ksys_close() on cleanups
anon_inode_getfd() should be used *ONLY* in situations when we are
guaranteed to be past the last failure point (including copying the
descriptor number to userland, at that).  And ksys_close() should
not be used for cleanups at all.

anon_inode_getfile() is there for all nontrivial cases like that.
Just use that...

Fixes: b3e5838252 ("clone: add CLONE_PIDFD")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Jann Horn <jannh@google.com>
Signed-off-by: Christian Brauner <christian@brauner.io>
2019-06-27 12:24:03 +02:00
Eiichi Tsukata
33d4a5a7a5 cpu/hotplug: Fix out-of-bounds read when setting fail state
Setting invalid value to /sys/devices/system/cpu/cpuX/hotplug/fail
can control `struct cpuhp_step *sp` address, results in the following
global-out-of-bounds read.

Reproducer:

  # echo -2 > /sys/devices/system/cpu/cpu0/hotplug/fail

KASAN report:

  BUG: KASAN: global-out-of-bounds in write_cpuhp_fail+0x2cd/0x2e0
  Read of size 8 at addr ffffffff89734438 by task bash/1941

  CPU: 0 PID: 1941 Comm: bash Not tainted 5.2.0-rc6+ #31
  Call Trace:
   write_cpuhp_fail+0x2cd/0x2e0
   dev_attr_store+0x58/0x80
   sysfs_kf_write+0x13d/0x1a0
   kernfs_fop_write+0x2bc/0x460
   vfs_write+0x1e1/0x560
   ksys_write+0x126/0x250
   do_syscall_64+0xc1/0x390
   entry_SYSCALL_64_after_hwframe+0x49/0xbe
  RIP: 0033:0x7f05e4f4c970

  The buggy address belongs to the variable:
   cpu_hotplug_lock+0x98/0xa0

  Memory state around the buggy address:
   ffffffff89734300: fa fa fa fa 00 00 00 00 00 00 00 00 00 00 00 00
   ffffffff89734380: fa fa fa fa 00 00 00 00 00 00 00 00 00 00 00 00
  >ffffffff89734400: 00 00 00 00 fa fa fa fa 00 00 00 00 fa fa fa fa
                                          ^
   ffffffff89734480: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
   ffffffff89734500: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Add a sanity check for the value written from user space.

Fixes: 1db49484f2 ("smp/hotplug: Hotplug state fail injection")
Signed-off-by: Eiichi Tsukata <devel@etsukata.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: peterz@infradead.org
Link: https://lkml.kernel.org/r/20190627024732.31672-1-devel@etsukata.com
2019-06-27 09:34:04 +02:00
Al Viro
02e5ad9738 perf_event_get(): don't bother with fget_raw()
... since we immediately follow that with check that it *is* an
opened perf file, with O_PATH ones ending with with the same
-EBADF we'd get for descriptor that isn't opened at all.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-06-26 20:43:53 -04:00
Rafael J. Wysocki
471a739a47 PCI: PM: Avoid skipping bus-level PM on platforms without ACPI
There are platforms that do not call pm_set_suspend_via_firmware(),
so pm_suspend_via_firmware() returns 'false' on them, but the power
states of PCI devices (PCIe ports in particular) are changed as a
result of powering down core platform components during system-wide
suspend.  Thus the pm_suspend_via_firmware() checks in
pci_pm_suspend_noirq() and pci_pm_resume_noirq() introduced by
commit 3e26c5feed ("PCI: PM: Skip devices in D0 for suspend-to-
idle") are not sufficient to determine that devices left in D0
during suspend will remain in D0 during resume and so the bus-level
power management can be skipped for them.

For this reason, introduce a new global suspend flag,
PM_SUSPEND_FLAG_NO_PLATFORM, set it for suspend-to-idle only
and replace the pm_suspend_via_firmware() checks mentioned above
with checks against this flag.

Fixes: 3e26c5feed ("PCI: PM: Skip devices in D0 for suspend-to-idle")
Reported-by: Jon Hunter <jonathanh@nvidia.com>
Tested-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
2019-06-26 23:51:56 +02:00
David Howells
0f44e4d976 keys: Move the user and user-session keyrings to the user_namespace
Move the user and user-session keyrings to the user_namespace struct rather
than pinning them from the user_struct struct.  This prevents these
keyrings from propagating across user-namespaces boundaries with regard to
the KEY_SPEC_* flags, thereby making them more useful in a containerised
environment.

The issue is that a single user_struct may be represent UIDs in several
different namespaces.

The way the patch does this is by attaching a 'register keyring' in each
user_namespace and then sticking the user and user-session keyrings into
that.  It can then be searched to retrieve them.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jann Horn <jannh@google.com>
2019-06-26 21:02:32 +01:00
David Howells
b206f281d0 keys: Namespace keyring names
Keyring names are held in a single global list that any process can pick
from by means of keyctl_join_session_keyring (provided the keyring grants
Search permission).  This isn't very container friendly, however.

Make the following changes:

 (1) Make default session, process and thread keyring names begin with a
     '.' instead of '_'.

 (2) Keyrings whose names begin with a '.' aren't added to the list.  Such
     keyrings are system specials.

 (3) Replace the global list with per-user_namespace lists.  A keyring adds
     its name to the list for the user_namespace that it is currently in.

 (4) When a user_namespace is deleted, it just removes itself from the
     keyring name list.

The global keyring_name_lock is retained for accessing the name lists.
This allows (4) to work.

This can be tested by:

	# keyctl newring foo @s
	995906392
	# unshare -U
	$ keyctl show
	...
	 995906392 --alswrv  65534 65534   \_ keyring: foo
	...
	$ keyctl session foo
	Joined session keyring: 935622349

As can be seen, a new session keyring was created.

The capability bit KEYCTL_CAPS1_NS_KEYRING_NAME is set if the kernel is
employing this feature.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Eric W. Biederman <ebiederm@xmission.com>
2019-06-26 21:02:32 +01:00
Yang Yingliang
93651f80dc modules: fix compile error if don't have strict module rwx
If CONFIG_ARCH_HAS_STRICT_MODULE_RWX is not defined,
we need stub for module_enable_nx() and module_enable_x().

If CONFIG_ARCH_HAS_STRICT_MODULE_RWX is defined, but
CONFIG_STRICT_MODULE_RWX is disabled, we need stub for
module_enable_nx.

Move frob_text() outside of the CONFIG_STRICT_MODULE_RWX,
because it is needed anyway.

Fixes: 2eef1399a8 ("modules: fix BUG when load module with rodata=n")
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2019-06-26 19:27:59 +02:00
Geert Uytterhoeven
1bf7272028 cpu/speculation: Warn on unsupported mitigations= parameter
Currently, if the user specifies an unsupported mitigation strategy on the
kernel command line, it will be ignored silently.  The code will fall back
to the default strategy, possibly leaving the system more vulnerable than
expected.

This may happen due to e.g. a simple typo, or, for a stable kernel release,
because not all mitigation strategies have been backported.

Inform the user by printing a message.

Fixes: 98af845294 ("cpu/speculation: Add 'mitigations=' cmdline option")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20190516070935.22546-1-geert@linux-m68k.org
2019-06-26 16:56:21 +02:00
Jiong Wang
75672dda27 bpf: fix BPF_ALU32 | BPF_ARSH on BE arches
Yauheni reported the following code do not work correctly on BE arches:

       ALU_ARSH_X:
               DST = (u64) (u32) ((*(s32 *) &DST) >> SRC);
               CONT;
       ALU_ARSH_K:
               DST = (u64) (u32) ((*(s32 *) &DST) >> IMM);
               CONT;

and are causing failure of test_verifier test 'arsh32 on imm 2' on BE
arches.

The code is taking address and interpreting memory directly, so is not
endianness neutral. We should instead perform standard C type casting on
the variable. A u64 to s32 conversion will drop the high 32-bit and reserve
the low 32-bit as signed integer, this is all we want.

Fixes: 2dc6b100f9 ("bpf: interpreter support BPF_ALU | BPF_ARSH")
Reported-by: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 14:48:20 +02:00
Yonghong Song
9db1ff0a41 bpf: fix compiler warning with CONFIG_MODULES=n
With CONFIG_MODULES=n, the following compiler warning occurs:
  /data/users/yhs/work/net-next/kernel/trace/bpf_trace.c:605:13: warning:
      ‘do_bpf_send_signal’ defined but not used [-Wunused-function]
  static void do_bpf_send_signal(struct irq_work *entry)

The __init function send_signal_irq_work_init(), which calls
do_bpf_send_signal(), is defined under CONFIG_MODULES. Hence,
when CONFIG_MODULES=n, nobody calls static function do_bpf_send_signal(),
hence the warning.

The init function send_signal_irq_work_init() should work without
CONFIG_MODULES. Moving it out of CONFIG_MODULES
code section fixed the compiler warning, and also make bpf_send_signal()
helper work without CONFIG_MODULES.

Fixes: 8b401f9ed2 ("bpf: implement bpf_send_signal() helper")
Reported-By: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-26 14:44:07 +02:00
Christoph Hellwig
d98849aff8 dma-direct: handle DMA_ATTR_NO_KERNEL_MAPPING in common code
DMA_ATTR_NO_KERNEL_MAPPING is generally implemented by allocating
normal cacheable pages or CMA memory, and then returning the page
pointer as the opaque handle.  Lift that code from the xtensa and
generic dma remapping implementations into the generic dma-direct
code so that we don't even call arch_dma_alloc for these allocations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-06-25 14:28:05 +02:00
Christoph Hellwig
c2f2124e0d dma-direct: handle DMA_ATTR_NON_CONSISTENT in common code
Only call into arch_dma_alloc if we require an uncached mapping,
and remove the parisc code manually doing normal cached
DMA_ATTR_NON_CONSISTENT allocations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Helge Deller <deller@gmx.de> # parisc
2019-06-25 14:27:58 +02:00
Toshiaki Makita
e7d4798960 xdp: Add tracepoint for bulk XDP_TX
This is introduced for admins to check what is happening on XDP_TX when
bulk XDP_TX is in use, which will be first introduced in veth in next
commit.

v3:
- Add act field to be in line with other XDP tracepoints.

Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-25 14:26:50 +02:00
Kobe Wu
9156e54576 locking/lockdep: increase size of counters for lockdep statistics
When system has been running for a long time, signed integer
counters are not enough for some lockdep statistics. Using
unsigned long counters can satisfy the requirement. Besides,
most of lockdep statistics are unsigned. It is better to use
unsigned int instead of int.

Remove unused variables.
- max_recursion_depth
- nr_cyclic_check_recursions
- nr_find_usage_forwards_recursions
- nr_find_usage_backwards_recursions

Signed-off-by: Kobe Wu <kobe-cp.wu@mediatek.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <linux-mediatek@lists.infradead.org>
Cc: <wsd_upstream@mediatek.com>
Cc: Eason Lin <eason-yh.lin@mediatek.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: https://lkml.kernel.org/r/1561365348-16050-1-git-send-email-kobe-cp.wu@mediatek.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-25 10:17:08 +02:00
Arnd Bergmann
886532aee3 locking/lockdep: Move mark_lock() inside CONFIG_TRACE_IRQFLAGS && CONFIG_PROVE_LOCKING
The last cleanup patch triggered another issue, as now another function
should be moved into the same section:

 kernel/locking/lockdep.c:3580:12: error: 'mark_lock' defined but not used [-Werror=unused-function]
  static int mark_lock(struct task_struct *curr, struct held_lock *this,

Move mark_lock() into the same #ifdef section as its only caller, and
remove the now-unused mark_lock_irq() stub helper.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yuyang Du <duyuyang@gmail.com>
Fixes: 0d2cc3b345 ("locking/lockdep: Move valid_state() inside CONFIG_TRACE_IRQFLAGS && CONFIG_PROVE_LOCKING")
Link: https://lkml.kernel.org/r/20190617124718.1232976-1-arnd@arndb.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-25 10:17:07 +02:00
Christoph Hellwig
4b85faed21 dma-mapping: add a dma_alloc_need_uncached helper
Check if we need to allocate uncached memory for a device given the
allocation flags.  Switch over the uncached segment check to this helper
to deal with architectures that do not support the dma_cache_sync
operation and thus should not returned cacheable memory for
DMA_ATTR_NON_CONSISTENT allocations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-06-25 08:14:43 +02:00
Christoph Hellwig
4a54d16f61 dma-mapping: truncate dma masks to what dma_addr_t can hold
The dma masks in struct device are always 64-bits wide.  But for builds
using a 32-bit dma_addr_t we need to ensure we don't store an
unsupportable value.  Before Linux 5.0 this was handled at least by
the ARM dma mapping code by never allowing to set a larger dma_mask,
but these days we allow the driver to just set the largest supported
value and never fall back to a smaller one.  Ensure this always works
by truncating the value.

Fixes: 9eb9e96e97 ("Documentation/DMA-API-HOWTO: update dma_mask sections")
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-06-25 07:54:06 +02:00
Ian Rogers
fd7d55172d perf/cgroups: Don't rotate events for cgroups unnecessarily
Currently perf_rotate_context assumes that if the context's nr_events !=
nr_active a rotation is necessary for perf event multiplexing. With
cgroups, nr_events is the total count of events for all cgroups and
nr_active will not include events in a cgroup other than the current
task's. This makes rotation appear necessary for cgroups when it is not.

Add a perf_event_context flag that is set when rotation is necessary.
Clear the flag during sched_out and set it when a flexible sched_in
fails due to resources.

Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: https://lkml.kernel.org/r/20190601082722.44543-1-irogers@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:30:04 +02:00
Ingo Molnar
b9271f0c65 Linux 5.2-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl0Os1seHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGtx4H/j6i482XzcGFKTBm
 A7mBoQpy+kLtoUov4EtBAR62OuwI8rsahW9di37QKndPoQrczWaKBmr3De6LCdPe
 v3pl3O6wBbvH5ru+qBPFX9PdNbDvimEChh7LHxmMxNQq3M+AjZAZVJyfpoiFnx35
 Fbge+LZaH/k8HMwZmkMr5t9Mpkip715qKg2o9Bua6dkH0AqlcpLlC8d9a+HIVw/z
 aAsyGSU8jRwhoAOJsE9bJf0acQ/pZSqmFp0rDKqeFTSDMsbDRKLGq/dgv4nW0RiW
 s7xqsjb/rdcvirRj3rv9+lcTVkOtEqwk0PVdL9WOf7g4iYrb3SOIZh8ZyViaDSeH
 VTS5zps=
 =huBY
 -----END PGP SIGNATURE-----

Merge tag 'v5.2-rc6' into perf/core, to refresh branch

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:25:52 +02:00
Patrick Bellasi
af24bde8df sched/uclamp: Add uclamp support to energy_compute()
The Energy Aware Scheduler (EAS) estimates the energy impact of waking
up a task on a given CPU. This estimation is based on:

 a) an (active) power consumption defined for each CPU frequency
 b) an estimation of which frequency will be used on each CPU
 c) an estimation of the busy time (utilization) of each CPU

Utilization clamping can affect both b) and c).

A CPU is expected to run:

 - on an higher than required frequency, but for a shorter time, in case
   its estimated utilization will be smaller than the minimum utilization
   enforced by uclamp
 - on a smaller than required frequency, but for a longer time, in case
   its estimated utilization is bigger than the maximum utilization
   enforced by uclamp

While compute_energy() already accounts clamping effects on busy time,
the clamping effects on frequency selection are currently ignored.

Fix it by considering how CPU clamp values will be affected by a
task waking up and being RUNNABLE on that CPU.

Do that by refactoring schedutil_freq_util() to take an additional
task_struct* which allows EAS to evaluate the impact on clamp values of
a task being eventually queued in a CPU. Clamp values are applied to the
RT+CFS utilization only when a FREQUENCY_UTIL is required by
compute_energy().

Do note that switching from ENERGY_UTIL to FREQUENCY_UTIL in the
computation of the cpu_util signal implies that we are more likely to
estimate the highest OPP when a RT task is running in another CPU of
the same performance domain. This can have an impact on energy
estimation but:

 - it's not easy to say which approach is better, since it depends on
   the use case
 - the original approach could still be obtained by setting a smaller
   task-specific util_min whenever required

Since we are at that:

 - rename schedutil_freq_util() into schedutil_cpu_util(),
   since it's not only used for frequency selection.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190621084217.8167-12-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:23:49 +02:00
Patrick Bellasi
9d20ad7dfc sched/uclamp: Add uclamp_util_with()
So far uclamp_util() allows to clamp a specified utilization considering
the clamp values requested by RUNNABLE tasks in a CPU. For the Energy
Aware Scheduler (EAS) it is interesting to test how clamp values will
change when a task is becoming RUNNABLE on a given CPU.
For example, EAS is interested in comparing the energy impact of
different scheduling decisions and the clamp values can play a role on
that.

Add uclamp_util_with() which allows to clamp a given utilization by
considering the possible impact on CPU clamp values of a specified task.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190621084217.8167-11-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:23:48 +02:00
Patrick Bellasi
982d9cdc22 sched/cpufreq, sched/uclamp: Add clamps for FAIR and RT tasks
Each time a frequency update is required via schedutil, a frequency is
selected to (possibly) satisfy the utilization reported by each
scheduling class and irqs. However, when utilization clamping is in use,
the frequency selection should consider userspace utilization clamping
hints.  This will allow, for example, to:

 - boost tasks which are directly affecting the user experience
   by running them at least at a minimum "requested" frequency

 - cap low priority tasks not directly affecting the user experience
   by running them only up to a maximum "allowed" frequency

These constraints are meant to support a per-task based tuning of the
frequency selection thus supporting a fine grained definition of
performance boosting vs energy saving strategies in kernel space.

Add support to clamp the utilization of RUNNABLE FAIR and RT tasks
within the boundaries defined by their aggregated utilization clamp
constraints.

Do that by considering the max(min_util, max_util) to give boosted tasks
the performance they need even when they happen to be co-scheduled with
other capped tasks.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190621084217.8167-10-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:23:48 +02:00
Patrick Bellasi
1a00d99997 sched/uclamp: Set default clamps for RT tasks
By default FAIR tasks start without clamps, i.e. neither boosted nor
capped, and they run at the best frequency matching their utilization
demand.  This default behavior does not fit RT tasks which instead are
expected to run at the maximum available frequency, if not otherwise
required by explicitly capping them.

Enforce the correct behavior for RT tasks by setting util_min to max
whenever:

 1. the task is switched to the RT class and it does not already have a
    user-defined clamp value assigned.

 2. an RT task is forked from a parent with RESET_ON_FORK set.

NOTE: utilization clamp values are cross scheduling class attributes and
thus they are never changed/reset once a value has been explicitly
defined from user-space.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190621084217.8167-9-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:23:47 +02:00
Patrick Bellasi
a87498ace5 sched/uclamp: Reset uclamp values on RESET_ON_FORK
A forked tasks gets the same clamp values of its parent however, when
the RESET_ON_FORK flag is set on parent, e.g. via:

   sys_sched_setattr()
      sched_setattr()
         __sched_setscheduler(attr::SCHED_FLAG_RESET_ON_FORK)

the new forked task is expected to start with all attributes reset to
default values.

Do that for utilization clamp values too by checking the reset request
from the existing uclamp_fork() call which already provides the required
initialization for other uclamp related bits.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190621084217.8167-8-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:23:47 +02:00
Patrick Bellasi
a509a7cd79 sched/uclamp: Extend sched_setattr() to support utilization clamping
The SCHED_DEADLINE scheduling class provides an advanced and formal
model to define tasks requirements that can translate into proper
decisions for both task placements and frequencies selections. Other
classes have a more simplified model based on the POSIX concept of
priorities.

Such a simple priority based model however does not allow to exploit
most advanced features of the Linux scheduler like, for example, driving
frequencies selection via the schedutil cpufreq governor. However, also
for non SCHED_DEADLINE tasks, it's still interesting to define tasks
properties to support scheduler decisions.

Utilization clamping exposes to user-space a new set of per-task
attributes the scheduler can use as hints about the expected/required
utilization for a task. This allows to implement a "proactive" per-task
frequency control policy, a more advanced policy than the current one
based just on "passive" measured task utilization. For example, it's
possible to boost interactive tasks (e.g. to get better performance) or
cap background tasks (e.g. to be more energy/thermal efficient).

Introduce a new API to set utilization clamping values for a specified
task by extending sched_setattr(), a syscall which already allows to
define task specific properties for different scheduling classes. A new
pair of attributes allows to specify a minimum and maximum utilization
the scheduler can consider for a task.

Do that by validating the required clamp values before and then applying
the required changes using _the_ same pattern already in use for
__setscheduler(). This ensures that the task is re-enqueued with the new
clamp values.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190621084217.8167-7-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:23:46 +02:00
Patrick Bellasi
1d6362fa0c sched/core: Allow sched_setattr() to use the current policy
The sched_setattr() syscall mandates that a policy is always specified.
This requires to always know which policy a task will have when
attributes are configured and this makes it impossible to add more
generic task attributes valid across different scheduling policies.
Reading the policy before setting generic tasks attributes is racy since
we cannot be sure it is not changed concurrently.

Introduce the required support to change generic task attributes without
affecting the current task policy. This is done by adding an attribute flag
(SCHED_FLAG_KEEP_POLICY) to enforce the usage of the current policy.

Add support for the SETPARAM_POLICY policy, which is already used by the
sched_setparam() POSIX syscall, to the sched_setattr() non-POSIX
syscall.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190621084217.8167-6-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:23:46 +02:00
Patrick Bellasi
e8f14172c6 sched/uclamp: Add system default clamps
Tasks without a user-defined clamp value are considered not clamped
and by default their utilization can have any value in the
[0..SCHED_CAPACITY_SCALE] range.

Tasks with a user-defined clamp value are allowed to request any value
in that range, and the required clamp is unconditionally enforced.
However, a "System Management Software" could be interested in limiting
the range of clamp values allowed for all tasks.

Add a privileged interface to define a system default configuration via:

  /proc/sys/kernel/sched_uclamp_util_{min,max}

which works as an unconditional clamp range restriction for all tasks.

With the default configuration, the full SCHED_CAPACITY_SCALE range of
values is allowed for each clamp index. Otherwise, the task-specific
clamp is capped by the corresponding system default value.

Do that by tracking, for each task, the "effective" clamp value and
bucket the task has been refcounted in at enqueue time. This
allows to lazy aggregate "requested" and "system default" values at
enqueue time and simplifies refcounting updates at dequeue time.

The cached bucket ids are used to avoid (relatively) more expensive
integer divisions every time a task is enqueued.

An active flag is used to report when the "effective" value is valid and
thus the task is actually refcounted in the corresponding rq's bucket.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190621084217.8167-5-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:23:45 +02:00
Patrick Bellasi
e496187da7 sched/uclamp: Enforce last task's UCLAMP_MAX
When a task sleeps it removes its max utilization clamp from its CPU.
However, the blocked utilization on that CPU can be higher than the max
clamp value enforced while the task was running. This allows undesired
CPU frequency increases while a CPU is idle, for example, when another
CPU on the same frequency domain triggers a frequency update, since
schedutil can now see the full not clamped blocked utilization of the
idle CPU.

Fix this by using:

  uclamp_rq_dec_id(p, rq, UCLAMP_MAX)
    uclamp_rq_max_value(rq, UCLAMP_MAX, clamp_value)

to detect when a CPU has no more RUNNABLE clamped tasks and to flag this
condition.

Don't track any minimum utilization clamps since an idle CPU never
requires a minimum frequency. The decay of the blocked utilization is
good enough to reduce the CPU frequency.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190621084217.8167-4-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:23:45 +02:00
Patrick Bellasi
60daf9c194 sched/uclamp: Add bucket local max tracking
Because of bucketization, different task-specific clamp values are
tracked in the same bucket.  For example, with 20% bucket size and
assuming to have:

  Task1: util_min=25%
  Task2: util_min=35%

both tasks will be refcounted in the [20..39]% bucket and always boosted
only up to 20% thus implementing a simple floor aggregation normally
used in histograms.

In systems with only few and well-defined clamp values, it would be
useful to track the exact clamp value required by a task whenever
possible. For example, if a system requires only 23% and 47% boost
values then it's possible to track the exact boost required by each
task using only 3 buckets of ~33% size each.

Introduce a mechanism to max aggregate the requested clamp values of
RUNNABLE tasks in the same bucket. Keep it simple by resetting the
bucket value to its base value only when a bucket becomes inactive.
Allow a limited and controlled overboosting margin for tasks recounted
in the same bucket.

In systems where the boost values are not known in advance, it is still
possible to control the maximum acceptable overboosting margin by tuning
the number of clamp groups. For example, 20 groups ensure a 5% maximum
overboost.

Remove the rq bucket initialization code since a correct bucket value
is now computed when a task is refcounted into a CPU's rq.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190621084217.8167-3-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:23:44 +02:00
Patrick Bellasi
69842cba9a sched/uclamp: Add CPU's clamp buckets refcounting
Utilization clamping allows to clamp the CPU's utilization within a
[util_min, util_max] range, depending on the set of RUNNABLE tasks on
that CPU. Each task references two "clamp buckets" defining its minimum
and maximum (util_{min,max}) utilization "clamp values". A CPU's clamp
bucket is active if there is at least one RUNNABLE tasks enqueued on
that CPU and refcounting that bucket.

When a task is {en,de}queued {on,from} a rq, the set of active clamp
buckets on that CPU can change. If the set of active clamp buckets
changes for a CPU a new "aggregated" clamp value is computed for that
CPU. This is because each clamp bucket enforces a different utilization
clamp value.

Clamp values are always MAX aggregated for both util_min and util_max.
This ensures that no task can affect the performance of other
co-scheduled tasks which are more boosted (i.e. with higher util_min
clamp) or less capped (i.e. with higher util_max clamp).

A task has:
   task_struct::uclamp[clamp_id]::bucket_id
to track the "bucket index" of the CPU's clamp bucket it refcounts while
enqueued, for each clamp index (clamp_id).

A runqueue has:
   rq::uclamp[clamp_id]::bucket[bucket_id].tasks
to track how many RUNNABLE tasks on that CPU refcount each
clamp bucket (bucket_id) of a clamp index (clamp_id).
It also has a:
   rq::uclamp[clamp_id]::bucket[bucket_id].value
to track the clamp value of each clamp bucket (bucket_id) of a clamp
index (clamp_id).

The rq::uclamp::bucket[clamp_id][] array is scanned every time it's
needed to find a new MAX aggregated clamp value for a clamp_id. This
operation is required only when it's dequeued the last task of a clamp
bucket tracking the current MAX aggregated clamp value. In this case,
the CPU is either entering IDLE or going to schedule a less boosted or
more clamped task.
The expected number of different clamp values configured at build time
is small enough to fit the full unordered array into a single cache
line, for configurations of up to 7 buckets.

Add to struct rq the basic data structures required to refcount the
number of RUNNABLE tasks for each clamp bucket. Add also the max
aggregation required to update the rq's clamp value at each
enqueue/dequeue event.

Use a simple linear mapping of clamp values into clamp buckets.
Pre-compute and cache bucket_id to avoid integer divisions at
enqueue/dequeue time.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190621084217.8167-2-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:23:44 +02:00
Dietmar Eggemann
a3df067974 sched/fair: Rename weighted_cpuload() to cpu_runnable_load()
The term 'weighted' is not needed since there is no 'unweighted' load.
Instead use the term 'runnable' to distinguish 'runnable' load
(avg.runnable_load_avg) used in load balance from load (avg.load_avg)
which is the sum of 'runnable' and 'blocked' load.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/57f27a7f-2775-d832-e965-0f4d51bb1954@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:23:43 +02:00
Qais Yousef
a056a5bed7 sched/debug: Export the newly added tracepoints
So that external modules can hook into them and extract the info they
need. Since these new tracepoints have no events associated with them
exporting these tracepoints make them useful for external modules to
perform testing and debugging. There's no other way otherwise to access
them.

BPF doesn't have infrastructure to access these bare tracepoints either.

Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pavankumar Kondeti <pkondeti@codeaurora.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Uwe Kleine-Konig <u.kleine-koenig@pengutronix.de>
Link: https://lkml.kernel.org/r/20190604111459.2862-7-qais.yousef@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:23:43 +02:00
Qais Yousef
f9f240f96e sched/debug: Add sched_overutilized tracepoint
The new tracepoint allows us to track the changes in overutilized
status.

Overutilized status is associated with EAS. It indicates that the system
is in high performance state. EAS is disabled when the system is in this
state since there's not much energy savings while high performance tasks
are pushing the system to the limit and it's better to default to the
spreading behavior of the scheduler.

This tracepoint helps understanding and debugging the conditions under
which this happens.

Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pavankumar Kondeti <pkondeti@codeaurora.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Uwe Kleine-Konig <u.kleine-koenig@pengutronix.de>
Link: https://lkml.kernel.org/r/20190604111459.2862-6-qais.yousef@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:23:42 +02:00
Qais Yousef
8de6242cca sched/debug: Add new tracepoint to track PELT at se level
The new tracepoint allows tracking PELT signals at sched_entity level.
Which is supported in CFS tasks and taskgroups only.

Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pavankumar Kondeti <pkondeti@codeaurora.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Uwe Kleine-Konig <u.kleine-koenig@pengutronix.de>
Link: https://lkml.kernel.org/r/20190604111459.2862-5-qais.yousef@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:23:42 +02:00
Qais Yousef
ba19f51fcb sched/debug: Add new tracepoints to track PELT at rq level
The new tracepoints allow tracking PELT signals at rq level for all
scheduling classes + irq.

Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pavankumar Kondeti <pkondeti@codeaurora.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Uwe Kleine-Konig <u.kleine-koenig@pengutronix.de>
Link: https://lkml.kernel.org/r/20190604111459.2862-4-qais.yousef@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:23:41 +02:00
Qais Yousef
3c93a0c04d sched/debug: Add a new sched_trace_*() helper functions
The new functions allow modules to access internal data structures of
unexported struct cfs_rq and struct rq to extract important information
from the tracepoints to be introduced in later patches.

While at it fix alphabetical order of struct declarations in sched.h

Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pavankumar Kondeti <pkondeti@codeaurora.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Uwe Kleine-Konig <u.kleine-koenig@pengutronix.de>
Link: https://lkml.kernel.org/r/20190604111459.2862-3-qais.yousef@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:23:41 +02:00
Qais Yousef
9ba5090aec sched/autogroup: Make autogroup_path() always available
Remove the #ifdef CONFIG_SCHED_DEBUG.

Some of the tracepoints to be introduced in later patches need to access
this function. Hence make it always available since the tracepoints are
not protected by CONFIG_SCHED_DEBUG.

Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pavankumar Kondeti <pkondeti@codeaurora.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Uwe Kleine-Konig <u.kleine-koenig@pengutronix.de>
Link: https://lkml.kernel.org/r/20190604111459.2862-2-qais.yousef@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:23:40 +02:00
Pavel Begunkov
016190a4b5 sched/wait: Deduplicate code with do-while
Statements in the loop's body and before it are identical.
Use do-while to not repeat it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/43ffea6ee2152b90dedf962eac851609e4197218.1560256112.git.asml.silence@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:23:40 +02:00
Vincent Guittot
8ec59c0f5f sched/topology: Remove unused 'sd' parameter from arch_scale_cpu_capacity()
The 'struct sched_domain *sd' parameter to arch_scale_cpu_capacity() is
unused since commit:

  765d0af19f ("sched/topology: Remove the ::smt_gain field from 'struct sched_domain'")

Remove it.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: gregkh@linuxfoundation.org
Cc: linux@armlinux.org.uk
Cc: quentin.perret@arm.com
Cc: rafael@kernel.org
Link: https://lkml.kernel.org/r/1560783617-5827-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:23:39 +02:00
Ingo Molnar
d2abae71eb Linux 5.2-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl0Os1seHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGtx4H/j6i482XzcGFKTBm
 A7mBoQpy+kLtoUov4EtBAR62OuwI8rsahW9di37QKndPoQrczWaKBmr3De6LCdPe
 v3pl3O6wBbvH5ru+qBPFX9PdNbDvimEChh7LHxmMxNQq3M+AjZAZVJyfpoiFnx35
 Fbge+LZaH/k8HMwZmkMr5t9Mpkip715qKg2o9Bua6dkH0AqlcpLlC8d9a+HIVw/z
 aAsyGSU8jRwhoAOJsE9bJf0acQ/pZSqmFp0rDKqeFTSDMsbDRKLGq/dgv4nW0RiW
 s7xqsjb/rdcvirRj3rv9+lcTVkOtEqwk0PVdL9WOf7g4iYrb3SOIZh8ZyViaDSeH
 VTS5zps=
 =huBY
 -----END PGP SIGNATURE-----

Merge tag 'v5.2-rc6' into sched/core, to refresh the branch

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:19:53 +02:00
Kan Liang
e321d02db8 perf/x86: Disable extended registers for non-supported PMUs
The perf fuzzer caused Skylake machine to crash:

[ 9680.085831] Call Trace:
[ 9680.088301]  <IRQ>
[ 9680.090363]  perf_output_sample_regs+0x43/0xa0
[ 9680.094928]  perf_output_sample+0x3aa/0x7a0
[ 9680.099181]  perf_event_output_forward+0x53/0x80
[ 9680.103917]  __perf_event_overflow+0x52/0xf0
[ 9680.108266]  ? perf_trace_run_bpf_submit+0xc0/0xc0
[ 9680.113108]  perf_swevent_hrtimer+0xe2/0x150
[ 9680.117475]  ? check_preempt_wakeup+0x181/0x230
[ 9680.122091]  ? check_preempt_curr+0x62/0x90
[ 9680.126361]  ? ttwu_do_wakeup+0x19/0x140
[ 9680.130355]  ? try_to_wake_up+0x54/0x460
[ 9680.134366]  ? reweight_entity+0x15b/0x1a0
[ 9680.138559]  ? __queue_work+0x103/0x3f0
[ 9680.142472]  ? update_dl_rq_load_avg+0x1cd/0x270
[ 9680.147194]  ? timerqueue_del+0x1e/0x40
[ 9680.151092]  ? __remove_hrtimer+0x35/0x70
[ 9680.155191]  __hrtimer_run_queues+0x100/0x280
[ 9680.159658]  hrtimer_interrupt+0x100/0x220
[ 9680.163835]  smp_apic_timer_interrupt+0x6a/0x140
[ 9680.168555]  apic_timer_interrupt+0xf/0x20
[ 9680.172756]  </IRQ>

The XMM registers can only be collected by PEBS hardware events on the
platforms with PEBS baseline support, e.g. Icelake, not software/probe
events.

Add capabilities flag PERF_PMU_CAP_EXTENDED_REGS to indicate the PMU
which support extended registers. For X86, the extended registers are
XMM registers.

Add has_extended_regs() to check if extended registers are applied.

The generic code define the mask of extended registers as 0 if arch
headers haven't overridden it.

Originally-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 878068ea27 ("perf/x86: Support outputting XMM registers")
Link: https://lkml.kernel.org/r/1559081314-9714-1-git-send-email-kan.liang@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:19:23 +02:00
Ravi Bangoria
913a90bc5a perf/ioctl: Add check for the sample_period value
perf_event_open() limits the sample_period to 63 bits. See:

  0819b2e30c ("perf: Limit perf_event_attr::sample_period to 63 bits")

Make ioctl() consistent with it.

Also on PowerPC, negative sample_period could cause a recursive
PMIs leading to a hang (reported when running perf-fuzzer).

Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: acme@kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: maddy@linux.vnet.ibm.com
Cc: mpe@ellerman.id.au
Fixes: 0819b2e30c ("perf: Limit perf_event_attr::sample_period to 63 bits")
Link: https://lkml.kernel.org/r/20190604042953.914-1-ravi.bangoria@linux.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-24 19:19:22 +02:00
Stanislav Fomichev
e4f0712021 bpf: fix NULL deref in btf_type_is_resolve_source_only
Commit 1dc9285184 ("bpf: kernel side support for BTF Var and DataSec")
added invocations of btf_type_is_resolve_source_only before
btf_type_nosize_or_null which checks for the NULL pointer.
Swap the order of btf_type_nosize_or_null and
btf_type_is_resolve_source_only to make sure the do the NULL pointer
check first.

Fixes: 1dc9285184 ("bpf: kernel side support for BTF Var and DataSec")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-24 15:53:19 +02:00
Dmitry V. Levin
9014143bab
fork: don't check parent_tidptr with CLONE_PIDFD
Give userspace a cheap and reliable way to tell whether CLONE_PIDFD is
supported by the kernel or not. The easiest way is to pass an invalid
file descriptor value in parent_tidptr, perform the syscall and verify
that parent_tidptr has been changed to a valid file descriptor value.

CLONE_PIDFD uses parent_tidptr to return pidfds. CLONE_PARENT_SETTID
will use parent_tidptr to return the tid of the parent. The two flags
cannot be used together. Old kernels that only support
CLONE_PARENT_SETTID will not verify the value pointed to by
parent_tidptr. This behavior is unchanged even with the introduction of
CLONE_PIDFD.
However, if CLONE_PIDFD is specified the kernel will currently check the
value pointed to by parent_tidptr before placing the pidfd in the memory
pointed to. EINVAL will be returned if the value in parent_tidptr is not
0.

If CLONE_PIDFD is supported and fd 0 is closed, then the returned pidfd
can and likely will be 0 and parent_tidptr will be unchanged. This means
userspace must either check CLONE_PIDFD support beforehand or check that
fd 0 is not closed when invoking CLONE_PIDFD.

The check for pidfd == 0 was introduced during the v5.2 merge window by
commit b3e5838252 ("clone: add CLONE_PIDFD") to ensure that
CLONE_PIDFD could be potentially extended by passing in flags through
the return argument.

However, that extension would look horrible, and with the upcoming
introduction of the clone3 syscall in v5.3 there is no need to extend
legacy clone syscall this way. (Even if it would need to be extended,
CLONE_DETACHED can be reused with CLONE_PIDFD.)

So remove the pidfd == 0 check. Userspace that needs to be portable to
kernels without CLONE_PIDFD support can then be advised to initialize
pidfd to -1 and check the pidfd value returned by CLONE_PIDFD.

Fixes: b3e5838252 ("clone: add CLONE_PIDFD")
Signed-off-by: Dmitry V. Levin <ldv@altlinux.org>
Signed-off-by: Christian Brauner <christian@brauner.io>
2019-06-24 15:52:54 +02:00
Matthias Schiffer
38b37d631a module: allow arch overrides for .exit section names
Some archs like ARM store unwind information for .exit.text in sections
with unusual names. As this unwind information refers to .exit.text, it
must not be loaded when .exit.text is not loaded (when CONFIG_MODULE_UNLOAD
is unset); otherwise, loading a module can fail due to relocation failures.

Signed-off-by: Matthias Schiffer <matthias.schiffer@ew.tq-group.com>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2019-06-24 14:00:32 +02:00
Yang Yingliang
2eef1399a8 modules: fix BUG when load module with rodata=n
When loading a module with rodata=n, it causes an executing
NX-protected page BUG.

[   32.379191] kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
[   32.382917] BUG: unable to handle page fault for address: ffffffffc0005000
[   32.385947] #PF: supervisor instruction fetch in kernel mode
[   32.387662] #PF: error_code(0x0011) - permissions violation
[   32.389352] PGD 240c067 P4D 240c067 PUD 240e067 PMD 421a52067 PTE 8000000421a53063
[   32.391396] Oops: 0011 [#1] SMP PTI
[   32.392478] CPU: 7 PID: 2697 Comm: insmod Tainted: G           O      5.2.0-rc5+ #202
[   32.394588] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
[   32.398157] RIP: 0010:ko_test_init+0x0/0x1000 [ko_test]
[   32.399662] Code: Bad RIP value.
[   32.400621] RSP: 0018:ffffc900029f3ca8 EFLAGS: 00010246
[   32.402171] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[   32.404332] RDX: 00000000000004c7 RSI: 0000000000000cc0 RDI: ffffffffc0005000
[   32.406347] RBP: ffffffffc0005000 R08: ffff88842fbebc40 R09: ffffffff810ede4a
[   32.408392] R10: ffffea00108e3480 R11: 0000000000000000 R12: ffff88842bee21a0
[   32.410472] R13: 0000000000000001 R14: 0000000000000001 R15: ffffc900029f3e78
[   32.412609] FS:  00007fb4f0c0a700(0000) GS:ffff88842fbc0000(0000) knlGS:0000000000000000
[   32.414722] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   32.416290] CR2: ffffffffc0004fd6 CR3: 0000000421a90004 CR4: 0000000000020ee0
[   32.418471] Call Trace:
[   32.419136]  do_one_initcall+0x41/0x1df
[   32.420199]  ? _cond_resched+0x10/0x40
[   32.421433]  ? kmem_cache_alloc_trace+0x36/0x160
[   32.422827]  do_init_module+0x56/0x1f7
[   32.423946]  load_module+0x1e67/0x2580
[   32.424947]  ? __alloc_pages_nodemask+0x150/0x2c0
[   32.426413]  ? map_vm_area+0x2d/0x40
[   32.427530]  ? __vmalloc_node_range+0x1ef/0x260
[   32.428850]  ? __do_sys_init_module+0x135/0x170
[   32.430060]  ? _cond_resched+0x10/0x40
[   32.431249]  __do_sys_init_module+0x135/0x170
[   32.432547]  do_syscall_64+0x43/0x120
[   32.433853]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Because if rodata=n, set_memory_x() can't be called, fix this by
calling set_memory_x in complete_formation();

Fixes: f2c65fb322 ("x86/modules: Avoid breaking W^X while loading modules")
Suggested-by: Jian Cheng <cj.chengjian@huawei.com>
Reviewed-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2019-06-24 12:11:41 +02:00
Muchun Song
8afecaa68d softirq: Use __this_cpu_write() in takeover_tasklets()
The code is executed with interrupts disabled, so it's safe to use
__this_cpu_write().

[ tglx: Massaged changelog ]

Signed-off-by: Muchun Song <smuchun@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: joel@joelfernandes.org
Cc: rostedt@goodmis.org
Cc: frederic@kernel.org
Cc: paulmck@linux.vnet.ibm.com
Cc: alexander.levin@verizon.com
Cc: peterz@infradead.org
Link: https://lkml.kernel.org/r/20190618143305.2038-1-smuchun@gmail.com
2019-06-23 18:14:27 +02:00
Nadav Amit
caa759323c smp: Remove smp_call_function() and on_each_cpu() return values
The return value is fixed. Remove it and amend the callers.

[ tglx: Fixup arm/bL_switcher and powerpc/rtas ]

Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lkml.kernel.org/r/20190613064813.8102-2-namit@vmware.com
2019-06-23 14:26:26 +02:00
Nadav Amit
a22793c79d smp: Do not mark call_function_data as shared
cfd_data is marked as shared, but although it hold pointers to shared
data structures, it is private per core.

Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Rik van Riel <riel@surriel.com>
Link: https://lkml.kernel.org/r/20190613064813.8102-8-namit@vmware.com
2019-06-23 14:26:25 +02:00
Nathan Huckleberry
a9314773a9 timer_list: Guard procfs specific code
With CONFIG_PROC_FS=n the following warning is emitted:

kernel/time/timer_list.c:361:36: warning: unused variable
'timer_list_sops' [-Wunused-const-variable]
   static const struct seq_operations timer_list_sops = {

Add #ifdef guard around procfs specific code.

Signed-off-by: Nathan Huckleberry <nhuck@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Cc: john.stultz@linaro.org
Cc: sboyd@kernel.org
Cc: clang-built-linux@googlegroups.com
Link: https://github.com/ClangBuiltLinux/linux/issues/534
Link: https://lkml.kernel.org/r/20190614181604.112297-1-nhuck@google.com
2019-06-23 00:08:52 +02:00
Vincenzo Frascino
44f57d788e timekeeping: Provide a generic update_vsyscall() implementation
The new generic VDSO library allows to unify the update_vsyscall[_tz]()
implementations.

Provide a generic implementation based on the x86 code and the bindings
which need to be implemented in architecture specific code.

[ tglx: Moved it into kernel/time where it belongs. Removed the pointless
  	line breaks in the stub functions. Massaged changelog ]

Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Shijith Thotton <sthotton@marvell.com>
Tested-by: Andre Przywara <andre.przywara@arm.com>
Cc: linux-arch@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-mips@vger.kernel.org
Cc: linux-kselftest@vger.kernel.org
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Mark Salyzyn <salyzyn@android.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Huw Davies <huw@codeweavers.com>
Link: https://lkml.kernel.org/r/20190621095252.32307-4-vincenzo.frascino@arm.com
2019-06-22 21:21:06 +02:00
David S. Miller
92ad6325cb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Minor SPDX change conflict.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-22 08:59:24 -04:00
Sebastian Andrzej Siewior
7586addb99 posix-timers: Use spin_lock_irq() in itimer_delete()
itimer_delete() uses spin_lock_irqsave() to obtain a `flags' variable
which can then be passed to unlock_timer(). It uses already spin_lock
locking for the structure instead of lock_timer() because it has a timer
which can not be removed by others at this point. The cleanup is always
performed with enabled interrupts.

Use spin_lock_irq() / spin_unlock_irq() so the `flags' variable can be
removed.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190621143643.25649-3-bigeasy@linutronix.de
2019-06-22 12:14:22 +02:00
Sebastian Andrzej Siewior
12063d4310 posix-timers: Remove "it_signal = NULL" assignment in itimer_delete()
itimer_delete() is invoked during do_exit(). At this point it is the
last thread in the group dying and doing the clean up.
Since it is the last thread in the group, there can not be any other
task attempting to lock the itimer which means the NULL assignment (which
avoids lookups in __lock_timer()) is not required.

The assignment and comment was copied in commit 0e568881178ff ("[PATCH]
fix posix-timers to have proper per-process scope") from
sys_timer_delete() which was/is the syscall interface and requires the
assignment.

Remove the superfluous ->it_signal = NULL assignment.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190621143643.25649-2-bigeasy@linutronix.de
2019-06-22 12:14:22 +02:00
Jason A. Donenfeld
9285ec4c8b timekeeping: Use proper clock specifier names in functions
This makes boot uniformly boottime and tai uniformly clocktai, to
address the remaining oversights.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lkml.kernel.org/r/20190621203249.3909-2-Jason@zx2c4.com
2019-06-22 12:11:27 +02:00
Jason A. Donenfeld
0354c1a3cd timekeeping: Use proper ktime_add when adding nsecs in coarse offset
While this doesn't actually amount to a real difference, since the macro
evaluates to the same thing, every place else operates on ktime_t using
these functions, so let's not break the pattern.

Fixes: e3ff9c3678 ("timekeeping: Repair ktime_get_coarse*() granularity")
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lkml.kernel.org/r/20190621203249.3909-1-Jason@zx2c4.com
2019-06-22 12:11:27 +02:00
Thomas Gleixner
6808acb57a Merge branch 'linus' into timers/core
Pick up upstream fixes for pending changes.
2019-06-22 12:07:35 +02:00
Miroslav Lichvar
d897a4ab11 ntp: Limit TAI-UTC offset
Don't allow the TAI-UTC offset of the system clock to be set by adjtimex()
to a value larger than 100000 seconds.

This prevents an overflow in the conversion to int, prevents the CLOCK_TAI
clock from getting too far ahead of the CLOCK_REALTIME clock, and it is
still large enough to allow leap seconds to be inserted at the maximum rate
currently supported by the kernel (once per day) for the next ~270 years,
however unlikely it is that someone can survive a catastrophic event which
slowed down the rotation of the Earth so much.

Reported-by: Weikang shi <swkhack@gmail.com>
Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Stephen Boyd <sboyd@kernel.org>
Link: https://lkml.kernel.org/r/20190618154713.20929-1-mlichvar@redhat.com
2019-06-22 11:28:53 +02:00
Linus Torvalds
c884d8ac7f SPDX update for 5.2-rc6
Another round of SPDX updates for 5.2-rc6
 
 Here is what I am guessing is going to be the last "big" SPDX update for
 5.2.  It contains all of the remaining GPLv2 and GPLv2+ updates that
 were "easy" to determine by pattern matching.  The ones after this are
 going to be a bit more difficult and the people on the spdx list will be
 discussing them on a case-by-case basis now.
 
 Another 5000+ files are fixed up, so our overall totals are:
 	Files checked:            64545
 	Files with SPDX:          45529
 
 Compared to the 5.1 kernel which was:
 	Files checked:            63848
 	Files with SPDX:          22576
 This is a huge improvement.
 
 Also, we deleted another 20000 lines of boilerplate license crud, always
 nice to see in a diffstat.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXQyQYA8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ymnGQCghETUBotn1p3hTjY56VEs6dGzpHMAnRT0m+lv
 kbsjBGEJpLbMRB2krnaU
 =RMcT
 -----END PGP SIGNATURE-----

Merge tag 'spdx-5.2-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx

Pull still more SPDX updates from Greg KH:
 "Another round of SPDX updates for 5.2-rc6

  Here is what I am guessing is going to be the last "big" SPDX update
  for 5.2. It contains all of the remaining GPLv2 and GPLv2+ updates
  that were "easy" to determine by pattern matching. The ones after this
  are going to be a bit more difficult and the people on the spdx list
  will be discussing them on a case-by-case basis now.

  Another 5000+ files are fixed up, so our overall totals are:
	Files checked:            64545
	Files with SPDX:          45529

  Compared to the 5.1 kernel which was:
	Files checked:            63848
	Files with SPDX:          22576

  This is a huge improvement.

  Also, we deleted another 20000 lines of boilerplate license crud,
  always nice to see in a diffstat"

* tag 'spdx-5.2-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx: (65 commits)
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 507
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 506
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 505
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 504
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 503
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 502
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 501
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 499
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 498
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 497
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 496
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 495
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 491
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 490
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 489
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 488
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 487
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 486
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 485
  ...
2019-06-21 09:58:42 -07:00
Julien Thierry
17ce302f31 arm64: Fix interrupt tracing in the presence of NMIs
In the presence of any form of instrumentation, nmi_enter() should be
done before calling any traceable code and any instrumentation code.

Currently, nmi_enter() is done in handle_domain_nmi(), which is much
too late as instrumentation code might get called before. Move the
nmi_enter/exit() calls to the arch IRQ vector handler.

On arm64, it is not possible to know if the IRQ vector handler was
called because of an NMI before acknowledging the interrupt. However, It
is possible to know whether normal interrupts could be taken in the
interrupted context (i.e. if taking an NMI in that context could
introduce a potential race condition).

When interrupting a context with IRQs disabled, call nmi_enter() as soon
as possible. In contexts with IRQs enabled, defer this to the interrupt
controller, which is in a better position to know if an interrupt taken
is an NMI.

Fixes: bc3c03ccb4 ("arm64: Enable the support of pseudo-NMIs")
Cc: <stable@vger.kernel.org> # 5.1.x-
Cc: Will Deacon <will.deacon@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Julien Thierry <julien.thierry@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2019-06-21 15:49:58 +01:00
Christoph Hellwig
474a280036 cgroup: export css_next_descendant_pre for bfq
The bfq schedule now uses css_next_descendant_pre directly after
the stats functionality depending on it has been from the core
blk-cgroup code to bfq.  Export the symbol so that bfq can still
be build modular.

Fixes: d6258980da ("bfq-iosched: move bfq_stat_recursive_sum into the only caller")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-21 02:48:34 -06:00
Christian Brauner
d68dbb0c9a
arch: handle arches who do not yet define clone3
This cleanly handles arches who do not yet define clone3.

clone3() was initially placed under __ARCH_WANT_SYS_CLONE under the
assumption that this would cleanly handle all architectures. It does
not.
Architectures such as nios2 or h8300 simply take the asm-generic syscall
definitions and generate their syscall table from it. Since they don't
define __ARCH_WANT_SYS_CLONE the build would fail complaining about
sys_clone3 missing. The reason this doesn't happen for legacy clone is
that nios2 and h8300 provide assembly stubs for sys_clone. This seems to
be done for architectural reasons.

The build failures for nios2 and h8300 were caught int -next luckily.
The solution is to define __ARCH_WANT_SYS_CLONE3 that architectures can
add. Additionally, we need a cond_syscall(clone3) for architectures such
as nios2 or h8300 that generate their syscall table in the way I
explained above.

Fixes: 8f3220a806 ("arch: wire-up clone3() syscall")
Signed-off-by: Christian Brauner <christian@brauner.io>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Adrian Reber <adrian@lisas.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: linux-api@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: x86@kernel.org
2019-06-21 01:54:53 +02:00
Petr Mladek
ac59a471e9 livepatch: Remove duplicate warning about missing reliable stacktrace support
WARN_ON_ONCE() could not be called safely under rq lock because
of console deadlock issues. Moreover WARN_ON_ONCE() is superfluous in
klp_check_stack(), because stack_trace_save_tsk_reliable() cannot return
-ENOSYS thanks to klp_have_reliable_stack() check in
klp_try_switch_task().

[ mbenes: changelog edited ]
Signed-off-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2019-06-20 16:19:13 +02:00
Miroslav Benes
67059d65f7 Revert "livepatch: Remove reliable stacktrace check in klp_try_switch_task()"
This reverts commit 1d98a69e5c. Commit
31adf2308f ("livepatch: Convert error about unsupported reliable
stacktrace into a warning") weakened the enforcement for architectures
to have reliable stack traces support. The system only warns now about
it.

It only makes sense to reintroduce the compile time checking in
klp_try_switch_task() again and bail out early.

Signed-off-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2019-06-20 16:08:41 +02:00
Miroslav Benes
380178ef7f stacktrace: Remove weak version of save_stack_trace_tsk_reliable()
Recent rework of stack trace infrastructure introduced a new set of
helpers for common stack trace operations (commit e9b98e162a
("stacktrace: Provide helpers for common stack trace operations") and
related). As a result, save_stack_trace_tsk_reliable() is not directly
called anywhere. Livepatch, currently the only user of the reliable
stack trace feature, now calls stack_trace_save_tsk_reliable().

When CONFIG_HAVE_RELIABLE_STACKTRACE is set and depending on
CONFIG_ARCH_STACKWALK, stack_trace_save_tsk_reliable() calls either
arch_stack_walk_reliable() or mentioned save_stack_trace_tsk_reliable().
x86_64 defines the former, ppc64le the latter. All other architectures
do not have HAVE_RELIABLE_STACKTRACE and include/linux/stacktrace.h
defines -ENOSYS returning version for them.

In short, stack_trace_save_tsk_reliable() returning -ENOSYS defined in
include/linux/stacktrace.h serves the same purpose as the old weak
version of save_stack_trace_tsk_reliable() which is therefore no longer
needed.

Signed-off-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2019-06-20 15:43:31 +02:00
David S. Miller
dca73a65a6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2019-06-19

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) new SO_REUSEPORT_DETACH_BPF setsocktopt, from Martin.

2) BTF based map definition, from Andrii.

3) support bpf_map_lookup_elem for xskmap, from Jonathan.

4) bounded loops and scalar precision logic in the verifier, from Alexei.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-20 00:06:27 -04:00
Paul E. McKenney
11ca7a9d54 Merge branches 'consolidate.2019.05.28a', 'doc.2019.05.28a', 'fixes.2019.06.13a', 'srcu.2019.05.28a', 'sync.2019.05.28a' and 'torture.2019.05.28a' into HEAD
consolidate.2019.05.28a: RCU flavor consolidation cleanups and optmizations.
doc.2019.05.28a: Documentation updates.
fixes.2019.06.13a: Miscellaneous fixes.
srcu.2019.05.28a: SRCU updates.
sync.2019.05.28a: RCU-sync flavor consolidation.
torture.2019.05.28a: Torture-test updates.
2019-06-19 09:21:46 -07:00
Jesper Dangaard Brouer
6bf071bf09 xdp: page_pool related fix to cpumap
When converting an xdp_frame into an SKB, and sending this into the network
stack, then the underlying XDP memory model need to release associated
resources, because the network stack don't have callbacks for XDP memory
models.  The only memory model that needs this is page_pool, when a driver
use the DMA-mapping feature.

Introduce page_pool_release_page(), which basically does the same as
page_pool_unmap_page(). Add xdp_release_frame() as the XDP memory model
interface for calling it, if the memory model match MEM_TYPE_PAGE_POOL, to
save the function call overhead for others. Have cpumap call
xdp_release_frame() before xdp_scrub_frame().

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19 11:23:13 -04:00
David Howells
7743c48e54 keys: Cache result of request_key*() temporarily in task_struct
If a filesystem uses keys to hold authentication tokens, then it needs a
token for each VFS operation that might perform an authentication check -
either by passing it to the server, or using to perform a check based on
authentication data cached locally.

For open files this isn't a problem, since the key should be cached in the
file struct since it represents the subject performing operations on that
file descriptor.

During pathwalk, however, there isn't anywhere to cache the key, except
perhaps in the nameidata struct - but that isn't exposed to the
filesystems.  Further, a pathwalk can incur a lot of operations, calling
one or more of the following, for instance:

	->lookup()
	->permission()
	->d_revalidate()
	->d_automount()
	->get_acl()
	->getxattr()

on each dentry/inode it encounters - and each one may need to call
request_key().  And then, at the end of pathwalk, it will call the actual
operation:

	->mkdir()
	->mknod()
	->getattr()
	->open()
	...

which may need to go and get the token again.

However, it is very likely that all of the operations on a single
dentry/inode - and quite possibly a sequence of them - will all want to use
the same authentication token, which suggests that caching it would be a
good idea.

To this end:

 (1) Make it so that a positive result of request_key() and co. that didn't
     require upcalling to userspace is cached temporarily in task_struct.

 (2) The cache is 1 deep, so a new result displaces the old one.

 (3) The key is released by exit and by notify-resume.

 (4) The cache is cleared in a newly forked process.

Signed-off-by: David Howells <dhowells@redhat.com>
2019-06-19 16:10:15 +01:00
Thomas Gleixner
d2912cb15b treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500
Based on 2 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license version 2 as
  published by the free software foundation

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license version 2 as
  published by the free software foundation #

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 4122 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Enrico Weigelt <info@metux.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190604081206.933168790@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-19 17:09:55 +02:00
Thomas Gleixner
f85d208658 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 451
Based on 1 normalized pattern(s):

  this file is subject to the terms and conditions of version 2 of the
  gnu general public license see the file copying in the main
  directory of the linux distribution for more details

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 5 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Enrico Weigelt <info@metux.net>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190604081200.872755311@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-19 17:09:08 +02:00
Thomas Gleixner
8092f73c51 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 248
Based on 1 normalized pattern(s):

  this file is released under the gpl v2

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 3 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Armijn Hemel <armijn@tjaldur.nl>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190602204655.103854853@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-19 17:09:08 +02:00
Thomas Gleixner
40b0b3f8fb treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 230
Based on 2 normalized pattern(s):

  this source code is licensed under the gnu general public license
  version 2 see the file copying for more details

  this source code is licensed under general public license version 2
  see

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 52 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Enrico Weigelt <info@metux.net>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190602204653.449021192@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-19 17:09:06 +02:00
Konrad Rzeszutek Wilk
8492101e15 Merge branch 'stable/for-linus-5.2' into devel/for-linus-5.2
* stable/for-linus-5.2:
  swiotlb: fix phys_addr_t overflow warning
2019-06-19 10:29:24 -04:00
Arnd Bergmann
9c106119f6 swiotlb: fix phys_addr_t overflow warning
On architectures that have a larger dma_addr_t than phys_addr_t,
the swiotlb_tbl_map_single() function truncates its return code
in the failure path, making it impossible to identify the error
later, as we compare to the original value:

kernel/dma/swiotlb.c:551:9: error: implicit conversion from 'dma_addr_t' (aka 'unsigned long long') to 'phys_addr_t' (aka 'unsigned int') changes value from 18446744073709551615 to 4294967295 [-Werror,-Wconstant-conversion]
        return DMA_MAPPING_ERROR;

Use an explicit typecast here to convert it to the narrower type,
and use the same expression in the error handling later.

Fixes: b907e20508 ("swiotlb: remove SWIOTLB_MAP_ERROR")
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2019-06-19 10:28:54 -04:00
Rafael J. Wysocki
0b385a0c3b PM: suspend: Rename pm_suspend_via_s2idle()
The name of pm_suspend_via_s2idle() is confusing, as it doesn't
reflect the purpose of the function precisely enough and it is
very similar to pm_suspend_via_firmware(), which has a different
purpose, so rename it as pm_suspend_default_s2idle() and update
its only caller, i8042_register_ports(), accordingly.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
2019-06-19 11:41:26 +02:00
Alexei Starovoitov
b5dc0163d8 bpf: precise scalar_value tracking
Introduce precision tracking logic that
helps cilium programs the most:
                  old clang  old clang    new clang  new clang
                          with all patches         with all patches
bpf_lb-DLB_L3.o      1838     2283         1923       1863
bpf_lb-DLB_L4.o      3218     2657         3077       2468
bpf_lb-DUNKNOWN.o    1064     545          1062       544
bpf_lxc-DDROP_ALL.o  26935    23045        166729     22629
bpf_lxc-DUNKNOWN.o   34439    35240        174607     28805
bpf_netdev.o         9721     8753         8407       6801
bpf_overlay.o        6184     7901         5420       4754
bpf_lxc_jit.o        39389    50925        39389      50925

Consider code:
654: (85) call bpf_get_hash_recalc#34
655: (bf) r7 = r0
656: (15) if r8 == 0x0 goto pc+29
657: (bf) r2 = r10
658: (07) r2 += -48
659: (18) r1 = 0xffff8881e41e1b00
661: (85) call bpf_map_lookup_elem#1
662: (15) if r0 == 0x0 goto pc+23
663: (69) r1 = *(u16 *)(r0 +0)
664: (15) if r1 == 0x0 goto pc+21
665: (bf) r8 = r7
666: (57) r8 &= 65535
667: (bf) r2 = r8
668: (3f) r2 /= r1
669: (2f) r2 *= r1
670: (bf) r1 = r8
671: (1f) r1 -= r2
672: (57) r1 &= 255
673: (25) if r1 > 0x1e goto pc+12
 R0=map_value(id=0,off=0,ks=20,vs=64,imm=0) R1_w=inv(id=0,umax_value=30,var_off=(0x0; 0x1f))
674: (67) r1 <<= 1
675: (0f) r0 += r1

At this point the verifier will notice that scalar R1 is used in map pointer adjustment.
R1 has to be precise for later operations on R0 to be validated properly.

The verifier will backtrack the above code in the following way:
last_idx 675 first_idx 664
regs=2 stack=0 before 675: (0f) r0 += r1         // started backtracking R1 regs=2 is a bitmask
regs=2 stack=0 before 674: (67) r1 <<= 1
regs=2 stack=0 before 673: (25) if r1 > 0x1e goto pc+12
regs=2 stack=0 before 672: (57) r1 &= 255
regs=2 stack=0 before 671: (1f) r1 -= r2         // now both R1 and R2 has to be precise -> regs=6 mask
regs=6 stack=0 before 670: (bf) r1 = r8          // after this insn R8 and R2 has to be precise
regs=104 stack=0 before 669: (2f) r2 *= r1       // after this one R8, R2, and R1
regs=106 stack=0 before 668: (3f) r2 /= r1
regs=106 stack=0 before 667: (bf) r2 = r8
regs=102 stack=0 before 666: (57) r8 &= 65535
regs=102 stack=0 before 665: (bf) r8 = r7
regs=82 stack=0 before 664: (15) if r1 == 0x0 goto pc+21
 // this is the end of verifier state. The following regs will be marked precised:
 R1_rw=invP(id=0,umax_value=65535,var_off=(0x0; 0xffff)) R7_rw=invP(id=0)
parent didn't have regs=82 stack=0 marks         // so backtracking continues into parent state
last_idx 663 first_idx 655
regs=82 stack=0 before 663: (69) r1 = *(u16 *)(r0 +0)   // R1 was assigned no need to track it further
regs=80 stack=0 before 662: (15) if r0 == 0x0 goto pc+23    // keep tracking R7
regs=80 stack=0 before 661: (85) call bpf_map_lookup_elem#1  // keep tracking R7
regs=80 stack=0 before 659: (18) r1 = 0xffff8881e41e1b00
regs=80 stack=0 before 658: (07) r2 += -48
regs=80 stack=0 before 657: (bf) r2 = r10
regs=80 stack=0 before 656: (15) if r8 == 0x0 goto pc+29
regs=80 stack=0 before 655: (bf) r7 = r0                // here the assignment into R7
 // mark R0 to be precise:
 R0_rw=invP(id=0)
parent didn't have regs=1 stack=0 marks                 // regs=1 -> tracking R0
last_idx 654 first_idx 644
regs=1 stack=0 before 654: (85) call bpf_get_hash_recalc#34 // and in the parent frame it was a return value
  // nothing further to backtrack

Two scalar registers not marked precise are equivalent from state pruning point of view.
More details in the patch comments.

It doesn't support bpf2bpf calls yet and enabled for root only.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-19 02:22:52 +02:00
Alexei Starovoitov
eea1c227b9 bpf: fix callees pruning callers
The commit 7640ead939 partially resolved the issue of callees
incorrectly pruning the callers.
With introduction of bounded loops and jmps_processed heuristic
single verifier state may contain multiple branches and calls.
It's possible that new verifier state (for future pruning) will be
allocated inside callee. Then callee will exit (still within the same
verifier state). It will go back to the caller and there R6-R9 registers
will be read and will trigger mark_reg_read. But the reg->live for all frames
but the top frame is not set to LIVE_NONE. Hence mark_reg_read will fail
to propagate liveness into parent and future walking will incorrectly
conclude that the states are equivalent because LIVE_READ is not set.
In other words the rule for parent/live should be:
whenever register parentage chain is set the reg->live should be set to LIVE_NONE.
is_state_visited logic already follows this rule for spilled registers.

Fixes: 7640ead939 ("bpf: verifier: make sure callees don't prune with caller differences")
Fixes: f4d7e40a5b ("bpf: introduce function calls (verification)")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-19 02:22:51 +02:00
Alexei Starovoitov
2589726d12 bpf: introduce bounded loops
Allow the verifier to validate the loops by simulating their execution.
Exisiting programs have used '#pragma unroll' to unroll the loops
by the compiler. Instead let the verifier simulate all iterations
of the loop.
In order to do that introduce parentage chain of bpf_verifier_state and
'branches' counter for the number of branches left to explore.
See more detailed algorithm description in bpf_verifier.h

This algorithm borrows the key idea from Edward Cree approach:
https://patchwork.ozlabs.org/patch/877222/
Additional state pruning heuristics make such brute force loop walk
practical even for large loops.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-19 02:22:51 +02:00
Alexei Starovoitov
fb8d251ee2 bpf: extend is_branch_taken to registers
This patch extends is_branch_taken() logic from JMP+K instructions
to JMP+X instructions.
Conditional branches are often done when src and dst registers
contain known scalars. In such case the verifier can follow
the branch that is going to be taken when program executes.
That speeds up the verification and is essential feature to support
bounded loops.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-19 02:22:51 +02:00
Alexei Starovoitov
f7cf25b202 bpf: track spill/fill of constants
Compilers often spill induction variables into the stack,
hence it is necessary for the verifier to track scalar values
of the registers through stack slots.

Also few bpf programs were incorrectly rejected in the past,
since the verifier was not able to track such constants while
they were used to compute offsets into packet headers.

Tracking constants through the stack significantly decreases
the chances of state pruning, since two different constants
are considered to be different by state equivalency.
End result that cilium tests suffer serious degradation in the number
of states processed and corresponding verification time increase.

                     before  after
bpf_lb-DLB_L3.o      1838    6441
bpf_lb-DLB_L4.o      3218    5908
bpf_lb-DUNKNOWN.o    1064    1064
bpf_lxc-DDROP_ALL.o  26935   93790
bpf_lxc-DUNKNOWN.o   34439   123886
bpf_netdev.o         9721    31413
bpf_overlay.o        6184    18561
bpf_lxc_jit.o        39389   359445

After further debugging turned out that cillium progs are
getting hurt by clang due to the same constant tracking issue.
Newer clang generates better code by spilling less to the stack.
Instead it keeps more constants in the registers which
hurts state pruning since the verifier already tracks constants
in the registers:
                  old clang  new clang
                         (no spill/fill tracking introduced by this patch)
bpf_lb-DLB_L3.o      1838    1923
bpf_lb-DLB_L4.o      3218    3077
bpf_lb-DUNKNOWN.o    1064    1062
bpf_lxc-DDROP_ALL.o  26935   166729
bpf_lxc-DUNKNOWN.o   34439   174607
bpf_netdev.o         9721    8407
bpf_overlay.o        6184    5420
bpf_lcx_jit.o        39389   39389

The final table is depressing:
                  old clang  old clang    new clang  new clang
                           const spill/fill        const spill/fill
bpf_lb-DLB_L3.o      1838    6441          1923      8128
bpf_lb-DLB_L4.o      3218    5908          3077      6707
bpf_lb-DUNKNOWN.o    1064    1064          1062      1062
bpf_lxc-DDROP_ALL.o  26935   93790         166729    380712
bpf_lxc-DUNKNOWN.o   34439   123886        174607    440652
bpf_netdev.o         9721    31413         8407      31904
bpf_overlay.o        6184    18561         5420      23569
bpf_lxc_jit.o        39389   359445        39389     359445

Tracking constants in the registers hurts state pruning already.
Adding tracking of constants through stack hurts pruning even more.
The later patch address this general constant tracking issue
with coarse/precise logic.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-19 02:22:51 +02:00
David S. Miller
13091aa305 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Honestly all the conflicts were simple overlapping changes,
nothing really interesting to report.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-17 20:20:36 -07:00
Gustavo A. R. Silva
f0553dcb97 tracepoint: Use struct_size() in kmalloc()
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:

struct tp_probes {
	...
        struct tracepoint_func probes[0];
};

instance = kmalloc(sizeof(sizeof(struct tp_probes) +
			sizeof(struct tracepoint_func) * count, GFP_KERNEL);

Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:

instance = kmalloc(struct_size(instance, probes, count) GFP_KERNEL);

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-06-17 21:13:32 -04:00
Linus Torvalds
da0f382029 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "Lots of bug fixes here:

   1) Out of bounds access in __bpf_skc_lookup, from Lorenz Bauer.

   2) Fix rate reporting in cfg80211_calculate_bitrate_he(), from John
      Crispin.

   3) Use after free in psock backlog workqueue, from John Fastabend.

   4) Fix source port matching in fdb peer flow rule of mlx5, from Raed
      Salem.

   5) Use atomic_inc_not_zero() in fl6_sock_lookup(), from Eric Dumazet.

   6) Network header needs to be set for packet redirect in nfp, from
      John Hurley.

   7) Fix udp zerocopy refcnt, from Willem de Bruijn.

   8) Don't assume linear buffers in vxlan and geneve error handlers,
      from Stefano Brivio.

   9) Fix TOS matching in mlxsw, from Jiri Pirko.

  10) More SCTP cookie memory leak fixes, from Neil Horman.

  11) Fix VLAN filtering in rtl8366, from Linus Walluij.

  12) Various TCP SACK payload size and fragmentation memory limit fixes
      from Eric Dumazet.

  13) Use after free in pneigh_get_next(), also from Eric Dumazet.

  14) LAPB control block leak fix from Jeremy Sowden"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (145 commits)
  lapb: fixed leak of control-blocks.
  tipc: purge deferredq list for each grp member in tipc_group_delete
  ax25: fix inconsistent lock state in ax25_destroy_timer
  neigh: fix use-after-free read in pneigh_get_next
  tcp: fix compile error if !CONFIG_SYSCTL
  hv_sock: Suppress bogus "may be used uninitialized" warnings
  be2net: Fix number of Rx queues used for flow hashing
  net: handle 802.1P vlan 0 packets properly
  tcp: enforce tcp_min_snd_mss in tcp_mtu_probing()
  tcp: add tcp_min_snd_mss sysctl
  tcp: tcp_fragment() should apply sane memory limits
  tcp: limit payload size of sacked skbs
  Revert "net: phylink: set the autoneg state in phylink_phy_change"
  bpf: fix nested bpf tracepoints with per-cpu data
  bpf: Fix out of bounds memory access in bpf_sk_storage
  vsock/virtio: set SOCK_DONE on peer shutdown
  net: dsa: rtl8366: Fix up VLAN filtering
  net: phylink: set the autoneg state in phylink_phy_change
  net: add high_order_alloc_disable sysctl/static key
  tcp: add tcp_tx_skb_cache sysctl
  ...
2019-06-17 15:55:34 -07:00
Peter Zijlstra
8dc2d993cf x86/percpu, sched/fair: Avoid local_clock()
Nadav reported that code-gen changed because of the this_cpu_*()
constraints, avoid this for select_idle_cpu() because that runs with
preemption (and IRQs) disabled anyway.

Reported-by: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:43:43 +02:00
Ingo Molnar
bddb363673 Merge branch 'x86/cpu' into perf/core, to pick up dependent changes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:29:16 +02:00
Waiman Long
a15ea1a35f locking/rwsem: Guard against making count negative
The upper bits of the count field is used as reader count. When
sufficient number of active readers are present, the most significant
bit will be set and the count becomes negative. If the number of active
readers keep on piling up, we may eventually overflow the reader counts.
This is not likely to happen unless the number of bits reserved for
reader count is reduced because those bits are need for other purpose.

To prevent this count overflow from happening, the most significant
bit is now treated as a guard bit (RWSEM_FLAG_READFAIL). Read-lock
attempts will now fail for both the fast and slow paths whenever this
bit is set. So all those extra readers will be put to sleep in the wait
list. Wakeup will not happen until the reader count reaches 0.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-17-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:28:11 +02:00
Waiman Long
5cfd92e12e locking/rwsem: Adaptive disabling of reader optimistic spinning
Reader optimistic spinning is helpful when the reader critical section
is short and there aren't that many readers around. It makes readers
relatively more preferred than writers. When a writer times out spinning
on a reader-owned lock and set the nospinnable bits, there are two main
reasons for that.

 1) The reader critical section is long, perhaps the task sleeps after
    acquiring the read lock.
 2) There are just too many readers contending the lock causing it to
    take a while to service all of them.

In the former case, long reader critical section will impede the progress
of writers which is usually more important for system performance.
In the later case, reader optimistic spinning tends to make the reader
groups that contain readers that acquire the lock together smaller
leading to more of them. That may hurt performance in some cases. In
other words, the setting of nonspinnable bits indicates that reader
optimistic spinning may not be helpful for those workloads that cause it.

Therefore, any writers that have observed the setting of the writer
nonspinnable bit for a given rwsem after they fail to acquire the lock
via optimistic spinning will set the reader nonspinnable bit once they
acquire the write lock. Similarly, readers that observe the setting
of reader nonspinnable bit at slowpath entry will also set the reader
nonspinnable bit when they acquire the read lock via the wakeup path.

Once the reader nonspinnable bit is on, it will only be reset when
a writer is able to acquire the rwsem in the fast path or somehow a
reader or writer in the slowpath doesn't observe the nonspinable bit.

This is to discourage reader optmistic spinning on that particular
rwsem and make writers more preferred. This adaptive disabling of reader
optimistic spinning will alleviate some of the negative side effect of
this feature.

In addition, this patch tries to make readers in the spinning queue
follow the phase-fair principle after quitting optimistic spinning
by checking if another reader has somehow acquired a read lock after
this reader enters the optimistic spinning queue. If so and the rwsem
is still reader-owned, this reader is in the right read-phase and can
attempt to acquire the lock.

On a 2-socket 40-core 80-thread Skylake system, the page_fault1 test of
the will-it-scale benchmark was run with various number of threads. The
number of operations done before reader optimistic spinning patches,
this patch and after this patch were:

  Threads  Before rspin  Before patch  After patch    %change
  -------  ------------  ------------  -----------    -------
    20        5541068      5345484       5455667    -3.5%/ +2.1%
    40       10185150      7292313       9219276   -28.5%/+26.4%
    60        8196733      6460517       7181209   -21.2%/+11.2%
    80        9508864      6739559       8107025   -29.1%/+20.3%

This patch doesn't recover all the lost performance, but it is more
than half. Given the fact that reader optimistic spinning does benefit
some workloads, this is a good compromise.

Using the rwsem locking microbenchmark with very short critical section,
this patch doesn't have too much impact on locking performance as shown
by the locking rates (kops/s) below with equal numbers of readers and
writers before and after this patch:

   # of Threads  Pre-patch    Post-patch
   ------------  ---------    ----------
        2          4,730        4,969
        4          4,814        4,786
        8          4,866        4,815
       16          4,715        4,511
       32          3,338        3,500
       64          3,212        3,389
       80          3,110        3,044

When running the locking microbenchmark with 40 dedicated reader and writer
threads, however, the reader performance is curtailed to favor the writer.

Before patch:

  40 readers, Iterations Min/Mean/Max = 204,026/234,309/254,816
  40 writers, Iterations Min/Mean/Max = 88,515/95,884/115,644

After patch:

  40 readers, Iterations Min/Mean/Max = 33,813/35,260/36,791
  40 writers, Iterations Min/Mean/Max = 95,368/96,565/97,798

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-16-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:28:09 +02:00
Waiman Long
7d43f1ce9d locking/rwsem: Enable time-based spinning on reader-owned rwsem
When the rwsem is owned by reader, writers stop optimistic spinning
simply because there is no easy way to figure out if all the readers
are actively running or not. However, there are scenarios where
the readers are unlikely to sleep and optimistic spinning can help
performance.

This patch provides a simple mechanism for spinning on a reader-owned
rwsem by a writer. It is a time threshold based spinning where the
allowable spinning time can vary from 10us to 25us depending on the
condition of the rwsem.

When the time threshold is exceeded, the nonspinnable bits will be set
in the owner field to indicate that no more optimistic spinning will
be allowed on this rwsem until it becomes writer owned again. Not even
readers is allowed to acquire the reader-locked rwsem by optimistic
spinning for fairness.

We also want a writer to acquire the lock after the readers hold the
lock for a relatively long time. In order to give preference to writers
under such a circumstance, the single RWSEM_NONSPINNABLE bit is now split
into two - one for reader and one for writer. When optimistic spinning
is disabled, both bits will be set. When the reader count drop down
to 0, the writer nonspinnable bit will be cleared to allow writers to
spin on the lock, but not the readers. When a writer acquires the lock,
it will write its own task structure pointer into sem->owner and clear
the reader nonspinnable bit in the process.

The time taken for each iteration of the reader-owned rwsem spinning
loop varies. Below are sample minimum elapsed times for 16 iterations
of the loop.

      System                 Time for 16 Iterations
      ------                 ----------------------
  1-socket Skylake                  ~800ns
  4-socket Broadwell                ~300ns
  2-socket ThunderX2 (arm64)        ~250ns

When the lock cacheline is contended, we can see up to almost 10X
increase in elapsed time.  So 25us will be at most 500, 1300 and 1600
iterations for each of the above systems.

With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) on a 8-socket IvyBridge-EX system with
equal numbers of readers and writers before and after this patch were
as follows:

   # of Threads  Pre-patch    Post-patch
   ------------  ---------    ----------
        2          1,759        6,684
        4          1,684        6,738
        8          1,074        7,222
       16            900        7,163
       32            458        7,316
       64            208          520
      128            168          425
      240            143          474

This patch gives a big boost in performance for mixed reader/writer
workloads.

With 32 locking threads, the rwsem lock event data were:

rwsem_opt_fail=79850
rwsem_opt_nospin=5069
rwsem_opt_rlock=597484
rwsem_opt_wlock=957339
rwsem_sleep_reader=57782
rwsem_sleep_writer=55663

With 64 locking threads, the data looked like:

rwsem_opt_fail=346723
rwsem_opt_nospin=6293
rwsem_opt_rlock=1127119
rwsem_opt_wlock=1400628
rwsem_sleep_reader=308201
rwsem_sleep_writer=72281

So a lot more threads acquired the lock in the slowpath and more threads
went to sleep.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-15-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:28:07 +02:00
Waiman Long
94a9717b3c locking/rwsem: Make rwsem->owner an atomic_long_t
The rwsem->owner contains not just the task structure pointer, it also
holds some flags for storing the current state of the rwsem. Some of
the flags may have to be atomically updated. To reflect the new reality,
the owner is now changed to an atomic_long_t type.

New helper functions are added to properly separate out the task
structure pointer and the embedded flags.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-14-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:28:06 +02:00
Waiman Long
cf69482d62 locking/rwsem: Enable readers spinning on writer
This patch enables readers to optimistically spin on a
rwsem when it is owned by a writer instead of going to sleep
directly.  The rwsem_can_spin_on_owner() function is extracted
out of rwsem_optimistic_spin() and is called directly by
rwsem_down_read_slowpath() and rwsem_down_write_slowpath().

With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) on a 8-socket IvyBrige-EX system with equal
numbers of readers and writers before and after the patch were as
follows:

   # of Threads  Pre-patch    Post-patch
   ------------  ---------    ----------
        4          1,674        1,684
        8          1,062        1,074
       16            924          900
       32            300          458
       64            195          208
      128            164          168
      240            149          143

The performance change wasn't significant in this case, but this change
is required by a follow-on patch.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-13-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:28:05 +02:00
Waiman Long
02f1082b00 locking/rwsem: Clarify usage of owner's nonspinaable bit
Bit 1 of sem->owner (RWSEM_ANONYMOUSLY_OWNED) is used to designate an
anonymous owner - readers or an anonymous writer. The setting of this
anonymous bit is used as an indicator that optimistic spinning cannot
be done on this rwsem.

With the upcoming reader optimistic spinning patches, a reader-owned
rwsem can be spinned on for a limit period of time. We still need
this bit to indicate a rwsem is nonspinnable, but not setting this
bit loses its meaning that the owner is known. So rename the bit
to RWSEM_NONSPINNABLE to clarify its meaning.

This patch also fixes a DEBUG_RWSEMS_WARN_ON() bug in __up_write().

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-12-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:28:03 +02:00
Waiman Long
d3681e269f locking/rwsem: Wake up almost all readers in wait queue
When the front of the wait queue is a reader, other readers
immediately following the first reader will also be woken up at the
same time. However, if there is a writer in between. Those readers
behind the writer will not be woken up.

Because of optimistic spinning, the lock acquisition order is not FIFO
anyway. The lock handoff mechanism will ensure that lock starvation
will not happen.

Assuming that the lock hold times of the other readers still in the
queue will be about the same as the readers that are being woken up,
there is really not much additional cost other than the additional
latency due to the wakeup of additional tasks by the waker. Therefore
all the readers up to a maximum of 256 in the queue are woken up when
the first waiter is a reader to improve reader throughput. This is
somewhat similar in concept to a phase-fair R/W lock.

With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) on a 8-socket IvyBridge-EX system with
equal numbers of readers and writers before and after this patch were
as follows:

   # of Threads  Pre-Patch   Post-patch
   ------------  ---------   ----------
        4          1,641        1,674
        8            731        1,062
       16            564          924
       32             78          300
       64             38          195
      240             50          149

There is no performance gain at low contention level. At high contention
level, however, this patch gives a pretty decent performance boost.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-11-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:28:02 +02:00
Waiman Long
990fa7384a locking/rwsem: More optimal RT task handling of null owner
An RT task can do optimistic spinning only if the lock holder is
actually running. If the state of the lock holder isn't known, there
is a possibility that high priority of the RT task may block forward
progress of the lock holder if it happens to reside on the same CPU.
This will lead to deadlock. So we have to make sure that an RT task
will not spin on a reader-owned rwsem.

When the owner is temporarily set to NULL, there are two cases
where we may want to continue spinning:

 1) The lock owner is in the process of releasing the lock, sem->owner
    is cleared but the lock has not been released yet.

 2) The lock was free and owner cleared, but another task just comes
    in and acquire the lock before we try to get it. The new owner may
    be a spinnable writer.

So an RT task is now made to retry one more time to see if it can
acquire the lock or continue spinning on the new owning writer.

When testing on a 8-socket IvyBridge-EX system, the one additional retry
seems to improve locking performance of RT write locking threads under
heavy contentions. The table below shows the locking rates (in kops/s)
with various write locking threads before and after the patch.

    Locking threads     Pre-patch     Post-patch
    ---------------     ---------     -----------
            4             2,753          2,608
            8             2,529          2,520
           16             1,727          1,918
           32             1,263          1,956
           64               889          1,343

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-10-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:28:01 +02:00
Waiman Long
00f3c5a3df locking/rwsem: Always release wait_lock before waking up tasks
With the use of wake_q, we can do task wakeups without holding the
wait_lock. There is one exception in the rwsem code, though. It is
when the writer in the slowpath detects that there are waiters ahead
but the rwsem is not held by a writer. This can lead to a long wait_lock
hold time especially when a large number of readers are to be woken up.

Remediate this situation by releasing the wait_lock before waking
up tasks and re-acquiring it afterward. The rwsem_try_write_lock()
function is also modified to read the rwsem count directly to avoid
stale count value.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-9-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:28:00 +02:00
Waiman Long
4f23dbc1e6 locking/rwsem: Implement lock handoff to prevent lock starvation
Because of writer lock stealing, it is possible that a constant
stream of incoming writers will cause a waiting writer or reader to
wait indefinitely leading to lock starvation.

This patch implements a lock handoff mechanism to disable lock stealing
and force lock handoff to the first waiter or waiters (for readers)
in the queue after at least a 4ms waiting period unless it is a RT
writer task which doesn't need to wait. The waiting period is used to
avoid discouraging lock stealing too much to affect performance.

The setting and clearing of the handoff bit is serialized by the
wait_lock. So racing is not possible.

A rwsem microbenchmark was run for 5 seconds on a 2-socket 40-core
80-thread Skylake system with a v5.1 based kernel and 240 write_lock
threads with 5us sleep critical section.

Before the patch, the min/mean/max numbers of locking operations for
the locking threads were 1/7,792/173,696. After the patch, the figures
became 5,842/6,542/7,458.  It can be seen that the rwsem became much
more fair, though there was a drop of about 16% in the mean locking
operations done which was a tradeoff of having better fairness.

Making the waiter set the handoff bit right after the first wakeup can
impact performance especially with a mixed reader/writer workload. With
the same microbenchmark with short critical section and equal number of
reader and writer threads (40/40), the reader/writer locking operation
counts with the current patch were:

  40 readers, Iterations Min/Mean/Max = 1,793/1,794/1,796
  40 writers, Iterations Min/Mean/Max = 1,793/34,956/86,081

By making waiter set handoff bit immediately after wakeup:

  40 readers, Iterations Min/Mean/Max = 43/44/46
  40 writers, Iterations Min/Mean/Max = 43/1,263/3,191

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-8-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:27:59 +02:00
Waiman Long
3f6d517a3e locking/rwsem: Make rwsem_spin_on_owner() return owner state
This patch modifies rwsem_spin_on_owner() to return four possible
values to better reflect the state of lock holder which enables us to
make a better decision of what to do next.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-7-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:27:59 +02:00
Waiman Long
6cef7ff6e4 locking/rwsem: Code cleanup after files merging
After merging all the relevant rwsem code into one single file, there
are a number of optimizations and cleanups that can be done:

 1) Remove all the EXPORT_SYMBOL() calls for functions that are not
    accessed elsewhere.
 2) Remove all the __visible tags as none of the functions will be
    called from assembly code anymore.
 3) Make all the internal functions static.
 4) Remove some unneeded blank lines.
 5) Remove the intermediate rwsem_down_{read|write}_failed*() functions
    and rename __rwsem_down_{read|write}_failed_common() to
    rwsem_down_{read|write}_slowpath().
 6) Remove "__" prefix of __rwsem_mark_wake().
 7) Use atomic_long_try_cmpxchg_acquire() as much as possible.
 8) Remove the rwsem_rtrylock and rwsem_wtrylock lock events as they
    are not that useful.

That enables the compiler to do better optimization and reduce code
size. The text+data size of rwsem.o on an x86-64 machine with gcc8 was
reduced from 10237 bytes to 5030 bytes with this change.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-6-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:27:58 +02:00
Waiman Long
5dec94d492 locking/rwsem: Merge rwsem.h and rwsem-xadd.c into rwsem.c
Now we only have one implementation of rwsem. Even though we still use
xadd to handle reader locking, we use cmpxchg for writer instead. So
the filename rwsem-xadd.c is not strictly correct. Also no one outside
of the rwsem code need to know the internal implementation other than
function prototypes for two internal functions that are called directly
from percpu-rwsem.c.

So the rwsem-xadd.c and rwsem.h files are now merged into rwsem.c in
the following order:

  <upper part of rwsem.h>
  <rwsem-xadd.c>
  <lower part of rwsem.h>
  <rwsem.c>

The rwsem.h file now contains only 2 function declarations for
__up_read() and __down_read().

This is a code relocation patch with no code change at all except
making __up_read() and __down_read() non-static functions so they
can be used by percpu-rwsem.c.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-5-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:27:57 +02:00
Waiman Long
64489e7800 locking/rwsem: Implement a new locking scheme
The current way of using various reader, writer and waiting biases
in the rwsem code are confusing and hard to understand. I have to
reread the rwsem count guide in the rwsem-xadd.c file from time to
time to remind myself how this whole thing works. It also makes the
rwsem code harder to be optimized.

To make rwsem more sane, a new locking scheme similar to the one in
qrwlock is now being used.  The atomic long count has the following
bit definitions:

  Bit  0   - writer locked bit
  Bit  1   - waiters present bit
  Bits 2-7 - reserved for future extension
  Bits 8-X - reader count (24/56 bits)

The cmpxchg instruction is now used to acquire the write lock. The read
lock is still acquired with xadd instruction, so there is no change here.
This scheme will allow up to 16M/64P active readers which should be
more than enough. We can always use some more reserved bits if necessary.

With that change, we can deterministically know if a rwsem has been
write-locked. Looking at the count alone, however, one cannot determine
for certain if a rwsem is owned by readers or not as the readers that
set the reader count bits may be in the process of backing out. So we
still need the reader-owned bit in the owner field to be sure.

With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) of the benchmark on a 8-socket 120-core
IvyBridge-EX system before and after the patch were as follows:

                  Before Patch      After Patch
   # of Threads  wlock    rlock    wlock    rlock
   ------------  -----    -----    -----    -----
        1        30,659   31,341   31,055   31,283
        2         8,909   16,457    9,884   17,659
        4         9,028   15,823    8,933   20,233
        8         8,410   14,212    7,230   17,140
       16         8,217   25,240    7,479   24,607

The locking rates of the benchmark on a Power8 system were as follows:

                  Before Patch      After Patch
   # of Threads  wlock    rlock    wlock    rlock
   ------------  -----    -----    -----    -----
        1        12,963   13,647   13,275   13,601
        2         7,570   11,569    7,902   10,829
        4         5,232    5,516    5,466    5,435
        8         5,233    3,386    5,467    3,168

The locking rates of the benchmark on a 2-socket ARM64 system were
as follows:

                  Before Patch      After Patch
   # of Threads  wlock    rlock    wlock    rlock
   ------------  -----    -----    -----    -----
        1        21,495   21,046   21,524   21,074
        2         5,293   10,502    5,333   10,504
        4         5,325   11,463    5,358   11,631
        8         5,391   11,712    5,470   11,680

The performance are roughly the same before and after the patch. There
are run-to-run variations in performance. Runs with higher variances
usually have higher throughput.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-4-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:27:56 +02:00
Waiman Long
5c1ec49b60 locking/rwsem: Remove rwsem_wake() wakeup optimization
After the following commit:

  59aabfc7e9 ("locking/rwsem: Reduce spinlock contention in wakeup after up_read()/up_write()")

the rwsem_wake() forgoes doing a wakeup if the wait_lock cannot be directly
acquired and an optimistic spinning locker is present.  This can help performance
by avoiding spinning on the wait_lock when it is contended.

With the later commit:

  133e89ef5e ("locking/rwsem: Enable lockless waiter wakeup(s)")

the performance advantage of the above optimization diminishes as the average
wait_lock hold time become much shorter.

With a later patch that supports rwsem lock handoff, we can no
longer relies on the fact that the presence of an optimistic spinning
locker will ensure that the lock will be acquired by a task soon and
rwsem_wake() will be called later on to wake up waiters. This can lead
to missed wakeup and application hang.

So the original 59aabfc7e9 commit has to be reverted.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-3-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:27:55 +02:00
Waiman Long
c71fd893f6 locking/rwsem: Make owner available even if !CONFIG_RWSEM_SPIN_ON_OWNER
The owner field in the rw_semaphore structure is used primarily for
optimistic spinning. However, identifying the rwsem owner can also be
helpful in debugging as well as tracing locking related issues when
analyzing crash dump. The owner field may also store state information
that can be important to the operation of the rwsem.

So the owner field is now made a permanent member of the rw_semaphore
structure irrespective of CONFIG_RWSEM_SPIN_ON_OWNER.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-2-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:27:54 +02:00
bsegall@google.com
66567fcbae sched/fair: Don't push cfs_bandwith slack timers forward
When a cfs_rq sleeps and returns its quota, we delay for 5ms before
waking any throttled cfs_rqs to coalesce with other cfs_rqs going to
sleep, as this has to be done outside of the rq lock we hold.

The current code waits for 5ms without any sleeps, instead of waiting
for 5ms from the first sleep, which can delay the unthrottle more than
we want. Switch this around so that we can't push this forward forever.

This requires an extra flag rather than using hrtimer_active, since we
need to start a new timer if the current one is in the process of
finishing.

Signed-off-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com>
Acked-by: Phil Auld <pauld@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/xm26a7euy6iq.fsf_-_@bsegall-linux.svl.corp.google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:16:01 +02:00
Peter Zijlstra
aacedf26fb sched/core: Optimize try_to_wake_up() for local wakeups
Jens reported that significant performance can be had on some block
workloads by special casing local wakeups. That is, wakeups on the
current task before it schedules out.

Given something like the normal wait pattern:

	for (;;) {
		set_current_state(TASK_UNINTERRUPTIBLE);

		if (cond)
			break;

		schedule();
	}
	__set_current_state(TASK_RUNNING);

Any wakeup (on this CPU) after set_current_state() and before
schedule() would benefit from this.

Normal wakeups take p->pi_lock, which serializes wakeups to the same
task. By eliding that we gain concurrency on:

 - ttwu_stat(); we already had concurrency on rq stats, this now also
   brings it to task stats. -ENOCARE

 - tracepoints; it is now possible to get multiple instances of
   trace_sched_waking() (and possibly trace_sched_wakeup()) for the
   same task. Tracers will have to learn to cope.

Furthermore, p->pi_lock is used by set_special_state(), to order
against TASK_RUNNING stores from other CPUs. But since this is
strictly CPU local, we don't need the lock, and set_special_state()'s
disabling of IRQs is sufficient.

After the normal wakeup takes p->pi_lock it issues
smp_mb__after_spinlock(), in order to ensure the woken task must
observe prior stores before we observe the p->state. If this is CPU
local, this will be satisfied with a compiler barrier, and we rely on
try_to_wake_up() being a funcation call, which implies such.

Since, when 'p == current', 'p->on_rq' must be true, the normal wakeup
would continue into the ttwu_remote() branch, which normally is
concerned with exactly this wakeup scenario, except from a remote CPU.
IOW we're waking a task that is still running. In this case, we can
trivially avoid taking rq->lock, all that's left from this is to set
p->state.

This then yields an extremely simple and fast path for 'p == current'.

Reported-by: Jens Axboe <axboe@kernel.dk>
Tested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akpm@linux-foundation.org
Cc: gkohli@codeaurora.org
Cc: hch@lst.de
Cc: oleg@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:15:59 +02:00
Qian Cai
509466b7d4 sched/fair: Fix "runnable_avg_yN_inv" not used warnings
runnable_avg_yN_inv[] is only used in kernel/sched/pelt.c but was
included in several other places because they need other macros all
came from kernel/sched/sched-pelt.h which was generated by
Documentation/scheduler/sched-pelt. As the result, it causes compilation
a lot of warnings,

  kernel/sched/sched-pelt.h:4:18: warning: 'runnable_avg_yN_inv' defined but not used [-Wunused-const-variable=]
  kernel/sched/sched-pelt.h:4:18: warning: 'runnable_avg_yN_inv' defined but not used [-Wunused-const-variable=]
  kernel/sched/sched-pelt.h:4:18: warning: 'runnable_avg_yN_inv' defined but not used [-Wunused-const-variable=]
  ...

Silence it by appending the __maybe_unused attribute for it, so all
generated variables and macros can still be kept in the same file.

Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/1559596304-31581-1-git-send-email-cai@lca.pw
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:15:58 +02:00
Valentin Schneider
b0c7922441 sched/fair: Clean up definition of NOHZ blocked load functions
cfs_rq_has_blocked() and others_have_blocked() are only used within
update_blocked_averages(). The !CONFIG_FAIR_GROUP_SCHED version of the
latter calls them within a #define CONFIG_NO_HZ_COMMON block, whereas
the CONFIG_FAIR_GROUP_SCHED one calls them unconditionnally.

As reported by Qian, the above leads to this warning in
!CONFIG_NO_HZ_COMMON configs:

  kernel/sched/fair.c: In function 'update_blocked_averages':
  kernel/sched/fair.c:7750:7: warning: variable 'done' set but not used [-Wunused-but-set-variable]

It wouldn't be wrong to keep cfs_rq_has_blocked() and
others_have_blocked() as they are, but since their only current use is
to figure out when we can stop calling update_blocked_averages() on
fully decayed NOHZ idle CPUs, we can give them a new definition for
!CONFIG_NO_HZ_COMMON.

Change the definition of cfs_rq_has_blocked() and
others_have_blocked() for !CONFIG_NO_HZ_COMMON so that the
NOHZ-specific blocks of update_blocked_averages() become no-ops and
the 'done' variable gets optimised out.

While at it, remove the CONFIG_NO_HZ_COMMON block from the
!CONFIG_FAIR_GROUP_SCHED definition of update_blocked_averages() by
using the newly-introduced update_blocked_load_status() helper.

No change in functionality intended.

[ Additions by Peter Zijlstra. ]

Reported-by: Qian Cai <cai@lca.pw>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190603115424.7951-1-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:15:57 +02:00
Gao Xiang
e3b929b0a1 sched/core: Add __sched tag for io_schedule()
Non-inline io_schedule() was introduced in:

  commit 10ab56434f ("sched/core: Separate out io_schedule_prepare() and io_schedule_finish()")

Keep in line with io_schedule_timeout(), otherwise "/proc/<pid>/wchan" will
report io_schedule() rather than its callers when waiting for IO.

Reported-by: Jilong Kou <koujilong@huawei.com>
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Miao Xie <miaoxie@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 10ab56434f ("sched/core: Separate out io_schedule_prepare() and io_schedule_finish()")
Link: https://lkml.kernel.org/r/20190603091338.2695-1-gaoxiang25@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:15:56 +02:00
Ingo Molnar
23da766ab1 Linux 5.2-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl0Gj1MeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGctkH/0At3+SQPY2JJSy8
 i6+TDeytFx9OggeGLPHChRfehkAlvMb/kd34QHnuEvDqUuCAMU6HZQJFKoK9mvFI
 sDJVayPGDSqpm+iv8qLpMBPShiCXYVnGZeVfOdv36jUswL0k6wHV1pz4avFkDeZa
 1F4pmI6O2XRkNTYQawbUaFkAngWUCBG9ECLnHJnuIY6ohShBvjI4+E2JUaht+8gO
 M2h2b9ieddWmjxV3LTKgsK1v+347RljxdZTWnJ62SCDSEVZvsgSA9W2wnebVhBkJ
 drSmrFLxNiM+W45mkbUFmQixRSmjv++oRR096fxAnodBxMw0TDxE1RiMQWE6rVvG
 N6MC6xA=
 =+B0P
 -----END PGP SIGNATURE-----

Merge tag 'v5.2-rc5' into sched/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:12:27 +02:00
Peter Zijlstra
085ebfe937 perf/core: Fix perf_sample_regs_user() mm check
perf_sample_regs_user() uses 'current->mm' to test for the presence of
userspace, but this is insufficient, consider use_mm().

A better test is: '!(current->flags & PF_KTHREAD)', exec() clears
PF_KTHREAD after it sets the new ->mm but before it drops to userspace
for the first time.

Possibly obsoletes: bf05fc25f2 ("powerpc/perf: Fix oops when kthread execs user process")

Reported-by: Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com>
Reported-by: Young Xiao <92siuyang@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 4018994f3d ("perf: Add ability to attach user level registers dump to sample")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:11:58 +02:00
Kobe Wu
dd471efe34 locking/lockdep: Remove unnecessary DEBUG_LOCKS_WARN_ON()
DEBUG_LOCKS_WARN_ON() will turn off debug_locks and
makes print_unlock_imbalance_bug() return directly.

Remove a redundant whitespace.

Signed-off-by: Kobe Wu <kobe-cp.wu@mediatek.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <linux-mediatek@lists.infradead.org>
Cc: <wsd_upstream@mediatek.com>
Cc: Eason Lin <eason-yh.lin@mediatek.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: https://lkml.kernel.org/r/1559217575-30298-1-git-send-email-kobe-cp.wu@mediatek.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:09:37 +02:00
Daniel Bristot de Oliveira
c2ba8a15f3 jump_label: Batch updates if arch supports it
If the architecture supports the batching of jump label updates, use it!

An easy way to see the benefits of this patch is switching the
schedstats on and off. For instance:

-------------------------- %< ----------------------------
  #!/bin/sh
  while [ true ]; do
      sysctl -w kernel.sched_schedstats=1
      sleep 2
      sysctl -w kernel.sched_schedstats=0
      sleep 2
  done
-------------------------- >% ----------------------------

while watching the IPI count:

-------------------------- %< ----------------------------
  # watch -n1 "cat /proc/interrupts | grep Function"
-------------------------- >% ----------------------------

With the current mode, it is possible to see +- 168 IPIs each 2 seconds,
while with this patch the number of IPIs goes to 3 each 2 seconds.

Regarding the performance impact of this patch set, I made two measurements:

    The time to update a key (the task that is causing the change)
    The time to run the int3 handler (the side effect on a thread that
                                      hits the code being changed)

The schedstats static key was chosen as the key to being switched on and off.
The reason being is that it is used in more than 56 places, in a hot path. The
change in the schedstats static key will be done with the following command:

while [ true ]; do
    sysctl -w kernel.sched_schedstats=1
    usleep 500000
    sysctl -w kernel.sched_schedstats=0
    usleep 500000
done

In this way, they key will be updated twice per second. To force the hit of the
int3 handler, the system will also run a kernel compilation with two jobs per
CPU. The test machine is a two nodes/24 CPUs box with an Intel Xeon processor
@2.27GHz.

Regarding the update part, on average, the regular kernel takes 57 ms to update
the schedstats key, while the kernel with the batch updates takes just 1.4 ms
on average. Although it seems to be too good to be true, it makes sense: the
schedstats key is used in 56 places, so it was expected that it would take
around 56 times to update the keys with the current implementation, as the
IPIs are the most expensive part of the update.

Regarding the int3 handler, the non-batch handler takes 45 ns on average, while
the batch version takes around 180 ns. At first glance, it seems to be a high
value. But it is not, considering that it is doing 56 updates, rather than one!
It is taking four times more, only. This gain is possible because the patch
uses a binary search in the vector: log2(56)=5.8. So, it was expected to have
an overhead within four times.

(voice of tv propaganda) But, that is not all! As the int3 handler keeps on for
a shorter period (because the update part is on for a shorter time), the number
of hits in the int3 handler decreased by 10%.

The question then is: Is it worth paying the price of "135 ns" more in the int3
handler?

Considering that, in this test case, we are saving the handling of 53 IPIs,
that takes more than these 135 ns, it seems to be a meager price to be paid.
Moreover, the test case was forcing the hit of the int3, in practice, it
does not take that often. While the IPI takes place on all CPUs, hitting
the int3 handler or not!

For instance, in an isolated CPU with a process running in user-space
(nohz_full use-case), the chances of hitting the int3 handler is barely zero,
while there is no way to avoid the IPIs. By bounding the IPIs, we are improving
a lot this scenario.

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chris von Recklinghausen <crecklin@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott Wood <swood@redhat.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/acc891dbc2dbc9fd616dd680529a2337b1d1274c.1560325897.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:09:22 +02:00
Daniel Bristot de Oliveira
0f133021bd jump_label: Sort entries of the same key by the code
In the batching mode, all the entries of a given key are updated at once.
During the update of a key, a hit in the int3 handler will check if the
hitting code address belongs to one of these keys.

To optimize the search of a given code in the vector of entries being
updated, a binary search is used. The binary search relies on the order
of the entries of a key by its code. Hence the keys need to be sorted
by the code too, so sort the entries of a given key by the code.

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chris von Recklinghausen <crecklin@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott Wood <swood@redhat.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/f57ae83e0592418ba269866bb7ade570fc8632e0.1560325897.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:09:21 +02:00
Daniel Bristot de Oliveira
e1aacb3f4a jump_label: Add a jump_label_can_update() helper
Move the check if a jump_entry is valid to a function. No functional
change.

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chris von Recklinghausen <crecklin@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott Wood <swood@redhat.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/56b69bd3f8e644ed64f2dbde7c088030b8cbe76b.1560325897.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:09:19 +02:00
Ingo Molnar
410df0c574 Linux 5.2-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl0Gj1MeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGctkH/0At3+SQPY2JJSy8
 i6+TDeytFx9OggeGLPHChRfehkAlvMb/kd34QHnuEvDqUuCAMU6HZQJFKoK9mvFI
 sDJVayPGDSqpm+iv8qLpMBPShiCXYVnGZeVfOdv36jUswL0k6wHV1pz4avFkDeZa
 1F4pmI6O2XRkNTYQawbUaFkAngWUCBG9ECLnHJnuIY6ohShBvjI4+E2JUaht+8gO
 M2h2b9ieddWmjxV3LTKgsK1v+347RljxdZTWnJ62SCDSEVZvsgSA9W2wnebVhBkJ
 drSmrFLxNiM+W45mkbUFmQixRSmjv++oRR096fxAnodBxMw0TDxE1RiMQWE6rVvG
 N6MC6xA=
 =+B0P
 -----END PGP SIGNATURE-----

Merge tag 'v5.2-rc5' into locking/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-17 12:06:34 +02:00
Linus Torvalds
efba92d58f Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
 "A set of small fixes:

   - Repair the ktime_get_coarse() functions so they actually deliver
     what they are supposed to: tick granular time stamps. The current
     code missed to add the accumulated nanoseconds part of the
     timekeeper so the resulting granularity was 1 second.

   - Prevent the tracer from infinitely recursing into time getter
     functions in the arm architectured timer by marking these functions
     notrace

   - Fix a trivial compiler warning caused by wrong qualifier ordering"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timekeeping: Repair ktime_get_coarse*() granularity
  clocksource/drivers/arm_arch_timer: Don't trace count reader functions
  clocksource/drivers/timer-ti-dm: Change to new style declaration
2019-06-16 07:22:56 -10:00
David S. Miller
1eb4169c1e Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Alexei Starovoitov says:

====================
pull-request: bpf 2019-06-15

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) fix stack layout of JITed x64 bpf code, from Alexei.

2) fix out of bounds memory access in bpf_sk_storage, from Arthur.

3) fix lpm trie walk, from Jonathan.

4) fix nested bpf_perf_event_output, from Matt.

5) and several other fixes.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-15 18:19:47 -07:00
Matt Mullins
9594dc3c7e bpf: fix nested bpf tracepoints with per-cpu data
BPF_PROG_TYPE_RAW_TRACEPOINTs can be executed nested on the same CPU, as
they do not increment bpf_prog_active while executing.

This enables three levels of nesting, to support
  - a kprobe or raw tp or perf event,
  - another one of the above that irq context happens to call, and
  - another one in nmi context
(at most one of which may be a kprobe or perf event).

Fixes: 20b9d7ac48 ("bpf: avoid excessive stack usage for perf_sample_data")
Signed-off-by: Matt Mullins <mmullins@fb.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-06-15 16:33:35 -07:00
Linus Torvalds
6a71398c6a This includes the following fixes:
- Out of range read of stack trace output
  - Fix for NULL pointer dereference in trace_uprobe_create()
  - Fix to a livepatching / ftrace permission race in the module code
  - Fix for NULL pointer dereference in free_ftrace_func_mapper()
  - A couple of build warning clean ups
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXQToxhQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qusmAP4/mmJPgsDchnu5ui0wB8BByzJlsPn8
 luXFDuqI4f34zgD+JCmeYbj5LLh98D9XkaaEgP4yz3yKsWeSdwWPCU0vTgo=
 =M/+E
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Out of range read of stack trace output

 - Fix for NULL pointer dereference in trace_uprobe_create()

 - Fix to a livepatching / ftrace permission race in the module code

 - Fix for NULL pointer dereference in free_ftrace_func_mapper()

 - A couple of build warning clean ups

* tag 'trace-v5.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  ftrace: Fix NULL pointer dereference in free_ftrace_func_mapper()
  module: Fix livepatch/ftrace module text permissions race
  tracing/uprobe: Fix obsolete comment on trace_uprobe_create()
  tracing/uprobe: Fix NULL pointer dereference in trace_uprobe_create()
  tracing: Make two symbols static
  tracing: avoid build warning with HAVE_NOP_MCOUNT
  tracing: Fix out-of-range read in trace_stack_print()
2019-06-15 07:24:11 -10:00
Heiko Carstens
4ecf0a43e7 processor: get rid of cpu_relax_yield
stop_machine is the only user left of cpu_relax_yield. Given that it
now has special semantics which are tied to stop_machine introduce a
weak stop_machine_yield function which architectures can override, and
get rid of the generic cpu_relax_yield implementation.

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2019-06-15 12:25:55 +02:00
Martin Schwidefsky
38f2c691a4 s390: improve wait logic of stop_machine
The stop_machine loop to advance the state machine and to wait for all
affected CPUs to check-in calls cpu_relax_yield in a tight loop until
the last missing CPUs acknowledged the state transition.

On a virtual system where not all logical CPUs are backed by real CPUs
all the time it can take a while for all CPUs to check-in. With the
current definition of cpu_relax_yield a diagnose 0x44 is done which
tells the hypervisor to schedule *some* other CPU. That can be any
CPU and not necessarily one of the CPUs that need to run in order to
advance the state machine. This can lead to a pretty bad diagnose 0x44
storm until the last missing CPU finally checked-in.

Replace the undirected cpu_relax_yield based on diagnose 0x44 with a
directed yield. Each CPU in the wait loop will pick up the next CPU
in the cpumask of stop_machine. The diagnose 0x9c is used to tell the
hypervisor to run this next CPU instead of the current one. If there
is only a limited number of real CPUs backing the virtual CPUs we
end up with the real CPUs passed around in a round-robin fashion.

[heiko.carstens@de.ibm.com]:
    Use cpumask_next_wrap as suggested by Peter Zijlstra.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2019-06-15 12:25:52 +02:00
Linus Torvalds
0011572c88 Merge branch 'for-5.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
 "This has an unusually high density of tricky fixes:

   - task_get_css() could deadlock when it races against a dying cgroup.

   - cgroup.procs didn't list thread group leaders with live threads.

     This could mislead readers to think that a cgroup is empty when
     it's not. Fixed by making PROCS iterator include dead tasks. I made
     a couple mistakes making this change and this pull request contains
     a couple follow-up patches.

   - When cpusets run out of online cpus, it updates cpusmasks of member
     tasks in bizarre ways. Joel improved the behavior significantly"

* 'for-5.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cpuset: restore sanity to cpuset_cpus_allowed_fallback()
  cgroup: Fix css_task_iter_advance_css_set() cset skip condition
  cgroup: css_task_iter_skip()'d iterators must be advanced before accessed
  cgroup: Include dying leaders with live threads in PROCS iterations
  cgroup: Implement css_task_iter_skip()
  cgroup: Call cgroup_release() before __exit_signal()
  docs cgroups: add another example size for hugetlb
  cgroup: Use css_tryget() instead of css_tryget_online() in task_get_css()
2019-06-14 17:46:14 -10:00
Eric Dumazet
a8e11e5c56 sysctl: define proc_do_static_key()
Convert proc_dointvec_minmax_bpf_stats() into a more generic
helper, since we are going to use jump labels more often.

Note that sysctl_bpf_stats_enabled is removed, since
it is no longer needed/used.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14 20:18:27 -07:00
Toshiaki Makita
86723c8640 bpf, devmap: Add missing RCU read lock on flush
.ndo_xdp_xmit() assumes it is called under RCU. For example virtio_net
uses RCU to detect it has setup the resources for tx. The assumption
accidentally broke when introducing bulk queue in devmap.

Fixes: 5d053f9da4 ("bpf: devmap prepare xdp frames for bulking")
Reported-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-15 00:58:51 +02:00
Toshiaki Makita
edabf4d9dd bpf, devmap: Add missing bulk queue free
dev_map_free() forgot to free bulk queue when freeing its entries.

Fixes: 5d053f9da4 ("bpf: devmap prepare xdp frames for bulking")
Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-15 00:58:47 +02:00
Toshiaki Makita
d4dd153d55 bpf, devmap: Fix premature entry free on destroying map
dev_map_free() waits for flush_needed bitmap to be empty in order to
ensure all flush operations have completed before freeing its entries.
However the corresponding clear_bit() was called before using the
entries, so the entries could be used after free.

All access to the entries needs to be done before clearing the bit.
It seems commit a5e2da6e97 ("bpf: netdev is never null in
__dev_map_flush") accidentally changed the clear_bit() and memory access
order.

Note that the problem happens only in __dev_map_flush(), not in
dev_map_flush_old(). dev_map_flush_old() is called only after nulling
out the corresponding netdev_map entry, so dev_map_free() never frees
the entry thus no such race happens there.

Fixes: a5e2da6e97 ("bpf: netdev is never null in __dev_map_flush")
Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-15 00:58:42 +02:00
Wei Li
04e03d9a61 ftrace: Fix NULL pointer dereference in free_ftrace_func_mapper()
The mapper may be NULL when called from register_ftrace_function_probe()
with probe->data == NULL.

This issue can be reproduced as follow (it may be covered by compiler
optimization sometime):

/ # cat /sys/kernel/debug/tracing/set_ftrace_filter
#### all functions enabled ####
/ # echo foo_bar:dump > /sys/kernel/debug/tracing/set_ftrace_filter
[  206.949100] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[  206.952402] Mem abort info:
[  206.952819]   ESR = 0x96000006
[  206.955326]   Exception class = DABT (current EL), IL = 32 bits
[  206.955844]   SET = 0, FnV = 0
[  206.956272]   EA = 0, S1PTW = 0
[  206.956652] Data abort info:
[  206.957320]   ISV = 0, ISS = 0x00000006
[  206.959271]   CM = 0, WnR = 0
[  206.959938] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000419f3a000
[  206.960483] [0000000000000000] pgd=0000000411a87003, pud=0000000411a83003, pmd=0000000000000000
[  206.964953] Internal error: Oops: 96000006 [#1] SMP
[  206.971122] Dumping ftrace buffer:
[  206.973677]    (ftrace buffer empty)
[  206.975258] Modules linked in:
[  206.976631] Process sh (pid: 281, stack limit = 0x(____ptrval____))
[  206.978449] CPU: 10 PID: 281 Comm: sh Not tainted 5.2.0-rc1+ #17
[  206.978955] Hardware name: linux,dummy-virt (DT)
[  206.979883] pstate: 60000005 (nZCv daif -PAN -UAO)
[  206.980499] pc : free_ftrace_func_mapper+0x2c/0x118
[  206.980874] lr : ftrace_count_free+0x68/0x80
[  206.982539] sp : ffff0000182f3ab0
[  206.983102] x29: ffff0000182f3ab0 x28: ffff8003d0ec1700
[  206.983632] x27: ffff000013054b40 x26: 0000000000000001
[  206.984000] x25: ffff00001385f000 x24: 0000000000000000
[  206.984394] x23: ffff000013453000 x22: ffff000013054000
[  206.984775] x21: 0000000000000000 x20: ffff00001385fe28
[  206.986575] x19: ffff000013872c30 x18: 0000000000000000
[  206.987111] x17: 0000000000000000 x16: 0000000000000000
[  206.987491] x15: ffffffffffffffb0 x14: 0000000000000000
[  206.987850] x13: 000000000017430e x12: 0000000000000580
[  206.988251] x11: 0000000000000000 x10: cccccccccccccccc
[  206.988740] x9 : 0000000000000000 x8 : ffff000013917550
[  206.990198] x7 : ffff000012fac2e8 x6 : ffff000012fac000
[  206.991008] x5 : ffff0000103da588 x4 : 0000000000000001
[  206.991395] x3 : 0000000000000001 x2 : ffff000013872a28
[  206.991771] x1 : 0000000000000000 x0 : 0000000000000000
[  206.992557] Call trace:
[  206.993101]  free_ftrace_func_mapper+0x2c/0x118
[  206.994827]  ftrace_count_free+0x68/0x80
[  206.995238]  release_probe+0xfc/0x1d0
[  206.995555]  register_ftrace_function_probe+0x4a8/0x868
[  206.995923]  ftrace_trace_probe_callback.isra.4+0xb8/0x180
[  206.996330]  ftrace_dump_callback+0x50/0x70
[  206.996663]  ftrace_regex_write.isra.29+0x290/0x3a8
[  206.997157]  ftrace_filter_write+0x44/0x60
[  206.998971]  __vfs_write+0x64/0xf0
[  206.999285]  vfs_write+0x14c/0x2f0
[  206.999591]  ksys_write+0xbc/0x1b0
[  206.999888]  __arm64_sys_write+0x3c/0x58
[  207.000246]  el0_svc_common.constprop.0+0x408/0x5f0
[  207.000607]  el0_svc_handler+0x144/0x1c8
[  207.000916]  el0_svc+0x8/0xc
[  207.003699] Code: aa0003f8 a9025bf5 aa0103f5 f946ea80 (f9400303)
[  207.008388] ---[ end trace 7b6d11b5f542bdf1 ]---
[  207.010126] Kernel panic - not syncing: Fatal exception
[  207.011322] SMP: stopping secondary CPUs
[  207.013956] Dumping ftrace buffer:
[  207.014595]    (ftrace buffer empty)
[  207.015632] Kernel Offset: disabled
[  207.017187] CPU features: 0x002,20006008
[  207.017985] Memory Limit: none
[  207.019825] ---[ end Kernel panic - not syncing: Fatal exception ]---

Link: http://lkml.kernel.org/r/20190606031754.10798-1-liwei391@huawei.com

Signed-off-by: Wei Li <liwei391@huawei.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-06-14 17:40:21 -04:00
Mauro Carvalho Chehab
151f4e2bdc docs: power: convert docs to ReST and rename to *.rst
Convert the PM documents to ReST, in order to allow them to
build with Sphinx.

The conversion is actually:
  - add blank lines and indentation in order to identify paragraphs;
  - fix tables markups;
  - add some lists markups;
  - mark literal blocks;
  - adjust title markups.

At its new index.rst, let's add a :orphan: while this is not linked to
the main index.rst file, in order to avoid build warnings.

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Mark Brown <broonie@kernel.org>
Acked-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
2019-06-14 16:08:36 -05:00
Josh Poimboeuf
9f255b632b module: Fix livepatch/ftrace module text permissions race
It's possible for livepatch and ftrace to be toggling a module's text
permissions at the same time, resulting in the following panic:

  BUG: unable to handle page fault for address: ffffffffc005b1d9
  #PF: supervisor write access in kernel mode
  #PF: error_code(0x0003) - permissions violation
  PGD 3ea0c067 P4D 3ea0c067 PUD 3ea0e067 PMD 3cc13067 PTE 3b8a1061
  Oops: 0003 [#1] PREEMPT SMP PTI
  CPU: 1 PID: 453 Comm: insmod Tainted: G           O  K   5.2.0-rc1-a188339ca5 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-20181126_142135-anatol 04/01/2014
  RIP: 0010:apply_relocate_add+0xbe/0x14c
  Code: fa 0b 74 21 48 83 fa 18 74 38 48 83 fa 0a 75 40 eb 08 48 83 38 00 74 33 eb 53 83 38 00 75 4e 89 08 89 c8 eb 0a 83 38 00 75 43 <89> 08 48 63 c1 48 39 c8 74 2e eb 48 83 38 00 75 32 48 29 c1 89 08
  RSP: 0018:ffffb223c00dbb10 EFLAGS: 00010246
  RAX: ffffffffc005b1d9 RBX: 0000000000000000 RCX: ffffffff8b200060
  RDX: 000000000000000b RSI: 0000004b0000000b RDI: ffff96bdfcd33000
  RBP: ffffb223c00dbb38 R08: ffffffffc005d040 R09: ffffffffc005c1f0
  R10: ffff96bdfcd33c40 R11: ffff96bdfcd33b80 R12: 0000000000000018
  R13: ffffffffc005c1f0 R14: ffffffffc005e708 R15: ffffffff8b2fbc74
  FS:  00007f5f447beba8(0000) GS:ffff96bdff900000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: ffffffffc005b1d9 CR3: 000000003cedc002 CR4: 0000000000360ea0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   klp_init_object_loaded+0x10f/0x219
   ? preempt_latency_start+0x21/0x57
   klp_enable_patch+0x662/0x809
   ? virt_to_head_page+0x3a/0x3c
   ? kfree+0x8c/0x126
   patch_init+0x2ed/0x1000 [livepatch_test02]
   ? 0xffffffffc0060000
   do_one_initcall+0x9f/0x1c5
   ? kmem_cache_alloc_trace+0xc4/0xd4
   ? do_init_module+0x27/0x210
   do_init_module+0x5f/0x210
   load_module+0x1c41/0x2290
   ? fsnotify_path+0x3b/0x42
   ? strstarts+0x2b/0x2b
   ? kernel_read+0x58/0x65
   __do_sys_finit_module+0x9f/0xc3
   ? __do_sys_finit_module+0x9f/0xc3
   __x64_sys_finit_module+0x1a/0x1c
   do_syscall_64+0x52/0x61
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

The above panic occurs when loading two modules at the same time with
ftrace enabled, where at least one of the modules is a livepatch module:

CPU0					CPU1
klp_enable_patch()
  klp_init_object_loaded()
    module_disable_ro()
    					ftrace_module_enable()
					  ftrace_arch_code_modify_post_process()
				    	    set_all_modules_text_ro()
      klp_write_object_relocations()
        apply_relocate_add()
	  *patches read-only code* - BOOM

A similar race exists when toggling ftrace while loading a livepatch
module.

Fix it by ensuring that the livepatch and ftrace code patching
operations -- and their respective permissions changes -- are protected
by the text_mutex.

Link: http://lkml.kernel.org/r/ab43d56ab909469ac5d2520c5d944ad6d4abd476.1560474114.git.jpoimboe@redhat.com

Reported-by: Johannes Erdfelt <johannes@erdfelt.com>
Fixes: 444d13ff10 ("modules: add ro_after_init support")
Acked-by: Jessica Yu <jeyu@kernel.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-06-14 17:01:50 -04:00
Eiichi Tsukata
a4158345ec tracing/uprobe: Fix obsolete comment on trace_uprobe_create()
Commit 0597c49c69 ("tracing/uprobes: Use dyn_event framework for
uprobe events") cleaned up the usage of trace_uprobe_create(), and the
function has been no longer used for removing uprobe/uretprobe.

Link: http://lkml.kernel.org/r/20190614074026.8045-2-devel@etsukata.com

Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Eiichi Tsukata <devel@etsukata.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-06-14 17:00:36 -04:00
Eiichi Tsukata
f01098c74b tracing/uprobe: Fix NULL pointer dereference in trace_uprobe_create()
Just like the case of commit 8b05a3a750 ("tracing/kprobes: Fix NULL
pointer dereference in trace_kprobe_create()"), writing an incorrectly
formatted string to uprobe_events can trigger NULL pointer dereference.

Reporeducer:

  # echo r > /sys/kernel/debug/tracing/uprobe_events

dmesg:

  BUG: kernel NULL pointer dereference, address: 0000000000000000
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 8000000079d12067 P4D 8000000079d12067 PUD 7b7ab067 PMD 0
  Oops: 0000 [#1] PREEMPT SMP PTI
  CPU: 0 PID: 1903 Comm: bash Not tainted 5.2.0-rc3+ #15
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014
  RIP: 0010:strchr+0x0/0x30
  Code: c0 eb 0d 84 c9 74 18 48 83 c0 01 48 39 d0 74 0f 0f b6 0c 07 3a 0c 06 74 ea 19 c0 83 c8 01 c3 31 c0 c3 0f 1f 84 00 00 00 00 00 <0f> b6 07 89 f2 40 38 f0 75 0e eb 13 0f b6 47 01 48 83 c
  RSP: 0018:ffffb55fc0403d10 EFLAGS: 00010293

  RAX: ffff993ffb793400 RBX: 0000000000000000 RCX: ffffffffa4852625
  RDX: 0000000000000000 RSI: 000000000000002f RDI: 0000000000000000
  RBP: ffffb55fc0403dd0 R08: ffff993ffb793400 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
  R13: ffff993ff9cc1668 R14: 0000000000000001 R15: 0000000000000000
  FS:  00007f30c5147700(0000) GS:ffff993ffda00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000000 CR3: 000000007b628000 CR4: 00000000000006f0
  Call Trace:
   trace_uprobe_create+0xe6/0xb10
   ? __kmalloc_track_caller+0xe6/0x1c0
   ? __kmalloc+0xf0/0x1d0
   ? trace_uprobe_create+0xb10/0xb10
   create_or_delete_trace_uprobe+0x35/0x90
   ? trace_uprobe_create+0xb10/0xb10
   trace_run_command+0x9c/0xb0
   trace_parse_run_command+0xf9/0x1eb
   ? probes_open+0x80/0x80
   __vfs_write+0x43/0x90
   vfs_write+0x14a/0x2a0
   ksys_write+0xa2/0x170
   do_syscall_64+0x7f/0x200
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

Link: http://lkml.kernel.org/r/20190614074026.8045-1-devel@etsukata.com

Cc: stable@vger.kernel.org
Fixes: 0597c49c69 ("tracing/uprobes: Use dyn_event framework for uprobe events")
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Eiichi Tsukata <devel@etsukata.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-06-14 16:51:14 -04:00
YueHaibing
ff585c5b9a tracing: Make two symbols static
Fix sparse warnings:

kernel/trace/trace.c:6927:24: warning:
 symbol 'get_tracing_log_err' was not declared. Should it be static?
kernel/trace/trace.c:8196:15: warning:
 symbol 'trace_instance_dir' was not declared. Should it be static?

Link: http://lkml.kernel.org/r/20190614153210.24424-1-yuehaibing@huawei.com

Acked-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-06-14 16:49:26 -04:00
Vasily Gorbik
cbdaeaf050 tracing: avoid build warning with HAVE_NOP_MCOUNT
Selecting HAVE_NOP_MCOUNT enables -mnop-mcount (if gcc supports it)
and sets CC_USING_NOP_MCOUNT. Reuse __is_defined (which is suitable for
testing CC_USING_* defines) to avoid conditional compilation and fix
the following gcc 9 warning on s390:

kernel/trace/ftrace.c:2514:1: warning: ‘ftrace_code_disable’ defined
but not used [-Wunused-function]

Link: http://lkml.kernel.org/r/patch.git-1a82d13f33ac.your-ad-here.call-01559732716-ext-6629@work.hours

Fixes: 2f4df0017b ("tracing: Add -mcount-nop option support")
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-06-14 16:34:57 -04:00
Mauro Carvalho Chehab
d6a3b24762 docs: scheduler: convert docs to ReST and rename to *.rst
In order to prepare to add them to the Kernel API book,
convert the files to ReST format.

The conversion is actually:
  - add blank lines and identation in order to identify paragraphs;
  - fix tables markups;
  - add some lists markups;
  - mark literal blocks;
  - adjust title markups.

At its new index.rst, let's add a :orphan: while this is not linked to
the main index.rst file, in order to avoid build warnings.

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2019-06-14 14:32:18 -06:00
Mauro Carvalho Chehab
99c8b231ae docs: cgroup-v1: convert docs to ReST and rename to *.rst
Convert the cgroup-v1 files to ReST format, in order to
allow a later addition to the admin-guide.

The conversion is actually:
  - add blank lines and identation in order to identify paragraphs;
  - fix tables markups;
  - add some lists markups;
  - mark literal blocks;
  - adjust title markups.

At its new index.rst, let's add a :orphan: while this is not linked to
the main index.rst file, in order to avoid build warnings.

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-06-14 13:29:54 -07:00
Eiichi Tsukata
becf33f694 tracing: Fix out-of-range read in trace_stack_print()
Puts range check before dereferencing the pointer.

Reproducer:

  # echo stacktrace > trace_options
  # echo 1 > events/enable
  # cat trace > /dev/null

KASAN report:

  ==================================================================
  BUG: KASAN: use-after-free in trace_stack_print+0x26b/0x2c0
  Read of size 8 at addr ffff888069d20000 by task cat/1953

  CPU: 0 PID: 1953 Comm: cat Not tainted 5.2.0-rc3+ #5
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014
  Call Trace:
   dump_stack+0x8a/0xce
   print_address_description+0x60/0x224
   ? trace_stack_print+0x26b/0x2c0
   ? trace_stack_print+0x26b/0x2c0
   __kasan_report.cold+0x1a/0x3e
   ? trace_stack_print+0x26b/0x2c0
   kasan_report+0xe/0x20
   trace_stack_print+0x26b/0x2c0
   print_trace_line+0x6ea/0x14d0
   ? tracing_buffers_read+0x700/0x700
   ? trace_find_next_entry_inc+0x158/0x1d0
   s_show+0xea/0x310
   seq_read+0xaa7/0x10e0
   ? seq_escape+0x230/0x230
   __vfs_read+0x7c/0x100
   vfs_read+0x16c/0x3a0
   ksys_read+0x121/0x240
   ? kernel_write+0x110/0x110
   ? perf_trace_sys_enter+0x8a0/0x8a0
   ? syscall_slow_exit_work+0xa9/0x410
   do_syscall_64+0xb7/0x390
   ? prepare_exit_to_usermode+0x165/0x200
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f867681f910
  Code: b6 fe ff ff 48 8d 3d 0f be 08 00 48 83 ec 08 e8 06 db 01 00 66 0f 1f 44 00 00 83 3d f9 2d 2c 00 00 75 10 b8 00 00 00 00 04
  RSP: 002b:00007ffdabf23488 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
  RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007f867681f910
  RDX: 0000000000020000 RSI: 00007f8676cde000 RDI: 0000000000000003
  RBP: 00007f8676cde000 R08: ffffffffffffffff R09: 0000000000000000
  R10: 0000000000000871 R11: 0000000000000246 R12: 00007f8676cde000
  R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000000ec0

  Allocated by task 1214:
   save_stack+0x1b/0x80
   __kasan_kmalloc.constprop.0+0xc2/0xd0
   kmem_cache_alloc+0xaf/0x1a0
   getname_flags+0xd2/0x5b0
   do_sys_open+0x277/0x5a0
   do_syscall_64+0xb7/0x390
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

  Freed by task 1214:
   save_stack+0x1b/0x80
   __kasan_slab_free+0x12c/0x170
   kmem_cache_free+0x8a/0x1c0
   putname+0xe1/0x120
   do_sys_open+0x2c5/0x5a0
   do_syscall_64+0xb7/0x390
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

  The buggy address belongs to the object at ffff888069d20000
   which belongs to the cache names_cache of size 4096
  The buggy address is located 0 bytes inside of
   4096-byte region [ffff888069d20000, ffff888069d21000)
  The buggy address belongs to the page:
  page:ffffea0001a74800 refcount:1 mapcount:0 mapping:ffff88806ccd1380 index:0x0 compound_mapcount: 0
  flags: 0x100000000010200(slab|head)
  raw: 0100000000010200 dead000000000100 dead000000000200 ffff88806ccd1380
  raw: 0000000000000000 0000000000070007 00000001ffffffff 0000000000000000
  page dumped because: kasan: bad access detected

  Memory state around the buggy address:
   ffff888069d1ff00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
   ffff888069d1ff80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  >ffff888069d20000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                     ^
   ffff888069d20080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
   ffff888069d20100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ==================================================================

Link: http://lkml.kernel.org/r/20190610040016.5598-1-devel@etsukata.com

Fixes: 4285f2fcef ("tracing: Remove the ULONG_MAX stack trace hackery")
Signed-off-by: Eiichi Tsukata <devel@etsukata.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-06-14 16:28:42 -04:00
Tejun Heo
38cf3a687f cgroup: Move cgroup_parse_float() implementation out of CONFIG_SYSFS
a5e112e642 ("cgroup: add cgroup_parse_float()") accidentally added
cgroup_parse_float() inside CONFIG_SYSFS block.  Move it outside so
that it doesn't cause failures on !CONFIG_SYSFS builds.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: a5e112e642 ("cgroup: add cgroup_parse_float()")
2019-06-14 10:14:44 -07:00
Yangtao Li
141e1ecda3 alarmtimer: Fix kerneldoc comment for alarmtimer_suspend()
This brings the kernel doc in line with the function signature.

Signed-off-by: Yangtao Li <tiny.windzz@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: john.stultz@linaro.org
Cc: sboyd@kernel.org
Link: https://lkml.kernel.org/r/20190525183925.18963-1-tiny.windzz@gmail.com
2019-06-14 17:04:04 +02:00
Mathieu Malaterre
0f48b41f59 clocksource: Move inline keyword to the beginning of function declarations
The inline keyword was not at the beginning of the function declarations.
Fix the following warnings triggered when using W=1:

  kernel/time/clocksource.c:108:1: warning: 'inline' is not at beginning of declaration [-Wold-style-declaration]
  kernel/time/clocksource.c:113:1: warning: 'inline' is not at beginning of declaration [-Wold-style-declaration]

Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: trivial@kernel.org
Cc: kernel-janitors@vger.kernel.org
Cc: John Stultz <john.stultz@linaro.org>
Cc: Stephen Boyd <sboyd@kernel.org>
Link: https://lkml.kernel.org/r/20190524103339.28787-1-malat@debian.org
2019-06-14 17:04:03 +02:00
Florian Fainelli
4b4b077cbd dma-remap: Avoid de-referencing NULL atomic_pool
With architectures allowing the kernel to be placed almost arbitrarily
in memory (e.g.: ARM64), it is possible to have the kernel resides at
physical addresses above 4GB, resulting in neither the default CMA area,
nor the atomic pool from successfully allocating. This does not prevent
specific peripherals from working though, one example is XHCI, which
still operates correctly.

Trouble comes when the XHCI driver gets suspended and resumed, since we
can now trigger the following NPD:

[   12.664170] usb usb1: root hub lost power or was reset
[   12.669387] usb usb2: root hub lost power or was reset
[   12.674662] Unable to handle kernel NULL pointer dereference at virtual address 00000008
[   12.682896] pgd = ffffffc1365a7000
[   12.686386] [00000008] *pgd=0000000136500003, *pud=0000000136500003, *pmd=0000000000000000
[   12.694897] Internal error: Oops: 96000006 [#1] SMP
[   12.699843] Modules linked in:
[   12.702980] CPU: 0 PID: 1499 Comm: pml Not tainted 4.9.135-1.13pre #51
[   12.709577] Hardware name: BCM97268DV (DT)
[   12.713736] task: ffffffc136bb6540 task.stack: ffffffc1366cc000
[   12.719740] PC is at addr_in_gen_pool+0x4/0x48
[   12.724253] LR is at __dma_free+0x64/0xbc
[   12.728325] pc : [<ffffff80083c0df8>] lr : [<ffffff80080979e0>] pstate: 60000145
[   12.735825] sp : ffffffc1366cf990
[   12.739196] x29: ffffffc1366cf990 x28: ffffffc1366cc000
[   12.744608] x27: 0000000000000000 x26: ffffffc13a8568c8
[   12.750020] x25: 0000000000000000 x24: ffffff80098f9000
[   12.755433] x23: 000000013a5ff000 x22: ffffff8009c57000
[   12.760844] x21: ffffffc13a856810 x20: 0000000000000000
[   12.766255] x19: 0000000000001000 x18: 000000000000000a
[   12.771667] x17: 0000007f917553e0 x16: 0000000000001002
[   12.777078] x15: 00000000000a36cb x14: ffffff80898feb77
[   12.782490] x13: ffffffffffffffff x12: 0000000000000030
[   12.787899] x11: 00000000fffffffe x10: ffffff80098feb7f
[   12.793311] x9 : 0000000005f5e0ff x8 : 65776f702074736f
[   12.798723] x7 : 6c2062756820746f x6 : ffffff80098febb1
[   12.804134] x5 : ffffff800809797c x4 : 0000000000000000
[   12.809545] x3 : 000000013a5ff000 x2 : 0000000000000fff
[   12.814955] x1 : ffffff8009c57000 x0 : 0000000000000000
[   12.820363]
[   12.821907] Process pml (pid: 1499, stack limit = 0xffffffc1366cc020)
[   12.828421] Stack: (0xffffffc1366cf990 to 0xffffffc1366d0000)
[   12.834240] f980:                                   ffffffc1366cf9e0 ffffff80086004d0
[   12.842186] f9a0: ffffffc13ab08238 0000000000000010 ffffff80097c2218 ffffffc13a856810
[   12.850131] f9c0: ffffff8009c57000 000000013a5ff000 0000000000000008 000000013a5ff000
[   12.858076] f9e0: ffffffc1366cfa50 ffffff80085f9250 ffffffc13ab08238 0000000000000004
[   12.866021] fa00: ffffffc13ab08000 ffffff80097b6000 ffffffc13ab08130 0000000000000001
[   12.873966] fa20: 0000000000000008 ffffffc13a8568c8 0000000000000000 ffffffc1366cc000
[   12.881911] fa40: ffffffc13ab08130 0000000000000001 ffffffc1366cfa90 ffffff80085e3de8
[   12.889856] fa60: ffffffc13ab08238 0000000000000000 ffffffc136b75b00 0000000000000000
[   12.897801] fa80: 0000000000000010 ffffff80089ccb92 ffffffc1366cfac0 ffffff80084ad040
[   12.905746] faa0: ffffffc13a856810 0000000000000000 ffffff80084ad004 ffffff80084b91a8
[   12.913691] fac0: ffffffc1366cfae0 ffffff80084b91b4 ffffffc13a856810 ffffff80080db5cc
[   12.921636] fae0: ffffffc1366cfb20 ffffff80084b96bc ffffffc13a856810 0000000000000010
[   12.929581] fb00: ffffffc13a856870 0000000000000000 ffffffc13a856810 ffffff800984d2b8
[   12.937526] fb20: ffffffc1366cfb50 ffffff80084baa70 ffffff8009932ad0 ffffff800984d260
[   12.945471] fb40: 0000000000000010 00000002eff0a065 ffffffc1366cfbb0 ffffff80084bafbc
[   12.953415] fb60: 0000000000000010 0000000000000003 ffffff80098fe000 0000000000000000
[   12.961360] fb80: ffffff80097b6000 ffffff80097b6dc8 ffffff80098c12b8 ffffff80098c12f8
[   12.969306] fba0: ffffff8008842000 ffffff80097b6dc8 ffffffc1366cfbd0 ffffff80080e0d88
[   12.977251] fbc0: 00000000fffffffb ffffff80080e10bc ffffffc1366cfc60 ffffff80080e16a8
[   12.985196] fbe0: 0000000000000000 0000000000000003 ffffff80097b6000 ffffff80098fe9f0
[   12.993140] fc00: ffffff80097d4000 ffffff8008983802 0000000000000123 0000000000000040
[   13.001085] fc20: ffffff8008842000 ffffffc1366cc000 ffffff80089803c2 00000000ffffffff
[   13.009029] fc40: 0000000000000000 0000000000000000 ffffffc1366cfc60 0000000000040987
[   13.016974] fc60: ffffffc1366cfcc0 ffffff80080dfd08 0000000000000003 0000000000000004
[   13.024919] fc80: 0000000000000003 ffffff80098fea08 ffffffc136577ec0 ffffff80089803c2
[   13.032864] fca0: 0000000000000123 0000000000000001 0000000500000002 0000000000040987
[   13.040809] fcc0: ffffffc1366cfd00 ffffff80083a89d4 0000000000000004 ffffffc136577ec0
[   13.048754] fce0: ffffffc136610cc0 ffffffffffffffea ffffffc1366cfeb0 ffffffc136610cd8
[   13.056700] fd00: ffffffc1366cfd10 ffffff800822a614 ffffffc1366cfd40 ffffff80082295d4
[   13.064645] fd20: 0000000000000004 ffffffc136577ec0 ffffffc136610cc0 0000000021670570
[   13.072590] fd40: ffffffc1366cfd80 ffffff80081b5d10 ffffff80097b6000 ffffffc13aae4200
[   13.080536] fd60: ffffffc1366cfeb0 0000000000000004 0000000021670570 0000000000000004
[   13.088481] fd80: ffffffc1366cfe30 ffffff80081b6b20 ffffffc13aae4200 0000000000000000
[   13.096427] fda0: 0000000000000004 0000000021670570 ffffffc1366cfeb0 ffffffc13a838200
[   13.104371] fdc0: 0000000000000000 000000000000000a ffffff80097b6000 0000000000040987
[   13.112316] fde0: ffffffc1366cfe20 ffffff80081b3af0 ffffffc13a838200 0000000000000000
[   13.120261] fe00: ffffffc1366cfe30 ffffff80081b6b0c ffffffc13aae4200 0000000000000000
[   13.128206] fe20: 0000000000000004 0000000000040987 ffffffc1366cfe70 ffffff80081b7dd8
[   13.136151] fe40: ffffff80097b6000 ffffffc13aae4200 ffffffc13aae4200 fffffffffffffff7
[   13.144096] fe60: 0000000021670570 ffffffc13a8c63c0 0000000000000000 ffffff8008083180
[   13.152042] fe80: ffffffffffffff1d 0000000021670570 ffffffffffffffff 0000007f917ad9b8
[   13.159986] fea0: 0000000020000000 0000000000000015 0000000000000000 0000000000040987
[   13.167930] fec0: 0000000000000001 0000000021670570 0000000000000004 0000000000000000
[   13.175874] fee0: 0000000000000888 0000440110000000 000000000000006d 0000000000000003
[   13.183819] ff00: 0000000000000040 ffffff80ffffffc8 0000000000000000 0000000000000020
[   13.191762] ff20: 0000000000000000 0000000000000000 0000000000000001 0000000000000000
[   13.199707] ff40: 0000000000000000 0000007f917553e0 0000000000000000 0000000000000004
[   13.207651] ff60: 0000000021670570 0000007f91835480 0000000000000004 0000007f91831638
[   13.215595] ff80: 0000000000000004 00000000004b0de0 00000000004b0000 0000000000000000
[   13.223539] ffa0: 0000000000000000 0000007fc92ac8c0 0000007f9175d178 0000007fc92ac8c0
[   13.231483] ffc0: 0000007f917ad9b8 0000000020000000 0000000000000001 0000000000000040
[   13.239427] ffe0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[   13.247360] Call trace:
[   13.249866] Exception stack(0xffffffc1366cf7a0 to 0xffffffc1366cf8d0)
[   13.256386] f7a0: 0000000000001000 0000007fffffffff ffffffc1366cf990 ffffff80083c0df8
[   13.264331] f7c0: 0000000060000145 ffffff80089b5001 ffffffc13ab08130 0000000000000001
[   13.272275] f7e0: 0000000000000008 ffffffc13a8568c8 0000000000000000 0000000000000000
[   13.280220] f800: ffffffc1366cf960 ffffffc1366cf960 ffffffc1366cf930 00000000ffffffd8
[   13.288165] f820: ffffff8009931ac0 4554535953425553 4544006273753d4d 3831633d45434956
[   13.296110] f840: ffff003832313a39 ffffff800845926c ffffffc1366cf880 0000000000040987
[   13.304054] f860: 0000000000000000 ffffff8009c57000 0000000000000fff 000000013a5ff000
[   13.311999] f880: 0000000000000000 ffffff800809797c ffffff80098febb1 6c2062756820746f
[   13.319944] f8a0: 65776f702074736f 0000000005f5e0ff ffffff80098feb7f 00000000fffffffe
[   13.327884] f8c0: 0000000000000030 ffffffffffffffff
[   13.332835] [<ffffff80083c0df8>] addr_in_gen_pool+0x4/0x48
[   13.338398] [<ffffff80086004d0>] xhci_mem_cleanup+0xc8/0x51c
[   13.344137] [<ffffff80085f9250>] xhci_resume+0x308/0x65c
[   13.349524] [<ffffff80085e3de8>] xhci_brcm_resume+0x84/0x8c
[   13.355174] [<ffffff80084ad040>] platform_pm_resume+0x3c/0x64
[   13.360997] [<ffffff80084b91b4>] dpm_run_callback+0x5c/0x15c
[   13.366732] [<ffffff80084b96bc>] device_resume+0xc0/0x190
[   13.372205] [<ffffff80084baa70>] dpm_resume+0x144/0x2cc
[   13.377504] [<ffffff80084bafbc>] dpm_resume_end+0x20/0x34
[   13.382980] [<ffffff80080e0d88>] suspend_devices_and_enter+0x104/0x704
[   13.389585] [<ffffff80080e16a8>] pm_suspend+0x320/0x53c
[   13.394881] [<ffffff80080dfd08>] state_store+0xbc/0xe0
[   13.400094] [<ffffff80083a89d4>] kobj_attr_store+0x14/0x24
[   13.405655] [<ffffff800822a614>] sysfs_kf_write+0x60/0x70
[   13.411128] [<ffffff80082295d4>] kernfs_fop_write+0x130/0x194
[   13.416954] [<ffffff80081b5d10>] __vfs_write+0x60/0x150
[   13.422254] [<ffffff80081b6b20>] vfs_write+0xc8/0x164
[   13.427376] [<ffffff80081b7dd8>] SyS_write+0x70/0xc8
[   13.432412] [<ffffff8008083180>] el0_svc_naked+0x34/0x38
[   13.437800] Code: 92800173 97f6fb9e 17fffff5 d1000442 (f8408c03)
[   13.444033] ---[ end trace 2effe12f909ce205 ]---

The call path leading to this problem is xhci_mem_cleanup() ->
dma_free_coherent() -> dma_free_from_pool() -> addr_in_gen_pool. If the
atomic_pool is NULL, we can't possibly have the address in the atomic
pool anyway, so guard against that.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-06-14 14:30:21 +02:00
Thomas Gleixner
e3ff9c3678 timekeeping: Repair ktime_get_coarse*() granularity
Jason reported that the coarse ktime based time getters advance only once
per second and not once per tick as advertised.

The code reads only the monotonic base time, which advances once per
second. The nanoseconds are accumulated on every tick in xtime_nsec up to
a second and the regular time getters take this nanoseconds offset into
account, but the ktime_get_coarse*() implementation fails to do so.

Add the accumulated xtime_nsec value to the monotonic base time to get the
proper per tick advancing coarse tinme.

Fixes: b9ff604cff ("timekeeping: Add ktime_get_coarse_with_offset")
Reported-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: Sultan Alsawaf <sultan@kerneltoast.com>
Cc: Waiman Long <longman@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1906132136280.1791@nanos.tec.linutronix.de
2019-06-14 11:51:44 +02:00
Mathieu Malaterre
1ec0cd8286 PM: hibernate: powerpc: Expose pfn_is_nosave() prototype
The declaration for pfn_is_nosave is only available in
kernel/power/power.h. Since this function can be override in arch,
expose it globally. Having a prototype will make sure to avoid warning
(sometime treated as error with W=1) such as:

  arch/powerpc/kernel/suspend.c:18:5: error: no previous prototype for 'pfn_is_nosave' [-Werror=missing-prototypes]

This moves the declaration into a globally visible header file and add
missing include to avoid a warning on powerpc.

Also remove the duplicated prototypes since not required anymore.

Signed-off-by: Mathieu Malaterre <malat@debian.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-06-14 10:48:56 +02:00
YueHaibing
bc6f2a757d kernel/module: Fix mem leak in module_add_modinfo_attrs
In module_add_modinfo_attrs if sysfs_create_file
fails, we forget to free allocated modinfo_attrs
and roll back the sysfs files.

Fixes: 03e88ae1b1 ("[PATCH] fix module sysfs files reference counting")
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2019-06-14 09:31:09 +02:00
Dan Williams
50f44ee724 mm/devm_memremap_pages: fix final page put race
Logan noticed that devm_memremap_pages_release() kills the percpu_ref
drops all the page references that were acquired at init and then
immediately proceeds to unplug, arch_remove_memory(), the backing pages
for the pagemap.  If for some reason device shutdown actually collides
with a busy / elevated-ref-count page then arch_remove_memory() should
be deferred until after that reference is dropped.

As it stands the "wait for last page ref drop" happens *after*
devm_memremap_pages_release() returns, which is obviously too late and
can lead to crashes.

Fix this situation by assigning the responsibility to wait for the
percpu_ref to go idle to devm_memremap_pages() with a new ->cleanup()
callback.  Implement the new cleanup callback for all
devm_memremap_pages() users: pmem, devdax, hmm, and p2pdma.

Link: http://lkml.kernel.org/r/155727339156.292046.5432007428235387859.stgit@dwillia2-desk3.amr.corp.intel.com
Fixes: 41e94a8513 ("add devm_memremap_pages")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reported-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-13 17:34:56 -10:00
Dan Williams
2e3f139e8e mm/devm_memremap_pages: introduce devm_memunmap_pages
Use the new devm_release_action() facility to allow
devm_memremap_pages_release() to be manually triggered.

Link: http://lkml.kernel.org/r/155727337088.292046.5774214552136776763.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-13 17:34:56 -10:00
Paul E. McKenney
96050c68be rcu: Upgrade sync_exp_work_done() to smp_mb()
The sync_exp_work_done() function uses smp_mb__before_atomic(), but
there is no obvious atomic in the ensuing code.  The ordering is
absolutely required for grace periods to work correctly, so this
commit upgrades the smp_mb__before_atomic() to smp_mb().

Fixes: 6fba2b3767 ("rcu: Remove deprecated RCU debugfs tracing code")
Reported-by: Andrea Parri <andrea.parri@amarulasolutions.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-06-13 15:33:19 -07:00
Joel Savitz
d477f8c202 cpuset: restore sanity to cpuset_cpus_allowed_fallback()
In the case that a process is constrained by taskset(1) (i.e.
sched_setaffinity(2)) to a subset of available cpus, and all of those are
subsequently offlined, the scheduler will set tsk->cpus_allowed to
the current value of task_cs(tsk)->effective_cpus.

This is done via a call to do_set_cpus_allowed() in the context of
cpuset_cpus_allowed_fallback() made by the scheduler when this case is
detected. This is the only call made to cpuset_cpus_allowed_fallback()
in the latest mainline kernel.

However, this is not sane behavior.

I will demonstrate this on a system running the latest upstream kernel
with the following initial configuration:

	# grep -i cpu /proc/$$/status
	Cpus_allowed:	ffffffff,fffffff
	Cpus_allowed_list:	0-63

(Where cpus 32-63 are provided via smt.)

If we limit our current shell process to cpu2 only and then offline it
and reonline it:

	# taskset -p 4 $$
	pid 2272's current affinity mask: ffffffffffffffff
	pid 2272's new affinity mask: 4

	# echo off > /sys/devices/system/cpu/cpu2/online
	# dmesg | tail -3
	[ 2195.866089] process 2272 (bash) no longer affine to cpu2
	[ 2195.872700] IRQ 114: no longer affine to CPU2
	[ 2195.879128] smpboot: CPU 2 is now offline

	# echo on > /sys/devices/system/cpu/cpu2/online
	# dmesg | tail -1
	[ 2617.043572] smpboot: Booting Node 0 Processor 2 APIC 0x4

We see that our current process now has an affinity mask containing
every cpu available on the system _except_ the one we originally
constrained it to:

	# grep -i cpu /proc/$$/status
	Cpus_allowed:   ffffffff,fffffffb
	Cpus_allowed_list:      0-1,3-63

This is not sane behavior, as the scheduler can now not only place the
process on previously forbidden cpus, it can't even schedule it on
the cpu it was originally constrained to!

Other cases result in even more exotic affinity masks. Take for instance
a process with an affinity mask containing only cpus provided by smt at
the moment that smt is toggled, in a configuration such as the following:

	# taskset -p f000000000 $$
	# grep -i cpu /proc/$$/status
	Cpus_allowed:	000000f0,00000000
	Cpus_allowed_list:	36-39

A double toggle of smt results in the following behavior:

	# echo off > /sys/devices/system/cpu/smt/control
	# echo on > /sys/devices/system/cpu/smt/control
	# grep -i cpus /proc/$$/status
	Cpus_allowed:	ffffff00,ffffffff
	Cpus_allowed_list:	0-31,40-63

This is even less sane than the previous case, as the new affinity mask
excludes all smt-provided cpus with ids less than those that were
previously in the affinity mask, as well as those that were actually in
the mask.

With this patch applied, both of these cases end in the following state:

	# grep -i cpu /proc/$$/status
	Cpus_allowed:	ffffffff,ffffffff
	Cpus_allowed_list:	0-63

The original policy is discarded. Though not ideal, it is the simplest way
to restore sanity to this fallback case without reinventing the cpuset
wheel that rolls down the kernel just fine in cgroup v2. A user who wishes
for the previous affinity mask to be restored in this fallback case can use
that mechanism instead.

This patch modifies scheduler behavior by instead resetting the mask to
task_cs(tsk)->cpus_allowed by default, and cpu_possible mask in legacy
mode. I tested the cases above on both modes.

Note that the scheduler uses this fallback mechanism if and only if
_every_ other valid avenue has been traveled, and it is the last resort
before calling BUG().

Suggested-by: Waiman Long <longman@redhat.com>
Suggested-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Joel Savitz <jsavitz@redhat.com>
Acked-by: Phil Auld <pauld@redhat.com>
Acked-by: Waiman Long <longman@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-06-12 11:00:08 -07:00
Valdis Klētnieks
aee450cbe4 bpf: silence warning messages in core
Compiling kernel/bpf/core.c with W=1 causes a flood of warnings:

kernel/bpf/core.c:1198:65: warning: initialized field overwritten [-Woverride-init]
 1198 | #define BPF_INSN_3_TBL(x, y, z) [BPF_##x | BPF_##y | BPF_##z] = true
      |                                                                 ^~~~
kernel/bpf/core.c:1087:2: note: in expansion of macro 'BPF_INSN_3_TBL'
 1087 |  INSN_3(ALU, ADD,  X),   \
      |  ^~~~~~
kernel/bpf/core.c:1202:3: note: in expansion of macro 'BPF_INSN_MAP'
 1202 |   BPF_INSN_MAP(BPF_INSN_2_TBL, BPF_INSN_3_TBL),
      |   ^~~~~~~~~~~~
kernel/bpf/core.c:1198:65: note: (near initialization for 'public_insntable[12]')
 1198 | #define BPF_INSN_3_TBL(x, y, z) [BPF_##x | BPF_##y | BPF_##z] = true
      |                                                                 ^~~~
kernel/bpf/core.c:1087:2: note: in expansion of macro 'BPF_INSN_3_TBL'
 1087 |  INSN_3(ALU, ADD,  X),   \
      |  ^~~~~~
kernel/bpf/core.c:1202:3: note: in expansion of macro 'BPF_INSN_MAP'
 1202 |   BPF_INSN_MAP(BPF_INSN_2_TBL, BPF_INSN_3_TBL),
      |   ^~~~~~~~~~~~

98 copies of the above.

The attached patch silences the warnings, because we *know* we're overwriting
the default initializer. That leaves bpf/core.c with only 6 other warnings,
which become more visible in comparison.

Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-12 16:51:02 +02:00
Pavankumar Kondeti
a66d955e91 cpu/hotplug: Abort disabling secondary CPUs if wakeup is pending
When "deep" suspend is enabled, all CPUs except the primary CPU are frozen
via CPU hotplug one by one. After all secondary CPUs are unplugged the
wakeup pending condition is evaluated and if pending the suspend operation
is aborted and the secondary CPUs are brought up again.

CPU hotplug is a slow operation, so it makes sense to check for wakeup
pending in the freezer loop before bringing down the next CPU. This
improves the system suspend abort latency significantly.

[ tglx: Massaged changelog and improved printk message ]

Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <len.brown@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: iri Kosina <jkosina@suse.cz>
Cc: Mukesh Ojha <mojha@codeaurora.org>
Cc: linux-pm@vger.kernel.org
Link: https://lkml.kernel.org/r/1559536263-16472-1-git-send-email-pkondeti@codeaurora.org
2019-06-12 11:03:05 +02:00
Minwoo Im
0e51833042 genirq/affinity: Remove unused argument from [__]irq_build_affinity_masks()
The *affd argument is neither used in irq_build_affinity_masks() nor
__irq_build_affinity_masks(). Remove it.

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Cc: Minwoo Im <minwoo.im@samsung.com>
Cc: linux-block@vger.kernel.org
Link: https://lkml.kernel.org/r/20190602112117.31839-1-minwoo.im.dev@gmail.com
2019-06-12 10:52:45 +02:00
Daniel Lezcano
699785f5d8 genirq/timings: Add selftest for next event computation
The circular buffers are now validated with selftests. The next interrupt
index algorithm which is the hardest part to validate needs extra coverage.

Add a selftest which uses the intervals stored in the arrays and insert all
the values except the last one. The next event computation must return the
same value as the last element which was not inserted.

Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: andriy.shevchenko@linux.intel.com
Link: https://lkml.kernel.org/r/20190527205521.12091-9-daniel.lezcano@linaro.org
2019-06-12 10:47:05 +02:00
Daniel Lezcano
f52da98d90 genirq/timings: Add selftest for irqs circular buffer
After testing the per cpu interrupt circular event, make sure the per
interrupt circular buffer usage is correct.

Add tests to validate the interrupt circular buffer.

Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: andriy.shevchenko@linux.intel.com
Link: https://lkml.kernel.org/r/20190527205521.12091-8-daniel.lezcano@linaro.org
2019-06-12 10:47:05 +02:00
Daniel Lezcano
6aed82de71 genirq/timings: Add selftest for circular array
Due to the complexity of the code and the difficulty to debug it, add some
selftests to the framework in order to spot issues or regression at boot
time when the runtime testing is enabled for this subsystem.

This tests the circular buffer at the limits and validates:
 - the encoding / decoding of the values
 - the macro to browse the irq timings circular buffer
 - the function to push data in the circular buffer

Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: andriy.shevchenko@linux.intel.com
Link: https://lkml.kernel.org/r/20190527205521.12091-7-daniel.lezcano@linaro.org
2019-06-12 10:47:04 +02:00
Daniel Lezcano
23aa3b9a6b genirq/timings: Encapsulate storing function
For the next patches providing the selftest, it is required to insert
interval values directly in the buffer in order to check the correctness of
the code. Encapsulate the code doing that in a always inline function in
order to reuse it in the test code.

No functional changes.

Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: andriy.shevchenko@linux.intel.com
Link: https://lkml.kernel.org/r/20190527205521.12091-6-daniel.lezcano@linaro.org
2019-06-12 10:47:04 +02:00
Daniel Lezcano
df025e47e4 genirq/timings: Encapsulate timings push
For the next patches providing the selftest, it is required to artificially
insert timings value in the circular buffer in order to check the
correctness of the code. Encapsulate the common code between the future
test code and the current code with an always-inline tag.

No functional change.

Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: andriy.shevchenko@linux.intel.com
Link: https://lkml.kernel.org/r/20190527205521.12091-5-daniel.lezcano@linaro.org
2019-06-12 10:47:04 +02:00
Daniel Lezcano
3c2e79f4ce genirq/timings: Optimize the period detection speed
With a minimal period and if there is a period which is a multiple of it
but lesser than the max period then it will be detected before and the
minimal period will be never reached.

 1 2 1 2 1 2 1 2 1 2 1 2
 <-----> <-----> <----->
 <-> <-> <-> <-> <-> <->

In that case, the minimum period is 2 and the maximum period is 5. That
means all repeating pattern of 2 will be detected as repeating pattern of
4, it is pointless to go up to 2 when searching for the period as it will
always fail.

Remove one loop iteration by increasing the minimal period to 3.

Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: andriy.shevchenko@linux.intel.com
Link: https://lkml.kernel.org/r/20190527205521.12091-4-daniel.lezcano@linaro.org
2019-06-12 10:47:03 +02:00
Daniel Lezcano
2840eef051 genirq/timings: Fix timings buffer inspection
It appears the index beginning computation is not correct, the current
code does:

     i = (irqts->count & IRQ_TIMINGS_MASK) - 1

If irqts->count is equal to zero, we end up with an index equal to -1,
but that does not happen because the function checks against zero
before and returns in such case.

However, if irqts->count is a multiple of IRQ_TIMINGS_SIZE, the
resulting & bit op will be zero and leads also to a -1 index.

Re-introduce the iteration loop belonging to the previous variance
code which was correct.

Fixes: bbba0e7c5c "genirq/timings: Add array suffix computation code"
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: andriy.shevchenko@linux.intel.com
Link: https://lkml.kernel.org/r/20190527205521.12091-3-daniel.lezcano@linaro.org
2019-06-12 10:47:03 +02:00
Daniel Lezcano
619c1baa91 genirq/timings: Fix next event index function
The current code is luckily working with most of the interval samples
testing but actually it fails to correctly detect pattern repetition
breaking at the end of the buffer.

Narrowing down the bug has been a real pain because of the pointers,
so the routine is rewrittne by using indexes instead.

Fixes: bbba0e7c5c "genirq/timings: Add array suffix computation code"
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: andriy.shevchenko@linux.intel.com
Link: https://lkml.kernel.org/r/20190527205521.12091-2-daniel.lezcano@linaro.org
2019-06-12 10:47:03 +02:00
Yangtao Li
0e5aa23282 hrtimer: Remove unused header include
seq_file.h does not need to be included, so remove it.

Signed-off-by: Yangtao Li <tiny.windzz@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190607174253.27403-1-tiny.windzz@gmail.com
2019-06-12 10:21:17 +02:00
Linus Torvalds
aa7235483a Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull ptrace fixes from Eric Biederman:
 "This is just two very minor fixes:

   - prevent ptrace from reading unitialized kernel memory found twice
     by syzkaller

   - restore a missing smp_rmb in ptrace_may_access and add comment tp
     it so it is not removed by accident again.

  Apologies for being a little slow about getting this to you, I am
  still figuring out how to develop with a little baby in the house"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  ptrace: restore smp_rmb() in __ptrace_may_access()
  signal/ptrace: Don't leak unitialized kernel memory with PTRACE_PEEK_SIGINFO
2019-06-11 15:44:45 -10:00
Jann Horn
f6581f5b55 ptrace: restore smp_rmb() in __ptrace_may_access()
Restore the read memory barrier in __ptrace_may_access() that was deleted
a couple years ago. Also add comments on this barrier and the one it pairs
with to explain why they're there (as far as I understand).

Fixes: bfedb58925 ("mm: Add a user_ns owner to mm_struct and fix ptrace permission checks")
Cc: stable@vger.kernel.org
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2019-06-11 15:08:28 -05:00
Florian Fainelli
4aa095ea32 swiotlb: Return consistent SWIOTLB segments/nr_tbl
With a specifically contrived memory layout where there is no physical
memory available to the kernel below the 4GB boundary, we will fail to
perform the initial swiotlb_init() call and set no_iotlb_memory to true.

There are drivers out there that call into swiotlb_nr_tbl() to determine
whether they can use the SWIOTLB. With the right DMA_BIT_MASK() value
for these drivers (say 64-bit), they won't ever need to hit
swiotlb_tbl_map_single() so this can go unoticed and we would be
possibly lying about those drivers.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2019-06-11 14:36:33 -04:00
Florian Fainelli
0bfaffbf4c swiotlb: Group identical cleanup in swiotlb_cleanup()
Avoid repeating the zeroing of global swiotlb variables in two locations
and introduce swiotlb_cleanup() to do that.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2019-06-11 14:36:33 -04:00
Jonathan Lemon
da2577fdd0 bpf: lpm_trie: check left child of last leftmost node for NULL
If the leftmost parent node of the tree has does not have a child
on the left side, then trie_get_next_key (and bpftool map dump) will
not look at the child on the right.  This leads to the traversal
missing elements.

Lookup is not affected.

Update selftest to handle this case.

Reproducer:

 bpftool map create /sys/fs/bpf/lpm type lpm_trie key 6 \
     value 1 entries 256 name test_lpm flags 1
 bpftool map update pinned /sys/fs/bpf/lpm key  8 0 0 0  0   0 value 1
 bpftool map update pinned /sys/fs/bpf/lpm key 16 0 0 0  0 128 value 2
 bpftool map dump   pinned /sys/fs/bpf/lpm

Returns only 1 element. (2 expected)

Fixes: b471f2f1de ("bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE")
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-11 13:52:37 +02:00
Jonathan Lemon
fada7fdc83 bpf: Allow bpf_map_lookup_elem() on an xskmap
Currently, the AF_XDP code uses a separate map in order to
determine if an xsk is bound to a queue.  Instead of doing this,
have bpf_map_lookup_elem() return a xdp_sock.

Rearrange some xdp_sock members to eliminate structure holes.

Remove selftest - will be added back in later patch.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-06-10 23:31:26 -07:00
Tejun Heo
11dc8b4011 Merge branch 'for-5.2-fixes' into for-5.3 2019-06-10 09:12:39 -07:00
Tejun Heo
c596687a00 cgroup: Fix css_task_iter_advance_css_set() cset skip condition
While adding handling for dying task group leaders c03cd7738a
("cgroup: Include dying leaders with live threads in PROCS
iterations") added an inverted cset skip condition to
css_task_iter_advance_css_set().  It should skip cset if it's
completely empty but was incorrectly testing for the inverse condition
for the dying_tasks list.  Fix it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: c03cd7738a ("cgroup: Include dying leaders with live threads in PROCS iterations")
Reported-by: syzbot+d4bba5ccd4f9a2a68681@syzkaller.appspotmail.com
2019-06-10 09:08:27 -07:00
Jason Gunthorpe
c8a53b2db0 mm/hmm: Hold a mmgrab from hmm to mm
So long as a struct hmm pointer exists, so should the struct mm it is
linked too. Hold the mmgrab() as soon as a hmm is created, and mmdrop() it
once the hmm refcount goes to zero.

Since mmdrop() (ie a 0 kref on struct mm) is now impossible with a !NULL
mm->hmm delete the hmm_hmm_destroy().

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Philip Yang <Philip.Yang@amd.com>
2019-06-10 10:10:33 -03:00
Jens Axboe
cf8929885d cgroup/bfq: revert bfq.weight symlink change
There's some discussion on how to do this the best, and Tejun prefers
that BFQ just create the file itself instead of having cgroups support
a symlink feature.

Hence revert commit 54b7b868e8 and 19e9da9e86 for 5.2, and this
can be done properly for 5.3.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-10 03:35:41 -06:00
Christian Brauner
7f192e3cd3
fork: add clone3
This adds the clone3 system call.

As mentioned several times already (cf. [7], [8]) here's the promised
patchset for clone3().

We recently merged the CLONE_PIDFD patchset (cf. [1]). It took the last
free flag from clone().

Independent of the CLONE_PIDFD patchset a time namespace has been discussed
at Linux Plumber Conference last year and has been sent out and reviewed
(cf. [5]). It is expected that it will go upstream in the not too distant
future. However, it relies on the addition of the CLONE_NEWTIME flag to
clone(). The only other good candidate - CLONE_DETACHED - is currently not
recyclable as we have identified at least two large or widely used
codebases that currently pass this flag (cf. [2], [3], and [4]). Given that
CLONE_PIDFD grabbed the last clone() flag the time namespace is effectively
blocked. clone3() has the advantage that it will unblock this patchset
again. In general, clone3() is extensible and allows for the implementation
of new features.

The idea is to keep clone3() very simple and close to the original clone(),
specifically, to keep on supporting old clone()-based workloads.
We know there have been various creative proposals how a new process
creation syscall or even api is supposed to look like. Some people even
going so far as to argue that the traditional fork()+exec() split should be
abandoned in favor of an in-kernel version of spawn(). Independent of
whether or not we personally think spawn() is a good idea this patchset has
and does not want to have anything to do with this.
One stance we take is that there's no real good alternative to
clone()+exec() and we need and want to support this model going forward;
independent of spawn().
The following requirements guided clone3():
- bump the number of available flags
- move arguments that are currently passed as separate arguments
  in clone() into a dedicated struct clone_args
  - choose a struct layout that is easy to handle on 32 and on 64 bit
  - choose a struct layout that is extensible
  - give new flags that currently need to abuse another flag's dedicated
    return argument in clone() their own dedicated return argument
    (e.g. CLONE_PIDFD)
  - use a separate kernel internal struct kernel_clone_args that is
    properly typed according to current kernel conventions in fork.c and is
    different from  the uapi struct clone_args
- port _do_fork() to use kernel_clone_args so that all process creation
  syscalls such as fork(), vfork(), clone(), and clone3() behave identical
  (Arnd suggested, that we can probably also port do_fork() itself in a
   separate patchset.)
- ease of transition for userspace from clone() to clone3()
  This very much means that we do *not* remove functionality that userspace
  currently relies on as the latter is a good way of creating a syscall
  that won't be adopted.
- do not try to be clever or complex: keep clone3() as dumb as possible

In accordance with Linus suggestions (cf. [11]), clone3() has the following
signature:

/* uapi */
struct clone_args {
        __aligned_u64 flags;
        __aligned_u64 pidfd;
        __aligned_u64 child_tid;
        __aligned_u64 parent_tid;
        __aligned_u64 exit_signal;
        __aligned_u64 stack;
        __aligned_u64 stack_size;
        __aligned_u64 tls;
};

/* kernel internal */
struct kernel_clone_args {
        u64 flags;
        int __user *pidfd;
        int __user *child_tid;
        int __user *parent_tid;
        int exit_signal;
        unsigned long stack;
        unsigned long stack_size;
        unsigned long tls;
};

long sys_clone3(struct clone_args __user *uargs, size_t size)

clone3() cleanly supports all of the supported flags from clone() and thus
all legacy workloads.
The advantage of sticking close to the old clone() is the low cost for
userspace to switch to this new api. Quite a lot of userspace apis (e.g.
pthreads) are based on the clone() syscall. With the new clone3() syscall
supporting all of the old workloads and opening up the ability to add new
features should make switching to it for userspace more appealing. In
essence, glibc can just write a simple wrapper to switch from clone() to
clone3().

There has been some interest in this patchset already. We have received a
patch from the CRIU corner for clone3() that would set the PID/TID of a
restored process without /proc/sys/kernel/ns_last_pid to eliminate a race.

/* User visible differences to legacy clone() */
- CLONE_DETACHED will cause EINVAL with clone3()
- CSIGNAL is deprecated
  It is superseeded by a dedicated "exit_signal" argument in struct
  clone_args freeing up space for additional flags.
  This is based on a suggestion from Andrei and Linus (cf. [9] and [10])

/* References */
[1]: b3e5838252
[2]: https://dxr.mozilla.org/mozilla-central/source/security/sandbox/linux/SandboxFilter.cpp#343
[3]: https://git.musl-libc.org/cgit/musl/tree/src/thread/pthread_create.c#n233
[4]: https://sources.debian.org/src/blcr/0.8.5-2.3/cr_module/cr_dump_self.c/?hl=740#L740
[5]: https://lore.kernel.org/lkml/20190425161416.26600-1-dima@arista.com/
[6]: https://lore.kernel.org/lkml/20190425161416.26600-2-dima@arista.com/
[7]: https://lore.kernel.org/lkml/CAHrFyr5HxpGXA2YrKza-oB-GGwJCqwPfyhD-Y5wbktWZdt0sGQ@mail.gmail.com/
[8]: https://lore.kernel.org/lkml/20190524102756.qjsjxukuq2f4t6bo@brauner.io/
[9]: https://lore.kernel.org/lkml/20190529222414.GA6492@gmail.com/
[10]: https://lore.kernel.org/lkml/CAHk-=whQP-Ykxi=zSYaV9iXsHsENa+2fdj-zYKwyeyed63Lsfw@mail.gmail.com/
[11]: https://lore.kernel.org/lkml/CAHk-=wieuV4hGwznPsX-8E0G2FKhx3NjZ9X3dTKh5zKd+iqOBw@mail.gmail.com/

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <christian@brauner.io>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Serge Hallyn <serge@hallyn.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Jann Horn <jannh@google.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Adrian Reber <adrian@lisas.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: linux-api@vger.kernel.org
2019-06-09 09:29:28 +02:00
Linus Torvalds
9331b6740f SPDX update for 5.2-rc4
Another round of SPDX header file fixes for 5.2-rc4
 
 These are all more "GPL-2.0-or-later" or "GPL-2.0-only" tags being
 added, based on the text in the files.  We are slowly chipping away at
 the 700+ different ways people tried to write the license text.  All of
 these were reviewed on the spdx mailing list by a number of different
 people.
 
 We now have over 60% of the kernel files covered with SPDX tags:
 	$ ./scripts/spdxcheck.py -v 2>&1 | grep Files
 	Files checked:            64533
 	Files with SPDX:          40392
 	Files with errors:            0
 
 I think the majority of the "easy" fixups are now done, it's now the
 start of the longer-tail of crazy variants to wade through.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXPuGTg8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ykBvQCg2SG+HmDH+tlwKLT/q7jZcLMPQigAoMpt9Uuy
 sxVEiFZo8ZU9v1IoRb1I
 =qU++
 -----END PGP SIGNATURE-----

Merge tag 'spdx-5.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull yet more SPDX updates from Greg KH:
 "Another round of SPDX header file fixes for 5.2-rc4

  These are all more "GPL-2.0-or-later" or "GPL-2.0-only" tags being
  added, based on the text in the files. We are slowly chipping away at
  the 700+ different ways people tried to write the license text. All of
  these were reviewed on the spdx mailing list by a number of different
  people.

  We now have over 60% of the kernel files covered with SPDX tags:
	$ ./scripts/spdxcheck.py -v 2>&1 | grep Files
	Files checked:            64533
	Files with SPDX:          40392
	Files with errors:            0

  I think the majority of the "easy" fixups are now done, it's now the
  start of the longer-tail of crazy variants to wade through"

* tag 'spdx-5.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (159 commits)
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 450
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 449
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 448
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 446
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 445
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 444
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 443
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 442
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 441
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 440
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 438
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 437
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 436
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 435
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 434
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 433
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 432
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 431
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 430
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 429
  ...
2019-06-08 12:52:42 -07:00
Linus Torvalds
1ce2c85137 Char/Misc driver fixes for 5.2-rc4
Here are some small char and misc driver fixes for 5.2-rc4 to resolve a
 number of reported issues.
 
 The most "notable" one here is the kernel headers in proc^Wsysfs fixes.
 Those changes move the header file info into sysfs and fixes the build
 issues that you reported.
 
 Other than that, a bunch of small habanalabs driver fixes, some fpga
 driver fixes, and a few other tiny driver fixes.
 
 All of these have been in linux-next for a while with no reported
 issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXPuEVg8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+yma0QCfVPa7r1rHqljz1UgvjKJTzVg8g9wAn1W1mddx
 MIlG+0+ZnBdaPzyzoY1O
 =0LJD
 -----END PGP SIGNATURE-----

Merge tag 'char-misc-5.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc

Pull char/misc driver fixes from Greg KH:
 "Here are some small char and misc driver fixes for 5.2-rc4 to resolve
  a number of reported issues.

  The most "notable" one here is the kernel headers in proc^Wsysfs
  fixes. Those changes move the header file info into sysfs and fixes
  the build issues that you reported.

  Other than that, a bunch of small habanalabs driver fixes, some fpga
  driver fixes, and a few other tiny driver fixes.

  All of these have been in linux-next for a while with no reported
  issues"

* tag 'char-misc-5.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
  habanalabs: Read upper bits of trace buffer from RWPHI
  habanalabs: Fix virtual address access via debugfs for 2MB pages
  fpga: zynqmp-fpga: Correctly handle error pointer
  habanalabs: fix bug in checking huge page optimization
  habanalabs: Avoid using a non-initialized MMU cache mutex
  habanalabs: fix debugfs code
  uapi/habanalabs: add opcode for enable/disable device debug mode
  habanalabs: halt debug engines on user process close
  test_firmware: Use correct snprintf() limit
  genwqe: Prevent an integer overflow in the ioctl
  parport: Fix mem leak in parport_register_dev_model
  fpga: dfl: expand minor range when registering chrdev region
  fpga: dfl: Add lockdep classes for pdata->lock
  fpga: dfl: afu: Pass the correct device to dma_mapping_error()
  fpga: stratix10-soc: fix use-after-free on s10_init()
  w1: ds2408: Fix typo after 49695ac468 (reset on output_write retry with readback)
  kheaders: Do not regenerate archive if config is not changed
  kheaders: Move from proc to sysfs
  lkdtm/bugs: Adjust recursion test to avoid elision
  lkdtm/usercopy: Moves the KERNEL_DS test to non-canonical
2019-06-08 12:50:36 -07:00
Linus Torvalds
8d72e5bd86 for-linus-20190608
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlz7bn4QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpiYlEADG5BaOEqmhRlwFB5slR8b4aOcrQQzjKclZ
 cgPxOCnINPKRWarR1LSjkCotMZDSZtnsOflbkdKorWebPVqFnmRo8EzK/cPCHhbF
 w3y/xwMZNU0KWz3KzSOegMgb7cJIj5ryN/4xmTg/4kYWVwMWyuEPaqF83NtGnujR
 TQmecOjk93IoaOl+YRqYnxqvztqHyRRQdzn/qazkblg5JM3WfrICqDSGdKNRoHzE
 oOLqVRkLDUO9JWnpqA6n3ZNcksSe80vLbt/syWLqt/XmJJzHJQAjxh8ikVR9cX7R
 LLyFg/s5cuDVhlZPtIfVyYoGvxenaLMB839UOwt5/PDw4wSLMSnVpw49VM4pz8WJ
 GMYXBsSzZgpKztf8hzax+3SOX7B5FaV7GV/Hqryt2PxxDJr521Njj29RQz0lwYEe
 R38zn9VjKABofiC1kGDUYrZ7LVsvcT/dKcZsyIICpzfkKE1OHAWAqgyOHXp+V8uc
 b4Z4dQONuXL///DcrT7FiZjyq4P3an4wmEuMqEsvH6XO6zo6ndCkw4OXxOzzihXI
 SYDmKQs92MkTuNxJJFBnEGrfKTOIy/MJDpzrqdFIy/JM6DOtG4pKahIhqD1dwWmw
 7a6MZ5AZWZpZ7P7+uQbtl56aC5vby974wax5pfcf061ICNFztgV1ws6wjVnsskRF
 fPnBMPeYVQ==
 =vG32
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20190608' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - Allow symlink from the bfq.weight cgroup parameter to the general
   weight (Angelo)

 - Damien is new skd maintainer (Bart)

 - NVMe pull request from Sagi, with a few small fixes.

 - Ensure we set DMA segment size properly, dma-debug is now tripping on
   these (Christoph)

 - Remove useless debugfs_create() return check (Greg)

 - Remove redundant unlikely() check on IS_ERR() (Kefeng)

 - Fixup request freeing on exit (Ming)

* tag 'for-linus-20190608' of git://git.kernel.dk/linux-block:
  block, bfq: add weight symlink to the bfq.weight cgroup parameter
  cgroup: let a symlink too be created with a cftype file
  block: free sched's request pool in blk_cleanup_queue
  nvme-rdma: use dynamic dma mapping per command
  nvme: Fix u32 overflow in the number of namespace list calculation
  mmc: also set max_segment_size in the device
  mtip32xx: also set max_segment_size in the device
  rsxx: don't call dma_set_max_seg_size
  nvme-pci: don't limit DMA segement size
  block: Drop unlikely before IS_ERR(_OR_NULL)
  block: aoe: no need to check return value of debugfs_create functions
  nvmet: fix data_len to 0 for bdev-backed write_zeroes
  MAINTAINERS: Hand over skd maintainership
  nvme-tcp: fix queue mapping when queue count is limited
  nvme-rdma: fix queue mapping when queue count is limited
2019-06-08 12:12:11 -07:00
David S. Miller
38e406f600 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2019-06-07

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Fix several bugs in riscv64 JIT code emission which forgot to clear high
   32-bits for alu32 ops, from Björn and Luke with selftests covering all
   relevant BPF alu ops from Björn and Jiong.

2) Two fixes for UDP BPF reuseport that avoid calling the program in case of
   __udp6_lib_err and UDP GRO which broke reuseport_select_sock() assumption
   that skb->data is pointing to transport header, from Martin.

3) Two fixes for BPF sockmap: a use-after-free from sleep in psock's backlog
   workqueue, and a missing restore of sk_write_space when psock gets dropped,
   from Jakub and John.

4) Fix unconnected UDP sendmsg hook API which is insufficient as-is since it
   breaks standard applications like DNS if reverse NAT is not performed upon
   receive, from Daniel.

5) Fix an out-of-bounds read in __bpf_skc_lookup which in case of AF_INET6
   fails to verify that the length of the tuple is long enough, from Lorenz.

6) Fix libbpf's libbpf__probe_raw_btf to return an fd instead of 0/1 (for
   {un,}successful probe) as that is expected to be propagated as an fd to
   load_sk_storage_btf() and thus closing the wrong descriptor otherwise,
   from Michal.

7) Fix bpftool's JSON output for the case when a lookup fails, from Krzesimir.

8) Minor misc fixes in docs, samples and selftests, from various others.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-07 14:46:47 -07:00
Linus Torvalds
a373ec23ab Power management fixes for 5.2-rc4
- Fix a crash that occurs when a kernel with 'nosmt' in the command
    line is used to resume the system from hibernation (as the "restore"
    kernel), because memory mapping differences between the restore and
    image kernels cause SMT siblings to be woken up from idle states
    and subsequently they try to fetch instructions from incorrect
    memory locations (Jiri Kosina).
 
  - Cause the new Performance and Energy Bias Hint (EPB) code to be
    built only if CONFIG_PM is set, because that code is not really
    necessary otherwise (Rafael Wysocki).
 
  - Add kerneldoc comments to documents some helper functions related
    to system-wide suspend to avoid possible confusion regarding their
    purpose (Rafael Wysocki).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAlz6Kf8SHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxcxkP/05X18Js4yZb7aUc9p6k0rAhI2kwE8OM
 MiF68H2b57qBinR5Amc/tCtMLvdScieVThFmti8nxO1Fvm4Ld+ZUfRPXZUmCc7VJ
 rQ7r1uEeWXC+vb5b2587jFVzRzARzQ9xF590tneI/T4p3yBvWaj6W/t0pxU4+Lok
 hAUeeMNIyBhNHkCZVeX3ZxNM03fFWbZBU1GIFhmDxMaXRAsYt5N0dtes/mjbhAFw
 /iXDuPv2FL4JqU/vuVWR5/icd35IZVDq1bCz5ChRlUKweoj+KhK9uxoKTYrAh3zo
 jhXE6RBZgVbVUzSWjhkWHEN2RkylIE2uhUvVpAwWFuqoHUfFOdXfF/PcRwkNC9tx
 XPNYHMzWKNXS/OIzBHqqv3AhNelou7onh78OROpIsw0AkbA2afN4RjywnybdzIwX
 Lb3q7B1DQt6Nv/sf7zyOzUuE0kJLJ4FGm4mCOkXb+CLQ9X/kkePlaOWi4/UBvT5x
 L07M4nTQla65UVIL1nnru8yTNSy2dESOj9QHlKJnkwOTJK1RfxLMmeUbao3EZkzh
 5woshADVHlRNCt/2fo2bllzJeDNjeXfiOjOOVsnf+vjiQ034d1CD9IXlVgW/QC32
 V/DjiWCai1HvHq5j5KurLDJOyMmDl8O9tqCLNSYAzOkvMJsuLN7rhzJossZtYCT4
 kTpy2BLIOrMI
 =r5bv
 -----END PGP SIGNATURE-----

Merge tag 'pm-5.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
 "These fix a crash during resume from hibernation introduced during the
  4.19 cycle, cause the new Performance and Energy Bias Hint (EPB) code
  to be built only if CONFIG_PM is set and add a few missing kerneldoc
  comments.

  Specifics:

   - Fix a crash that occurs when a kernel with 'nosmt' in the command
     line is used to resume the system from hibernation (as the
     "restore" kernel), because memory mapping differences between the
     restore and image kernels cause SMT siblings to be woken up from
     idle states and subsequently they try to fetch instructions from
     incorrect memory locations (Jiri Kosina).

   - Cause the new Performance and Energy Bias Hint (EPB) code to be
     built only if CONFIG_PM is set, because that code is not really
     necessary otherwise (Rafael Wysocki).

   - Add kerneldoc comments to documents some helper functions related
     to system-wide suspend to avoid possible confusion regarding their
     purpose (Rafael Wysocki)"

* tag 'pm-5.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  x86/power: Fix 'nosmt' vs hibernation triple fault during resume
  PM: sleep: Add kerneldoc comments to some functions
  x86: intel_epb: Do not build when CONFIG_PM is unset
2019-06-07 11:36:17 -07:00
David S. Miller
a6cdeeb16b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Some ISDN files that got removed in net-next had some changes
done in mainline, take the removals.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-07 11:00:14 -07:00
Gustavo A. R. Silva
8d1b73dd25 kernel: module: Use struct_size() helper
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:

struct module_sect_attrs {
	...
        struct module_sect_attr attrs[0];
};

Make use of the struct_size() helper instead of an open-coded version
in order to avoid any potential type mistakes.

So, replace the following form:

sizeof(*sect_attrs) + nloaded * sizeof(sect_attrs->attrs[0]

with:

struct_size(sect_attrs, attrs, nloaded)

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2019-06-07 10:56:32 +02:00
Rafael J. Wysocki
a964d23c94 Merge branch 'pm-x86'
* pm-x86:
  x86/power: Fix 'nosmt' vs hibernation triple fault during resume
  x86: intel_epb: Do not build when CONFIG_PM is unset
2019-06-07 10:48:57 +02:00
Angelo Ruocco
54b7b868e8 cgroup: let a symlink too be created with a cftype file
This commit enables a cftype to have a symlink (of any name) that
points to the file associated with the cftype.

Signed-off-by: Angelo Ruocco <angeloruocco90@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-07 01:29:39 -06:00
Daniel Borkmann
983695fa67 bpf: fix unconnected udp hooks
Intention of cgroup bind/connect/sendmsg BPF hooks is to act transparently
to applications as also stated in original motivation in 7828f20e37 ("Merge
branch 'bpf-cgroup-bind-connect'"). When recently integrating the latter
two hooks into Cilium to enable host based load-balancing with Kubernetes,
I ran into the issue that pods couldn't start up as DNS got broken. Kubernetes
typically sets up DNS as a service and is thus subject to load-balancing.

Upon further debugging, it turns out that the cgroupv2 sendmsg BPF hooks API
is currently insufficient and thus not usable as-is for standard applications
shipped with most distros. To break down the issue we ran into with a simple
example:

  # cat /etc/resolv.conf
  nameserver 147.75.207.207
  nameserver 147.75.207.208

For the purpose of a simple test, we set up above IPs as service IPs and
transparently redirect traffic to a different DNS backend server for that
node:

  # cilium service list
  ID   Frontend            Backend
  1    147.75.207.207:53   1 => 8.8.8.8:53
  2    147.75.207.208:53   1 => 8.8.8.8:53

The attached BPF program is basically selecting one of the backends if the
service IP/port matches on the cgroup hook. DNS breaks here, because the
hooks are not transparent enough to applications which have built-in msg_name
address checks:

  # nslookup 1.1.1.1
  ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
  ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
  ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
  [...]
  ;; connection timed out; no servers could be reached

  # dig 1.1.1.1
  ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
  ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
  ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
  [...]

  ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
  ;; global options: +cmd
  ;; connection timed out; no servers could be reached

For comparison, if none of the service IPs is used, and we tell nslookup
to use 8.8.8.8 directly it works just fine, of course:

  # nslookup 1.1.1.1 8.8.8.8
  1.1.1.1.in-addr.arpa	name = one.one.one.one.

In order to fix this and thus act more transparent to the application,
this needs reverse translation on recvmsg() side. A minimal fix for this
API is to add similar recvmsg() hooks behind the BPF cgroups static key
such that the program can track state and replace the current sockaddr_in{,6}
with the original service IP. From BPF side, this basically tracks the
service tuple plus socket cookie in an LRU map where the reverse NAT can
then be retrieved via map value as one example. Side-note: the BPF cgroups
static key should be converted to a per-hook static key in future.

Same example after this fix:

  # cilium service list
  ID   Frontend            Backend
  1    147.75.207.207:53   1 => 8.8.8.8:53
  2    147.75.207.208:53   1 => 8.8.8.8:53

Lookups work fine now:

  # nslookup 1.1.1.1
  1.1.1.1.in-addr.arpa    name = one.one.one.one.

  Authoritative answers can be found from:

  # dig 1.1.1.1

  ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
  ;; global options: +cmd
  ;; Got answer:
  ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 51550
  ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

  ;; OPT PSEUDOSECTION:
  ; EDNS: version: 0, flags:; udp: 512
  ;; QUESTION SECTION:
  ;1.1.1.1.                       IN      A

  ;; AUTHORITY SECTION:
  .                       23426   IN      SOA     a.root-servers.net. nstld.verisign-grs.com. 2019052001 1800 900 604800 86400

  ;; Query time: 17 msec
  ;; SERVER: 147.75.207.207#53(147.75.207.207)
  ;; WHEN: Tue May 21 12:59:38 UTC 2019
  ;; MSG SIZE  rcvd: 111

And from an actual packet level it shows that we're using the back end
server when talking via 147.75.207.20{7,8} front end:

  # tcpdump -i any udp
  [...]
  12:59:52.698732 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
  12:59:52.698735 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
  12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
  12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
  [...]

In order to be flexible and to have same semantics as in sendmsg BPF
programs, we only allow return codes in [1,1] range. In the sendmsg case
the program is called if msg->msg_name is present which can be the case
in both, connected and unconnected UDP.

The former only relies on the sockaddr_in{,6} passed via connect(2) if
passed msg->msg_name was NULL. Therefore, on recvmsg side, we act in similar
way to call into the BPF program whenever a non-NULL msg->msg_name was
passed independent of sk->sk_state being TCP_ESTABLISHED or not. Note
that for TCP case, the msg->msg_name is ignored in the regular recvmsg
path and therefore not relevant.

For the case of ip{,v6}_recv_error() paths, picked up via MSG_ERRQUEUE,
the hook is not called. This is intentional as it aligns with the same
semantics as in case of TCP cgroup BPF hooks right now. This might be
better addressed in future through a different bpf_attach_type such
that this case can be distinguished from the regular recvmsg paths,
for example.

Fixes: 1cedee13d2 ("bpf: Hooks for sys_sendmsg")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Martynas Pumputis <m@lambda.lt>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-06-06 16:53:12 -07:00
Tejun Heo
cee0c33c54 cgroup: css_task_iter_skip()'d iterators must be advanced before accessed
b636fd38dc ("cgroup: Implement css_task_iter_skip()") introduced
css_task_iter_skip() which is used to fix task iterations skipping
dying threadgroup leaders with live threads.  Skipping is implemented
as a subportion of full advancing but css_task_iter_next() forgot to
fully advance a skipped iterator before determining the next task to
visit causing it to return invalid task pointers.

Fix it by making css_task_iter_next() fully advance the iterator if it
has been skipped since the previous iteration.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: syzbot
Link: http://lkml.kernel.org/r/00000000000097025d058a7fd785@google.com
Fixes: b636fd38dc ("cgroup: Implement css_task_iter_skip()")
2019-06-05 09:54:34 -07:00
Sudeep Holla
15532fd6f5 ptrace: move clearing of TIF_SYSCALL_EMU flag to core
While the TIF_SYSCALL_EMU is set in ptrace_resume independent of any
architecture, currently only powerpc and x86 unset the TIF_SYSCALL_EMU
flag in ptrace_disable which gets called from ptrace_detach.

Let's move the clearing of TIF_SYSCALL_EMU flag to __ptrace_unlink
which gets executed from ptrace_detach and also keep it along with
or close to clearing of TIF_SYSCALL_TRACE.

Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2019-06-05 17:51:17 +01:00
Thomas Gleixner
b886d83c5b treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 441
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation version 2 of the license

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 315 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Armijn Hemel <armijn@tjaldur.nl>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190531190115.503150771@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05 17:37:17 +02:00
Thomas Gleixner
3e45610181 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 436
Based on 1 normalized pattern(s):

  distributed under the terms of the gnu gpl version 2

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 2 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Armijn Hemel <armijn@tjaldur.nl>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190531190115.032570679@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05 17:37:17 +02:00
Thomas Gleixner
767a67b0b3 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 430
Based on 1 normalized pattern(s):

  distribute under gplv2

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 8 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Armijn Hemel <armijn@tjaldur.nl>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190531190114.475576622@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05 17:37:16 +02:00
Thomas Gleixner
55716d2643 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 428
Based on 1 normalized pattern(s):

  this file is released under the gplv2

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 68 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Armijn Hemel <armijn@tjaldur.nl>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190531190114.292346262@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05 17:37:16 +02:00
Thomas Gleixner
ddc64d0ac9 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 363
Based on 1 normalized pattern(s):

  released under terms in gpl version 2 see copying

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 5 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Armijn Hemel <armijn@tjaldur.nl>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190531081035.689962394@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05 17:37:09 +02:00
Thomas Gleixner
4505153954 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 333
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license version 2 as
  published by the free software foundation this program is
  distributed in the hope that it will be useful but without any
  warranty without even the implied warranty of merchantability or
  fitness for a particular purpose see the gnu general public license
  for more details you should have received a copy of the gnu general
  public license along with this program if not write to the free
  software foundation inc 59 temple place suite 330 boston ma 02111
  1307 usa

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 136 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190530000436.384967451@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05 17:37:06 +02:00
Thomas Gleixner
5b497af42f treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 295
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of version 2 of the gnu general public license as
  published by the free software foundation this program is
  distributed in the hope that it will be useful but without any
  warranty without even the implied warranty of merchantability or
  fitness for a particular purpose see the gnu general public license
  for more details

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 64 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190529141901.894819585@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05 17:36:38 +02:00
Thomas Gleixner
9c92ab6191 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 282
Based on 1 normalized pattern(s):

  this software is licensed under the terms of the gnu general public
  license version 2 as published by the free software foundation and
  may be copied distributed and modified under those terms this
  program is distributed in the hope that it will be useful but
  without any warranty without even the implied warranty of
  merchantability or fitness for a particular purpose see the gnu
  general public license for more details

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 285 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190529141900.642774971@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05 17:36:37 +02:00
Petr Mladek
f36e664516 livepatch: Use static buffer for debugging messages under rq lock
The err_buf array uses 128 bytes of stack space.  Move it off the stack
by making it static.  It's safe to use a shared buffer because
klp_try_switch_task() is called under klp_mutex.

Acked-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2019-06-05 16:35:47 +02:00
Christian Brauner
c732327f04
signal: improve comments
Improve the comments for pidfd_send_signal().
First, the comment still referred to a file descriptor for a process as a
"task file descriptor" which stems from way back at the beginning of the
discussion. Replace this with "pidfd" for consistency.
Second, the wording for the explanation of the arguments to the syscall
was a bit inconsistent, e.g. some used the past tense some used present
tense. Make the wording more consistent.

Signed-off-by: Christian Brauner <christian@brauner.io>
2019-06-05 15:06:07 +02:00
Prarit Bhargava
6e6de3dee5 kernel/module.c: Only return -EEXIST for modules that have finished loading
Microsoft HyperV disables the X86_FEATURE_SMCA bit on AMD systems, and
linux guests boot with repeated errors:

amd64_edac_mod: Unknown symbol amd_unregister_ecc_decoder (err -2)
amd64_edac_mod: Unknown symbol amd_register_ecc_decoder (err -2)
amd64_edac_mod: Unknown symbol amd_report_gart_errors (err -2)
amd64_edac_mod: Unknown symbol amd_unregister_ecc_decoder (err -2)
amd64_edac_mod: Unknown symbol amd_register_ecc_decoder (err -2)
amd64_edac_mod: Unknown symbol amd_report_gart_errors (err -2)

The warnings occur because the module code erroneously returns -EEXIST
for modules that have failed to load and are in the process of being
removed from the module list.

module amd64_edac_mod has a dependency on module edac_mce_amd.  Using
modules.dep, systemd will load edac_mce_amd for every request of
amd64_edac_mod.  When the edac_mce_amd module loads, the module has
state MODULE_STATE_UNFORMED and once the module load fails and the state
becomes MODULE_STATE_GOING.  Another request for edac_mce_amd module
executes and add_unformed_module() will erroneously return -EEXIST even
though the previous instance of edac_mce_amd has MODULE_STATE_GOING.
Upon receiving -EEXIST, systemd attempts to load amd64_edac_mod, which
fails because of unknown symbols from edac_mce_amd.

add_unformed_module() must wait to return for any case other than
MODULE_STATE_LIVE to prevent a race between multiple loads of
dependent modules.

Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Barret Rhoden <brho@google.com>
Cc: David Arcari <darcari@redhat.com>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2019-06-05 09:53:24 +02:00
Colin Ian King
6685699e4e bpf: remove redundant assignment to err
The variable err is assigned with the value -EINVAL that is never
read and it is re-assigned a new value later on.  The assignment is
redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-04 16:57:07 +02:00
Greg Kroah-Hartman
1c769fc41a gcov: no need to check return value of debugfs_create functions
When calling debugfs functions, there is no need to ever check the
return value.  The function can work or not, but the code logic should
never do something different based on this.

Also delete the dentry variable as it is never needed.

Acked-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-03 16:18:12 +02:00
Christoph Hellwig
c30700db9e dma-direct: provide generic support for uncached kernel segments
A few architectures support uncached kernel segments.  In that case we get
an uncached mapping for a given physica address by using an offset in the
uncached segement.  Implement support for this scheme in the generic
dma-direct code instead of duplicating it in arch hooks.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-06-03 16:00:08 +02:00
Nicolin Chen
bd2e75633c dma-contiguous: use fallback alloc_pages for single pages
The addresses within a single page are always contiguous, so it's
not so necessary to always allocate one single page from CMA area.
Since the CMA area has a limited predefined size of space, it may
run out of space in heavy use cases, where there might be quite a
lot CMA pages being allocated for single pages.

However, there is also a concern that a device might care where a
page comes from -- it might expect the page from CMA area and act
differently if the page doesn't.

This patch tries to use the fallback alloc_pages path, instead of
one-page size allocations from the global CMA area in case that a
device does not have its own CMA area. This'd save resources from
the CMA global area for more CMA allocations, and also reduce CMA
fragmentations resulted from trivial allocations.

Signed-off-by: Nicolin Chen <nicoleotsuka@gmail.com>
Tested-by: dann frazier <dann.frazier@canonical.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-06-03 16:00:07 +02:00
Nicolin Chen
b1d2dc009d dma-contiguous: add dma_{alloc,free}_contiguous() helpers
Both dma_alloc_from_contiguous() and dma_release_from_contiguous() are
very simply implemented, but requiring callers to pass certain
parameters like count and align, and taking a boolean parameter to check
__GFP_NOWARN in the allocation flags. So every function call duplicates
similar work:

	unsigned long order = get_order(size);
	size_t count = size >> PAGE_SHIFT;

	page = dma_alloc_from_contiguous(dev, count, order,
			gfp & __GFP_NOWARN);

	[...]

	dma_release_from_contiguous(dev, page, size >> PAGE_SHIFT);

Additionally, as CMA can be used only in the context which permits
sleeping, most of callers do a gfpflags_allow_blocking() check and a
corresponding fallback allocation of normal pages upon any false result:

	if (gfpflags_allow_blocking(flag))
		page = dma_alloc_from_contiguous();
	if (!page)
		page = alloc_pages();

	[...]

	if (!dma_release_from_contiguous(dev, page, count))
		__free_pages(page, get_order(size));

So this patch simplifies those function calls by abstracting these
operations into the two new functions: dma_{alloc,free}_contiguous.

As some callers of dma_{alloc,release}_from_contiguous() might be
complicated, this patch just implements these two new functions to
kernel/dma/direct.c only as an initial step.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicolin Chen <nicoleotsuka@gmail.com>
Tested-by: dann frazier <dann.frazier@canonical.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-06-03 16:00:07 +02:00
Greg Kroah-Hartman
8c0fd1fa64 kprobes: no need to check return value of debugfs_create functions
When calling debugfs functions, there is no need to ever check the
return value.  The function can work or not, but the code logic should
never do something different based on this.

Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-03 15:49:07 +02:00
Greg Kroah-Hartman
4aa3b1f67d fail_function: no need to check return value of debugfs_create functions
When calling debugfs functions, there is no need to ever check the
return value.  The function can work or not, but the code logic should
never do something different based on this.

Cc: Kees Cook <keescook@chromium.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com>
Cc: zhong jiang <zhongjiang@huawei.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-03 15:49:06 +02:00
Greg Kroah-Hartman
3e6f176f30 blktrace: no need to check return value of debugfs_create functions
When calling debugfs functions, there is no need to ever check the
return value.  The function can work or not, but the code logic should
never do something different based on this.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: linux-block@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-03 15:39:39 +02:00
Greg Kroah-Hartman
6a54cd872f trace: no need to check return value of debugfs_create functions
When calling debugfs functions, there is no need to ever check the
return value.  The function can work or not, but the code logic should
never do something different based on this.

Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-03 15:39:39 +02:00
Peter Zijlstra
24811637db locking/lock_events: Use raw_cpu_{add,inc}() for stats
Instead of playing silly games with CONFIG_DEBUG_PREEMPT toggling
between this_cpu_*() and __this_cpu_*() use raw_cpu_*(), which is
exactly what we want here.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190527082326.GP2623@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-03 12:32:56 +02:00
Imre Deak
d9349850e1 locking/lockdep: Fix merging of hlocks with non-zero references
The sequence

	static DEFINE_WW_CLASS(test_ww_class);

	struct ww_acquire_ctx ww_ctx;
	struct ww_mutex ww_lock_a;
	struct ww_mutex ww_lock_b;
	struct ww_mutex ww_lock_c;
	struct mutex lock_c;

	ww_acquire_init(&ww_ctx, &test_ww_class);

	ww_mutex_init(&ww_lock_a, &test_ww_class);
	ww_mutex_init(&ww_lock_b, &test_ww_class);
	ww_mutex_init(&ww_lock_c, &test_ww_class);

	mutex_init(&lock_c);

	ww_mutex_lock(&ww_lock_a, &ww_ctx);

	mutex_lock(&lock_c);

	ww_mutex_lock(&ww_lock_b, &ww_ctx);
	ww_mutex_lock(&ww_lock_c, &ww_ctx);

	mutex_unlock(&lock_c);	(*)

	ww_mutex_unlock(&ww_lock_c);
	ww_mutex_unlock(&ww_lock_b);
	ww_mutex_unlock(&ww_lock_a);

	ww_acquire_fini(&ww_ctx); (**)

will trigger the following error in __lock_release() when calling
mutex_release() at **:

	DEBUG_LOCKS_WARN_ON(depth <= 0)

The problem is that the hlock merging happening at * updates the
references for test_ww_class incorrectly to 3 whereas it should've
updated it to 4 (representing all the instances for ww_ctx and
ww_lock_[abc]).

Fix this by updating the references during merging correctly taking into
account that we can have non-zero references (both for the hlock that we
merge into another hlock or for the hlock we are merging into).

Signed-off-by: Imre Deak <imre.deak@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: =?UTF-8?q?Ville=20Syrj=C3=A4l=C3=A4?= <ville.syrjala@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: https://lkml.kernel.org/r/20190524201509.9199-2-imre.deak@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-03 12:32:56 +02:00
Imre Deak
8c8889d8ea locking/lockdep: Fix OOO unlock when hlocks need merging
The sequence

	static DEFINE_WW_CLASS(test_ww_class);

	struct ww_acquire_ctx ww_ctx;
	struct ww_mutex ww_lock_a;
	struct ww_mutex ww_lock_b;
	struct mutex lock_c;
	struct mutex lock_d;

	ww_acquire_init(&ww_ctx, &test_ww_class);

	ww_mutex_init(&ww_lock_a, &test_ww_class);
	ww_mutex_init(&ww_lock_b, &test_ww_class);

	mutex_init(&lock_c);

	ww_mutex_lock(&ww_lock_a, &ww_ctx);

	mutex_lock(&lock_c);

	ww_mutex_lock(&ww_lock_b, &ww_ctx);

	mutex_unlock(&lock_c);		(*)

	ww_mutex_unlock(&ww_lock_b);
	ww_mutex_unlock(&ww_lock_a);

	ww_acquire_fini(&ww_ctx);

triggers the following WARN in __lock_release() when doing the unlock at *:

	DEBUG_LOCKS_WARN_ON(curr->lockdep_depth != depth - 1);

The problem is that the WARN check doesn't take into account the merging
of ww_lock_a and ww_lock_b which results in decreasing curr->lockdep_depth
by 2 not only 1.

Note that the following sequence doesn't trigger the WARN, since there
won't be any hlock merging.

	ww_acquire_init(&ww_ctx, &test_ww_class);

	ww_mutex_init(&ww_lock_a, &test_ww_class);
	ww_mutex_init(&ww_lock_b, &test_ww_class);

	mutex_init(&lock_c);
	mutex_init(&lock_d);

	ww_mutex_lock(&ww_lock_a, &ww_ctx);

	mutex_lock(&lock_c);
	mutex_lock(&lock_d);

	ww_mutex_lock(&ww_lock_b, &ww_ctx);

	mutex_unlock(&lock_d);

	ww_mutex_unlock(&ww_lock_b);
	ww_mutex_unlock(&ww_lock_a);

	mutex_unlock(&lock_c);

	ww_acquire_fini(&ww_ctx);

In general both of the above two sequences are valid and shouldn't
trigger any lockdep warning.

Fix this by taking the decrement due to the hlock merging into account
during lock release and hlock class re-setting. Merging can't happen
during lock downgrading since there won't be a new possibility to merge
hlocks in that case, so add a WARN if merging still happens then.

Signed-off-by: Imre Deak <imre.deak@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: ville.syrjala@linux.intel.com
Link: https://lkml.kernel.org/r/20190524201509.9199-1-imre.deak@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-03 12:32:29 +02:00
Jiri Kosina
ec527c3180 x86/power: Fix 'nosmt' vs hibernation triple fault during resume
As explained in

	0cc3cd2165 ("cpu/hotplug: Boot HT siblings at least once")

we always, no matter what, have to bring up x86 HT siblings during boot at
least once in order to avoid first MCE bringing the system to its knees.

That means that whenever 'nosmt' is supplied on the kernel command-line,
all the HT siblings are as a result sitting in mwait or cpudile after
going through the online-offline cycle at least once.

This causes a serious issue though when a kernel, which saw 'nosmt' on its
commandline, is going to perform resume from hibernation: if the resume
from the hibernated image is successful, cr3 is flipped in order to point
to the address space of the kernel that is being resumed, which in turn
means that all the HT siblings are all of a sudden mwaiting on address
which is no longer valid.

That results in triple fault shortly after cr3 is switched, and machine
reboots.

Fix this by always waking up all the SMT siblings before initiating the
'restore from hibernation' process; this guarantees that all the HT
siblings will be properly carried over to the resumed kernel waiting in
resume_play_dead(), and acted upon accordingly afterwards, based on the
target kernel configuration.

Symmetricaly, the resumed kernel has to push the SMT siblings to mwait
again in case it has SMT disabled; this means it has to online all
the siblings when resuming (so that they come out of hlt) and offline
them again to let them reach mwait.

Cc: 4.19+ <stable@vger.kernel.org> # v4.19+
Debugged-by: Thomas Gleixner <tglx@linutronix.de>
Fixes: 0cc3cd2165 ("cpu/hotplug: Boot HT siblings at least once")
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Pavel Machek <pavel@ucw.cz>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-06-03 12:02:03 +02:00
Jiri Olsa
f3a3a8257e perf/core: Add attr_groups_update into struct pmu
Adding attr_update attribute group into pmu, to allow
having multiple attribute groups for same group name.

This will allow us to update "events" or "format"
directories with attributes that depend on various
HW conditions.

For example having group_format_extra group that updates
"format" directory only if pmu version is 2 and higher:

  static umode_t
  exra_is_visible(struct kobject *kobj, struct attribute *attr, int i)
  {
         return x86_pmu.version >= 2 ? attr->mode : 0;
  }

  static struct attribute_group group_format_extra = {
         .name       = "format",
         .is_visible = exra_is_visible,
  };

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190512155518.21468-3-jolsa@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-03 11:58:21 +02:00
Song Liu
9fd2e48b9a perf/core: Allow non-privileged uprobe for user processes
Currently, non-privileged user could only use uprobe with

    kernel.perf_event_paranoid = -1

However, setting perf_event_paranoid to -1 leaks other users' processes to
non-privileged uprobes.

To introduce proper permission control of uprobes, we are building the
following system:

  A daemon with CAP_SYS_ADMIN is in charge to create uprobes via tracefs;
  Users asks the daemon to create uprobes;
  Then user can attach uprobe only to processes owned by the user.

This patch allows non-privileged user to attach uprobe to processes owned
by the user.

The following example shows how to use uprobe with non-privileged user.
This is based on Brendan's blog post [1]

1. Create uprobe with root:

  sudo perf probe -x 'readline%return +0($retval):string'

2. Then non-root user can use the uprobe as:

  perf record -vvv -e probe_bash:readline__return -p <pid> sleep 20
  perf script

[1] http://www.brendangregg.com/blog/2015-06-28/linux-ftrace-uprobe.html

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <kernel-team@fb.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190507161545.788381-1-songliubraving@fb.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-03 11:58:18 +02:00
Yuyang Du
bf998b98f5 locking/lockdep: Remove !dir in lock irq usage check
In mark_lock_irq(), the following checks are performed:

   ----------------------------------
  |   ->      | unsafe | read unsafe |
  |----------------------------------|
  | safe      |  F  B  |    F* B*    |
  |----------------------------------|
  | read safe |  F? B* |      -      |
   ----------------------------------

Where:
F: check_usage_forwards
B: check_usage_backwards
*: check enabled by STRICT_READ_CHECKS
?: check enabled by the !dir condition

From checking point of view, the special F? case does not make sense,
whereas it perhaps is made for peroformance concern. As later patch will
address this issue, remove this exception, which makes the checks
consistent later.

With STRICT_READ_CHECKS = 1 which is default, there is no functional
change.

Signed-off-by: Yuyang Du <duyuyang@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bvanassche@acm.org
Cc: frederic@kernel.org
Cc: ming.lei@redhat.com
Cc: will.deacon@arm.com
Link: https://lkml.kernel.org/r/20190506081939.74287-24-duyuyang@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-03 11:55:53 +02:00
Yuyang Du
4d56330df2 locking/lockdep: Adjust new bit cases in mark_lock
The new bit can be any possible lock usage except it is garbage, so the
cases in switch can be made simpler. Warn early on if wrong usage bit is
passed without taking locks. No functional change.

Signed-off-by: Yuyang Du <duyuyang@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bvanassche@acm.org
Cc: frederic@kernel.org
Cc: ming.lei@redhat.com
Cc: will.deacon@arm.com
Link: https://lkml.kernel.org/r/20190506081939.74287-23-duyuyang@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-03 11:55:52 +02:00
Yuyang Du
0918065151 locking/lockdep: Consolidate lock usage bit initialization
Lock usage bit initialization is consolidated into one function
mark_usage(). Trivial readability improvement. No functional change.

Signed-off-by: Yuyang Du <duyuyang@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bvanassche@acm.org
Cc: frederic@kernel.org
Cc: ming.lei@redhat.com
Cc: will.deacon@arm.com
Link: https://lkml.kernel.org/r/20190506081939.74287-22-duyuyang@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-03 11:55:51 +02:00
Yuyang Du
68e9dc29f8 locking/lockdep: Check redundant dependency only when CONFIG_LOCKDEP_SMALL
As Peter has put it all sound and complete for the cause, I simply quote:

"It (check_redundant) was added for cross-release (which has since been
reverted) which would generate a lot of redundant links (IIRC) but
having it makes the reports more convoluted -- basically, if we had an
A-B-C relation, then A-C will not be added to the graph because it is
already covered. This then means any report will include B, even though
a shorter cycle might have been possible."

This would increase the number of direct dependencies. For a simple workload
(make clean; reboot; make vmlinux -j8), the data looks like this:

 CONFIG_LOCKDEP_SMALL: direct dependencies:                  6926

!CONFIG_LOCKDEP_SMALL: direct dependencies:                  9052    (+30.7%)

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Yuyang Du <duyuyang@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bvanassche@acm.org
Cc: frederic@kernel.org
Cc: ming.lei@redhat.com
Cc: will.deacon@arm.com
Link: https://lkml.kernel.org/r/20190506081939.74287-21-duyuyang@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-03 11:55:50 +02:00
Yuyang Du
8c2c2b449a locking/lockdep: Refactorize check_noncircular and check_redundant
These two functions now handle different check results themselves. A new
check_path function is added to check whether there is a path in the
dependency graph. No functional change.

Signed-off-by: Yuyang Du <duyuyang@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bvanassche@acm.org
Cc: frederic@kernel.org
Cc: ming.lei@redhat.com
Cc: will.deacon@arm.com
Link: https://lkml.kernel.org/r/20190506081939.74287-20-duyuyang@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-03 11:55:50 +02:00
Yuyang Du
b4adfe8e05 locking/lockdep: Remove unused argument in __lock_release
The @nested is not used in __release_lock so remove it despite that it
is not used in lock_release in the first place.

Signed-off-by: Yuyang Du <duyuyang@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bvanassche@acm.org
Cc: frederic@kernel.org
Cc: ming.lei@redhat.com
Cc: will.deacon@arm.com
Link: https://lkml.kernel.org/r/20190506081939.74287-19-duyuyang@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-03 11:55:49 +02:00
Yuyang Du
4609c4f963 locking/lockdep: Remove redundant argument in check_deadlock
In check_deadlock(), the third argument read comes from the second
argument hlock so that it can be removed. No functional change.

Signed-off-by: Yuyang Du <duyuyang@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bvanassche@acm.org
Cc: frederic@kernel.org
Cc: ming.lei@redhat.com
Cc: will.deacon@arm.com
Link: https://lkml.kernel.org/r/20190506081939.74287-18-duyuyang@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-03 11:55:49 +02:00
Yuyang Du
154f185e9c locking/lockdep: Update comments on dependency search
The breadth-first search is implemented as flat-out non-recursive now, but
the comments are still describing it as recursive, update the comments in
that regard.

Signed-off-by: Yuyang Du <duyuyang@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bvanassche@acm.org
Cc: frederic@kernel.org
Cc: ming.lei@redhat.com
Cc: will.deacon@arm.com
Link: https://lkml.kernel.org/r/20190506081939.74287-16-duyuyang@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-03 11:55:47 +02:00
Yuyang Du
77a806922c locking/lockdep: Avoid constant checks in __bfs by using offset reference
In search of a dependency in the lock graph, there is contant checks for
forward or backward search. Directly reference the field offset of the
struct that differentiates the type of search to avoid those checks.

No functional change.

Signed-off-by: Yuyang Du <duyuyang@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bvanassche@acm.org
Cc: frederic@kernel.org
Cc: ming.lei@redhat.com
Cc: will.deacon@arm.com
Link: https://lkml.kernel.org/r/20190506081939.74287-15-duyuyang@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-03 11:55:46 +02:00
Yuyang Du
c166132559 locking/lockdep: Change the return type of __cq_dequeue()
With the change, we can slightly adjust the code to iterate the queue in BFS
search, which simplifies the code. No functional change.

Signed-off-by: Yuyang Du <duyuyang@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bvanassche@acm.org
Cc: frederic@kernel.org
Cc: ming.lei@redhat.com
Cc: will.deacon@arm.com
Link: https://lkml.kernel.org/r/20190506081939.74287-14-duyuyang@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-03 11:55:46 +02:00
Yuyang Du
aa4807719e locking/lockdep: Change type of the element field in circular_queue
The element field is an array in struct circular_queue to keep track of locks
in the search. Making it the same type as the locks avoids type cast. Also
fix a typo and elaborate the comment above struct circular_queue.

No functional change.

Signed-off-by: Yuyang Du <duyuyang@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: frederic@kernel.org
Cc: ming.lei@redhat.com
Cc: will.deacon@arm.com
Link: https://lkml.kernel.org/r/20190506081939.74287-13-duyuyang@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-03 11:55:45 +02:00
Yuyang Du
31a490e5c5 locking/lockdep: Update comment
A leftover comment is removed. While at it, add more explanatory
comments. Such a trivial patch!

Signed-off-by: Yuyang Du <duyuyang@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bvanassche@acm.org
Cc: frederic@kernel.org
Cc: ming.lei@redhat.com
Cc: will.deacon@arm.com
Link: https://lkml.kernel.org/r/20190506081939.74287-12-duyuyang@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-03 11:55:44 +02:00
Yuyang Du
0b9fc8ecfa locking/lockdep: Remove unused argument in validate_chain() and check_deadlock()
The lockdep_map argument in them is not used, remove it.

Signed-off-by: Yuyang Du <duyuyang@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: frederic@kernel.org
Cc: ming.lei@redhat.com
Cc: will.deacon@arm.com
Link: https://lkml.kernel.org/r/20190506081939.74287-11-duyuyang@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-03 11:55:44 +02:00
Yuyang Du
01bb6f0af9 locking/lockdep: Change the range of class_idx in held_lock struct
held_lock->class_idx is used to point to the class of the held lock. The
index is shifted by 1 to make index 0 mean no class, which results in class
index shifting back and forth but is not worth doing so.

The reason is: (1) there will be no "no-class" held_lock to begin with, and
(2) index 0 seems to be used for error checking, but if something wrong
indeed happened, the index can't be counted on to distinguish it as that
something won't set the class_idx to 0 on purpose to tell us it is wrong.

Therefore, change the index to start from 0. This saves a lot of
back-and-forth shifts and a class slot back to lock_classes.

Since index 0 is now used for lock class, we change the initial chain key to
-1 to avoid key collision, which is due to the fact that __jhash_mix(0, 0, 0) = 0.
Actually, the initial chain key can be any arbitrary value other than 0.

In addition, a bitmap is maintained to keep track of the used lock classes,
and we check the validity of the held lock against that bitmap.

Signed-off-by: Yuyang Du <duyuyang@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bvanassche@acm.org
Cc: frederic@kernel.org
Cc: ming.lei@redhat.com
Cc: will.deacon@arm.com
Link: https://lkml.kernel.org/r/20190506081939.74287-10-duyuyang@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-03 11:55:43 +02:00
Yuyang Du
f6ec8829ac locking/lockdep: Define INITIAL_CHAIN_KEY for chain keys to start with
Chain keys are computed using Jenkins hash function, which needs an initial
hash to start with. Dedicate a macro to make this clear and configurable. A
later patch changes this initial chain key.

Signed-off-by: Yuyang Du <duyuyang@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bvanassche@acm.org
Cc: frederic@kernel.org
Cc: ming.lei@redhat.com
Cc: will.deacon@arm.com
Link: https://lkml.kernel.org/r/20190506081939.74287-9-duyuyang@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-03 11:55:43 +02:00
Yuyang Du
e196e479a3 locking/lockdep: Use lockdep_init_task for task initiation consistently
Despite that there is a lockdep_init_task() which does nothing, lockdep
initiates tasks by assigning lockdep fields and does so inconsistently. Fix
this by using lockdep_init_task().

Signed-off-by: Yuyang Du <duyuyang@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bvanassche@acm.org
Cc: frederic@kernel.org
Cc: ming.lei@redhat.com
Cc: will.deacon@arm.com
Link: https://lkml.kernel.org/r/20190506081939.74287-8-duyuyang@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-03 11:55:42 +02:00
Yuyang Du
834494b280 locking/lockdep: Print the right depth for chain key collision
Since chains are separated by IRQ context, so when printing a chain the
depth should be consistent with it.

Signed-off-by: Yuyang Du <duyuyang@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bvanassche@acm.org
Cc: frederic@kernel.org
Cc: ming.lei@redhat.com
Cc: will.deacon@arm.com
Link: https://lkml.kernel.org/r/20190506081939.74287-6-duyuyang@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-03 11:55:36 +02:00
Yuyang Du
e7a38f63ba locking/lockdep: Remove useless conditional macro
Since #defined(CONFIG_PROVE_LOCKING) is used in the scope of #ifdef
CONFIG_PROVE_LOCKING, it can be removed.

Signed-off-by: Yuyang Du <duyuyang@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bvanassche@acm.org
Cc: frederic@kernel.org
Cc: ming.lei@redhat.com
Cc: will.deacon@arm.com
Link: https://lkml.kernel.org/r/20190506081939.74287-5-duyuyang@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-03 11:55:35 +02:00
Yuyang Du
c52478f4f3 locking/lockdep: Adjust lock usage bit character checks
The lock usage bit characters are defined and determined with tricks.
Add some explanation to make it a bit clearer, then adjust the logic to
check the usage, which optimizes the code a bit.

No functional change.

Signed-off-by: Yuyang Du <duyuyang@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bvanassche@acm.org
Cc: frederic@kernel.org
Cc: ming.lei@redhat.com
Cc: will.deacon@arm.com
Link: https://lkml.kernel.org/r/20190506081939.74287-4-duyuyang@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-03 11:55:35 +02:00
Yuyang Du
f7c1c6b36a locking/lockdep: Change all print_*() return type to void
Since none of the print_*() function's return value is necessary, change
their return type to void. No functional change.

In cases where an invariable return value is used, this change slightly
improves readability, i.e.:

	print_x();
	return 0;

is definitely better than:

	return print_x(); /* where print_x() always returns 0 */

Signed-off-by: Yuyang Du <duyuyang@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bvanassche@acm.org
Cc: frederic@kernel.org
Cc: ming.lei@redhat.com
Cc: will.deacon@arm.com
Link: https://lkml.kernel.org/r/20190506081939.74287-2-duyuyang@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-03 11:55:32 +02:00
Ingo Molnar
26b73da360 Linux 5.2-rc3
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlz0N88eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG3kIH/2uP/+A3STjoURBh
 nCZVThVUXryD+9eughto97PfkBsVs6Wfylx/WX4Qhi4zi8PnIM8DnY9MuCdfhT5+
 7WN76MQrCxagHOtHfGf2yXYtYP4wfNmbttWPxsxtEsWVNMzboCMILTGeSpZlwD04
 bb5qdRVeAcULO3A0xAJXS/sSAvX9mFDLDfOV24G2ksRbmrzDs8KPRVJBoSicem+Z
 Rz0wktu+G3GAb8j3mBu2DcDe66pLGLCbQ3VxwpbCN0+ZyEXUkiY7khGCFEX0SxLH
 1+SICNVbdJWMvhQf4p0eEUX/5NhIhtZyUFMiXX/vHnglECTRk4AQ9LQaVuYXDey9
 wsnlA9o=
 =KXpG
 -----END PGP SIGNATURE-----

Merge tag 'v5.2-rc3' into locking/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-03 11:50:18 +02:00
Dietmar Eggemann
af75d1a9a9 sched/fair: Remove sgs->sum_weighted_load
Since sg_lb_stats::sum_weighted_load is now identical with
sg_lb_stats::group_load remove it and replace its use case
(calculating load per task) with the latter.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rik van Riel <riel@surriel.com>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20190527062116.11512-7-dietmar.eggemann@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-03 11:49:41 +02:00
Dietmar Eggemann
0e1fef63d9 sched/core: Remove sd->*_idx
The sched domain per rq load index files also disappear from the
/proc/sys/kernel/sched_domain/cpuX/domainY directories.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rik van Riel <riel@surriel.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20190527062116.11512-6-dietmar.eggemann@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-03 11:49:40 +02:00
Dietmar Eggemann
55627e3cd2 sched/core: Remove rq->cpu_load[]
The per rq load array values also disappear from the cpu#X sections in
/proc/sched_debug.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rik van Riel <riel@surriel.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20190527062116.11512-5-dietmar.eggemann@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-03 11:49:40 +02:00
Dietmar Eggemann
3d8d535544 sched/debug: Remove sd->*_idx range on sysctl
This reverts:

  commit 201c373e8e ("sched/debug: Limit sd->*_idx range on sysctl")

Load indexes (sd->*_idx) are no longer needed without rq->cpu_load[].
The range check for load indexes can be removed as well. Get rid of it
before the rq->cpu_load[] since it uses CPU_LOAD_IDX_MAX.

At the same time, fix the following coding style issues detected by
scripts/checkpatch.pl:

  ERROR: space prohibited before that ','
  ERROR: space prohibited before that close parenthesis ')'

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rik van Riel <riel@surriel.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20190527062116.11512-4-dietmar.eggemann@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-03 11:49:39 +02:00
Dietmar Eggemann
1c1b8a7b03 sched/fair: Replace source_load() & target_load() with weighted_cpuload()
With LB_BIAS disabled, source_load() & target_load() return
weighted_cpuload(). Replace both with calls to weighted_cpuload().

The function to obtain the load index (sd->*_idx) for an sd,
get_sd_load_idx(), can be removed as well.

Finally, get rid of the sched feature LB_BIAS.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rik van Riel <riel@surriel.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20190527062116.11512-3-dietmar.eggemann@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-03 11:49:39 +02:00
Dietmar Eggemann
5e83eafbfd sched/fair: Remove the rq->cpu_load[] update code
With LB_BIAS disabled, there is no need to update the rq->cpu_load[idx]
any more.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rik van Riel <riel@surriel.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20190527062116.11512-2-dietmar.eggemann@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-03 11:49:38 +02:00
Dietmar Eggemann
f2bedc4705 sched/fair: Remove rq->load
The CFS class is the only one maintaining and using the CPU wide load
(rq->load(.weight)). The last use case of the CPU wide load in CFS's
set_next_entity() can be replaced by using the load of the CFS class
(rq->cfs.load(.weight)) instead.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190424084556.604-1-dietmar.eggemann@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-03 11:49:37 +02:00
Sebastian Andrzej Siewior
3bd3706251 sched/core: Provide a pointer to the valid CPU mask
In commit:

  4b53a3412d ("sched/core: Remove the tsk_nr_cpus_allowed() wrapper")

the tsk_nr_cpus_allowed() wrapper was removed. There was not
much difference in !RT but in RT we used this to implement
migrate_disable(). Within a migrate_disable() section the CPU mask is
restricted to single CPU while the "normal" CPU mask remains untouched.

As an alternative implementation Ingo suggested to use:

	struct task_struct {
		const cpumask_t		*cpus_ptr;
		cpumask_t		cpus_mask;
        };
with
	t->cpus_ptr = &t->cpus_mask;

In -RT we then can switch the cpus_ptr to:

	t->cpus_ptr = &cpumask_of(task_cpu(p));

in a migration disabled region. The rules are simple:

 - Code that 'uses' ->cpus_allowed would use the pointer.
 - Code that 'modifies' ->cpus_allowed would use the direct mask.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190423142636.14347-1-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-03 11:49:37 +02:00
Rafael J. Wysocki
a613734761 PM: sleep: Add kerneldoc comments to some functions
Add kerneldoc comments to pm_suspend_via_firmware(),
pm_resume_via_firmware() and pm_suspend_via_s2idle()
to explain what they do.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-06-03 10:44:21 +02:00
Linus Torvalds
6751b8d91a Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "On the kernel side there's a bunch of ring-buffer ordering fixes for a
  reproducible bug, plus a PEBS constraints regression fix.

  Plus tooling fixes"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  tools headers UAPI: Sync kvm.h headers with the kernel sources
  perf record: Fix s390 missing module symbol and warning for non-root users
  perf machine: Read also the end of the kernel
  perf test vmlinux-kallsyms: Ignore aliases to _etext when searching on kallsyms
  perf session: Add missing swap ops for namespace events
  perf namespace: Protect reading thread's namespace
  tools headers UAPI: Sync drm/drm.h with the kernel
  tools headers UAPI: Sync drm/i915_drm.h with the kernel
  tools headers UAPI: Sync linux/fs.h with the kernel
  tools headers UAPI: Sync linux/sched.h with the kernel
  tools arch x86: Sync asm/cpufeatures.h with the with the kernel
  tools include UAPI: Update copy of files related to new fspick, fsmount, fsconfig, fsopen, move_mount and open_tree syscalls
  perf arm64: Fix mksyscalltbl when system kernel headers are ahead of the kernel
  perf data: Fix 'strncat may truncate' build failure with recent gcc
  perf/ring-buffer: Use regular variables for nesting
  perf/ring-buffer: Always use {READ,WRITE}_ONCE() for rb->user_page data
  perf/ring_buffer: Add ordering to rb->nest increment
  perf/ring_buffer: Fix exposing a temporarily decreased data_head
  perf/x86/intel/ds: Fix EVENT vs. UEVENT PEBS constraints
2019-06-02 11:08:12 -07:00
Linus Torvalds
4fb5741c7c Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull stacktrace fix from Ingo Molnar:
 "Fix a stack_trace_save_tsk_reliable() regression"

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  stacktrace: Unbreak stack_trace_save_tsk_reliable()
2019-06-02 11:04:42 -07:00
Linus Torvalds
7b3064f0e8 Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "Various fixes and followups"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  mm, compaction: make sure we isolate a valid PFN
  include/linux/generic-radix-tree.h: fix kerneldoc comment
  kernel/signal.c: trace_signal_deliver when signal_group_exit
  drivers/iommu/intel-iommu.c: fix variable 'iommu' set but not used
  spdxcheck.py: fix directory structures
  kasan: initialize tag to 0xff in __kasan_kmalloc
  z3fold: fix sheduling while atomic
  scripts/gdb: fix invocation when CONFIG_COMMON_CLK is not set
  mm/gup: continue VM_FAULT_RETRY processing even for pre-faults
  ocfs2: fix error path kobject memory leak
  memcg: make it work on sparse non-0-node systems
  mm, memcg: consider subtrees in memory.events
  prctl_set_mm: downgrade mmap_sem to read lock
  prctl_set_mm: refactor checks from validate_prctl_map
  kernel/fork.c: make max_threads symbol static
  arch/arm/boot/compressed/decompress.c: fix build error due to lz4 changes
  arch/parisc/configs/c8000_defconfig: remove obsoleted CONFIG_DEBUG_SLAB_LEAK
  mm/vmalloc.c: fix typo in comment
  lib/sort.c: fix kernel-doc notation warnings
  mm: fix Documentation/vm/hmm.rst Sphinx warnings
2019-06-02 08:51:30 -07:00
Zhenliang Wei
98af37d624 kernel/signal.c: trace_signal_deliver when signal_group_exit
In the fixes commit, removing SIGKILL from each thread signal mask and
executing "goto fatal" directly will skip the call to
"trace_signal_deliver".  At this point, the delivery tracking of the
SIGKILL signal will be inaccurate.

Therefore, we need to add trace_signal_deliver before "goto fatal" after
executing sigdelset.

Note: SEND_SIG_NOINFO matches the fact that SIGKILL doesn't have any info.

Link: http://lkml.kernel.org/r/20190425025812.91424-1-weizhenliang@huawei.com
Fixes: cf43a757fd ("signal: Restore the stop PTRACE_EVENT_EXIT")
Signed-off-by: Zhenliang Wei <weizhenliang@huawei.com>
Reviewed-by: Christian Brauner <christian@brauner.io>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Ivan Delalande <colona@arista.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-01 15:51:32 -07:00
Chris Down
9852ae3fe5 mm, memcg: consider subtrees in memory.events
memory.stat and other files already consider subtrees in their output, and
we should too in order to not present an inconsistent interface.

The current situation is fairly confusing, because people interacting with
cgroups expect hierarchical behaviour in the vein of memory.stat,
cgroup.events, and other files.  For example, this causes confusion when
debugging reclaim events under low, as currently these always read "0" at
non-leaf memcg nodes, which frequently causes people to misdiagnose breach
behaviour.  The same confusion applies to other counters in this file when
debugging issues.

Aggregation is done at write time instead of at read-time since these
counters aren't hot (unlike memory.stat which is per-page, so it does it
at read time), and it makes sense to bundle this with the file
notifications.

After this patch, events are propagated up the hierarchy:

    [root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
    low 0
    high 0
    max 0
    oom 0
    oom_kill 0
    [root@ktst ~]# systemd-run -p MemoryMax=1 true
    Running as unit: run-r251162a189fb4562b9dabfdc9b0422f5.service
    [root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
    low 0
    high 0
    max 7
    oom 1
    oom_kill 1

As this is a change in behaviour, this can be reverted to the old
behaviour by mounting with the `memory_localevents' flag set.  However, we
use the new behaviour by default as there's a lack of evidence that there
are any current users of memory.events that would find this change
undesirable.

akpm: this is a behaviour change, so Cc:stable.  THis is so that
forthcoming distros which use cgroup v2 are more likely to pick up the
revised behaviour.

Link: http://lkml.kernel.org/r/20190208224419.GA24772@chrisdown.name
Signed-off-by: Chris Down <chris@chrisdown.name>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-01 15:51:31 -07:00
Michal Koutný
bc81426f5b prctl_set_mm: downgrade mmap_sem to read lock
The commit a3b609ef9f ("proc read mm's {arg,env}_{start,end} with mmap
semaphore taken.") added synchronization of reading argument/environment
boundaries under mmap_sem.  Later commit 88aa7cc688 ("mm: introduce
arg_lock to protect arg_start|end and env_start|end in mm_struct") avoided
the coarse use of mmap_sem in similar situations.  But there still
remained two places that (mis)use mmap_sem.

get_cmdline should also use arg_lock instead of mmap_sem when it reads the
boundaries.

The second place that should use arg_lock is in prctl_set_mm.  By
protecting the boundaries fields with the arg_lock, we can downgrade
mmap_sem to reader lock (analogous to what we already do in
prctl_set_mm_map).

[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20190502125203.24014-3-mkoutny@suse.com
Fixes: 88aa7cc688 ("mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct")
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Co-developed-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-01 15:51:31 -07:00
Michal Koutný
11bbd8b416 prctl_set_mm: refactor checks from validate_prctl_map
Despite comment of validate_prctl_map claims there are no capability
checks, it is not completely true since commit 4d28df6152 ("prctl: Allow
local CAP_SYS_ADMIN changing exe_file").  Extract the check out of the
function and make the function perform purely arithmetic checks.

This patch should not change any behavior, it is mere refactoring for
following patch.

[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20190502125203.24014-2-mkoutny@suse.com
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-01 15:51:31 -07:00
Kefeng Wang
8856ae4df3 kernel/fork.c: make max_threads symbol static
Fix build warning,
kernel/fork.c:125:5: warning: symbol 'max_threads' was not declared. Should it be static?

Link: http://lkml.kernel.org/r/20190516015118.140561-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-01 15:51:31 -07:00
David S. Miller
0462eaacee Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2019-05-31

The following pull-request contains BPF updates for your *net-next* tree.

Lots of exciting new features in the first PR of this developement cycle!
The main changes are:

1) misc verifier improvements, from Alexei.

2) bpftool can now convert btf to valid C, from Andrii.

3) verifier can insert explicit ZEXT insn when requested by 32-bit JITs.
   This feature greatly improves BPF speed on 32-bit architectures. From Jiong.

4) cgroups will now auto-detach bpf programs. This fixes issue of thousands
   bpf programs got stuck in dying cgroups. From Roman.

5) new bpf_send_signal() helper, from Yonghong.

6) cgroup inet skb programs can signal CN to the stack, from Lawrence.

7) miscellaneous cleanups, from many developers.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-31 21:21:18 -07:00
Roman Gushchin
c85d69135a bpf: move memory size checks to bpf_map_charge_init()
Most bpf map types doing similar checks and bytes to pages
conversion during memory allocation and charging.

Let's unify these checks by moving them into bpf_map_charge_init().

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-31 16:52:56 -07:00
Roman Gushchin
b936ca643a bpf: rework memlock-based memory accounting for maps
In order to unify the existing memlock charging code with the
memcg-based memory accounting, which will be added later, let's
rework the current scheme.

Currently the following design is used:
  1) .alloc() callback optionally checks if the allocation will likely
     succeed using bpf_map_precharge_memlock()
  2) .alloc() performs actual allocations
  3) .alloc() callback calculates map cost and sets map.memory.pages
  4) map_create() calls bpf_map_init_memlock() which sets map.memory.user
     and performs actual charging; in case of failure the map is
     destroyed
  <map is in use>
  1) bpf_map_free_deferred() calls bpf_map_release_memlock(), which
     performs uncharge and releases the user
  2) .map_free() callback releases the memory

The scheme can be simplified and made more robust:
  1) .alloc() calculates map cost and calls bpf_map_charge_init()
  2) bpf_map_charge_init() sets map.memory.user and performs actual
    charge
  3) .alloc() performs actual allocations
  <map is in use>
  1) .map_free() callback releases the memory
  2) bpf_map_charge_finish() performs uncharge and releases the user

The new scheme also allows to reuse bpf_map_charge_init()/finish()
functions for memcg-based accounting. Because charges are performed
before actual allocations and uncharges after freeing the memory,
no bogus memory pressure can be created.

In cases when the map structure is not available (e.g. it's not
created yet, or is already destroyed), on-stack bpf_map_memory
structure is used. The charge can be transferred with the
bpf_map_charge_move() function.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-31 16:52:56 -07:00
Roman Gushchin
3539b96e04 bpf: group memory related fields in struct bpf_map_memory
Group "user" and "pages" fields of bpf_map into the bpf_map_memory
structure. Later it can be extended with "memcg" and other related
information.

The main reason for a such change (beside cosmetics) is to pass
bpf_map_memory structure to charging functions before the actual
allocation of bpf_map.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-31 16:52:56 -07:00
Roman Gushchin
ffc8b144d5 bpf: add memlock precharge check for cgroup_local_storage
Cgroup local storage maps lack the memlock precharge check,
which is performed before the memory allocation for
most other bpf map types.

Let's add it in order to unify all map types.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-31 16:52:56 -07:00
brakmo
e7a3160d09 bpf: Update __cgroup_bpf_run_filter_skb with cn
For egress packets, __cgroup_bpf_fun_filter_skb() will now call
BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY() instead of PROG_CGROUP_RUN_ARRAY()
in order to propagate congestion notifications (cn) requests to TCP
callers.

For egress packets, this function can return:
   NET_XMIT_SUCCESS    (0)    - continue with packet output
   NET_XMIT_DROP       (1)    - drop packet and notify TCP to call cwr
   NET_XMIT_CN         (2)    - continue with packet output and notify TCP
                                to call cwr
   -EPERM                     - drop packet

For ingress packets, this function will return -EPERM if any attached
program was found and if it returned != 1 during execution. Otherwise 0
is returned.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-31 16:41:29 -07:00
brakmo
5cf1e91456 bpf: cgroup inet skb programs can return 0 to 3
Allows cgroup inet skb programs to return values in the range [0, 3].
The second bit is used to deterine if congestion occurred and higher
level protocol should decrease rate. E.g. TCP would call tcp_enter_cwr()

The bpf_prog must set expected_attach_type to BPF_CGROUP_INET_EGRESS
at load time if it uses the new return values (i.e. 2 or 3).

The expected_attach_type is currently not enforced for
BPF_PROG_TYPE_CGROUP_SKB.  e.g Meaning the current bpf_prog with
expected_attach_type setting to BPF_CGROUP_INET_EGRESS can attach to
BPF_CGROUP_INET_INGRESS.  Blindly enforcing expected_attach_type will
break backward compatibility.

This patch adds a enforce_expected_attach_type bit to only
enforce the expected_attach_type when it uses the new
return value.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-31 16:41:29 -07:00
Tejun Heo
a5e112e642 cgroup: add cgroup_parse_float()
cgroup already uses floating point for percent[ile] numbers and there
are several controllers which want to take them as input.  Add a
generic parse helper to handle inputs.

Update the interface convention documentation about the use of
percentage numbers.  While at it, also clarify the default time unit.

Signed-off-by: Tejun Heo <tj@kernel.org>
2019-05-31 11:48:40 -07:00
Tejun Heo
c03cd7738a cgroup: Include dying leaders with live threads in PROCS iterations
CSS_TASK_ITER_PROCS currently iterates live group leaders; however,
this means that a process with dying leader and live threads will be
skipped.  IOW, cgroup.procs might be empty while cgroup.threads isn't,
which is confusing to say the least.

Fix it by making cset track dying tasks and include dying leaders with
live threads in PROCS iteration.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Topi Miettinen <toiwoton@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
2019-05-31 10:38:58 -07:00
Tejun Heo
b636fd38dc cgroup: Implement css_task_iter_skip()
When a task is moved out of a cset, task iterators pointing to the
task are advanced using the normal css_task_iter_advance() call.  This
is fine but we'll be tracking dying tasks on csets and thus moving
tasks from cset->tasks to (to be added) cset->dying_tasks.  When we
remove a task from cset->tasks, if we advance the iterators, they may
move over to the next cset before we had the chance to add the task
back on the dying list, which can allow the task to escape iteration.

This patch separates out skipping from advancing.  Skipping only moves
the affected iterators to the next pointer rather than fully advancing
it and the following advancing will recognize that the cursor has
already been moved forward and do the rest of advancing.  This ensures
that when a task moves from one list to another in its cset, as long
as it moves in the right direction, it's always visible to iteration.

This doesn't cause any visible behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
2019-05-31 10:38:58 -07:00
Tejun Heo
6b115bf58e cgroup: Call cgroup_release() before __exit_signal()
cgroup_release() calls cgroup_subsys->release() which is used by the
pids controller to uncharge its pid.  We want to use it to manage
iteration of dying tasks which requires putting it before
__unhash_process().  Move cgroup_release() above __exit_signal().
While this makes it uncharge before the pid is freed, pid is RCU freed
anyway and the window is very narrow.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
2019-05-31 10:38:57 -07:00
Linus Torvalds
702c31e856 Power management fixes for 5.2-rc3
- Modify the PCI bus type's PM code to avoid putting devices left
    by their drivers in D0 on purpose during suspend to idle into
    low-power states as doing that may confuse the system resume
    callbacks of the drivers in question (Rafael Wysocki).
 
  - Avoid checking ACPI wakeup configuration during system-wide
    suspend for suspended devices that do not use ACPI-based wakeup
    to allow them to stay in suspend more often (Rafael Wysocki).
 
  - The last phase of hibernation is analogous to system-wide suspend
    also because on platforms with ACPI it passes control to the
    platform firmware to complete the transision, so make it indicate
    that by calling pm_set_suspend_via_firmware() to allow the drivers
    that care about this to do the right thing (Rafael Wysocki).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAlzw7b0SHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxeCAQAJQKrXMdguELnhSS5pEC4+WsFTk4nwAN
 DczNIbhogFSmzw4jOWsjQARqbZkDXhmMSoWSKXLlujONDtYOm56Cun3R90DLL3wp
 jIu41LA2M3jM8/hmog9elhr/eFQwbnO3qm7nanysBqWdsQlqmz1d0/BhlGf+rufc
 nFVXbxmBUk8M9o6guhwPl5YULIkxOFR9b8mjfOvpsxPcMBz9+ZRSM7/KbVVbJCY4
 Bkbbu/IUAGywAO2PFjl0lBvdYT7Rbaf+/UOUhOF+3AUcEgoBcJl0+2eamngwra8U
 OVzip+vKIcYCdrzmpCw1X4pesAV7Lq8AdXhWDGMGn0QUss/j5nBmesXrXKzecPum
 7ett/9ZAQ0UncULnHWmXu4352r+RcKZix/ul3k3uR+flBENK7rR7M8PUJdIWqaaB
 +qxcLz8MgCBcRw53fWjGy8gJd+IqKx+wcmqQ9tnHpVC0HK51KR6uCaF1VI8SDT/0
 t+vbmtlKb9Fi5Th4tAytzZS49uoREmm7hs+rnFxNe+Ms4kLCc3/ZpUyQeAC6jBB0
 Ul6RfWhHbzuZoqYmxXiEVkcrnQryiWbqmS1AbHQyUnjKepn+z00Rt7Ye94A7C8n/
 fxCWOpCZtBvWOT9MudNIVh/YIuEEsPnZWn2mAkAs8gDGlOgcpsl/LbAhJLYb4IOv
 9w0RNwXVKsVk
 =Ddo8
 -----END PGP SIGNATURE-----

Merge tag 'pm-5.2-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
 "These fix three issues in the system-wide suspend and hibernation area
  related to PCI device PM handling by suspend-to-idle, device wakeup
  optimizations and arbitrary differences between suspend and
  hiberantion.

  Specifics:

   - Modify the PCI bus type's PM code to avoid putting devices left by
     their drivers in D0 on purpose during suspend to idle into
     low-power states as doing that may confuse the system resume
     callbacks of the drivers in question (Rafael Wysocki).

   - Avoid checking ACPI wakeup configuration during system-wide suspend
     for suspended devices that do not use ACPI-based wakeup to allow
     them to stay in suspend more often (Rafael Wysocki).

   - The last phase of hibernation is analogous to system-wide suspend
     also because on platforms with ACPI it passes control to the
     platform firmware to complete the transision, so make it indicate
     that by calling pm_set_suspend_via_firmware() to allow the drivers
     that care about this to do the right thing (Rafael Wysocki)"

* tag 'pm-5.2-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PCI: PM: Avoid possible suspend-to-idle issue
  ACPI: PM: Call pm_set_suspend_via_firmware() during hibernation
  ACPI/PCI: PM: Add missing wakeup.flags.valid checks
2019-05-31 10:38:35 -07:00
Linus Torvalds
2f4c533499 SPDX update for 5.2-rc3, round 1
Here is another set of reviewed patches that adds SPDX tags to different
 kernel files, based on a set of rules that are being used to parse the
 comments to try to determine that the license of the file is
 "GPL-2.0-or-later" or "GPL-2.0-only".  Only the "obvious" versions of
 these matches are included here, a number of "non-obvious" variants of
 text have been found but those have been postponed for later review and
 analysis.
 
 There is also a patch in here to add the proper SPDX header to a bunch
 of Kbuild files that we have missed in the past due to new files being
 added and forgetting that Kbuild uses two different file names for
 Makefiles.  This issue was reported by the Kbuild maintainer.
 
 These patches have been out for review on the linux-spdx@vger mailing
 list, and while they were created by automatic tools, they were
 hand-verified by a bunch of different people, all whom names are on the
 patches are reviewers.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXPCHLg8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ykxyACgql6ktH+Tv8Ho1747kKPiFca1Jq0AoK5HORXI
 yB0DSTXYNjMtH41ypnsZ
 =x2f8
 -----END PGP SIGNATURE-----

Merge tag 'spdx-5.2-rc3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull yet more SPDX updates from Greg KH:
 "Here is another set of reviewed patches that adds SPDX tags to
  different kernel files, based on a set of rules that are being used to
  parse the comments to try to determine that the license of the file is
  "GPL-2.0-or-later" or "GPL-2.0-only". Only the "obvious" versions of
  these matches are included here, a number of "non-obvious" variants of
  text have been found but those have been postponed for later review
  and analysis.

  There is also a patch in here to add the proper SPDX header to a bunch
  of Kbuild files that we have missed in the past due to new files being
  added and forgetting that Kbuild uses two different file names for
  Makefiles. This issue was reported by the Kbuild maintainer.

  These patches have been out for review on the linux-spdx@vger mailing
  list, and while they were created by automatic tools, they were
  hand-verified by a bunch of different people, all whom names are on
  the patches are reviewers"

* tag 'spdx-5.2-rc3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (82 commits)
  treewide: Add SPDX license identifier - Kbuild
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 225
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 224
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 223
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 222
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 221
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 220
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 218
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 217
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 216
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 215
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 214
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 213
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 211
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 210
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 209
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 207
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 206
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 203
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 201
  ...
2019-05-31 08:34:32 -07:00
Thomas Gleixner
25763b3c86 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 206
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of version 2 of the gnu general public license as
  published by the free software foundation

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 107 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Reviewed-by: Steve Winslow <swinslow@gmail.com>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190528171438.615055994@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-30 11:29:53 -07:00
Thomas Gleixner
468e15fdc2 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 170
Based on 1 normalized pattern(s):

  this file is release under the gplv2

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 1 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Armijn Hemel <armijn@tjaldur.nl>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070034.216732358@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-30 11:26:39 -07:00
Thomas Gleixner
c942fddf87 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 157
Based on 3 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 of the license or at
  your option any later version this program is distributed in the
  hope that it will be useful but without any warranty without even
  the implied warranty of merchantability or fitness for a particular
  purpose see the gnu general public license for more details

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 of the license or at
  your option any later version [author] [kishon] [vijay] [abraham]
  [i] [kishon]@[ti] [com] this program is distributed in the hope that
  it will be useful but without any warranty without even the implied
  warranty of merchantability or fitness for a particular purpose see
  the gnu general public license for more details

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 of the license or at
  your option any later version [author] [graeme] [gregory]
  [gg]@[slimlogic] [co] [uk] [author] [kishon] [vijay] [abraham] [i]
  [kishon]@[ti] [com] [based] [on] [twl6030]_[usb] [c] [author] [hema]
  [hk] [hemahk]@[ti] [com] this program is distributed in the hope
  that it will be useful but without any warranty without even the
  implied warranty of merchantability or fitness for a particular
  purpose see the gnu general public license for more details

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 1105 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070033.202006027@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-30 11:26:37 -07:00
Thomas Gleixner
1a59d1b8e0 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 156
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 of the license or at
  your option any later version this program is distributed in the
  hope that it will be useful but without any warranty without even
  the implied warranty of merchantability or fitness for a particular
  purpose see the gnu general public license for more details you
  should have received a copy of the gnu general public license along
  with this program if not write to the free software foundation inc
  59 temple place suite 330 boston ma 02111 1307 usa

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 1334 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070033.113240726@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-30 11:26:35 -07:00
Thomas Gleixner
2874c5fd28 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 of the license or at
  your option any later version

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 3029 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-30 11:26:32 -07:00
Paul Moore
839d05e413 audit: remove the BUG() calls in the audit rule comparison functions
The audit_data_to_entry() function ensures that the operator is valid
so we can get rid of these BUG() calls.  We keep the "return 0" just
so the system behaves in a sane-ish manner should something go
horribly wrong.

Signed-off-by: Paul Moore <paul@paul-moore.com>
Acked-by: Richard Guy Briggs <rgb@redhat.com>
2019-05-30 12:53:42 -04:00
Eric W. Biederman
f6e2aa91a4 signal/ptrace: Don't leak unitialized kernel memory with PTRACE_PEEK_SIGINFO
Recently syzbot in conjunction with KMSAN reported that
ptrace_peek_siginfo can copy an uninitialized siginfo to userspace.
Inspecting ptrace_peek_siginfo confirms this.

The problem is that off when initialized from args.off can be
initialized to a negaive value.  At which point the "if (off >= 0)"
test to see if off became negative fails because off started off
negative.

Prevent the core problem by adding a variable found that is only true
if a siginfo is found and copied to a temporary in preparation for
being copied to userspace.

Prevent args.off from being truncated when being assigned to off by
testing that off is <= the maximum possible value of off.  Convert off
to an unsigned long so that we should not have to truncate args.off,
we have well defined overflow behavior so if we add another check we
won't risk fighting undefined compiler behavior, and so that we have a
type whose maximum value is easy to test for.

Cc: Andrei Vagin <avagin@gmail.com>
Cc: stable@vger.kernel.org
Reported-by: syzbot+0d602a1b0d8c95bdf299@syzkaller.appspotmail.com
Fixes: 84c751bd4a ("ptrace: add ability to retrieve signals without removing from a queue (v4)")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2019-05-30 06:38:29 -05:00
Linus Torvalds
9e82b4a91d This fixes a memory leak from the error path in the event filter logic.
-----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXO303xQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qlb7AP4179SWPy9RFxvxyjTmpPFPL9oR0Q26
 sOTyIBN99MUTsgEA0FNWz7/FWtFDa1wbh0tEVreaTQlKEeoIYF96dkN0iwE=
 =BJgg
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.2-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fix from Steven Rostedt:
 "This fixes a memory leak from the error path in the event filter
  logic"

* tag 'trace-v5.2-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Avoid memory leak in predicate_parse()
2019-05-29 11:26:40 -07:00
Eric W. Biederman
a89e9b8abf signal: Remove the signal number and task parameters from force_sig_info
force_sig_info always delivers to the current task and the signal
parameter always matches info.si_signo.  So remove those parameters to
make it a simpler less error prone interface, and to make it clear
that none of the callers are doing anything clever.

This guarantees that force_sig_info will not grow any new buggy
callers that attempt to call force_sig on a non-current task, or that
pass an signal number that does not match info.si_signo.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2019-05-29 09:31:44 -05:00
Eric W. Biederman
59c0e696a6 signal: Factor force_sig_info_to_task out of force_sig_info
All callers of force_sig_info pass info.si_signo in for the signal
by definition as well as in practice.

Further all callers of force_sig_info except force_sig_fault_to_task
pass current as the target task to force_sig_info.

Factor out a static force_sig_info_to_task that
force_sig_fault_to_task can call.

This prepares the way for force_sig_info to have it's task and signal
parameters removed.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2019-05-29 09:31:44 -05:00
Eric W. Biederman
ffafd23b2c signal: Generate the siginfo in force_sig
In preparation for removing the special case in force_sig_info for
only having a signal number generate an appropriate siginfo in
force_sig the last caller of force_sig_info that does not
pass a filled out siginfo.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2019-05-29 09:31:44 -05:00
Eric W. Biederman
8ad23dea80 signal: Move the computation of force into send_signal and correct it.
Forcing a signal or not allowing a pid namespace init to ignore
SIGKILL or SIGSTOP is more cleanly computed in send_signal.

There are two cases where we don't allow a pid namespace init
to ignore SIGKILL or SIGSTOP.  If the sending process is
from an ancestor pid namespace and as such is effectively
the god to the target process, and if the it is the kernel
that is sending the signal, not another application.

It is known that a process is from an ancestor pid namespace if
it can see it's target but it's target does not have a pid for
the sender in it's pid namespace.

It is know that a signal is sent from the kernel if si_code is set to
SI_KERNEL or info is SEND_SIG_PRIV (which ultimately generates
a signal with si_code == SI_KERNEL).

The only signals that matter are SIGKILL and SIGSTOP neither of
which can really be caught, and both of which always have a siginfo
layout that includes si_uid and si_pid.  Therefore we never need
to worry about forcing a signal when si_pid and si_uid are absent.

So handle the two special cases of info and the case when si_pid and
si_uid are present.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2019-05-29 09:31:43 -05:00
Eric W. Biederman
8917bef336 signal: Properly set TRACE_SIGNAL_LOSE_INFO in __send_signal
Any time siginfo is not stored in the signal queue information is
lost.  Therefore set TRACE_SIGNAL_LOSE_INFO every time the code does
not allocate a signal queue entry, and a queue overflow abort is not
triggered.

Fixes: ba005e1f41 ("tracepoint: Add signal loss events")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2019-05-29 09:31:43 -05:00
Eric W. Biederman
2e1661d267 signal: Remove the task parameter from force_sig_fault
As synchronous exceptions really only make sense against the current
task (otherwise how are you synchronous) remove the task parameter
from from force_sig_fault to make it explicit that is what is going
on.

The two known exceptions that deliver a synchronous exception to a
stopped ptraced task have already been changed to
force_sig_fault_to_task.

The callers have been changed with the following emacs regular expression
(with obvious variations on the architectures that take more arguments)
to avoid typos:

force_sig_fault[(]\([^,]+\)[,]\([^,]+\)[,]\([^,]+\)[,]\W+current[)]
->
force_sig_fault(\1,\2,\3)

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2019-05-29 09:31:43 -05:00
Eric W. Biederman
91ca180dbd signal: Use force_sig_fault_to_task for the two calls that don't deliver to current
In preparation for removing the task parameter from force_sig_fault
introduce force_sig_fault_to_task and use it for the two cases where
it matters.

On mips force_fcr31_sig calls force_sig_fault and is called on either
the current task, or a task that is suspended and is being switched to
by the scheduler.  This is safe because the task being switched to by
the scheduler is guaranteed to be suspended.  This ensures that
task->sighand is stable while the signal is delivered to it.

On parisc user_enable_single_step calls force_sig_fault and is in turn
called by ptrace_request.  The function ptrace_request always calls
user_enable_single_step on a child that is stopped for tracing.  The
child being traced and not reaped ensures that child->sighand is not
NULL, and that the child will not change child->sighand.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2019-05-29 09:31:43 -05:00
Stanislav Fomichev
e672db03ab bpf: tracing: properly use bpf_prog_array api
Now that we don't have __rcu markers on the bpf_prog_array helpers,
let's use proper rcu_dereference_protected to obtain array pointer
under mutex.

Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-05-29 15:17:35 +02:00
Stanislav Fomichev
dbcc1ba26e bpf: cgroup: properly use bpf_prog_array api
Now that we don't have __rcu markers on the bpf_prog_array helpers,
let's use proper rcu_dereference_protected to obtain array pointer
under mutex.

We also don't need __rcu annotations on cgroup_bpf.inactive since
it's not read/updated concurrently.

v4:
* drop cgroup_rcu_xyz wrappers and use rcu APIs directly; presumably
  should be more clear to understand which mutex/refcount protects
  each particular place

v3:
* amend cgroup_rcu_dereference to include percpu_ref_is_dying;
  cgroup_bpf is now reference counted and we don't hold cgroup_mutex
  anymore in cgroup_bpf_release

v2:
* replace xchg with rcu_swap_protected

Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-05-29 15:17:35 +02:00
Stanislav Fomichev
54e9c9d4b5 bpf: remove __rcu annotations from bpf_prog_array
Drop __rcu annotations and rcu read sections from bpf_prog_array
helper functions. They are not needed since all existing callers
call those helpers from the rcu update side while holding a mutex.
This guarantees that use-after-free could not happen.

In the next patches I'll fix the callers with missing
rcu_dereference_protected to make sparse/lockdep happy, the proper
way to use these helpers is:

	struct bpf_prog_array __rcu *progs = ...;
	struct bpf_prog_array *p;

	mutex_lock(&mtx);
	p = rcu_dereference_protected(progs, lockdep_is_held(&mtx));
	bpf_prog_array_length(p);
	bpf_prog_array_copy_to_user(p, ...);
	bpf_prog_array_delete_safe(p, ...);
	bpf_prog_array_copy_info(p, ...);
	bpf_prog_array_copy(p, ...);
	bpf_prog_array_free(p);
	mutex_unlock(&mtx);

No functional changes! rcu_dereference_protected with lockdep_is_held
should catch any cases where we update prog array without a mutex
(I've looked at existing call sites and I think we hold a mutex
everywhere).

Motivation is to fix sparse warnings:
kernel/bpf/core.c:1803:9: warning: incorrect type in argument 1 (different address spaces)
kernel/bpf/core.c:1803:9:    expected struct callback_head *head
kernel/bpf/core.c:1803:9:    got struct callback_head [noderef] <asn:4> *
kernel/bpf/core.c:1877:44: warning: incorrect type in initializer (different address spaces)
kernel/bpf/core.c:1877:44:    expected struct bpf_prog_array_item *item
kernel/bpf/core.c:1877:44:    got struct bpf_prog_array_item [noderef] <asn:4> *
kernel/bpf/core.c:1901:26: warning: incorrect type in assignment (different address spaces)
kernel/bpf/core.c:1901:26:    expected struct bpf_prog_array_item *existing
kernel/bpf/core.c:1901:26:    got struct bpf_prog_array_item [noderef] <asn:4> *
kernel/bpf/core.c:1935:26: warning: incorrect type in assignment (different address spaces)
kernel/bpf/core.c:1935:26:    expected struct bpf_prog_array_item *[assigned] existing
kernel/bpf/core.c:1935:26:    got struct bpf_prog_array_item [noderef] <asn:4> *

v2:
* remove comment about potential race; that can't happen
  because all callers are in rcu-update section

Cc: Roman Gushchin <guro@fb.com>
Acked-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-05-29 15:17:35 +02:00
Richard Guy Briggs
0223fad3c9 audit: enforce op for string fields
The field operator is ignored on several string fields.  WATCH, DIR,
PERM and FILETYPE field operators are completely ignored and meaningless
since the op is not referenced in audit_filter_rules().  Range and
bitwise operators are already addressed in ghak73.

Honour the operator for WATCH, DIR, PERM, FILETYPE fields as is done in
the EXE field.

Please see github issue
https://github.com/linux-audit/audit-kernel/issues/114

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2019-05-28 17:46:43 -04:00
Ingo Molnar
849e96f300 perf/urgent fixes:
BPF:
 
   Jiri Olsa:
 
   - Fixup determination of end of kernel map, to avoid having BPF programs,
     that are after the kernel headers and just before module texts mixed up in
     the kernel map.
 
 tools UAPI header copies:
 
   Arnaldo Carvalho de Melo:
 
   - Update copy of files related to new fspick, fsmount, fsconfig, fsopen,
     move_mount and open_tree syscalls.
 
   - Sync cpufeatures.h, sched.h, fs.h, drm.h, i915_drm.h and kvm.h headers.
 
 Namespaces:
 
   Namhyung Kim:
 
   - Add missing byte swap ops for namespace events when processing records from
     perf.data files that could have been recorded in a arch with a different
     endianness.
 
   - Fix access to the thread namespaces list by using the namespaces_lock.
 
 perf data:
 
   Shawn Landden:
 
   - Fix 'strncat may truncate' build failure with recent gcc.
 
 s/390
 
   Thomas Richter:
 
   - Fix s390 missing module symbol and warning for non-root users in 'perf record'.
 
 arm64:
 
   Vitaly Chikunov:
 
   - Fix mksyscalltbl when system kernel headers are ahead of the kernel.
 
 Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQR2GiIUctdOfX2qHhGyPKLppCJ+JwUCXO1vsQAKCRCyPKLppCJ+
 J5MrAQCrxsTz1Lc6GrStrMMX72BqmoEPzoCkmONCukVJCcXeEQEAzdz4I4/CNG3g
 phtc030+Njnc8X5qpkR9kqSQuaPjWAk=
 =1Fbq
 -----END PGP SIGNATURE-----

Merge tag 'perf-urgent-for-mingo-5.2-20190528' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgent

Pull perf/urgent fixes:

BPF:

  Jiri Olsa:

  - Fixup determination of end of kernel map, to avoid having BPF programs,
    that are after the kernel headers and just before module texts mixed up in
    the kernel map.

tools UAPI header copies:

  Arnaldo Carvalho de Melo:

  - Update copy of files related to new fspick, fsmount, fsconfig, fsopen,
    move_mount and open_tree syscalls.

  - Sync cpufeatures.h, sched.h, fs.h, drm.h, i915_drm.h and kvm.h headers.

Namespaces:

  Namhyung Kim:

  - Add missing byte swap ops for namespace events when processing records from
    perf.data files that could have been recorded in a arch with a different
    endianness.

  - Fix access to the thread namespaces list by using the namespaces_lock.

perf data:

  Shawn Landden:

  - Fix 'strncat may truncate' build failure with recent gcc.

s/390

  Thomas Richter:

  - Fix s390 missing module symbol and warning for non-root users in 'perf record'.

arm64:

  Vitaly Chikunov:

  - Fix mksyscalltbl when system kernel headers are ahead of the kernel.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-05-28 23:16:22 +02:00
Tomas Bortoli
dfb4a6f219 tracing: Avoid memory leak in predicate_parse()
In case of errors, predicate_parse() goes to the out_free label
to free memory and to return an error code.

However, predicate_parse() does not free the predicates of the
temporary prog_stack array, thence leaking them.

Link: http://lkml.kernel.org/r/20190528154338.29976-1-tomasbortoli@gmail.com

Cc: stable@vger.kernel.org
Fixes: 80765597bc ("tracing: Rewrite filter logic to be simpler and faster")
Reported-by: syzbot+6b8e0fb820e570c59e19@syzkaller.appspotmail.com
Signed-off-by: Tomas Bortoli <tomasbortoli@gmail.com>
[ Added protection around freeing prog_stack[i].pred ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-05-28 16:27:58 -04:00
Geert Uytterhoeven
43b98d876f genirq/irqdomain: Remove WARN_ON() on out-of-memory condition
There is no need to print a backtrace when memory allocation fails, as
the memory allocation core already takes care of that.

Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Link: https://lkml.kernel.org/r/20190527115742.2693-1-geert+renesas@glider.be
2019-05-28 13:10:55 -07:00
Jiri Kosina
f560201102 cpu/hotplug: Fix notify_cpu_starting() reference in bringup_wait_for_ap()
bringup_wait_for_ap() comment references cpu_notify_starting(), but the 
function is actually called notify_cpu_starting(). Fix that.

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/nycvar.YFH.7.76.1905282128100.1962@cbobk.fhfr.pm
2019-05-28 12:59:03 -07:00
Waiman Long
5ca584d935 futex: Consolidate duplicated timer setup code
Add a new futex_setup_timer() helper function to consolidate all the
hrtimer_sleeper setup code.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Darren Hart <dvhart@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Link: https://lkml.kernel.org/r/20190528160345.24017-1-longman@redhat.com
2019-05-28 11:12:00 -07:00
Roman Gushchin
4bfc0bb2c6 bpf: decouple the lifetime of cgroup_bpf from cgroup itself
Currently the lifetime of bpf programs attached to a cgroup is bound
to the lifetime of the cgroup itself. It means that if a user
forgets (or intentionally avoids) to detach a bpf program before
removing the cgroup, it will stay attached up to the release of the
cgroup. Since the cgroup can stay in the dying state (the state
between being rmdir()'ed and being released) for a very long time, it
leads to a waste of memory. Also, it blocks a possibility to implement
the memcg-based memory accounting for bpf objects, because a circular
reference dependency will occur. Charged memory pages are pinning the
corresponding memory cgroup, and if the memory cgroup is pinning
the attached bpf program, nothing will be ever released.

A dying cgroup can not contain any processes, so the only chance for
an attached bpf program to be executed is a live socket associated
with the cgroup. So in order to release all bpf data early, let's
count associated sockets using a new percpu refcounter. On cgroup
removal the counter is transitioned to the atomic mode, and as soon
as it reaches 0, all bpf programs are detached.

Because cgroup_bpf_release() can block, it can't be called from
the percpu ref counter callback directly, so instead an asynchronous
work is scheduled.

The reference counter is not socket specific, and can be used for any
other types of programs, which can be executed from a cgroup-bpf hook
outside of the process context, had such a need arise in the future.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: jolsa@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-28 09:30:02 -07:00
Paul E. McKenney
354ea05d02 rcutorture: Upper case solves the case of the vanishing NULL pointer
Various security techniques can obfuscate pointer printouts on the
console.  Unfortunately, rcutorture relies on either "null" or all zeroes
to identify the last few statistics printouts at the end of the test.
These need to be identified because failing to do so will results in
false-positive complaints about grace-period hangs.

This commit therefore prints the "ver:" in capitals ("VER:") when
the RCU-protected pointer has been set to NULL, which causes rcutorture's
parse-console.sh script to correctly ignore these lines.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-28 09:06:09 -07:00
Paul E. McKenney
34aa34b818 rcutorture: Dump trace buffer for callback pipe drain failures
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-28 09:06:09 -07:00
Paul E. McKenney
c682db558e rcutorture: Add trivial RCU implementation
I have been showing off a trivial RCU implementation for non-preemptive
environments for some time now:

	#define rcu_read_lock()
	#define rcu_read_unlock()
	#define rcu_dereference(p) READ_ONCE(p)
	#define rcu_assign_pointer(p, v) smp_store_release(&(p), (v))
	void synchronize_rcu(void)
	{
	int cpu;
		for_each_online_cpu(cpu)
			sched_setaffinity(current->pid, cpumask_of(cpu));
	}

Trivial or not, as the old saying goes, "if it ain't tested, it don't
work!".  This commit therefore adds a "trivial" flavor to rcutorture
and a corresponding TRIVIAL test scenario.  This variant does not handle
CPU hotplug, which is unconditionally enabled on x86 for post-v5.1-rc3
kernels, which is why the TRIVIAL.boot says "rcutorture.onoff_interval=0".
This commit actually does handle CONFIG_PREEMPT=y kernels, but only
because it turns back the Linux-kernel clock in order to provide these
alternative definitions (or the moral equivalent thereof):

	#define rcu_read_lock() preempt_disable()
	#define rcu_read_unlock() preempt_enable()

In CONFIG_PREEMPT=n kernels without debugging, these are equivalent to
empty macros give or take a compiler barrier.  However, the have been
successfully tested with actual empty macros as well.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
[ paulmck: Fix symbol issue reported by kbuild test robot <lkp@intel.com>. ]
[ paulmck: Work around sched_setaffinity() issue noted by Andrea Parri. ]
[ paulmck: Add rcutorture.shuffle_interval=0 to TRIVIAL.boot to fix
  interaction with shuffler task noted by Peter Zijlstra. ]
Tested-by: Andrea Parri <andrea.parri@amarulasolutions.com>
2019-05-28 09:06:09 -07:00
Paul E. McKenney
3432d765c5 rcutorture: Halt forward-progress checks at end of run
Once removed, an rcu_torture element can be deferred-freed by a chain
of call_rcu() invocations, with each callback invoking another round of
call_rcu() until either a fixed number of call_rcu() invocations have
been chained or until the test ends.  This means that if the test ends,
some of the rcu_torture elements will be "stranded" partway through the
deferred-free process, which results in false-positive warnings from
rcu_torture_writer() due to lack of forward progress should the test
end just at the end of a stutter interval.

This commit therefore suppresses rcu_torture_writer()'s forward-progress
checks when the test ends in order to avoid these false-positive reports..

Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-28 09:06:09 -07:00
Paul E. McKenney
ab21f6081f rcutorture: Give the scheduler a chance on PREEMPT && NO_HZ_FULL kernels
In !PREEMPT kernels, cond_resched() is a no-op.  In NO_HZ_FULL kernels,
in-kernel execution (such as that of rcutorture's kthreads) might extend
indefinitely without the scheduler gaining the aid of a scheduling-clock
interrupt.  This combination can make the interaction of an rcutorture
forward-progress test and a CPU-hotplug stop_machine operation make less
forward progress than one might like.  Additionally, Sebastian Siewior
notes that NO_HZ_FULL kernels have a scheduler check upon return to
userspace execution, which suggests that in-kernel emulation of tight
userspace loops containing system calls doing call_rcu() might also need
explicit checks in the PREEMPT && NO_HZ_FULL case.

This commit therefore introduces a rcu_torture_fwd_prog_cond_resched()
function that explicitly invokes schedule() in such kernels whenever
need_resched() returns true, while retaining use of cond_resched()
for kernels that are either !PREEMPT or !NO_HZ_FULL.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-28 09:06:09 -07:00
Paul E. McKenney
5eabea594b rcutorture: Exempt tasks RCU from timely draining of grace periods
After the end of each stutter pause interval, the rcu_torture_writer()
kthread checks to be sure that all prior callbacks have completed so
that all the test structures have been freed.  This works fine except
for tasks RCU, in which grace periods can take one good long time.
This commit therefore exempts tasks RCU from this check.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-28 09:06:09 -07:00
Paul E. McKenney
ff3bf92d90 torture: Allow inter-stutter interval to be specified
Currently, the inter-stutter interval is the same as the stutter duration,
that is, whatever number of jiffies is passed into torture_stutter_init().
This has worked well for quite some time, but the addition of
forward-progress testing to rcutorture can delay processes for several
seconds, which can triple the time that they are stuttered.

This commit therefore adds a second argument to torture_stutter_init()
that specifies the inter-stutter interval.  While locktorture preserves
the current behavior, rcutorture uses the RCU CPU stall warning interval
to provide a wider inter-stutter interval.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-28 09:06:09 -07:00
Paul E. McKenney
e8516c64fe rcutorture: Fix stutter_wait() return value and freelist checks
The stutter_wait() function is supposed to return true if it actually
waits and false otherwise, but it instead unconditionally returns false.
Which hides a bug in rcu_torture_writer() that fails to account for
the fact that one of the rcu_tortures[] array elements will normally be
referenced by rcu_torture_current, and thus not be on the freelist.

This commit therefore corrects the stutter_wait() return value and adds a
check for rcu_torture_current to rcu_torture_writer()'s check that things
get freed after everything goes quiescent.  In addition, this commit
causes torture_stutter() to give a bit more than one second (instead of
only one jiffy) warning of the end of the stutter interval.  Finally,
this commit disables long-delay readers and aggressive update-side
forward-progress checks while forward-progress testing is in flight.

Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-28 09:06:09 -07:00
Paul E. McKenney
140e53f20b rcutorture: Add cond_resched() to forward-progress free-up loop
The rcu_torture_fwd_prog_cbfree() function frees callbacks used during
rcutorture's call_rcu() forward-progress test, but does so in a tight
loop.  This could cause problems given a very long list of callbacks to be
freed, and actual testing produces lists with as many as 25M callbacks.
This commit therefore adds a cond_resched() to this loop.  While in
the area, this commit also rearranges the lock releases to look a bit
more sane.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-28 09:06:09 -07:00
Oleg Nesterov
89da3b94bb rcu/sync: Simplify the state machine
With this patch rcu_sync has a single state variable and the transition rules
become really simple:

	GP_IDLE   - owned by the first rcu_sync_enter() which moves it to

	GP_ENTER  - owned by rcu-callback which moves it to

	GP_PASSED - owned by the last rcu_sync_exit() which moves it to

	GP_EXIT   - and this is the only "nontrivial" state.

		rcu-callback moves it back to GP_IDLE unless another enter()
		comes before a GP pass.

		If rcu-callback is invoked before the next rcu_sync_exit() it
		must see gp_count incremented by that enter() and set GP_PASSED.

		Otherwise, if the next rcu_sync_exit() wins the race, it will
		move it to

	GP_REPLAY - owned by rcu-callback which moves it to GP_EXIT

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
[ paulmck: While here, apply READ_ONCE() and WRITE_ONCE() to ->gp_state. ]
[ paulmck: Tweaks to make htmldocs happy. (Reported by kbuild test robot.) ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-28 09:05:23 -07:00
Oleg Nesterov
3f2947b781 locking/percpu-rwsem: Add DEFINE_PERCPU_RWSEM(), use it to initialize cgroup_threadgroup_rwsem
Turn DEFINE_STATIC_PERCPU_RWSEM() into __DEFINE_PERCPU_RWSEM() with the
additional "is_static" argument to introduce DEFINE_PERCPU_RWSEM().

Change cgroup.c to use DEFINE_PERCPU_RWSEM(cgroup_threadgroup_rwsem).

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-28 09:05:23 -07:00
Oleg Nesterov
2bf1acc299 uprobes: Use DEFINE_STATIC_PERCPU_RWSEM() to initialize dup_mmap_sem
Use DEFINE_STATIC_PERCPU_RWSEM() to initialize dup_mmap_sem.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-28 09:05:23 -07:00
Oleg Nesterov
95bf33b55f rcu/sync: Kill rcu_sync_type/gp_type
Now that the RCU flavors have been consolidated, rcu_sync_type makes no
sense because none of internal update functions aside from .held() depend
on gp_type.  This commit therefore removes this field and consolidates
the relevant code.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
[ paulmck: Added RCU and RCU-bh checks to rcu_sync_is_idle(). ]
[ paulmck: And applied subsequent feedback from Oleg Nesterov. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-28 09:05:23 -07:00
Jiang Biao
11b000457f rcu: Make __call_srcu static
Because __call_srcu() is not used outside kernel/rcu/srcutree.c,
this commit makes it static.

Signed-off-by: Jiang Biao <benbjiang@tencent.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-28 09:03:35 -07:00
Paul E. McKenney
fe15b50cde srcu: Allocate per-CPU data for DEFINE_SRCU() in modules
Adding DEFINE_SRCU() or DEFINE_STATIC_SRCU() to a loadable module requires
that the size of the reserved region be increased, which is not something
we want to be doing all that often.  One approach would be to require
that loadable modules define an srcu_struct and invoke init_srcu_struct()
from their module_init function and cleanup_srcu_struct() from their
module_exit function.  However, this is more than a bit user unfriendly.

This commit therefore creates an ___srcu_struct_ptrs linker section,
and pointers to srcu_struct structures created by DEFINE_SRCU() and
DEFINE_STATIC_SRCU() within a module are placed into that module's
___srcu_struct_ptrs section.  The required init_srcu_struct() and
cleanup_srcu_struct() functions are then automatically invoked as needed
when that module is loaded and unloaded, thus allowing modules to continue
to use DEFINE_SRCU() and DEFINE_STATIC_SRCU() while avoiding the need
to increase the size of the reserved region.

Many of the algorithms and some of the code was cheerfully cherry-picked
from other code making use of linker sections, perhaps most notably from
tracepoints.  All bugs are nevertheless the sole property of the author.

Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
[ paulmck: Use __section() and use "default" in srcu_module_notify()'s
  "switch" statement as suggested by Joel Fernandes. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2019-05-28 09:03:35 -07:00
Paul E. McKenney
d5a9a8c3bc rcu: Set a maximum limit for back-to-back callback invocation
Currently, if a CPU has more than 10,000 callbacks pending, it will
increase rdp->blimit to LONG_MAX.  If you are lucky, LONG_MAX is only
about two billion, but this is still a bit too many callbacks to invoke
back-to-back while otherwise ignoring the world.

This commit therefore sets a maximum limit of DEFAULT_MAX_RCU_BLIMIT,
which is set to 10,000, for rdp->blimit.

Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-28 09:02:57 -07:00
Neeraj Upadhyay
3ae976a7e3 rcu: Correctly unlock root node in rcu_check_gp_start_stall()
On systems whose rcu_node tree has only one node, the
rcu_check_gp_start_stall() function's values of rnp and rnp_root will
be identical.  In this case, it clearly does not make sense to release
both rnp->lock and rnp_root->lock, but that is exactly what this function
does in the last early exit.  This commit therefore unlocks only rnp->lock
when rnp and rnp_root are equal.

Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Reviewed-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-28 09:02:57 -07:00
Neeraj Upadhyay
cd6d17b4a4 rcu: Dump specified number of blocked tasks
The dump_blkd_tasks() function dumps at most 10 blocked tasks, ignoring
the value of the ncheck parameter.  This commit therefore substitutes
the value of ncheck for the hard-coded value of 10.  Because all callers
currently pass 10 as the number, this patch does not change behavior,
but it is clearly an accident waiting to happen.

Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Reviewed-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-28 09:02:57 -07:00
Jiang Biao
f0b6356273 rcu: Remove unused rdp local from synchronize_rcu_expedited()
Because rdp is initialized but never used in synchronize_rcu_expedited(),
this commit removes it.

Signed-off-by: Jiang Biao <benbjiang@tencent.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-28 08:48:19 -07:00
Paul E. McKenney
1bb336443c rcu: Rename rcu_data's ->deferred_qs to ->exp_deferred_qs
The rcu_data structure's ->deferred_qs field is used to indicate that the
current CPU is blocking an expedited grace period (perhaps a future one).
Given that it is used only for expedited grace periods, its current name
is misleading, so this commit renames it to ->exp_deferred_qs.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-28 08:48:19 -07:00
Joel Fernandes (Google)
eddded8012 rcu: Add checks for dynticks counters in rcu_is_cpu_rrupt_from_idle()
It would be good to combine the dynticks and dynticks_nesting counters
in order to simplify the code.  Unfortunately, there are concerns
about usermode upcalls appearing to RCU as half of an interrupt, as
Byungchul learned [1].  The "half" in "half interrupt" is due to an
unpaired rcu_irq_enter(): Normally, each rcu_irq_enter() has a later
call to rcu_irq_exit().

Out of an abundance of caution, Paul added warnings [2] in the RCU
code which if not fired by 2021 will be interpreted as meaning that
this half-interrupt scenario cannot happen any more, thus permitting
simplification of this code.

In the meantime, this commit makes the following changes:

(1) Combining these two counters requires that rcu_rrupt_from_idle()
    is invoked only from hard-interrupt contexts as discussed here [3].
    This commit therefore adds the required lockdep_assert_in_irq()
    to check this constraint.

(2) Furthermore, rcu_rrupt_from_idle() is not explicit about how it
    is using the counters which can lead to weird future bugs. This
    commit therefore adds comments indicating the meaning and use of
    each counter.

(3) Lastly, this commit checks for counter underflows as another check
    that half interrupts don't occur.  (Previously, the function would
    simply return true upon underflow.)

All these checks checks are NOOPs if PROVE_LOCKING (and thus PROVE_RCU)
are disabled.

[1] https://lore.kernel.org/patchwork/patch/952349/
[2] Commit e11ec65cc8 ("rcu: Add warning to detect half-interrupts")
[3] https://lore.kernel.org/lkml/20190312150514.GB249405@google.com/

Cc: byungchul.park@lge.com
Cc: kernel-team@android.com
Cc: rcu@vger.kernel.org
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-28 08:48:19 -07:00
Steven Rostedt (VMware)
86b3de60a0 ring-buffer: Remove HAVE_64BIT_ALIGNED_ACCESS
Commit c19fa94a8f ("Add HAVE_64BIT_ALIGNED_ACCESS") added the config for
architectures that required 64bit aligned access for all 64bit words. As
the ftrace ring buffer stores data on 4 byte alignment, this config option
was used to force it to store data on 8 byte alignment to make sure the data
being stored and written directly into the ring buffer was 8 byte aligned as
it would cause issues trying to write an 8 byte word on a 4 not 8 byte
aligned memory location.

But with the removal of the metag architecture, which was the only
architecture to use this, there is no architecture supported by Linux that
requires 8 byte aligne access for all 8 byte words (4 byte alignment is good
enough). Removing this config can simplify the code a bit.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-05-28 09:36:19 -04:00
Yonghong Song
e1afb70252 bpf: check signal validity in nmi for bpf_send_signal() helper
Commit 8b401f9ed2 ("bpf: implement bpf_send_signal() helper")
introduced bpf_send_signal() helper. If the context is nmi,
the sending signal work needs to be deferred to irq_work.
If the signal is invalid, the error will appear in irq_work
and it won't be propagated to user.

This patch did an early check in the helper itself to notify
user invalid signal, as suggested by Daniel.

Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-05-28 10:51:33 +02:00
Eric W. Biederman
f8eac9011b signal: Remove task parameter from force_sig_mceerr
All of the callers pass current into force_sig_mceer so remove the
task parameter to make this obvious.

This also makes it clear that force_sig_mceerr passes current
into force_sig_info.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2019-05-27 09:36:28 -05:00
Eric W. Biederman
3cf5d076fb signal: Remove task parameter from force_sig
All of the remaining callers pass current into force_sig so
remove the task parameter to make this obvious and to make
misuse more difficult in the future.

This also makes it clear force_sig passes current into force_sig_info.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2019-05-27 09:36:28 -05:00
Eric W. Biederman
cb44c9a0ab signal: Remove task parameter from force_sigsegv
The function force_sigsegv is always called on the current task
so passing in current is redundant and not passing in current
makes this fact obvious.

This also makes it clear force_sigsegv always calls force_sig
on the current task.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2019-05-27 09:36:28 -05:00
Eric W. Biederman
f9070dc945 signal/pid_namespace: Fix reboot_pid_ns to use send_sig not force_sig
The locking in force_sig_info is not prepared to deal with a task that
exits or execs (as sighand may change).  The is not a locking problem
in force_sig as force_sig is only built to handle synchronous
exceptions.

Further the function force_sig_info changes the signal state if the
signal is ignored, or blocked or if SIGNAL_UNKILLABLE will prevent the
delivery of the signal.  The signal SIGKILL can not be ignored and can
not be blocked and SIGNAL_UNKILLABLE won't prevent it from being
delivered.

So using force_sig rather than send_sig for SIGKILL is confusing
and pointless.

Because it won't impact the sending of the signal and and because
using force_sig is wrong, replace force_sig with send_sig.

Cc: Daniel Lezcano <daniel.lezcano@free.fr>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Fixes: cf3f89214e ("pidns: add reboot_pid_ns() to handle the reboot syscall")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2019-05-27 09:36:28 -05:00
Rafael J. Wysocki
bb1869012d ACPI: PM: Call pm_set_suspend_via_firmware() during hibernation
On systems with ACPI platform firmware the last stage of hibernation
is analogous to system suspend to S3 (suspend-to-RAM), so it should
be handled analogously.  In particular, pm_suspend_via_firmware()
should return 'true' in that stage to let the callers of it know that
control will be passed to the platform firmware going forward, so
pm_set_suspend_via_firmware() needs to be called then in analogy with
acpi_suspend_begin().

However, the platform hibernation ->begin() callback is invoked
during the "freeze" transition (before creating a snapshot image of
system memory) as well as during the "hibernate" transition which is
the last stage of it and pm_set_suspend_via_firmware() should be
invoked by that callback in the latter stage only.

In order to implement that redefine the hibernation ->begin()
callback to take a pm_message_t argument to indicate which stage
of hibernation is taking place and rework acpi_hibernation_begin()
and acpi_hibernation_begin_old() to take it into account as needed.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-05-27 10:51:45 +02:00
Linus Torvalds
c5b440951a Make the GCC 9 warning for sub struct memset go away.
GCC 9 now warns about calling memset() on partial structures when it
 goes across multiple fields. This adds a helper for the place in
 tracing that does this type of clearing of a structure.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXOrlfhQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qoDhAP4mogBm0JjJ1LWr8RX2/X7qFm0x1zLz
 5Mk0QKfeRP3MYgEAl2mV/HeFp7aMxEY2CKy0LslmaXPhamPx1r0LlfMgIws=
 =drP3
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.2-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing warning fix from Steven Rostedt:
 "Make the GCC 9 warning for sub struct memset go away.

  GCC 9 now warns about calling memset() on partial structures when it
  goes across multiple fields. This adds a helper for the place in
  tracing that does this type of clearing of a structure"

* tag 'trace-v5.2-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Silence GCC 9 array bounds warning
2019-05-26 13:49:40 -07:00
Cheng Jian
a124692b69 ftrace: Enable trampoline when rec count returns back to one
Custom trampolines can only be enabled if there is only a single ops
attached to it. If there's only a single callback registered to a function,
and the ops has a trampoline registered for it, then we can call the
trampoline directly. This is very useful for improving the performance of
ftrace and livepatch.

If more than one callback is registered to a function, the general
trampoline is used, and the custom trampoline is not restored back to the
direct call even if all the other callbacks were unregistered and we are
back to one callback for the function.

To fix this, set FTRACE_FL_TRAMP flag if rec count is decremented
to one, and the ops that left has a trampoline.

Testing After this patch :

insmod livepatch_unshare_files.ko
cat /sys/kernel/debug/tracing/enabled_functions

	unshare_files (1) R I	tramp: 0xffffffffc0000000(klp_ftrace_handler+0x0/0xa0) ->ftrace_ops_assist_func+0x0/0xf0

echo unshare_files > /sys/kernel/debug/tracing/set_ftrace_filter
echo function > /sys/kernel/debug/tracing/current_tracer
cat /sys/kernel/debug/tracing/enabled_functions

	unshare_files (2) R I ->ftrace_ops_list_func+0x0/0x150

echo nop > /sys/kernel/debug/tracing/current_tracer
cat /sys/kernel/debug/tracing/enabled_functions

	unshare_files (1) R I	tramp: 0xffffffffc0000000(klp_ftrace_handler+0x0/0xa0) ->ftrace_ops_assist_func+0x0/0xf0

Link: http://lkml.kernel.org/r/1556969979-111047-1-git-send-email-cj.chengjian@huawei.com

Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-05-25 23:04:43 -04:00
Steven Rostedt (VMware)
b6399cc789 tracing/kprobe: Do not run kprobe boot tests if kprobe_event is on cmdline
When having kprobe trace event start up tests enabled and adding a
kprobe_event on the kernel command line, it produced the following:

 trace_kprobe: Testing kprobe tracing:
 WARNING: CPU: 5 PID: 1 at kernel/trace/trace_kprobe.c:1724 kprobe_trace_self_tests_init+0x32d/0x36b
 Modules linked in:
 CPU: 5 PID: 1 Comm: swapper/0 Not tainted 5.2.0-rc1-test+ #249
 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016
 RIP: 0010:kprobe_trace_self_tests_init+0x32d/0x36b
 Code: b7 e8 4f 8d a2 fe 85 c0 74 10 0f 0b 48 c7 c7 c8 1b 0d b7 ff c3 e8 19 af 99 fe 48 c7 c7 40 93 27 b7 e8 7f 1a a5 fe 85 c0 74 10 <0f> 0b 48 c7 c7 f8 1b 0d b7 ff c3 e8 f9 ae
9 a0 fe 85
 RSP: 0018:ffffb36e40653e08 EFLAGS: 00010286
 RAX: 00000000fffffff0 RBX: 0000000000000000 RCX: ffffb36e40653d5c
 RDX: 0000000000000000 RSI: ffffffffb72776e0 RDI: 0000000000000246
 RBP: ffff98414fe58ff8 R08: 0000000000000000 R09: 0000000000000000
 R10: 0000000000000000 R11: ffff98415d8aa940 R12: 0000000000000000
 R13: ffffffffb737c1b0 R14: 0000000000000000 R15: 0000000000000000
 FS:  0000000000000000(0000) GS:ffff98415ea80000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007f959ce741b8 CR3: 000000011a210002 CR4: 00000000001606e0
 Call Trace:
  ? init_kprobe_trace+0x19e/0x19e
  ? do_early_param+0x8e/0x8e
  do_one_initcall+0x6f/0x2b4
  ? do_early_param+0x8e/0x8e
  kernel_init_freeable+0x21d/0x2c6
  ? rest_init+0x146/0x146
  kernel_init+0xa/0x10a
  ret_from_fork+0x3a/0x50
 ---[ end trace 488430c083a4c956 ]---

As with the trace events, if a trace event is set on the kernel command
line, the trace events start up tests are suspended. The kprobe start up
tests should do the same when a kprobe is enabled on the kernel command
line.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-05-25 23:04:43 -04:00
Steven Rostedt (VMware)
b3015fe41d tracing: Make a separate config for trace event self tests
The trace event self tests enable loop through *all* events, enables each
one, one at a time, runs some code to trigger various events (not
necessarily the same events), and checks if anything went wrong. The issue
is that trace events are usually the least likely start up test to cause a
problem, but they take the longest to run (because there are so many
events). When one of the other tests trigger a bug, the trace event start up
tests causes the bisect to take much longer, because it takes 10s of seconds
to get through the trace event tests.

By making them a separate config (even though they are enabled by default if
start up tests are set), it is possible to turn them off and still run the
other tracing start up tests much quicker.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-05-25 23:04:43 -04:00
Masami Hiramatsu
970988e19e tracing/kprobe: Add kprobe_event= boot parameter
Add kprobe_event= boot parameter to define kprobe events
at boot time.
The definition syntax is similar to tracefs/kprobe_events
interface, but use ',' and ';' instead of ' ' and '\n'
respectively. e.g.

  kprobe_event=p,vfs_read,$arg1,$arg2

This puts a probe on vfs_read with argument1 and 2, and
enable the new event.

Link: http://lkml.kernel.org/r/155851395498.15728.830529496248543583.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-05-25 23:04:43 -04:00
Masami Hiramatsu
b5f8b32c93 kprobes: Initialize kprobes at postcore_initcall
Initialize kprobes at postcore_initcall level instead of module_init
since kprobes is not a module, and it depends on only subsystems
initialized in core_initcall.
This will allow ftrace kprobe event to add new events when it is
initializing because ftrace kprobe event is initialized at
later initcall level.

Link: http://lkml.kernel.org/r/155851394736.15728.13626739508905120098.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-05-25 23:04:43 -04:00
Masami Hiramatsu
539b75b2b9 tracing/kprobe: Cast user-space address correctly
Cast user-space address correctly to pass to probe_user_read().

Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-05-25 23:04:43 -04:00
Matthias Kaehlcke
f08367b364 tracing: Use correct function name in trace_filter_add_remove_task() comment
The comment of trace_filter_add_remove_task() refers to the function as
'trace_pid_filter_add_remove_task', use the correct name.

Link: http://lkml.kernel.org/r/20190523192628.134406-1-mka@chromium.org

Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-05-25 23:04:43 -04:00
Masami Hiramatsu
e65f7ae7f4 tracing/probe: Support user-space dereference
Support user-space dereference syntax for probe event arguments
to dereference the data-structure or array in user-space.

The syntax is just adding 'u' before an offset value.

 +|-u<OFFSET>(<FETCHARG>)

e.g. +u8(%ax), +u0(+0(%si))

For example, if you probe do_sched_setscheduler(pid, policy,
param) and record param->sched_priority, you can add new
probe as below;

 p do_sched_setscheduler priority=+u0($arg3)

Note that kprobe event provides this and it doesn't change the
dereference method automatically because we do not know whether
the given address is in userspace or kernel on some archs.

So as same as "ustring", this is an option for user, who has to
carefully choose the dereference method.

Link: http://lkml.kernel.org/r/155789872187.26965.4468456816590888687.stgit@devnote2

Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-05-25 23:04:42 -04:00
Masami Hiramatsu
88903c4643 tracing/probe: Add ustring type for user-space string
Add "ustring" type for fetching user-space string from kprobe event.
User can specify ustring type at uprobe event, and it is same as
"string" for uprobe.

Note that probe-event provides this option but it doesn't choose the
correct type automatically since we have not way to decide the address
is in user-space or not on some arch (and on some other arch, you can
fetch the string by "string" type). So user must carefully check the
target code (e.g. if you see __user on the target variable) and
use this new type.

Link: http://lkml.kernel.org/r/155789871009.26965.14167558859557329331.stgit@devnote2

Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-05-25 23:04:42 -04:00
Steven Rostedt (VMware)
7375dca164 ftrace: Make enable and update parameters bool when applicable
The code modification functions have "enable" and "update" variables that
are sometimes "int" but used as "bool". Remove the ambiguity and make them
"bool" when they are only used for true or false values.

Link: http://lkml.kernel.org/r/e1429923d9eda92a3cf5ee9e33c7eacce539781d.1558115654.git.naveen.n.rao@linux.vnet.ibm.com

Reported-by: "Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-05-25 23:04:42 -04:00
Miguel Ojeda
0c97bf863e tracing: Silence GCC 9 array bounds warning
Starting with GCC 9, -Warray-bounds detects cases when memset is called
starting on a member of a struct but the size to be cleared ends up
writing over further members.

Such a call happens in the trace code to clear, at once, all members
after and including `seq` on struct trace_iterator:

    In function 'memset',
        inlined from 'ftrace_dump' at kernel/trace/trace.c:8914:3:
    ./include/linux/string.h:344:9: warning: '__builtin_memset' offset
    [8505, 8560] from the object at 'iter' is out of the bounds of
    referenced subobject 'seq' with type 'struct trace_seq' at offset
    4368 [-Warray-bounds]
      344 |  return __builtin_memset(p, c, size);
          |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~

In order to avoid GCC complaining about it, we compute the address
ourselves by adding the offsetof distance instead of referring
directly to the member.

Since there are two places doing this clear (trace.c and trace_kdb.c),
take the chance to move the workaround into a single place in
the internal header.

Link: http://lkml.kernel.org/r/20190523124535.GA12931@gmail.com

Signed-off-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
[ Removed unnecessary parenthesis around "iter" ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-05-25 23:04:30 -04:00
Al Viro
d5f68d330c cpuset: move mount -t cpuset logics into cgroup.c
... and get rid of the weird dances in ->get_tree() - that logics
can be easily handled in ->init_fs_context().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-05-25 18:00:01 -04:00
Al Viro
f7a9945184 no need to protect against put_user_ns(NULL)
it's a no-op

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-05-25 17:59:54 -04:00
Paul E. McKenney
e015a34112 rcu: Avoid self-IPI in sync_sched_exp_online_cleanup()
The sync_sched_exp_online_cleanup() is invoked at online time to handle
the case where the start of an expedited grace period ran concurrently
with a CPU being taken offline and then immediately being placed online.
It checks to see if RCU needs an expedited quiescent state from the
incoming CPU, sending it an IPI if so.  However, it is quite possible
that sync_sched_exp_online_cleanup() is running on that CPU, in which
case it is considerably less overhead to simply request the quiescent
state locally instead of simulating a self-IPI.

This commit therefore places the last few lines of rcu_exp_handler()
into a new rcu_exp_need_qs() function, which is invoked both by
rcu_exp_handler() and by sync_sched_exp_online_cleanup() in the self-IPI
case.

This also reduces the rcu_exp_handler() function's state space by
removing the direct call that this smp_call_function_single() uses to
emulate the requested self-IPI.  This in turn will allow tighter error
checking in rcu_is_cpu_rrupt_from_idle().

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2019-05-25 14:50:50 -07:00
Paul E. McKenney
b9ad4d6ed1 rcu: Avoid self-IPI in sync_rcu_exp_select_node_cpus()
Although sync_rcu_exp_select_node_cpus() treats the current CPU as being
in a quiescent state, it might well migrate to some other CPU before
reaching the smp_call_function_single(), which could then result in an
unnecessary simulated self-IPI.  This commit therefore instead simply
refuses to invoke smp_call_function_single() on the current CPU, which
causes the later rcu_report_exp_cpu_mult() to report this CPU's quiescent
state with less overhead.

This also reduces the rcu_exp_handler() function's state space by removing
the direct call that this smp_call_function_single() uses to emulate the
requested self-IPI.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
[ paulmck: Use get_cpu() instead of preempt_disable() per Joel Fernandes. ]
2019-05-25 14:50:50 -07:00
Paul E. McKenney
43e903ad3e rcu: Inline invoke_rcu_callbacks() into its sole remaining caller
This commit saves a few lines of code by inlining invoke_rcu_callbacks()
into its sole remaining caller.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-25 14:50:49 -07:00
Paul E. McKenney
0864f057b0 rcu: Use irq_work to get scheduler's attention in clean context
When rcu_read_unlock_special() is invoked with interrupts disabled, is
either not in an interrupt handler or is not using RCU_SOFTIRQ, is not
the first RCU read-side critical section in the chain, and either there
is an expedited grace period in flight or this is a NO_HZ_FULL kernel,
the end of the grace period can be unduly delayed.  The reason for this
is that it is not safe to do wakeups in this situation.

This commit fixes this problem by using the irq_work subsystem to
force a later interrupt handler in a clean environment.  Because
set_tsk_need_resched(current) and set_preempt_need_resched() are
invoked prior to this, the scheduler will force a context switch
upon return from this interrupt (though perhaps at the end of any
interrupted preempt-disable or BH-disable region of code), which will
invoke rcu_note_context_switch() (again in a clean environment), which
will in turn give RCU the chance to report the deferred quiescent state.

Of course, by then this task might be within another RCU read-side
critical section.  But that will be detected at that time and reporting
will be further deferred to the outermost rcu_read_unlock().  See
rcu_preempt_need_deferred_qs() and rcu_preempt_deferred_qs() for more
details on the checking.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-25 14:50:49 -07:00
Paul E. McKenney
385b599e8c rcu: Allow rcu_read_unlock_special() to raise_softirq() if in_irq()
When running in an interrupt handler, raise_softirq() and
raise_softirq_irqoff() have extremely low overhead: They simply set a
bit in a per-CPU mask, which is checked upon exit from that interrupt
handler.  Therefore, if rcu_read_unlock_special() is invoked within an
interrupt handler and RCU_SOFTIRQ is in use, this commit make use of
raise_softirq_irqoff() even if there is no expedited grace period in
flight and even if this is not a nohz_full CPU.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-25 14:50:48 -07:00
Paul E. McKenney
25102de65f rcu: Only do rcu_read_unlock_special() wakeups if expedited
Currently, rcu_read_unlock_special() will do wakeups whenever it is safe
to do so.  However, wakeups are expensive, and they are only really
needed when the just-ended RCU read-side critical section is blocking
an expedited grace period (in which case speed is of the essence)
or on a nohz_full CPU (where it might be a good long time before an
interrupt arrives).  This commit therefore checks for these conditions,
and does the expensive wakeups only if doing so would be useful.

Note it can be rather expensive to determine whether or not the current
task (as opposed to the current CPU) is blocking the current expedited
grace period.  Doing so requires traversing the ->blkd_tasks list, which
can be quite long.  This commit therefore cheats:  If the current task
is on a given ->blkd_tasks list, and some task on that list is blocking
the current expedited grace period, the code assumes that the current
task is blocking that expedited grace period.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-25 14:50:48 -07:00
Paul E. McKenney
23634ebc1d rcu: Check for wakeup-safe conditions in rcu_read_unlock_special()
When RCU core processing is offloaded from RCU_SOFTIRQ to the rcuc
kthreads, a full and unconditional wakeup is required to initiate RCU
core processing.  In contrast, when RCU core processing is carried
out by RCU_SOFTIRQ, a raise_softirq() suffices.  Of course, there are
situations where raise_softirq() does a full wakeup, but these do not
occur with normal usage of rcu_read_unlock().

The reason that full wakeups can be problematic is that the scheduler
sometimes invokes rcu_read_unlock() with its pi or rq locks held,
which can of course result in deadlock in CONFIG_PREEMPT=y kernels when
rcu_read_unlock() invokes the scheduler.  Scheduler invocations can happen
in the following situations: (1) The just-ended reader has been subjected
to RCU priority boosting, in which case rcu_read_unlock() must deboost,
(2) Interrupts were disabled across the call to rcu_read_unlock(), so
the quiescent state must be deferred, requiring a wakeup of the rcuc
kthread corresponding to the current CPU.

Now, the scheduler may hold one of its locks across rcu_read_unlock()
only if preemption has been disabled across the entire RCU read-side
critical section, which in the days prior to RCU flavor consolidation
meant that rcu_read_unlock() never needed to do wakeups.  However, this
is no longer the case for any but the first rcu_read_unlock() following a
condition (e.g., preempted RCU reader) requiring special rcu_read_unlock()
attention.  For example, an RCU read-side critical section might be
preempted, but preemption might be disabled across the rcu_read_unlock().
The rcu_read_unlock() must defer the quiescent state, and therefore
leaves the task queued on its leaf rcu_node structure.  If a scheduler
interrupt occurs, the scheduler might well invoke rcu_read_unlock() with
one of its locks held.  However, the preempted task is still queued, so
rcu_read_unlock() will attempt to defer the quiescent state once more.
When RCU core processing is carried out by RCU_SOFTIRQ, this works just
fine: The raise_softirq() function simply sets a bit in a per-CPU mask
and the RCU core processing will be undertaken upon return from interrupt.

Not so when RCU core processing is carried out by the rcuc kthread: In this
case, the required wakeup can result in deadlock.

The initial solution to this problem was to use set_tsk_need_resched() and
set_preempt_need_resched() to force a future context switch, which allows
rcu_preempt_note_context_switch() to report the deferred quiescent state
to RCU's core processing.  Unfortunately for expedited grace periods,
there can be a significant delay between the call for a context switch
and the actual context switch.

This commit therefore introduces a ->deferred_qs flag to the task_struct
structure's rcu_special structure.  This flag is initially false, and
is set to true by the first call to rcu_read_unlock() requiring special
attention, then finally reset back to false when the quiescent state is
finally reported.  Then rcu_read_unlock() attempts full wakeups only when
->deferred_qs is false, that is, on the first rcu_read_unlock() requiring
special attention.  Note that a chain of RCU readers linked by some other
sort of reader may find that a later rcu_read_unlock() is once again able
to do a full wakeup, courtesy of an intervening preemption:

	rcu_read_lock();
	/* preempted */
	local_irq_disable();
	rcu_read_unlock(); /* Can do full wakeup, sets ->deferred_qs. */
	rcu_read_lock();
	local_irq_enable();
	preempt_disable()
	rcu_read_unlock(); /* Cannot do full wakeup, ->deferred_qs set. */
	rcu_read_lock();
	preempt_enable();
	/* preempted, >deferred_qs reset. */
	local_irq_disable();
	rcu_read_unlock(); /* Can again do full wakeup, sets ->deferred_qs. */

Such linked RCU readers do not yet seem to appear in the Linux kernel, and
it is probably best if they don't.  However, RCU needs to handle them, and
some variations on this theme could make even raise_softirq() unsafe due to
the possibility of its doing a full wakeup.  This commit therefore also
avoids invoking raise_softirq() when the ->deferred_qs set flag is set.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2019-05-25 14:50:47 -07:00
Sebastian Andrzej Siewior
48d07c04b4 rcu: Enable elimination of Tree-RCU softirq processing
Some workloads need to change kthread priority for RCU core processing
without affecting other softirq work.  This commit therefore introduces
the rcutree.use_softirq kernel boot parameter, which moves the RCU core
work from softirq to a per-CPU SCHED_OTHER kthread named rcuc.  Use of
SCHED_OTHER approach avoids the scalability problems that appeared
with the earlier attempt to move RCU core processing to from softirq
to kthreads.  That said, kernels built with RCU_BOOST=y will run the
rcuc kthreads at the RCU-boosting priority.

Note that rcutree.use_softirq=0 must be specified to move RCU core
processing to the rcuc kthreads: rcutree.use_softirq=1 is the default.

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
[ paulmck: Adjust for invoke_rcu_callbacks() only ever being invoked
  from RCU core processing, in contrast to softirq->rcuc transition
  in old mainline RCU priority boosting. ]
[ paulmck: Avoid wakeups when scheduler might have invoked rcu_read_unlock()
  while holding rq or pi locks, also possibly fixing a pre-existing latent
  bug involving raise_softirq()-induced wakeups. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-05-25 14:50:46 -07:00
Linus Torvalds
a2c48d98fc Tom Zanussi sent me some small fixes and cleanups to the histogram
code and I forgot to incorporate them.
 
 I also added a small clean up patch that was sent to me a while ago
 and I just noticed it.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXOixqRQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qlPsAQCmNzno+SJMXLLIojZ80pqs9PqLqIrW
 iBopFNihRBAkxgEAle8Pvr53S4R/Hjn6j++kxBm/uWaEfICOzyqcSyqxlQA=
 =Sin+
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
 "Tom Zanussi sent me some small fixes and cleanups to the histogram
  code and I forgot to incorporate them.

  I also added a small clean up patch that was sent to me a while ago
  and I just noticed it"

* tag 'trace-v5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  kernel/trace/trace.h: Remove duplicate header of trace_seq.h
  tracing: Add a check_val() check before updating cond_snapshot() track_val
  tracing: Check keys for variable references in expressions too
  tracing: Prevent hist_field_var_ref() from accessing NULL tracing_map_elts
2019-05-25 10:08:14 -07:00
Jiong Wang
d6c2308c74 bpf: verifier: randomize high 32-bit when BPF_F_TEST_RND_HI32 is set
This patch randomizes high 32-bit of a definition when BPF_F_TEST_RND_HI32
is set.

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-24 18:58:37 -07:00
Jiong Wang
c240eff63a bpf: introduce new bpf prog load flags "BPF_F_TEST_RND_HI32"
x86_64 and AArch64 perhaps are two arches that running bpf testsuite
frequently, however the zero extension insertion pass is not enabled for
them because of their hardware support.

It is critical to guarantee the pass correction as it is supposed to be
enabled at default for a couple of other arches, for example PowerPC,
SPARC, arm, NFP etc. Therefore, it would be very useful if there is a way
to test this pass on for example x86_64.

The test methodology employed by this set is "poisoning" useless bits. High
32-bit of a definition is randomized if it is identified as not used by any
later insn. Such randomization is only enabled under testing mode which is
gated by the new bpf prog load flags "BPF_F_TEST_RND_HI32".

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-24 18:58:37 -07:00
Jiong Wang
a4b1d3c1dd bpf: verifier: insert zero extension according to analysis result
After previous patches, verifier will mark a insn if it really needs zero
extension on dst_reg.

It is then for back-ends to decide how to use such information to eliminate
unnecessary zero extension code-gen during JIT compilation.

One approach is verifier insert explicit zero extension for those insns
that need zero extension in a generic way, JIT back-ends then do not
generate zero extension for sub-register write at default.

However, only those back-ends which do not have hardware zero extension
want this optimization. Back-ends like x86_64 and AArch64 have hardware
zero extension support that the insertion should be disabled.

This patch introduces new target hook "bpf_jit_needs_zext" which returns
false at default, meaning verifier zero extension insertion is disabled at
default. A back-end could override this hook to return true if it doesn't
have hardware support and want verifier insert zero extension explicitly.

Offload targets do not use this native target hook, instead, they could
get the optimization results using bpf_prog_offload_ops.finalize.

NOTE: arches could have diversified features, it is possible for one arch
to have hardware zero extension support for some sub-register write insns
but not for all. For example, PowerPC, SPARC have zero extended loads, but
not for alu32. So when verifier zero extension insertion enabled, these JIT
back-ends need to peephole insns to remove those zero extension inserted
for insn that actually has hardware zero extension support. The peephole
could be as simple as looking the next insn, if it is a special zero
extension insn then it is safe to eliminate it if the current insn has
hardware zero extension support.

Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-24 18:58:37 -07:00
Jiong Wang
b325fbca4b bpf: verifier: mark patched-insn with sub-register zext flag
Patched insns do not go through generic verification, therefore doesn't has
zero extension information collected during insn walking.

We don't bother analyze them at the moment, for any sub-register def comes
from them, just conservatively mark it as needing zero extension.

Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-24 18:58:37 -07:00
Jiong Wang
5327ed3d44 bpf: verifier: mark verified-insn with sub-register zext flag
eBPF ISA specification requires high 32-bit cleared when low 32-bit
sub-register is written. This applies to destination register of ALU32 etc.
JIT back-ends must guarantee this semantic when doing code-gen. x86_64 and
AArch64 ISA has the same semantics, so the corresponding JIT back-end
doesn't need to do extra work.

However, 32-bit arches (arm, x86, nfp etc.) and some other 64-bit arches
(PowerPC, SPARC etc) need to do explicit zero extension to meet this
requirement, otherwise code like the following will fail.

  u64_value = (u64) u32_value
  ... other uses of u64_value

This is because compiler could exploit the semantic described above and
save those zero extensions for extending u32_value to u64_value, these JIT
back-ends are expected to guarantee this through inserting extra zero
extensions which however could be a significant increase on the code size.
Some benchmarks show there could be ~40% sub-register writes out of total
insns, meaning at least ~40% extra code-gen.

One observation is these extra zero extensions are not always necessary.
Take above code snippet for example, it is possible u32_value will never be
casted into a u64, the value of high 32-bit of u32_value then could be
ignored and extra zero extension could be eliminated.

This patch implements this idea, insns defining sub-registers will be
marked when the high 32-bit of the defined sub-register matters. For
those unmarked insns, it is safe to eliminate high 32-bit clearnace for
them.

Algo:
 - Split read flags into READ32 and READ64.

 - Record index of insn that does sub-register write. Keep the index inside
   reg state and update it during verifier insn walking.

 - A full register read on a sub-register marks its definition insn as
   needing zero extension on dst register.

   A new sub-register write overrides the old one.

 - When propagating read64 during path pruning, also mark any insn defining
   a sub-register that is read in the pruned path as full-register.

Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-24 18:58:37 -07:00
Linus Torvalds
86c2f5d653 SPDX update for 5.2-rc2, round 2
Here is another set of reviewed patches that adds SPDX tags to different
 kernel files, based on a set of rules that are being used to parse the
 comments to try to determine that the license of the file is
 "GPL-2.0-or-later".  Only the "obvious" versions of these matches are
 included here, a number of "non-obvious" variants of text have been
 found but those have been postponed for later review and analysis.
 
 These patches have been out for review on the linux-spdx@vger mailing
 list, and while they were created by automatic tools, they were
 hand-verified by a bunch of different people, all whom names are on the
 patches are reviewers.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXOgmlw8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+yk4rACfRqxGOGVLR/t6E9dDzOZRAdEz/mYAoJLZmziY
 0YlSSSPtP5HI6JDh65Ng
 =HXQb
 -----END PGP SIGNATURE-----

Merge tag 'spdx-5.2-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pule more SPDX updates from Greg KH:
 "Here is another set of reviewed patches that adds SPDX tags to
  different kernel files, based on a set of rules that are being used to
  parse the comments to try to determine that the license of the file is
  "GPL-2.0-or-later".

  Only the "obvious" versions of these matches are included here, a
  number of "non-obvious" variants of text have been found but those
  have been postponed for later review and analysis.

  These patches have been out for review on the linux-spdx@vger mailing
  list, and while they were created by automatic tools, they were
  hand-verified by a bunch of different people, all whom names are on
  the patches are reviewers"

* tag 'spdx-5.2-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (85 commits)
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 125
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 123
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 122
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 121
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 120
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 119
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 118
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 116
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 114
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 113
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 112
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 111
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 110
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 106
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 105
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 104
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 103
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 102
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 101
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 98
  ...
2019-05-24 14:31:58 -07:00
Yonghong Song
8b401f9ed2 bpf: implement bpf_send_signal() helper
This patch tries to solve the following specific use case.

Currently, bpf program can already collect stack traces
through kernel function get_perf_callchain()
when certain events happens (e.g., cache miss counter or
cpu clock counter overflows). But such stack traces are
not enough for jitted programs, e.g., hhvm (jited php).
To get real stack trace, jit engine internal data structures
need to be traversed in order to get the real user functions.

bpf program itself may not be the best place to traverse
the jit engine as the traversing logic could be complex and
it is not a stable interface either.

Instead, hhvm implements a signal handler,
e.g. for SIGALARM, and a set of program locations which
it can dump stack traces. When it receives a signal, it will
dump the stack in next such program location.

Such a mechanism can be implemented in the following way:
  . a perf ring buffer is created between bpf program
    and tracing app.
  . once a particular event happens, bpf program writes
    to the ring buffer and the tracing app gets notified.
  . the tracing app sends a signal SIGALARM to the hhvm.

But this method could have large delays and causing profiling
results skewed.

This patch implements bpf_send_signal() helper to send
a signal to hhvm in real time, resulting in intended stack traces.

Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-05-24 23:26:47 +02:00
Waiman Long
51816e9e11 locking/lock_events: Use this_cpu_add() when necessary
The kernel test robot has reported that the use of __this_cpu_add()
causes bug messages like:

  BUG: using __this_cpu_add() in preemptible [00000000] code: ...

Given the imprecise nature of the count and the possibility of resetting
the count and doing the measurement again, this is not really a big
problem to use the unprotected __this_cpu_*() functions.

To make the preemption checking code happy, the this_cpu_*() functions
will be used if CONFIG_DEBUG_PREEMPT is defined.

The imprecise nature of the locking counts are also documented with
the suggestion that we should run the measurement a few times with the
counts reset in between to get a better picture of what is going on
under the hood.

Fixes: a8654596f0 ("locking/rwsem: Enable lock event counting")
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-24 14:17:18 -07:00
Joel Fernandes (Google)
1457dc9ed8 kheaders: Do not regenerate archive if config is not changed
Linus reported an issue that doing an allmodconfig was causing the
kheaders archive to be regenerated even though the config is the same.
This patch fixes the issue by ignoring the config-related header files
for "knowing when to regenerate based on timestamps".  Instead, if the
CONFIG_X_Y option really changes, then we there are the
include/config/X/Y.h which will already tells us "if a config really
changed". So we don't really need these files for regeneration detection
anyway, and ignoring them fixes Linus's issue.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-24 20:16:01 +02:00
Joel Fernandes (Google)
f7b101d330 kheaders: Move from proc to sysfs
The kheaders archive consisting of the kernel headers used for compiling
bpf programs is in /proc. However there is concern that moving it here
will make it permanent. Let us move it to /sys/kernel as discussed [1].

[1] https://lore.kernel.org/patchwork/patch/1067310/#1265969

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-24 20:16:01 +02:00
Thomas Gleixner
6ff3f917e0 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 38
Based on 1 normalized pattern(s):

  this file is released under the gplv2 and any later version

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 1 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190520170857.732920462@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-24 17:27:11 +02:00
Thomas Gleixner
b4d0d230cc treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 36
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public licence as published by
  the free software foundation either version 2 of the licence or at
  your option any later version

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 114 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190520170857.552531963@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-24 17:27:11 +02:00
Anders Roxell
c0090c4c85 locking/lockdep: Remove the unused print_lock_trace() function
gcc warns that function print_lock_trace() is unused if
CONFIG_PROVE_LOCKING isn't set:

../kernel/locking/lockdep.c:2820:13: warning: ‘print_lock_trace’ defined but not used [-Wunused-function]

Rework so we remove the function if CONFIG_PROVE_LOCKING isn't set.

Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: will.deacon@arm.com
Fixes: c120bce780 ("lockdep: Simplify stack trace handling")
Link: http://lkml.kernel.org/r/20190516191326.27003-1-anders.roxell@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-05-24 09:05:46 +02:00
Peter Zijlstra
5322ea58a0 perf/ring-buffer: Use regular variables for nesting
While the IRQ/NMI will nest, the nest-count will be invariant over the
actual exception, since it will decrement equal to increment.

This means we can -- carefully -- use a regular variable since the
typical LOAD-STORE race doesn't exist (similar to preempt_count).

This optimizes the ring-buffer for all LOAD-STORE architectures, since
they need to use atomic ops to implement local_t.

Suggested-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: acme@kernel.org
Cc: mark.rutland@arm.com
Cc: namhyung@kernel.org
Cc: yabinc@google.com
Link: http://lkml.kernel.org/r/20190517115418.481392777@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-05-24 09:00:11 +02:00
Peter Zijlstra
4d839dd9e4 perf/ring-buffer: Always use {READ,WRITE}_ONCE() for rb->user_page data
We must use {READ,WRITE}_ONCE() on rb->user_page data such that
concurrent usage will see whole values. A few key sites were missing
this.

Suggested-by: Yabin Cui <yabinc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: acme@kernel.org
Cc: mark.rutland@arm.com
Cc: namhyung@kernel.org
Fixes: 7b732a7504 ("perf_counter: new output ABI - part 1")
Link: http://lkml.kernel.org/r/20190517115418.394192145@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-05-24 09:00:11 +02:00
Peter Zijlstra
3f9fbe9bd8 perf/ring_buffer: Add ordering to rb->nest increment
Similar to how decrementing rb->next too early can cause data_head to
(temporarily) be observed to go backward, so too can this happen when
we increment too late.

This barrier() ensures the rb->head load happens after the increment,
both the one in the 'goto again' path, as the one from
perf_output_get_handle() -- albeit very unlikely to matter for the
latter.

Suggested-by: Yabin Cui <yabinc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: acme@kernel.org
Cc: mark.rutland@arm.com
Cc: namhyung@kernel.org
Fixes: ef60777c9a ("perf: Optimize the perf_output() path by removing IRQ-disables")
Link: http://lkml.kernel.org/r/20190517115418.309516009@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-05-24 09:00:10 +02:00
Yabin Cui
1b038c6e05 perf/ring_buffer: Fix exposing a temporarily decreased data_head
In perf_output_put_handle(), an IRQ/NMI can happen in below location and
write records to the same ring buffer:

	...
	local_dec_and_test(&rb->nest)
	...                          <-- an IRQ/NMI can happen here
	rb->user_page->data_head = head;
	...

In this case, a value A is written to data_head in the IRQ, then a value
B is written to data_head after the IRQ. And A > B. As a result,
data_head is temporarily decreased from A to B. And a reader may see
data_head < data_tail if it read the buffer frequently enough, which
creates unexpected behaviors.

This can be fixed by moving dec(&rb->nest) to after updating data_head,
which prevents the IRQ/NMI above from updating data_head.

[ Split up by peterz. ]

Signed-off-by: Yabin Cui <yabinc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: mark.rutland@arm.com
Fixes: ef60777c9a ("perf: Optimize the perf_output() path by removing IRQ-disables")
Link: http://lkml.kernel.org/r/20190517115418.224478157@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-05-24 09:00:10 +02:00
Sebastian Andrzej Siewior
978315462d locking/lockdep: Don't complain about incorrect name for no validate class
It is possible to ignore the validation for a certain lock by using:

	lockdep_set_novalidate_class()

on it. Each invocation will assign a new name to the class it created
for created __lockdep_no_validate__. That means that once
lockdep_set_novalidate_class() has been used on two locks then
class->name won't match lock->name for the first lock triggering the
warning.

So ignore changed non-matching ->name pointer for the special
__lockdep_no_validate__ class.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/20190517212234.32611-1-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-05-24 08:41:59 +02:00
Richard Guy Briggs
bf361231c2 audit: add saddr_fam filter field
Provide a method to filter out sockaddr and bind calls by network
address family.

Existing SOCKADDR records are listed for any network activity.
Implement the AUDIT_SADDR_FAM field selector to be able to classify or
limit records to specific network address families, such as AF_INET or
AF_INET6.

An example of a network record that is unlikely to be useful and flood
the logs:

type=SOCKADDR msg=audit(07/27/2017 12:18:27.019:845) : saddr={ fam=local
path=/var/run/nscd/socket }
type=SYSCALL msg=audit(07/27/2017 12:18:27.019:845) : arch=x86_64
syscall=connect success=no exit=ENOENT(No such file or directory) a0=0x3
a1=0x7fff229c4980 a2=0x6e a3=0x6 items=1 ppid=3301 pid=6145 auid=sgrubb
uid=sgrubb gid=sgrubb euid=sgrubb suid=sgrubb fsuid=sgrubb egid=sgrubb
sgid=sgrubb fsgid=sgrubb tty=pts3 ses=4 comm=bash exe=/usr/bin/bash
subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
key=network-test

Please see the audit-testsuite PR at
https://github.com/linux-audit/audit-testsuite/pull/87
Please see the github issue
https://github.com/linux-audit/audit-kernel/issues/64
Please see the github issue for the accompanying userspace support
https://github.com/linux-audit/audit-userspace/issues/93

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
[PM: merge fuzz in auditfilter.c]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2019-05-23 21:07:30 -04:00
Richard Guy Briggs
ecc68904a3 audit: re-structure audit field valid checks
Multiple checks were being done in one switch case statement that
started to cause some redundancies and awkward exceptions.  Separate the
valid field and op check from the select valid values checks.

Enforce the elimination of meaningless bitwise and greater/lessthan
checks on string fields and other fields with unrelated scalar values.

Please see the github issue
https://github.com/linux-audit/audit-kernel/issues/73

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2019-05-23 21:05:08 -04:00
Alexei Starovoitov
dc2a4ebc0b bpf: convert explored_states to hash table
All prune points inside a callee bpf function most likely will have
different callsites. For example, if function foo() is called from
two callsites the half of explored states in all prune points in foo()
will be useless for subsequent walking of one of those callsites.
Fortunately explored_states pruning heuristics keeps the number of states
per prune point small, but walking these states is still a waste of cpu
time when the callsite of the current state is different from the callsite
of the explored state.

To improve pruning logic convert explored_states into hash table and
use simple insn_idx ^ callsite hash to select hash bucket.
This optimization has no effect on programs without bpf2bpf calls
and drastically improves programs with calls.
In the later case it reduces total memory consumption in 1M scale tests
by almost 3 times (peak_states drops from 5752 to 2016).

Care should be taken when comparing the states for equivalency.
Since the same hash bucket can now contain states with different indices
the insn_idx has to be part of verifier_state and compared.

Different hash table sizes and different hash functions were explored,
but the results were not significantly better vs this patch.
They can be improved in the future.

Hit/miss heuristic is not counting index miscompare as a miss.
Otherwise verifier stats become unstable when experimenting
with different hash functions.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-05-24 01:46:22 +02:00
Alexei Starovoitov
a8f500af0c bpf: split explored_states
split explored_states into prune_point boolean mark
and link list of explored states.
This removes STATE_LIST_MARK hack and allows marks to be separate from states.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-05-24 01:46:22 +02:00
Alexei Starovoitov
5d83902167 bpf: cleanup explored_states
clean up explored_states to prep for introduction of hashtable
No functional changes.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-05-24 01:46:21 +02:00
Alexei Starovoitov
b285fcb760 bpf: bump jmp sequence limit
The limit of 1024 subsequent jumps was causing otherwise valid
programs to be rejected. Bump it to 8192 and make the error more verbose.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-05-23 16:20:57 +02:00
Eric W. Biederman
7a0cf09494 signal: Correct namespace fixups of si_pid and si_uid
The function send_signal was split from __send_signal so that it would
be possible to bypass the namespace logic based upon current[1].  As it
turns out the si_pid and the si_uid fixup are both inappropriate in
the case of kill_pid_usb_asyncio so move that logic into send_signal.

It is difficult to arrange but possible for a signal with an si_code
of SI_TIMER or SI_SIGIO to be sent across namespace boundaries.  In
which case tests for when it is ok to change si_pid and si_uid based
on SI_FROMUSER are incorrect.  Replace the use of SI_FROMUSER with a
new test has_si_pid_and_used based on siginfo_layout.

Now that the uid fixup is no longer present after expanding
SEND_SIG_NOINFO properly calculate the si_uid that the target
task needs to read.

[1] 7978b567d3 ("signals: add from_ancestor_ns parameter to send_signal()")
Cc: stable@vger.kernel.org
Fixes: 6588c1e3ff ("signals: SI_USER: Masquerade si_pid when crossing pid ns boundary")
Fixes: 6b550f9495 ("user namespace: make signal.c respect user namespaces")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2019-05-22 17:02:16 -05:00
Eric W. Biederman
70f1b0d34b signal/usb: Replace kill_pid_info_as_cred with kill_pid_usb_asyncio
The usb support for asyncio encoded one of it's values in the wrong
field.  It should have used si_value but instead used si_addr which is
not present in the _rt union member of struct siginfo.

The practical result of this is that on a 64bit big endian kernel
when delivering a signal to a 32bit process the si_addr field
is set to NULL, instead of the expected pointer value.

This issue can not be fixed in copy_siginfo_to_user32 as the usb
usage of the the _sigfault (aka si_addr) member of the siginfo
union when SI_ASYNCIO is set is incompatible with the POSIX and
glibc usage of the _rt member of the siginfo union.

Therefore replace kill_pid_info_as_cred with kill_pid_usb_asyncio a
dedicated function for this one specific case.  There are no other
users of kill_pid_info_as_cred so this specialization should have no
impact on the amount of code in the kernel.  Have kill_pid_usb_asyncio
take instead of a siginfo_t which is difficult and error prone, 3
arguments, a signal number, an errno value, and an address enconded as
a sigval_t.  The encoding of the address as a sigval_t allows the
code that reads the userspace request for a signal to handle this
compat issue along with all of the other compat issues.

Add BUILD_BUG_ONs in kernel/signal.c to ensure that we can now place
the pointer value at the in si_pid (instead of si_addr).  That is the
code now verifies that si_pid and si_addr always occur at the same
location.  Further the code veries that for native structures a value
placed in si_pid and spilling into si_uid will appear in userspace in
si_addr (on a byte by byte copy of siginfo or a field by field copy of
siginfo).  The code also verifies that for a 64bit kernel and a 32bit
userspace the 32bit pointer will fit in si_pid.

I have used the usbsig.c program below written by Alan Stern and
slightly tweaked by me to run on a big endian machine to verify the
issue exists (on sparc64) and to confirm the patch below fixes the issue.

 /* usbsig.c -- test USB async signal delivery */

 #define _GNU_SOURCE
 #include <stdio.h>
 #include <fcntl.h>
 #include <signal.h>
 #include <string.h>
 #include <sys/ioctl.h>
 #include <unistd.h>
 #include <endian.h>
 #include <linux/usb/ch9.h>
 #include <linux/usbdevice_fs.h>

 static struct usbdevfs_urb urb;
 static struct usbdevfs_disconnectsignal ds;
 static volatile sig_atomic_t done = 0;

 void urb_handler(int sig, siginfo_t *info , void *ucontext)
 {
 	printf("Got signal %d, signo %d errno %d code %d addr: %p urb: %p\n",
 	       sig, info->si_signo, info->si_errno, info->si_code,
 	       info->si_addr, &urb);

 	printf("%s\n", (info->si_addr == &urb) ? "Good" : "Bad");
 }

 void ds_handler(int sig, siginfo_t *info , void *ucontext)
 {
 	printf("Got signal %d, signo %d errno %d code %d addr: %p ds: %p\n",
 	       sig, info->si_signo, info->si_errno, info->si_code,
 	       info->si_addr, &ds);

 	printf("%s\n", (info->si_addr == &ds) ? "Good" : "Bad");
 	done = 1;
 }

 int main(int argc, char **argv)
 {
 	char *devfilename;
 	int fd;
 	int rc;
 	struct sigaction act;
 	struct usb_ctrlrequest *req;
 	void *ptr;
 	char buf[80];

 	if (argc != 2) {
 		fprintf(stderr, "Usage: usbsig device-file-name\n");
 		return 1;
 	}

 	devfilename = argv[1];
 	fd = open(devfilename, O_RDWR);
 	if (fd == -1) {
 		perror("Error opening device file");
 		return 1;
 	}

 	act.sa_sigaction = urb_handler;
 	sigemptyset(&act.sa_mask);
 	act.sa_flags = SA_SIGINFO;

 	rc = sigaction(SIGUSR1, &act, NULL);
 	if (rc == -1) {
 		perror("Error in sigaction");
 		return 1;
 	}

 	act.sa_sigaction = ds_handler;
 	sigemptyset(&act.sa_mask);
 	act.sa_flags = SA_SIGINFO;

 	rc = sigaction(SIGUSR2, &act, NULL);
 	if (rc == -1) {
 		perror("Error in sigaction");
 		return 1;
 	}

 	memset(&urb, 0, sizeof(urb));
 	urb.type = USBDEVFS_URB_TYPE_CONTROL;
 	urb.endpoint = USB_DIR_IN | 0;
 	urb.buffer = buf;
 	urb.buffer_length = sizeof(buf);
 	urb.signr = SIGUSR1;

 	req = (struct usb_ctrlrequest *) buf;
 	req->bRequestType = USB_DIR_IN | USB_TYPE_STANDARD | USB_RECIP_DEVICE;
 	req->bRequest = USB_REQ_GET_DESCRIPTOR;
 	req->wValue = htole16(USB_DT_DEVICE << 8);
 	req->wIndex = htole16(0);
 	req->wLength = htole16(sizeof(buf) - sizeof(*req));

 	rc = ioctl(fd, USBDEVFS_SUBMITURB, &urb);
 	if (rc == -1) {
 		perror("Error in SUBMITURB ioctl");
 		return 1;
 	}

 	rc = ioctl(fd, USBDEVFS_REAPURB, &ptr);
 	if (rc == -1) {
 		perror("Error in REAPURB ioctl");
 		return 1;
 	}

 	memset(&ds, 0, sizeof(ds));
 	ds.signr = SIGUSR2;
 	ds.context = &ds;
 	rc = ioctl(fd, USBDEVFS_DISCSIGNAL, &ds);
 	if (rc == -1) {
 		perror("Error in DISCSIGNAL ioctl");
 		return 1;
 	}

 	printf("Waiting for usb disconnect\n");
 	while (!done) {
 		sleep(1);
 	}

 	close(fd);
 	return 0;
 }

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: linux-usb@vger.kernel.org
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Oliver Neukum <oneukum@suse.com>
Fixes: v2.3.39
Cc: stable@vger.kernel.org
Acked-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2019-05-22 16:53:48 -05:00
Jagadeesh Pagadala
4eebe38a37 kernel/trace/trace.h: Remove duplicate header of trace_seq.h
Remove duplicate header which is included twice.

Link: http://lkml.kernel.org/r/1553725186-41442-1-git-send-email-jagdsh.linux@gmail.com

Reviewed-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: Jagadeesh Pagadala <jagdsh.linux@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-05-22 15:37:41 -04:00
David Howells
2e21865faf keys: sparse: Fix key_fs[ug]id_changed()
Sparse warnings are incurred by key_fs[ug]id_changed() due to unprotected
accesses of tsk->cred, which is marked __rcu.

Fix this by passing the new cred struct to these functions from
commit_creds() rather than the task pointer.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
2019-05-22 14:06:51 +01:00
Richard Guy Briggs
b48345aafb audit: deliver signal_info regarless of syscall
When a process signals the audit daemon (shutdown, rotate, resume,
reconfig) but syscall auditing is not enabled, we still want to know the
identity of the process sending the signal to the audit daemon.

Move audit_signal_info() out of syscall auditing to general auditing but
create a new function audit_signal_info_syscall() to take care of the
syscall dependent parts for when syscall auditing is enabled.

Please see the github kernel audit issue
https://github.com/linux-audit/audit-kernel/issues/111

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2019-05-21 22:18:25 -04:00
Linus Torvalds
2c1212de6f SPDX update for 5.2-rc2, round 1
Here are series of patches that add SPDX tags to different kernel files,
 based on two different things:
   - SPDX entries are added to a bunch of files that we missed a year ago
     that do not have any license information at all.
 
     These were either missed because the tool saw the MODULE_LICENSE()
     tag, or some EXPORT_SYMBOL tags, and got confused and thought the
     file had a real license, or the files have been added since the last
     big sweep, or they were Makefile/Kconfig files, which we didn't
     touch last time.
 
   - Add GPL-2.0-only or GPL-2.0-or-later tags to files where our scan
     tools can determine the license text in the file itself.  Where this
     happens, the license text is removed, in order to cut down on the
     700+ different ways we have in the kernel today, in a quest to get
     rid of all of these.
 
 These patches have been out for review on the linux-spdx@vger mailing
 list, and while they were created by automatic tools, they were
 hand-verified by a bunch of different people, all whom names are on the
 patches are reviewers.
 
 The reason for these "large" patches is if we were to continue to
 progress at the current rate of change in the kernel, adding license
 tags to individual files in different subsystems, we would be finished
 in about 10 years at the earliest.
 
 There will be more series of these types of patches coming over the next
 few weeks as the tools and reviewers crunch through the more "odd"
 variants of how to say "GPLv2" that developers have come up with over
 the years, combined with other fun oddities (GPL + a BSD disclaimer?)
 that are being unearthed, with the goal for the whole kernel to be
 cleaned up.
 
 These diffstats are not small, 3840 files are touched, over 10k lines
 removed in just 24 patches.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXOP8uw8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ynmGQCgy3evqzleuOITDpuWaxewFdHqiJYAnA7KRw4H
 1KwtfRnMtG6dk/XaS7H7
 =O9lH
 -----END PGP SIGNATURE-----

Merge tag 'spdx-5.2-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull SPDX update from Greg KH:
 "Here is a series of patches that add SPDX tags to different kernel
  files, based on two different things:

   - SPDX entries are added to a bunch of files that we missed a year
     ago that do not have any license information at all.

     These were either missed because the tool saw the MODULE_LICENSE()
     tag, or some EXPORT_SYMBOL tags, and got confused and thought the
     file had a real license, or the files have been added since the
     last big sweep, or they were Makefile/Kconfig files, which we
     didn't touch last time.

   - Add GPL-2.0-only or GPL-2.0-or-later tags to files where our scan
     tools can determine the license text in the file itself. Where this
     happens, the license text is removed, in order to cut down on the
     700+ different ways we have in the kernel today, in a quest to get
     rid of all of these.

  These patches have been out for review on the linux-spdx@vger mailing
  list, and while they were created by automatic tools, they were
  hand-verified by a bunch of different people, all whom names are on
  the patches are reviewers.

  The reason for these "large" patches is if we were to continue to
  progress at the current rate of change in the kernel, adding license
  tags to individual files in different subsystems, we would be finished
  in about 10 years at the earliest.

  There will be more series of these types of patches coming over the
  next few weeks as the tools and reviewers crunch through the more
  "odd" variants of how to say "GPLv2" that developers have come up with
  over the years, combined with other fun oddities (GPL + a BSD
  disclaimer?) that are being unearthed, with the goal for the whole
  kernel to be cleaned up.

  These diffstats are not small, 3840 files are touched, over 10k lines
  removed in just 24 patches"

* tag 'spdx-5.2-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (24 commits)
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 25
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 24
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 23
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 22
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 21
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 20
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 19
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 18
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 17
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 15
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 14
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 13
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 12
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 11
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 10
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 9
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 7
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 5
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 4
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 3
  ...
2019-05-21 12:33:38 -07:00
Tom Zanussi
9b2ca371b1 tracing: Add a check_val() check before updating cond_snapshot() track_val
Without this check a snapshot is taken whenever a bucket's max is hit,
rather than only when the global max is hit, as it should be.

Before:

  In this example, we do a first run of the workload (cyclictest),
  examine the output, note the max ('triggering value') (347), then do
  a second run and note the max again.

  In this case, the max in the second run (39) is below the max in the
  first run, but since we haven't cleared the histogram, the first max
  is still in the histogram and is higher than any other max, so it
  should still be the max for the snapshot.  It isn't however - the
  value should still be 347 after the second run.

  # echo 'hist:keys=pid:ts0=common_timestamp.usecs if comm=="cyclictest"' >> /sys/kernel/debug/tracing/events/sched/sched_waking/trigger
  # echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0:onmax($wakeup_lat).save(next_prio,next_comm,prev_pid,prev_prio,prev_comm):onmax($wakeup_lat).snapshot() if next_comm=="cyclictest"' >> /sys/kernel/debug/tracing/events/sched/sched_switch/trigger

  # cyclictest -p 80 -n -s -t 2 -D 2

  # cat /sys/kernel/debug/tracing/events/sched/sched_switch/hist

  { next_pid:       2143 } hitcount:        199
    max:         44  next_prio:        120  next_comm: cyclictest
    prev_pid:          0  prev_prio:        120  prev_comm: swapper/4

  { next_pid:       2145 } hitcount:       1325
    max:         38  next_prio:         19  next_comm: cyclictest
    prev_pid:          0  prev_prio:        120  prev_comm: swapper/2

  { next_pid:       2144 } hitcount:       1982
    max:        347  next_prio:         19  next_comm: cyclictest
    prev_pid:          0  prev_prio:        120  prev_comm: swapper/6

  Snapshot taken (see tracing/snapshot).  Details:
      triggering value { onmax($wakeup_lat) }:        347
      triggered by event with key: { next_pid:       2144 }

  # cyclictest -p 80 -n -s -t 2 -D 2

  # cat /sys/kernel/debug/tracing/events/sched/sched_switch/hist

  { next_pid:       2143 } hitcount:        199
    max:         44  next_prio:        120  next_comm: cyclictest
    prev_pid:          0  prev_prio:        120  prev_comm: swapper/4

  { next_pid:       2148 } hitcount:        199
    max:         16  next_prio:        120  next_comm: cyclictest
    prev_pid:          0  prev_prio:        120  prev_comm: swapper/1

  { next_pid:       2145 } hitcount:       1325
    max:         38  next_prio:         19  next_comm: cyclictest
    prev_pid:          0  prev_prio:        120  prev_comm: swapper/2

  { next_pid:       2150 } hitcount:       1326
    max:         39  next_prio:         19  next_comm: cyclictest
    prev_pid:          0  prev_prio:        120  prev_comm: swapper/4

  { next_pid:       2144 } hitcount:       1982
    max:        347  next_prio:         19  next_comm: cyclictest
    prev_pid:          0  prev_prio:        120  prev_comm: swapper/6

  { next_pid:       2149 } hitcount:       1983
    max:        130  next_prio:         19  next_comm: cyclictest
    prev_pid:          0  prev_prio:        120  prev_comm: swapper/0

  Snapshot taken (see tracing/snapshot).  Details:
    triggering value { onmax($wakeup_lat) }:    39
    triggered by event with key: { next_pid:       2150 }

After:

  In this example, we do a first run of the workload (cyclictest),
  examine the output, note the max ('triggering value') (375), then do
  a second run and note the max again.

  In this case, the max in the second run is still 375, the highest in
  any bucket, as it should be.

  # cyclictest -p 80 -n -s -t 2 -D 2

  # cat /sys/kernel/debug/tracing/events/sched/sched_switch/hist

  { next_pid:       2072 } hitcount:        200
    max:         28  next_prio:        120  next_comm: cyclictest
    prev_pid:          0  prev_prio:        120  prev_comm: swapper/5

  { next_pid:       2074 } hitcount:       1323
    max:        375  next_prio:         19  next_comm: cyclictest
    prev_pid:          0  prev_prio:        120  prev_comm: swapper/2

  { next_pid:       2073 } hitcount:       1980
    max:        153  next_prio:         19  next_comm: cyclictest
    prev_pid:          0  prev_prio:        120  prev_comm: swapper/6

  Snapshot taken (see tracing/snapshot).  Details:
    triggering value { onmax($wakeup_lat) }:        375
    triggered by event with key: { next_pid:       2074 }

  # cyclictest -p 80 -n -s -t 2 -D 2

  # cat /sys/kernel/debug/tracing/events/sched/sched_switch/hist

  { next_pid:       2101 } hitcount:        199
    max:         49  next_prio:        120  next_comm: cyclictest
    prev_pid:          0  prev_prio:        120  prev_comm: swapper/6

  { next_pid:       2072 } hitcount:        200
    max:         28  next_prio:        120  next_comm: cyclictest
    prev_pid:          0  prev_prio:        120  prev_comm: swapper/5

  { next_pid:       2074 } hitcount:       1323
    max:        375  next_prio:         19  next_comm: cyclictest
    prev_pid:          0  prev_prio:        120  prev_comm: swapper/2

  { next_pid:       2103 } hitcount:       1325
    max:         74  next_prio:         19  next_comm: cyclictest
    prev_pid:          0  prev_prio:        120  prev_comm: swapper/4

  { next_pid:       2073 } hitcount:       1980
    max:        153  next_prio:         19  next_comm: cyclictest
    prev_pid:          0  prev_prio:        120  prev_comm: swapper/6

  { next_pid:       2102 } hitcount:       1981
    max:         84  next_prio:         19  next_comm: cyclictest
    prev_pid:         12  prev_prio:        120  prev_comm: kworker/0:1

  Snapshot taken (see tracing/snapshot).  Details:
    triggering value { onmax($wakeup_lat) }:        375
    triggered by event with key: { next_pid:       2074 }

Link: http://lkml.kernel.org/r/95958351329f129c07504b4d1769c47a97b70d65.1555597045.git.tom.zanussi@linux.intel.com

Cc: stable@vger.kernel.org
Fixes: a3785b7eca ("tracing: Add hist trigger snapshot() action")
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-05-21 12:48:07 -04:00
Tom Zanussi
c8d94a1878 tracing: Check keys for variable references in expressions too
There's an existing check for variable references in keys, but it
doesn't go far enough.  It checks whether a key field is a variable
reference but doesn't check whether it's an expression containing
variable references, which can cause the same problems for callers.

Use the existing field_has_hist_vars() function rather than a direct
top-level flag check to catch all possible variable references.

Link: http://lkml.kernel.org/r/e8c3d3d53db5ca90ceea5a46e5413103a6902fc7.1555597045.git.tom.zanussi@linux.intel.com

Cc: stable@vger.kernel.org
Fixes: 067fe038e7 ("tracing: Add variable reference handling to hist triggers")
Reported-by: Vincent Bernat <vincent@bernat.ch>
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-05-21 12:46:32 -04:00
Tom Zanussi
55267c88c0 tracing: Prevent hist_field_var_ref() from accessing NULL tracing_map_elts
hist_field_var_ref() is an implementation of hist_field_fn_t(), which
can be called with a null tracing_map_elt elt param when assembling a
key in event_hist_trigger().

In the case of hist_field_var_ref() this doesn't make sense, because a
variable can only be resolved by looking it up using an already
assembled key i.e. a variable can't be used to assemble a key since
the key is required in order to access the variable.

Upper layers should prevent the user from constructing a key using a
variable in the first place, but in case one slips through, it
shouldn't cause a NULL pointer dereference.  Also if one does slip
through, we want to know about it, so emit a one-time warning in that
case.

Link: http://lkml.kernel.org/r/64ec8dc15c14d305295b64cdfcc6b2b9dd14753f.1555597045.git.tom.zanussi@linux.intel.com

Reported-by: Vincent Bernat <vincent@bernat.ch>
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-05-21 12:43:49 -04:00
Thomas Gleixner
7170066ecd treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 25
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 of the license or at
  your option any later version this program is distributed in the
  hope that it would be useful but without any warranty without even
  the implied warranty of merchantability or fitness for a particular
  purpose see the gnu general public license for more details

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 6 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steve Winslow <swinslow@gmail.com>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Jilayne Lovejoy <opensource@jilayne.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190519154043.007767574@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-21 11:52:39 +02:00
Thomas Gleixner
1ccea77e2a treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 13
Based on 2 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 of the license or at
  your option any later version this program is distributed in the
  hope that it will be useful but without any warranty without even
  the implied warranty of merchantability or fitness for a particular
  purpose see the gnu general public license for more details you
  should have received a copy of the gnu general public license along
  with this program if not see http www gnu org licenses

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 of the license or at
  your option any later version this program is distributed in the
  hope that it will be useful but without any warranty without even
  the implied warranty of merchantability or fitness for a particular
  purpose see the gnu general public license for more details [based]
  [from] [clk] [highbank] [c] you should have received a copy of the
  gnu general public license along with this program if not see http
  www gnu org licenses

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 355 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Jilayne Lovejoy <opensource@jilayne.com>
Reviewed-by: Steve Winslow <swinslow@gmail.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190519154041.837383322@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-21 11:28:45 +02:00
Thomas Gleixner
d6cd1e9b9f treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 9
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 of the license or at
  your option any later version this program is distributed in the
  hope that it will be useful but without any warranty without even
  the implied warranty of merchantability or fitness for a particular
  purpose see the gnu general public license for more details you
  should have received a copy of the gnu general public license along
  with this program if not you can access it online at http www gnu
  org licenses gpl 2 0 html

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 1 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Jilayne Lovejoy <opensource@jilayne.com>
Reviewed-by: Steve Winslow <swinslow@gmail.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190519154041.430943677@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-21 11:28:40 +02:00
Thomas Gleixner
ec8f24b7fa treewide: Add SPDX license identifier - Makefile/Kconfig
Add SPDX license identifiers to all Make/Kconfig files which:

 - Have no license information of any form

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

  GPL-2.0-only

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-21 10:50:46 +02:00
Thomas Gleixner
457c899653 treewide: Add SPDX license identifier for missed files
Add SPDX license identifiers to all files which:

 - Have no license information of any form

 - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
   initial scan/conversion to ignore the file

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

  GPL-2.0-only

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-21 10:50:45 +02:00
Linus Torvalds
78e0365184 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:1) Use after free in __dev_map_entry_free(), from Eric Dumazet.

 1) Use after free in __dev_map_entry_free(), from Eric Dumazet.

 2) Fix TCP retransmission timestamps on passive Fast Open, from Yuchung
    Cheng.

 3) Orphan NFC, we'll take the patches directly into my tree. From
    Johannes Berg.

 4) We can't recycle cloned TCP skbs, from Eric Dumazet.

 5) Some flow dissector bpf test fixes, from Stanislav Fomichev.

 6) Fix RCU marking and warnings in rhashtable, from Herbert Xu.

 7) Fix some potential fib6 leaks, from Eric Dumazet.

 8) Fix a _decode_session4 uninitialized memory read bug fix that got
    lost in a merge. From Florian Westphal.

 9) Fix ipv6 source address routing wrt. exception route entries, from
    Wei Wang.

10) The netdev_xmit_more() conversion was not done %100 properly in mlx5
    driver, fix from Tariq Toukan.

11) Clean up botched merge on netfilter kselftest, from Florian
    Westphal.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (74 commits)
  of_net: fix of_get_mac_address retval if compiled without CONFIG_OF
  net: fix kernel-doc warnings for socket.c
  net: Treat sock->sk_drops as an unsigned int when printing
  kselftests: netfilter: fix leftover net/net-next merge conflict
  mlxsw: core: Prevent reading unsupported slave address from SFP EEPROM
  mlxsw: core: Prevent QSFP module initialization for old hardware
  vsock/virtio: Initialize core virtio vsock before registering the driver
  net/mlx5e: Fix possible modify header actions memory leak
  net/mlx5e: Fix no rewrite fields with the same match
  net/mlx5e: Additional check for flow destination comparison
  net/mlx5e: Add missing ethtool driver info for representors
  net/mlx5e: Fix number of vports for ingress ACL configuration
  net/mlx5e: Fix ethtool rxfh commands when CONFIG_MLX5_EN_RXNFC is disabled
  net/mlx5e: Fix wrong xmit_more application
  net/mlx5: Fix peer pf disable hca command
  net/mlx5: E-Switch, Correct type to u16 for vport_num and int for vport_index
  net/mlx5: Add meaningful return codes to status_to_err function
  net/mlx5: Imply MLXFW in mlx5_core
  Revert "tipc: fix modprobe tipc failed after switch order of device registration"
  vsock/virtio: free packets during the socket release
  ...
2019-05-20 08:21:07 -07:00
Linus Torvalds
cb6f8739fb Merge branch 'akpm' (patches from Andrew)
Merge yet more updates from Andrew Morton:
 "A few final bits:

   - large changes to vmalloc, yielding large performance benefits

   - tweak the console-flush-on-panic code

   - a few fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  panic: add an option to replay all the printk message in buffer
  initramfs: don't free a non-existent initrd
  fs/writeback.c: use rcu_barrier() to wait for inflight wb switches going into workqueue when umount
  mm/compaction.c: correct zone boundary handling when isolating pages from a pageblock
  mm/vmap: add DEBUG_AUGMENT_LOWEST_MATCH_CHECK macro
  mm/vmap: add DEBUG_AUGMENT_PROPAGATE_CHECK macro
  mm/vmalloc.c: keep track of free blocks for vmap allocation
2019-05-19 12:15:32 -07:00
Linus Torvalds
d9351ea14d Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull IRQ chip updates from Ingo Molnar:
 "A late irqchips update:

   - New TI INTR/INTA set of drivers

   - Rewrite of the stm32mp1-exti driver as a platform driver

   - Update the IOMMU MSI mapping API to be RT friendly

   - A number of cleanups and other low impact fixes"

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
  iommu/dma-iommu: Remove iommu_dma_map_msi_msg()
  irqchip/gic-v3-mbi: Don't map the MSI page in mbi_compose_m{b, s}i_msg()
  irqchip/ls-scfg-msi: Don't map the MSI page in ls_scfg_msi_compose_msg()
  irqchip/gic-v3-its: Don't map the MSI page in its_irq_compose_msi_msg()
  irqchip/gicv2m: Don't map the MSI page in gicv2m_compose_msi_msg()
  iommu/dma-iommu: Split iommu_dma_map_msi_msg() in two parts
  genirq/msi: Add a new field in msi_desc to store an IOMMU cookie
  arm64: arch_k3: Enable interrupt controller drivers
  irqchip/ti-sci-inta: Add msi domain support
  soc: ti: Add MSI domain bus support for Interrupt Aggregator
  irqchip/ti-sci-inta: Add support for Interrupt Aggregator driver
  dt-bindings: irqchip: Introduce TISCI Interrupt Aggregator bindings
  irqchip/ti-sci-intr: Add support for Interrupt Router driver
  dt-bindings: irqchip: Introduce TISCI Interrupt router bindings
  gpio: thunderx: Use the default parent apis for {request,release}_resources
  genirq: Introduce irq_chip_{request,release}_resource_parent() apis
  firmware: ti_sci: Add helper apis to manage resources
  firmware: ti_sci: Add RM mapping table for am654
  firmware: ti_sci: Add support for IRQ management
  firmware: ti_sci: Add support for RM core ops
  ...
2019-05-19 10:58:45 -07:00
Joe Lawrence
7eaf51a2e0 stacktrace: Unbreak stack_trace_save_tsk_reliable()
Miroslav reported that the livepatch self-tests were failing, specifically
a case in which the consistency model ensures that a current executing
function is not allowed to be patched, "TEST: busy target module".

Recent renovations of stack_trace_save_tsk_reliable() left it returning
only an -ERRNO success indication in some configuration combinations:

  klp_check_stack()
    ret = stack_trace_save_tsk_reliable()
      #ifdef CONFIG_ARCH_STACKWALK && CONFIG_HAVE_RELIABLE_STACKTRACE
        stack_trace_save_tsk_reliable()
          ret = arch_stack_walk_reliable()
            return 0
            return -EINVAL
          ...
          return ret;
    ...
    if (ret < 0)
      /* stack_trace_save_tsk_reliable error */
    nr_entries = ret;                               << 0

Previously (and currently for !CONFIG_ARCH_STACKWALK &&
CONFIG_HAVE_RELIABLE_STACKTRACE) stack_trace_save_tsk_reliable() returned
the number of entries that it consumed in the passed storage array.

In the case of the above config and trace, be sure to return the
stacktrace_cookie.len on stack_trace_save_tsk_reliable() success.

Fixes: 25e39e32b0 ("livepatch: Simplify stack trace retrieval")
Reported-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: live-patching@vger.kernel.org
Cc: jikos@kernel.org
Cc: pmladek@suse.com
Link: https://lkml.kernel.org/r/20190517185117.24642-1-joe.lawrence@redhat.com
2019-05-19 11:43:22 +02:00
Feng Tang
de6da1e8bc panic: add an option to replay all the printk message in buffer
Currently on panic, kernel will lower the loglevel and print out pending
printk msg only with console_flush_on_panic().

Add an option for users to configure the "panic_print" to replay all
dmesg in buffer, some of which they may have never seen due to the
loglevel setting, which will help panic debugging .

[feng.tang@intel.com: keep the original console_flush_on_panic() inside panic()]
  Link: http://lkml.kernel.org/r/1556199137-14163-1-git-send-email-feng.tang@intel.com
[feng.tang@intel.com: use logbuf lock to protect the console log index]
  Link: http://lkml.kernel.org/r/1556269868-22654-1-git-send-email-feng.tang@intel.com
Link: http://lkml.kernel.org/r/1556095872-36838-1-git-send-email-feng.tang@intel.com
Signed-off-by: Feng Tang <feng.tang@intel.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Cc: Aaro Koskinen <aaro.koskinen@nokia.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Borislav Petkov <bp@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-18 15:52:26 -07:00
Linus Torvalds
5f3ab27b9e Merge branch 'for-5.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fix from Tejun Heo:
 "The cgroup2 freezer pulled in this cycle broke strace. This pull
  request includes a workaround for the problem.

  It's not a complete fix in that it may cause spurious frozen state
  flip-flops which is fairly minor. Will push a full fix once it's
  ready"

* 'for-5.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  signal: unconditionally leave the frozen state in ptrace_stop()
2019-05-16 19:01:23 -07:00
Chenbo Feng
e547ff3f80 bpf: relax inode permission check for retrieving bpf program
For iptable module to load a bpf program from a pinned location, it
only retrieve a loaded program and cannot change the program content so
requiring a write permission for it might not be necessary.
Also when adding or removing an unrelated iptable rule, it might need to
flush and reload the xt_bpf related rules as well and triggers the inode
permission check. It might be better to remove the write premission
check for the inode so we won't need to grant write access to all the
processes that flush and restore iptables rules.

Signed-off-by: Chenbo Feng <fengc@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-16 11:31:49 -07:00
Linus Torvalds
b2c3dda6f8 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull time fixes from Ingo Molnar:
 "A TIA adjtimex interface extension, and a POSIX compliance ABI fix for
  timespec64 users"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  ntp: Allow TAI-UTC offset to be set to zero
  y2038: Make CONFIG_64BIT_TIME unconditional
2019-05-16 11:00:20 -07:00
Linus Torvalds
f57d7715d7 Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fix from Ingo Molnar:
 "A single rwsem fix"

* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/rwsem: Prevent decrement of reader count before increment
2019-05-16 10:54:19 -07:00
Roman Gushchin
05b2892637 signal: unconditionally leave the frozen state in ptrace_stop()
Alex Xu reported a regression in strace, caused by the introduction of
the cgroup v2 freezer. The regression can be reproduced by stracing
the following simple program:

  #include <unistd.h>

  int main() {
      write(1, "a", 1);
      return 0;
  }

An attempt to run strace ./a.out leads to the infinite loop:
  [ pre-main omitted ]
  write(1, "a", 1)                        = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
  write(1, "a", 1)                        = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
  write(1, "a", 1)                        = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
  write(1, "a", 1)                        = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
  write(1, "a", 1)                        = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
  write(1, "a", 1)                        = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
  [ repeats forever ]

The problem occurs because the traced task leaves ptrace_stop()
(and the signal handling loop) with the frozen bit set. So let's
call cgroup_leave_frozen(true) unconditionally after sleeping
in ptrace_stop().

With this patch applied, strace works as expected:
  [ pre-main omitted ]
  write(1, "a", 1)                        = 1
  exit_group(0)                           = ?
  +++ exited with 0 +++

Reported-by: Alex Xu <alex_y_xu@yahoo.ca>
Fixes: 76f969e894 ("cgroup: cgroup v2 freezer")
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-05-16 10:43:58 -07:00
David S. Miller
c7d5ec26ea Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2019-05-16

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Fix a use after free in __dev_map_entry_free(), from Eric.

2) Several sockmap related bug fixes: a splat in strparser if
   it was never initialized, remove duplicate ingress msg list
   purging which can race, fix msg->sg.size accounting upon
   skb to msg conversion, and last but not least fix a timeout
   bug in tcp_bpf_wait_data(), from John.

3) Fix LRU map to avoid messing with eviction heuristics upon
   syscall lookup, e.g. map walks from user space side will
   then lead to eviction of just recently created entries on
   updates as it would mark all map entries, from Daniel.

4) Don't bail out when libbpf feature probing fails. Also
   various smaller fixes to flow_dissector test, from Stanislav.

5) Fix missing brackets for BTF_INT_OFFSET() in UAPI, from Gary.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-15 18:28:44 -07:00
Linus Torvalds
d2d8b14604 The major changes in this tracing update includes:
- Removing of non-DYNAMIC_FTRACE from 32bit x86
 
  - Removing of mcount support from x86
 
  - Emulating a call from int3 on x86_64, fixes live kernel patching
 
  - Consolidated Tracing Error logs file
 
 Minor updates:
 
  - Removal of klp_check_compiler_support()
 
  - kdb ftrace dumping output changes
 
  - Accessing and creating ftrace instances from inside the kernel
 
  - Clean up of #define if macro
 
  - Introduction of TRACE_EVENT_NOP() to disable trace events based on config
    options
 
 And other minor fixes and clean ups
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXNxMZxQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qq4PAP44kP6VbwL8CHyI2A3xuJ6Hwxd+2Z2r
 ip66RtzyJ+2iCgEA2QCuWUlEt2bLpF9a8IQ4N9tWenSeW2i7gunPb+tioQw=
 =RVQo
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing updates from Steven Rostedt:
 "The major changes in this tracing update includes:

   - Removal of non-DYNAMIC_FTRACE from 32bit x86

   - Removal of mcount support from x86

   - Emulating a call from int3 on x86_64, fixes live kernel patching

   - Consolidated Tracing Error logs file

  Minor updates:

   - Removal of klp_check_compiler_support()

   - kdb ftrace dumping output changes

   - Accessing and creating ftrace instances from inside the kernel

   - Clean up of #define if macro

   - Introduction of TRACE_EVENT_NOP() to disable trace events based on
     config options

  And other minor fixes and clean ups"

* tag 'trace-v5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (44 commits)
  x86: Hide the int3_emulate_call/jmp functions from UML
  livepatch: Remove klp_check_compiler_support()
  ftrace/x86: Remove mcount support
  ftrace/x86_32: Remove support for non DYNAMIC_FTRACE
  tracing: Simplify "if" macro code
  tracing: Fix documentation about disabling options using trace_options
  tracing: Replace kzalloc with kcalloc
  tracing: Fix partial reading of trace event's id file
  tracing: Allow RCU to run between postponed startup tests
  tracing: Fix white space issues in parse_pred() function
  tracing: Eliminate const char[] auto variables
  ring-buffer: Fix mispelling of Calculate
  tracing: probeevent: Fix to make the type of $comm string
  tracing: probeevent: Do not accumulate on ret variable
  tracing: uprobes: Re-enable $comm support for uprobe events
  ftrace/x86_64: Emulate call function while updating in breakpoint handler
  x86_64: Allow breakpoints to emulate call instructions
  x86_64: Add gap to int3 to allow for call emulation
  tracing: kdb: Allow ftdump to skip all but the last few entries
  tracing: Add trace_total_entries() / trace_total_entries_cpu()
  ...
2019-05-15 16:05:47 -07:00
Stephen Rothwell
89963adcdb kernel/compat.c: mark expected switch fall-throughs
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

This patch aims to suppress 3 missing-break-in-switch false positives
on some architectures.

Acked-by: Arnd Bergmann <arnd@arndb.de>
Cc: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: Gustavo A. R. Silva <gustavo@embeddedor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-15 08:16:14 -07:00
Linus Torvalds
1064d85773 Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:

 - a couple of hotfixes

 - almost all of the rest of MM

 - lib/ updates

 - binfmt_elf updates

 - autofs updates

 - quite a lot of misc fixes and updates
    - reiserfs, fatfs
    - signals
    - exec
    - cpumask
    - rapidio
    - sysctl
    - pids
    - eventfd
    - gcov
    - panic
    - pps

 - gdb script updates

 - ipc updates

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (126 commits)
  mm: memcontrol: fix NUMA round-robin reclaim at intermediate level
  mm: memcontrol: fix recursive statistics correctness & scalabilty
  mm: memcontrol: move stat/event counting functions out-of-line
  mm: memcontrol: make cgroup stats and events query API explicitly local
  drivers/virt/fsl_hypervisor.c: prevent integer overflow in ioctl
  drivers/virt/fsl_hypervisor.c: dereferencing error pointers in ioctl
  mm, memcg: rename ambiguously named memory.stat counters and functions
  arch: remove <asm/sizes.h> and <asm-generic/sizes.h>
  treewide: replace #include <asm/sizes.h> with #include <linux/sizes.h>
  fs/block_dev.c: Remove duplicate header
  fs/cachefiles/namei.c: remove duplicate header
  include/linux/sched/signal.h: replace `tsk' with `task'
  fs/coda/psdev.c: remove duplicate header
  ipc: do cyclic id allocation for the ipc object.
  ipc: conserve sequence numbers in ipcmni_extend mode
  ipc: allow boot time extension of IPCMNI from 32k to 16M
  ipc/mqueue: optimize msg_get()
  ipc/mqueue: remove redundant wq task assignment
  ipc: prevent lockup on alloc_msg and free_msg
  scripts/gdb: print cached rate in lx-clk-summary
  ...
2019-05-14 20:08:51 -07:00
Aaro Koskinen
b287a25a71 panic/reboot: allow specifying reboot_mode for panic only
Allow specifying reboot_mode for panic only.  This is needed on systems
where ramoops is used to store panic logs, and user wants to use warm
reset to preserve those, while still having cold reset on normal
reboots.

Link: http://lkml.kernel.org/r/20190322004735.27702-1-aaro.koskinen@iki.fi
Signed-off-by: Aaro Koskinen <aaro.koskinen@nokia.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:51 -07:00
Feng Tang
c39ea0b9dd panic: avoid the extra noise dmesg
When kernel panic happens, it will first print the panic call stack,
then the ending msg like:

[   35.743249] ---[ end Kernel panic - not syncing: Fatal exception
[   35.749975] ------------[ cut here ]------------

The above message are very useful for debugging.

But if system is configured to not reboot on panic, say the
"panic_timeout" parameter equals 0, it will likely print out many noisy
message like WARN() call stack for each and every CPU except the panic
one, messages like below:

	WARNING: CPU: 1 PID: 280 at kernel/sched/core.c:1198 set_task_cpu+0x183/0x190
	Call Trace:
	<IRQ>
	try_to_wake_up
	default_wake_function
	autoremove_wake_function
	__wake_up_common
	__wake_up_common_lock
	__wake_up
	wake_up_klogd_work_func
	irq_work_run_list
	irq_work_tick
	update_process_times
	tick_sched_timer
	__hrtimer_run_queues
	hrtimer_interrupt
	smp_apic_timer_interrupt
	apic_timer_interrupt

For people working in console mode, the screen will first show the panic
call stack, but immediately overridden by these noisy extra messages,
which makes debugging much more difficult, as the original context gets
lost on screen.

Also these noisy messages will confuse some users, as I have seen many bug
reporters posted the noisy message into bugzilla, instead of the real
panic call stack and context.

Adding a flag "suppress_printk" which gets set in panic() to avoid those
noisy messages, without changing current kernel behavior that both panic
blinking and sysrq magic key can work as is, suggested by Petr Mladek.

To verify this, make sure kernel is not configured to reboot on panic and
in console
 # echo c > /proc/sysrq-trigger
to see if console only prints out the panic call stack.

Link: http://lkml.kernel.org/r/1551430186-24169-1-git-send-email-feng.tang@intel.com
Signed-off-by: Feng Tang <feng.tang@intel.com>
Suggested-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jiri Slaby <jslaby@suse.com>
Cc: Sasha Levin <sashal@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:51 -07:00
Greg Hackmann
e178a5beb3 gcov: clang support
LLVM uses profiling data that's deliberately similar to GCC, but has a
very different way of exporting that data.  LLVM calls llvm_gcov_init()
once per module, and provides a couple of callbacks that we can use to
ask for more data.

We care about the "writeout" callback, which in turn calls back into
compiler-rt/this module to dump all the gathered coverage data to disk:

   llvm_gcda_start_file()
     llvm_gcda_emit_function()
     llvm_gcda_emit_arcs()
     llvm_gcda_emit_function()
     llvm_gcda_emit_arcs()
     [... repeats for each function ...]
   llvm_gcda_summary_info()
   llvm_gcda_end_file()

This design is much more stateless and unstructured than gcc's, and is
intended to run at process exit.  This forces us to keep some local
state about which module we're dealing with at the moment.  On the other
hand, it also means we don't depend as much on how LLVM represents
profiling data internally.

See LLVM's lib/Transforms/Instrumentation/GCOVProfiling.cpp for more
details on how this works, particularly GCOVProfiler::emitProfileArcs(),
GCOVProfiler::insertCounterWriteout(), and GCOVProfiler::insertFlush().

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20190417225328.208129-1-trong@android.com
Signed-off-by: Greg Hackmann <ghackmann@android.com>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Tri Vo <trong@android.com>
Co-developed-by: Nick Desaulniers <ndesaulniers@google.com>
Co-developed-by: Tri Vo <trong@android.com>
Tested-by: Trilok Soni <tsoni@quicinc.com>
Tested-by: Prasad Sodagudi <psodagud@quicinc.com>
Tested-by: Tri Vo <trong@android.com>
Tested-by: Daniel Mentz <danielmentz@google.com>
Tested-by: Petri Gynther <pgynther@google.com>
Reviewed-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:51 -07:00
Greg Hackmann
826eba0d77 gcov: clang: move common GCC code into gcc_base.c
Patch series "gcov: add Clang support", v4.

This patch (of 3):

base.c contains a few callbacks specific to GCC's gcov implementation.
Move these into their own module in preparation for Clang support.

Link: http://lkml.kernel.org/r/20190318025411.98014-2-trong@android.com
Signed-off-by: Greg Hackmann <ghackmann@android.com>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Tri Vo <trong@android.com>
Tested-by: Trilok Soni <tsoni@quicinc.com>
Tested-by: Prasad Sodagudi <psodagud@quicinc.com>
Tested-by: Tri Vo <trong@android.com>
Reviewed-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Cc: Daniel Mentz <danielmentz@google.com>
Cc: Petri Gynther <pgynther@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:51 -07:00
Timmy Li
1fd402df45 kernel/pid.c: remove unneeded hash header file
Hash functions are not needed since idr is used now.  Let's remove hash
header file for cleanup.

Link: http://lkml.kernel.org/r/20190430053319.95913-1-scuttimmy@gmail.com
Signed-off-by: Timmy Li <scuttimmy@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: KJ Tsanaktsidis <ktsanaktsidis@zendesk.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:51 -07:00
Eric Sandeen
3116ad38f5 kernel/sysctl.c: fix proc_do_large_bitmap for large input buffers
Today, proc_do_large_bitmap() truncates a large write input buffer to
PAGE_SIZE - 1, which may result in misparsed numbers at the (truncated)
end of the buffer.  Further, it fails to notify the caller that the
buffer was truncated, so it doesn't get called iteratively to finish the
entire input buffer.

Tell the caller if there's more work to do by adding the skipped amount
back to left/*lenp before returning.

To fix the misparsing, reset the position if we have completely consumed
a truncated buffer (or if just one char is left, which may be a "-" in a
range), and ask the caller to come back for more.

Link: http://lkml.kernel.org/r/20190320222831.8243-7-mcgrof@kernel.org
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:51 -07:00
Christian Brauner
e260ad01f0 sysctl: return -EINVAL if val violates minmax
Currently when userspace gives us a values that overflow e.g.  file-max
and other callers of __do_proc_doulongvec_minmax() we simply ignore the
new value and leave the current value untouched.

This can be problematic as it gives the illusion that the limit has
indeed be bumped when in fact it failed.  This commit makes sure to
return EINVAL when an overflow is detected.  Please note that this is a
userspace facing change.

Link: http://lkml.kernel.org/r/20190210203943.8227-4-christian@brauner.io
Signed-off-by: Christian Brauner <christian@brauner.io>
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:51 -07:00
Andy Shevchenko
475dae3854 kernel/sysctl.c: switch to bitmap_zalloc()
Switch to bitmap_zalloc() to show clearly what we are allocating.
Besides that it returns pointer of bitmap type instead of opaque void *.

Link: http://lkml.kernel.org/r/20190304094037.57756-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:51 -07:00
Mathieu Malaterre
b028fb6128 kernel/signal.c: annotate implicit fall through
There is a plan to build the kernel with -Wimplicit-fallthrough and this
place in the code produced a warning (W=1).

This commit remove the following warning:

  kernel/signal.c:795:13: warning: this statement may fall through [-Wimplicit-fallthrough=]

Link: http://lkml.kernel.org/r/20190114203505.17875-1-malat@debian.org
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Acked-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:50 -07:00
Rasmus Villemoes
6c4e121fda kernel/user.c: clean up some leftover code
The out_unlock label is misleading; no unlocking happens after it, so
just return NULL directly.

Also, nothing between the kmem_cache_zalloc() that creates new and the
two key_put() can initialize new->uid_keyring or new->session_keyring,
so those calls are no-ops.

Link: http://lkml.kernel.org/r/20190424200404.9114-1-linux@rasmusvillemoes.dk
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:49 -07:00
Lin Feng
e02c9b0d65 kernel/latencytop.c: rename clear_all_latency_tracing to clear_tsk_latency_tracing
The name clear_all_latency_tracing is misleading, in fact which only
clear per task's latency_record[], and we do have another function named
clear_global_latency_tracing which clear the global latency_record[]
buffer.

Link: http://lkml.kernel.org/r/20190226114602.16902-1-linf@wangsu.com
Signed-off-by: Lin Feng <linf@wangsu.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:49 -07:00
Lin Feng
0cc75888da kernel/latencytop.c: remove unnecessary checks for latencytop_enabled
1. In latencytop source codes, we only have such calling chain:

account_scheduler_latency(struct task_struct *task, int usecs, int inter)
{
        if (unlikely(latencytop_enabled)) /* the outtermost check */
                __account_scheduler_latency(task, usecs, inter);
}
__account_scheduler_latency
    account_global_scheduler_latency
        if (!latencytop_enabled)

So, the inner check for latencytop_enabled is not necessary at all.

2. In clear_all_latency_tracing and now is called
   clear_tsk_latency_tracing the check for latencytop_enabled is redundant
   and buggy to some extent.

   We have no reason to refuse clearing the /proc/$pid/latency if
   latencytop_enabled is set to 0, considering that if we use latencytop
   manually by echo 0 > /proc/sys/kernel/latencytop, then we want to clear
   /proc/$pid/latency and failed.

   Also we don't have such check in brother function
   clear_global_latency_tracing.

Notes: These changes are only visible to users who set
   CONFIG_LATENCYTOP and won't change user tool latencytop's behavior.

Link: http://lkml.kernel.org/r/20190226114602.16902-2-linf@wangsu.com
Signed-off-by: Lin Feng <linf@wangsu.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:49 -07:00
Vasily Averin
831246570d kernel/notifier.c: double register detection
By design notifiers can be registerd once only, 2nd register attempt
called by mistake silently corrupts notifiers list.

A few years ago I investigated described problem, the host was power
cycled because of notifier list corruption.  I've prepared this patch
and applied it to the OpenVZ kernel and sent this patch but nobody
commented on it.  Later it helped us to detect a similar problem in the
OpenVz kernel.

Mistakes with notifier registration can happen for example during
subsystem initialization from different namespaces, or because of a lost
unregister in the roll-back path on initialization failures.

The proposed check cannot prevent the described problem, however it
allows us to detect its reason quickly without coredump analysis.

Link: http://lkml.kernel.org/r/04127e71-4782-9bbb-fe5a-7c01e93a99b0@virtuozzo.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:49 -07:00
Dan Schatzberg
df5ba5be74 kernel/sched/psi.c: expose pressure metrics on root cgroup
Pressure metrics are already recorded and exposed in procfs for the
entire system, but any tool which monitors cgroup pressure has to
special case the root cgroup to read from procfs.  This patch exposes
the already recorded pressure metrics on the root cgroup.

Link: http://lkml.kernel.org/r/20190510174938.3361741-1-dschatzberg@fb.com
Signed-off-by: Dan Schatzberg <dschatzberg@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:48 -07:00
Suren Baghdasaryan
0e94682b73 psi: introduce psi monitor
Psi monitor aims to provide a low-latency short-term pressure detection
mechanism configurable by users.  It allows users to monitor psi metrics
growth and trigger events whenever a metric raises above user-defined
threshold within user-defined time window.

Time window and threshold are both expressed in usecs.  Multiple psi
resources with different thresholds and window sizes can be monitored
concurrently.

Psi monitors activate when system enters stall state for the monitored
psi metric and deactivate upon exit from the stall state.  While system
is in the stall state psi signal growth is monitored at a rate of 10
times per tracking window.  Min window size is 500ms, therefore the min
monitoring interval is 50ms.  Max window size is 10s with monitoring
interval of 1s.

When activated psi monitor stays active for at least the duration of one
tracking window to avoid repeated activations/deactivations when psi
signal is bouncing.

Notifications to the users are rate-limited to one per tracking window.

Link: http://lkml.kernel.org/r/20190319235619.260832-8-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:48 -07:00
Suren Baghdasaryan
8af0c18af1 include/: refactor headers to allow kthread.h inclusion in psi_types.h
kthread.h can't be included in psi_types.h because it creates a circular
inclusion with kthread.h eventually including psi_types.h and
complaining on kthread structures not being defined because they are
defined further in the kthread.h.  Resolve this by removing psi_types.h
inclusion from the headers included from kthread.h.

Link: http://lkml.kernel.org/r/20190319235619.260832-7-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:48 -07:00
Suren Baghdasaryan
333f3017c5 psi: track changed states
Introduce changed_states parameter into collect_percpu_times to track
the states changed since the last update.

This will be needed to detect whether polled states activated in the
monitor patch.

Link: http://lkml.kernel.org/r/20190319235619.260832-6-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:48 -07:00
Suren Baghdasaryan
7fc70a3999 psi: split update_stats into parts
Split update_stats into collect_percpu_times and update_averages for
collect_percpu_times to be reused later inside psi monitor.

Link: http://lkml.kernel.org/r/20190319235619.260832-5-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:47 -07:00
Suren Baghdasaryan
bcc78db641 psi: rename psi fields in preparation for psi trigger addition
Rename psi_group structure member fields used for calculating psi totals
and averages for clear distinction between them and for trigger-related
fields that will be added by "psi: introduce psi monitor".

[surenb@google.com: v6]
  Link: http://lkml.kernel.org/r/20190319235619.260832-4-surenb@google.com
Link: http://lkml.kernel.org/r/20190124211518.244221-5-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:47 -07:00
Suren Baghdasaryan
9289c5e6a7 psi: make psi_enable static
psi_enable is not used outside of psi.c, make it static.

Link: http://lkml.kernel.org/r/20190319235619.260832-3-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:47 -07:00
Suren Baghdasaryan
33b2d6302a psi: introduce state_mask to represent stalled psi states
Patch series "psi: pressure stall monitors", v6.

This is a respin of:
  https://lwn.net/ml/linux-kernel/20190308184311.144521-1-surenb%40google.com/

Android is adopting psi to detect and remedy memory pressure that
results in stuttering and decreased responsiveness on mobile devices.

Psi gives us the stall information, but because we're dealing with
latencies in the millisecond range, periodically reading the pressure
files to detect stalls in a timely fashion is not feasible.  Psi also
doesn't aggregate its averages at a high-enough frequency right now.

This patch series extends the psi interface such that users can
configure sensitive latency thresholds and use poll() and friends to be
notified when these are breached.

As high-frequency aggregation is costly, it implements an aggregation
method that is optimized for fast, short-interval averaging, and makes
the aggregation frequency adaptive, such that high-frequency updates
only happen while monitored stall events are actively occurring.

With these patches applied, Android can monitor for, and ward off,
mounting memory shortages before they cause problems for the user.  For
example, using memory stall monitors in userspace low memory killer
daemon (lmkd) we can detect mounting pressure and kill less important
processes before device becomes visibly sluggish.  In our memory stress
testing psi memory monitors produce roughly 10x less false positives
compared to vmpressure signals.  Having ability to specify multiple
triggers for the same psi metric allows other parts of Android framework
to monitor memory state of the device and act accordingly.

The new interface is straight-forward.  The user opens one of the
pressure files for writing and writes a trigger description into the
file descriptor that defines the stall state - some or full, and the
maximum stall time over a given window of time.  E.g.:

        /* Signal when stall time exceeds 100ms of a 1s window */
        char trigger[] = "full 100000 1000000"
        fd = open("/proc/pressure/memory")
        write(fd, trigger, sizeof(trigger))
        while (poll() >= 0) {
                ...
        };
        close(fd);

When the monitored stall state is entered, psi adapts its aggregation
frequency according to what the configured time window requires in order
to emit event signals in a timely fashion.  Once the stalling subsides,
aggregation reverts back to normal.

The trigger is associated with the open file descriptor.  To stop
monitoring, the user only needs to close the file descriptor and the
trigger is discarded.

Patches 1-6 prepare the psi code for polling support.  Patch 7
implements the adaptive polling logic, the pressure growth detection
optimized for short intervals, and hooks up write() and poll() on the
pressure files.

The patches were developed in collaboration with Johannes Weiner.

This patch (of 7):

The psi monitoring patches will need to determine the same states as
record_times().  To avoid calculating them twice, maintain a state mask
that can be consulted cheaply.  Do this in a separate patch to keep the
churn in the main feature patch at a minimum.

This adds 4-byte state_mask member into psi_group_cpu struct which
results in its first cacheline-aligned part becoming 52 bytes long.  Add
explicit values to enumeration element counters that affect
psi_group_cpu struct size.

Link: http://lkml.kernel.org/r/20190124211518.244221-4-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:47 -07:00
Andrea Arcangeli
987717e5e0 mm: change mm_update_next_owner() to update mm->owner with WRITE_ONCE
The RCU reader uses rcu_dereference() inside rcu_read_lock critical
sections, so the writer shall use WRITE_ONCE.  Just a cleanup, we still
rely on gcc to emit atomic writes in other places.

Link: http://lkml.kernel.org/r/20190325225636.11635-3-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:47 -07:00
Andrea Arcangeli
c3f3ce049f userfaultfd: use RCU to free the task struct when fork fails
The task structure is freed while get_mem_cgroup_from_mm() holds
rcu_read_lock() and dereferences mm->owner.

  get_mem_cgroup_from_mm()                failing fork()
  ----                                    ---
  task = mm->owner
                                          mm->owner = NULL;
                                          free(task)
  if (task) *task; /* use after free */

The fix consists in freeing the task with RCU also in the fork failure
case, exactly like it always happens for the regular exit(2) path.  That
is enough to make the rcu_read_lock hold in get_mem_cgroup_from_mm()
(left side above) effective to avoid a use after free when dereferencing
the task structure.

An alternate possible fix would be to defer the delivery of the
userfaultfd contexts to the monitor until after fork() is guaranteed to
succeed.  Such a change would require more changes because it would
create a strict ordering dependency where the uffd methods would need to
be called beyond the last potentially failing branch in order to be
safe.  This solution as opposed only adds the dependency to common code
to set mm->owner to NULL and to free the task struct that was pointed by
mm->owner with RCU, if fork ends up failing.  The userfaultfd methods
can still be called anywhere during the fork runtime and the monitor
will keep discarding orphaned "mm" coming from failed forks in userland.

This race condition couldn't trigger if CONFIG_MEMCG was set =n at build
time.

[aarcange@redhat.com: improve changelog, reduce #ifdefs per Michal]
  Link: http://lkml.kernel.org/r/20190429035752.4508-1-aarcange@redhat.com
Link: http://lkml.kernel.org/r/20190325225636.11635-2-aarcange@redhat.com
Fixes: 893e26e61d ("userfaultfd: non-cooperative: Add fork() event")
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Tested-by: zhong jiang <zhongjiang@huawei.com>
Reported-by: syzbot+cbb52e396df3e565ab02@syzkaller.appspotmail.com
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: zhong jiang <zhongjiang@huawei.com>
Cc: syzbot+cbb52e396df3e565ab02@syzkaller.appspotmail.com
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:47 -07:00
Andrew Morton
acb2ec3dd0 kernel/Makefile: don't assume that kernel/gen_ikh_data.sh is executable
If the user downloads and applies patch-5.1.gz using patch(1), the x bit
on kernel/gen_ikh_data.sh is not set.

  /bin/sh: 1: ./kernel/gen_ikh_data.sh: Permission denied

Fix this by using CONFIG_SHELL.

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 19:52:47 -07:00
Linus Torvalds
ca4b40629f kgdb patches for 5.2-rc1
Mostly clean ups but there are also a couple of out-of-bounds accesses
 (including a potential write to the byte before a static buffer).
 
 The main changes are:
 
  * Fixes those out-of-bounds access (empty string to configure
    test module could write the byte before a buffer, high cpu counts
    could read outside of per-cpu structures).
 
  * Improvements to string handling problems picked up by new compiler
    warnings and other static checks. Most are fixing benign issues that
    can't be tickled without code changes but still reduce the wtf factor
    a little.
 
  * Tidy up the terminal output.
 
 Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEELzVBU1D3lWq6cKzwfOMlXTn3iKEFAlza6CEACgkQfOMlXTn3
 iKHkwA/8CmCq7UA3ZZsItHYn48Xa4YzxySErhpocsqa7U7tPh7f7ETv6NkfmjxEQ
 MGmQflKw/yU3avR6eHUfJcd4AO3+x0Vi5wR9fhevAVu3sfnjTjTjP0F2tDjWe+Y/
 TJob4/gSssmDp1+DtFvOQdOmGVz9C7xMQ2wZFjQcpAwesLLbwu+KPZDcBPpyRnl1
 zSOIUtTxHH6ay2PX57CNAWEmE4SFoUha7712GTHVHe3rykrC+CLYNeCRzUuyTSAz
 OaPTNzd/OFz+uLcsvTPothpc3wUfM4MUkuCkAVKpuMcB2D+/7WqqszUfHYWJ27bH
 oeYqRyeQaRa6COFkxZ4XZQRMBOYzzidVboTDjlTj391qq5L32tje75TtfHgCJ/p7
 vlg0QdbrHOFaDF9aXcmqLr7kNRi83NxNPhg4XHA75MRHFBGvRkMd3jmuZdShFKc7
 Yegr+pR0FJwivZz8+UcRPAdI5gdmSWLdpeB2GhtZqDk3975Qjvsy90ieG7GQnjR8
 /ewoYsFQ5/qXwyZzJ+kHxAmTMWpGYx8Kge77j5UMljPsuSuHU56vUt4ovzSiSzuX
 dTAxLRmrYi6Dlri76EHEoE+1mx301ymK4MXMz8WnNVQhnPnkCcEXx7P3neCB/wuX
 w1O2VvMQ8b4si1/M71QFcAgQjMcJz7z8wGwhfnnqFPgZmVT5n/8=
 =SsRX
 -----END PGP SIGNATURE-----

Merge tag 'kgdb-5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux

Pull kgdb updates from Daniel Thompson:
 "Mostly cleanups but there are also a couple of fixes for out-of-bounds
  accesses (including a potential write to the byte before a static
  buffer).

  The main changes are:

   - Fixes to those out-of-bounds access (empty string to configure test
     module could write the byte before a buffer, high cpu counts could
     read outside of per-cpu structures).

   - Improvements to string handling problems picked up by new compiler
     warnings and other static checks. Most are fixing benign issues
     that can't be tickled without code changes but still reduce the wtf
     factor a little.

   - Tidy up the terminal output"

* tag 'kgdb-5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux:
  kdb: Fix bound check compiler warning
  kdb: do a sanity check on the cpu in kdb_per_cpu()
  kdb: Get rid of broken attempt to print CCVERSION in kdb summary
  misc: kgdbts: fix out-of-bounds access in function param_set_kgdbts_var
  kdb: kdb_support: replace strcpy() by strscpy()
  gdbstub: Replace strcpy() by strscpy()
  gdbstub: mark expected switch fall-throughs
2019-05-14 13:14:10 -07:00
Linus Torvalds
280664f558 Modules updates for v5.2
Summary of modules changes for the 5.2 merge window:
 
 - Use a separate table to store symbol types instead of
   hijacking fields in struct Elf_Sym
 
 - Trivial code cleanups
 
 Signed-off-by: Jessica Yu <jeyu@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCgAGBQJc2W9gAAoJEMBFfjjOO8FyAqkP+wbCB42LJ8kD2H+pbxpsf7vU
 T1+5fZ3rfyLqHDc9Pzt1K6SBR04aLq8OXW3nOIQ5kmirSstiznokAExQBjeMegVJ
 14Q5IxOSQ44rBPg5ggCA6s9qNyw2fqgwKWQKXLW+VmYAHiBfIBV0mO576KWIEgE1
 vOCQPqW0ccfmjEssumEYMSePg01LATjso3PMcdJm6yI74xSGd8n872y36BleAP8M
 232L8B4fuZEuHJ1QaBnXdapKibyMX3k3WbfK9U2RzSWaGjmuBXwihCB+ZA6HCiRX
 MVwXQwh63UGCZLYkXnGMwPNgX5TTZRiPT4oIhB0tGIXkEFX/DlA+NGfuinEAYuKS
 M0zhjgjSOKYEhNy1GtirbHjNCC4ULF0OZvD/dB8InQON0/t5Duq3sOHSp9+rWMg7
 yuHy3NJGHHd65nGd7u7vXmdZXct/NVDdahKhqRA8i+HOxUBdF9vvMMnSASXufWFt
 j5GSFA5/IZKpYofho/jvJK2rmQQVkVpD9zWCZcv8BNT124W4cZ5swOOSVYOnSAJb
 Wvy1oknjF/HrKIft3UIFR0FU8uIueP0mtnna2B9SBVfZm7rM+/3+a1D/UOHaLtuO
 ++b1sIOpyApevRZcXpTnH63ZaO2tRC7KnDzaBT8FTZimEyV5KdD6Ec2MKJdi8Nkb
 Pr099LJ2i/i8rQbkjM1i
 =NO31
 -----END PGP SIGNATURE-----

Merge tag 'modules-for-v5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux

Pull modules updates from Jessica Yu:

 - Use a separate table to store symbol types instead of hijacking
   fields in struct Elf_Sym

 - Trivial code cleanups

* tag 'modules-for-v5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
  module: add stubs for within_module functions
  kallsyms: store type information in its own array
  vmlinux.lds.h: drop unused __vermagic
2019-05-14 10:55:54 -07:00
Daniel Borkmann
50b045a8c0 bpf, lru: avoid messing with eviction heuristics upon syscall lookup
One of the biggest issues we face right now with picking LRU map over
regular hash table is that a map walk out of user space, for example,
to just dump the existing entries or to remove certain ones, will
completely mess up LRU eviction heuristics and wrong entries such
as just created ones will get evicted instead. The reason for this
is that we mark an entry as "in use" via bpf_lru_node_set_ref() from
system call lookup side as well. Thus upon walk, all entries are
being marked, so information of actual least recently used ones
are "lost".

In case of Cilium where it can be used (besides others) as a BPF
based connection tracker, this current behavior causes disruption
upon control plane changes that need to walk the map from user space
to evict certain entries. Discussion result from bpfconf [0] was that
we should simply just remove marking from system call side as no
good use case could be found where it's actually needed there.
Therefore this patch removes marking for regular LRU and per-CPU
flavor. If there ever should be a need in future, the behavior could
be selected via map creation flag, but due to mentioned reason we
avoid this here.

  [0] http://vger.kernel.org/bpfconf.html

Fixes: 29ba732acb ("bpf: Add BPF_MAP_TYPE_LRU_HASH")
Fixes: 8f8449384e ("bpf: Add BPF_MAP_TYPE_LRU_PERCPU_HASH")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-14 10:47:29 -07:00
Daniel Borkmann
c6110222c6 bpf: add map_lookup_elem_sys_only for lookups from syscall side
Add a callback map_lookup_elem_sys_only() that map implementations
could use over map_lookup_elem() from system call side in case the
map implementation needs to handle the latter differently than from
the BPF data path. If map_lookup_elem_sys_only() is set, this will
be preferred pick for map lookups out of user space. This hook is
used in a follow-up fix for LRU map, but once development window
opens, we can convert other map types from map_lookup_elem() (here,
the one called upon BPF_MAP_LOOKUP_ELEM cmd is meant) over to use
the callback to simplify and clean up the latter.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-14 10:47:29 -07:00
Linus Torvalds
318222a35b Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:

 - a few misc things and hotfixes

 - ocfs2

 - almost all of MM

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (139 commits)
  kernel/memremap.c: remove the unused device_private_entry_fault() export
  mm: delete find_get_entries_tag
  mm/huge_memory.c: make __thp_get_unmapped_area static
  mm/mprotect.c: fix compilation warning because of unused 'mm' variable
  mm/page-writeback: introduce tracepoint for wait_on_page_writeback()
  mm/vmscan: simplify trace_reclaim_flags and trace_shrink_flags
  mm/Kconfig: update "Memory Model" help text
  mm/vmscan.c: don't disable irq again when count pgrefill for memcg
  mm: memblock: make keeping memblock memory opt-in rather than opt-out
  hugetlbfs: always use address space in inode for resv_map pointer
  mm/z3fold.c: support page migration
  mm/z3fold.c: add structure for buddy handles
  mm/z3fold.c: improve compression by extending search
  mm/z3fold.c: introduce helper functions
  mm/page_alloc.c: remove unnecessary parameter in rmqueue_pcplist
  mm/hmm: add ARCH_HAS_HMM_MIRROR ARCH_HAS_HMM_DEVICE Kconfig
  mm/vmscan.c: simplify shrink_inactive_list()
  fs/sync.c: sync_file_range(2) may use WB_SYNC_ALL writeback
  xen/privcmd-buf.c: convert to use vm_map_pages_zero()
  xen/gntdev.c: convert to use vm_map_pages()
  ...
2019-05-14 10:10:55 -07:00
Christoph Hellwig
640be2d1ff kernel/memremap.c: remove the unused device_private_entry_fault() export
This export has been entirely unused since it was added more than 1 1/2
years ago.

Link: http://lkml.kernel.org/r/20190429115535.12793-1-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 09:47:51 -07:00
Mike Rapoport
350e88bad4 mm: memblock: make keeping memblock memory opt-in rather than opt-out
Most architectures do not need the memblock memory after the page
allocator is initialized, but only few enable ARCH_DISCARD_MEMBLOCK in the
arch Kconfig.

Replacing ARCH_DISCARD_MEMBLOCK with ARCH_KEEP_MEMBLOCK and inverting the
logic makes it clear which architectures actually use memblock after
system initialization and skips the necessity to add ARCH_DISCARD_MEMBLOCK
to the architectures that are still missing that option.

Link: http://lkml.kernel.org/r/1556102150-32517-1-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Cc: Russell King <linux@armlinux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Burton <paul.burton@mips.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 09:47:50 -07:00
Michal Hocko
940519f0c8 mm, memory_hotplug: provide a more generic restrictions for memory hotplug
arch_add_memory, __add_pages take a want_memblock which controls whether
the newly added memory should get the sysfs memblock user API (e.g.
ZONE_DEVICE users do not want/need this interface).  Some callers even
want to control where do we allocate the memmap from by configuring
altmap.

Add a more generic hotplug context for arch_add_memory and __add_pages.
struct mhp_restrictions contains flags which contains additional features
to be enabled by the memory hotplug (MHP_MEMBLOCK_API currently) and
altmap for alternative memmap allocator.

This patch shouldn't introduce any functional change.

[akpm@linux-foundation.org: build fix]
Link: http://lkml.kernel.org/r/20190408082633.2864-3-osalvador@suse.de
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 09:47:49 -07:00
Jérôme Glisse
7269f99993 mm/mmu_notifier: use correct mmu_notifier events for each invalidation
This updates each existing invalidation to use the correct mmu notifier
event that represent what is happening to the CPU page table.  See the
patch which introduced the events to see the rational behind this.

Link: http://lkml.kernel.org/r/20190326164747.24405-7-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 09:47:49 -07:00
Jérôme Glisse
6f4f13e8d9 mm/mmu_notifier: contextual information for event triggering invalidation
CPU page table update can happens for many reasons, not only as a result
of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as
a result of kernel activities (memory compression, reclaim, migration,
...).

Users of mmu notifier API track changes to the CPU page table and take
specific action for them.  While current API only provide range of virtual
address affected by the change, not why the changes is happening.

This patchset do the initial mechanical convertion of all the places that
calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP
event as well as the vma if it is know (most invalidation happens against
a given vma).  Passing down the vma allows the users of mmu notifier to
inspect the new vma page protection.

The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier
should assume that every for the range is going away when that event
happens.  A latter patch do convert mm call path to use a more appropriate
events for each call.

This is done as 2 patches so that no call site is forgotten especialy
as it uses this following coccinelle patch:

%<----------------------------------------------------------------------
@@
identifier I1, I2, I3, I4;
@@
static inline void mmu_notifier_range_init(struct mmu_notifier_range *I1,
+enum mmu_notifier_event event,
+unsigned flags,
+struct vm_area_struct *vma,
struct mm_struct *I2, unsigned long I3, unsigned long I4) { ... }

@@
@@
-#define mmu_notifier_range_init(range, mm, start, end)
+#define mmu_notifier_range_init(range, event, flags, vma, mm, start, end)

@@
expression E1, E3, E4;
identifier I1;
@@
<...
mmu_notifier_range_init(E1,
+MMU_NOTIFY_UNMAP, 0, I1,
I1->vm_mm, E3, E4)
...>

@@
expression E1, E2, E3, E4;
identifier FN, VMA;
@@
FN(..., struct vm_area_struct *VMA, ...) {
<...
mmu_notifier_range_init(E1,
+MMU_NOTIFY_UNMAP, 0, VMA,
E2, E3, E4)
...> }

@@
expression E1, E2, E3, E4;
identifier FN, VMA;
@@
FN(...) {
struct vm_area_struct *VMA;
<...
mmu_notifier_range_init(E1,
+MMU_NOTIFY_UNMAP, 0, VMA,
E2, E3, E4)
...> }

@@
expression E1, E2, E3, E4;
identifier FN;
@@
FN(...) {
<...
mmu_notifier_range_init(E1,
+MMU_NOTIFY_UNMAP, 0, NULL,
E2, E3, E4)
...> }
---------------------------------------------------------------------->%

Applied with:
spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place
spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place
spatch --sp-file mmu-notifier.spatch --dir mm --in-place

Link: http://lkml.kernel.org/r/20190326164747.24405-6-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 09:47:49 -07:00
Ira Weiny
73b0140bf0 mm/gup: change GUP fast to use flags rather than a write 'bool'
To facilitate additional options to get_user_pages_fast() change the
singular write parameter to be gup_flags.

This patch does not change any functionality.  New functionality will
follow in subsequent patches.

Some of the get_user_pages_fast() call sites were unchanged because they
already passed FOLL_WRITE or 0 for the write parameter.

NOTE: It was suggested to change the ordering of the get_user_pages_fast()
arguments to ensure that callers were converted.  This breaks the current
GUP call site convention of having the returned pages be the final
parameter.  So the suggestion was rejected.

Link: http://lkml.kernel.org/r/20190328084422.29911-4-ira.weiny@intel.com
Link: http://lkml.kernel.org/r/20190317183438.2057-4-ira.weiny@intel.com
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Mike Marshall <hubcap@omnibond.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 09:47:46 -07:00
Peter Xu
cefdca0a86 userfaultfd/sysctl: add vm.unprivileged_userfaultfd
Userfaultfd can be misued to make it easier to exploit existing
use-after-free (and similar) bugs that might otherwise only make a
short window or race condition available.  By using userfaultfd to
stall a kernel thread, a malicious program can keep some state that it
wrote, stable for an extended period, which it can then access using an
existing exploit.  While it doesn't cause the exploit itself, and while
it's not the only thing that can stall a kernel thread when accessing a
memory location, it's one of the few that never needs privilege.

We can add a flag, allowing userfaultfd to be restricted, so that in
general it won't be useable by arbitrary user programs, but in
environments that require userfaultfd it can be turned back on.

Add a global sysctl knob "vm.unprivileged_userfaultfd" to control
whether userfaultfd is allowed by unprivileged users.  When this is
set to zero, only privileged users (root user, or users with the
CAP_SYS_PTRACE capability) will be able to use the userfaultfd
syscalls.

Andrea said:

: The only difference between the bpf sysctl and the userfaultfd sysctl
: this way is that the bpf sysctl adds the CAP_SYS_ADMIN capability
: requirement, while userfaultfd adds the CAP_SYS_PTRACE requirement,
: because the userfaultfd monitor is more likely to need CAP_SYS_PTRACE
: already if it's doing other kind of tracking on processes runtime, in
: addition of userfaultfd.  In other words both syscalls works only for
: root, when the two sysctl are opt-in set to 1.

[dgilbert@redhat.com: changelog additions]
[akpm@linux-foundation.org: documentation tweak, per Mike]
Link: http://lkml.kernel.org/r/20190319030722.12441-2-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Suggested-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 09:47:45 -07:00
Cyrill Gorcunov
a9e73998f9 kernel/sys.c: prctl: fix false positive in validate_prctl_map()
While validating new map we require the @start_data to be strictly less
than @end_data, which is fine for regular applications (this is why this
nit didn't trigger for that long).  These members are set from executable
loaders such as elf handers, still it is pretty valid to have a loadable
data section with zero size in file, in such case the start_data is equal
to end_data once kernel loader finishes.

As a result when we're trying to restore such programs the procedure fails
and the kernel returns -EINVAL.  From the image dump of a program:

 | "mm_start_code": "0x400000",
 | "mm_end_code": "0x8f5fb4",
 | "mm_start_data": "0xf1bfb0",
 | "mm_end_data": "0xf1bfb0",

Thus we need to change validate_prctl_map from strictly less to less or
equal operator use.

Link: http://lkml.kernel.org/r/20190408143554.GY1421@uranus.lan
Fixes: f606b77f1a ("prctl: PR_SET_MM -- introduce PR_SET_MM_MAP operation")
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Andrey Vagin <avagin@gmail.com>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 09:47:44 -07:00
Wenlin Kang
ca976bfb31 kdb: Fix bound check compiler warning
The strncpy() function may leave the destination string buffer
unterminated, better use strscpy() instead.

This fixes the following warning with gcc 8.2:

kernel/debug/kdb/kdb_io.c: In function 'kdb_getstr':
kernel/debug/kdb/kdb_io.c:449:3: warning: 'strncpy' specified bound 256 equals destination size [-Wstringop-truncation]
   strncpy(kdb_prompt_str, prompt, CMD_BUFLEN);
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Signed-off-by: Wenlin Kang <wenlin.kang@windriver.com>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2019-05-14 13:44:24 +01:00
Stanislav Fomichev
390e99cfdd bpf: mark bpf_event_notify and bpf_event_init as static
Both of them are not declared in the headers and not used outside
of bpf_trace.c file.

Fixes: a38d1107f9 ("bpf: support raw tracepoints in modules")
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-05-14 01:27:18 +02:00
Eric Dumazet
2baae35453 bpf: devmap: fix use-after-free Read in __dev_map_entry_free
synchronize_rcu() is fine when the rcu callbacks only need
to free memory (kfree_rcu() or direct kfree() call rcu call backs)

__dev_map_entry_free() is a bit more complex, so we need to make
sure that call queued __dev_map_entry_free() callbacks have completed.

sysbot report:

BUG: KASAN: use-after-free in dev_map_flush_old kernel/bpf/devmap.c:365
[inline]
BUG: KASAN: use-after-free in __dev_map_entry_free+0x2a8/0x300
kernel/bpf/devmap.c:379
Read of size 8 at addr ffff8801b8da38c8 by task ksoftirqd/1/18

CPU: 1 PID: 18 Comm: ksoftirqd/1 Not tainted 4.17.0+ #39
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x1b9/0x294 lib/dump_stack.c:113
  print_address_description+0x6c/0x20b mm/kasan/report.c:256
  kasan_report_error mm/kasan/report.c:354 [inline]
  kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412
  __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433
  dev_map_flush_old kernel/bpf/devmap.c:365 [inline]
  __dev_map_entry_free+0x2a8/0x300 kernel/bpf/devmap.c:379
  __rcu_reclaim kernel/rcu/rcu.h:178 [inline]
  rcu_do_batch kernel/rcu/tree.c:2558 [inline]
  invoke_rcu_callbacks kernel/rcu/tree.c:2818 [inline]
  __rcu_process_callbacks kernel/rcu/tree.c:2785 [inline]
  rcu_process_callbacks+0xe9d/0x1760 kernel/rcu/tree.c:2802
  __do_softirq+0x2e0/0xaf5 kernel/softirq.c:284
  run_ksoftirqd+0x86/0x100 kernel/softirq.c:645
  smpboot_thread_fn+0x417/0x870 kernel/smpboot.c:164
  kthread+0x345/0x410 kernel/kthread.c:240
  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412

Allocated by task 6675:
  save_stack+0x43/0xd0 mm/kasan/kasan.c:448
  set_track mm/kasan/kasan.c:460 [inline]
  kasan_kmalloc+0xc4/0xe0 mm/kasan/kasan.c:553
  kmem_cache_alloc_trace+0x152/0x780 mm/slab.c:3620
  kmalloc include/linux/slab.h:513 [inline]
  kzalloc include/linux/slab.h:706 [inline]
  dev_map_alloc+0x208/0x7f0 kernel/bpf/devmap.c:102
  find_and_alloc_map kernel/bpf/syscall.c:129 [inline]
  map_create+0x393/0x1010 kernel/bpf/syscall.c:453
  __do_sys_bpf kernel/bpf/syscall.c:2351 [inline]
  __se_sys_bpf kernel/bpf/syscall.c:2328 [inline]
  __x64_sys_bpf+0x303/0x510 kernel/bpf/syscall.c:2328
  do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:290
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 26:
  save_stack+0x43/0xd0 mm/kasan/kasan.c:448
  set_track mm/kasan/kasan.c:460 [inline]
  __kasan_slab_free+0x11a/0x170 mm/kasan/kasan.c:521
  kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
  __cache_free mm/slab.c:3498 [inline]
  kfree+0xd9/0x260 mm/slab.c:3813
  dev_map_free+0x4fa/0x670 kernel/bpf/devmap.c:191
  bpf_map_free_deferred+0xba/0xf0 kernel/bpf/syscall.c:262
  process_one_work+0xc64/0x1b70 kernel/workqueue.c:2153
  worker_thread+0x181/0x13a0 kernel/workqueue.c:2296
  kthread+0x345/0x410 kernel/kthread.c:240
  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412

The buggy address belongs to the object at ffff8801b8da37c0
  which belongs to the cache kmalloc-512 of size 512
The buggy address is located 264 bytes inside of
  512-byte region [ffff8801b8da37c0, ffff8801b8da39c0)
The buggy address belongs to the page:
page:ffffea0006e368c0 count:1 mapcount:0 mapping:ffff8801da800940
index:0xffff8801b8da3540
flags: 0x2fffc0000000100(slab)
raw: 02fffc0000000100 ffffea0007217b88 ffffea0006e30cc8 ffff8801da800940
raw: ffff8801b8da3540 ffff8801b8da3040 0000000100000004 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
  ffff8801b8da3780: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
  ffff8801b8da3800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ffff8801b8da3880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                               ^
  ffff8801b8da3900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ffff8801b8da3980: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc

Fixes: 546ac1ffb7 ("bpf: add devmap, a map for storing net device references")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot+457d3e2ffbcf31aee5c0@syzkaller.appspotmail.com
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-05-14 01:25:49 +02:00
Linus Torvalds
a3958f5e13 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "Fixes all over:

   1) Netdev refcnt leak in nf_flow_table, from Taehee Yoo.

   2) Fix RCU usage in nf_tables, from Florian Westphal.

   3) Fix DSA build when NET_DSA_TAG_BRCM_PREPEND is not set, from Yue
      Haibing.

   4) Add missing page read/write ops to realtek driver, from Heiner
      Kallweit.

   5) Endianness fix in qrtr code, from Nicholas Mc Guire.

   6) Fix various bugs in DSA_SKB_* macros, from Vladimir Oltean.

   7) Several BPF documentation cures, from Quentin Monnet.

   8) Fix undefined behavior in narrow load handling of BPF verifier,
      from Krzesimir Nowak.

   9) DMA ops crash in SGI Seeq driver due to not set netdev parent
      device pointer, from Thomas Bogendoerfer.

  10) Flow dissector has to disable preemption when invoking BPF
      program, from Eric Dumazet"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (48 commits)
  net: ethernet: stmmac: dwmac-sun8i: enable support of unicast filtering
  net: ethernet: ti: netcp_ethss: fix build
  flow_dissector: disable preemption around BPF calls
  bonding: fix arp_validate toggling in active-backup mode
  net: meson: fixup g12a glue ephy id
  net: phy: realtek: Replace phy functions with non-locked version in rtl8211e_config_init()
  net: seeq: fix crash caused by not set dev.parent
  of_net: Fix missing of_find_device_by_node ref count drop
  net: mvpp2: cls: Add missing NETIF_F_NTUPLE flag
  bpf: fix undefined behavior in narrow load handling
  libbpf: detect supported kernel BTF features and sanitize BTF
  selftests: bpf: Add files generated after build to .gitignore
  tools: bpf: synchronise BPF UAPI header with tools
  bpf: fix minor issues in documentation for BPF helpers.
  bpf: fix recurring typo in documentation for BPF helpers
  bpf: fix script for generating man page on BPF helpers
  bpf: add various test cases for backward jumps
  net: dccp : proto: remove Unneeded variable "err"
  net: dsa: Remove the now unused DSA_SKB_CB_COPY() macro
  net: dsa: Remove dangerous DSA_SKB_CLONE() macro
  ...
2019-05-13 15:15:00 -07:00
Krzesimir Nowak
e2f7fc0ac6 bpf: fix undefined behavior in narrow load handling
Commit 31fd85816d ("bpf: permits narrower load from bpf program
context fields") made the verifier add AND instructions to clear the
unwanted bits with a mask when doing a narrow load. The mask is
computed with

  (1 << size * 8) - 1

where "size" is the size of the narrow load. When doing a 4 byte load
of a an 8 byte field the verifier shifts the literal 1 by 32 places to
the left. This results in an overflow of a signed integer, which is an
undefined behavior. Typically, the computed mask was zero, so the
result of the narrow load ended up being zero too.

Cast the literal to long long to avoid overflows. Note that narrow
load of the 4 byte fields does not have the undefined behavior,
because the load size can only be either 1 or 2 bytes, so shifting 1
by 8 or 16 places will not overflow it. And reading 4 bytes would not
be a narrow load of a 4 bytes field.

Fixes: 31fd85816d ("bpf: permits narrower load from bpf program context fields")
Reviewed-by: Alban Crequy <alban@kinvolk.io>
Reviewed-by: Iago López Galeiras <iago@kinvolk.io>
Signed-off-by: Krzesimir Nowak <krzesimir@kinvolk.io>
Cc: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-05-13 02:05:50 +02:00
Dan Carpenter
b586627e10 kdb: do a sanity check on the cpu in kdb_per_cpu()
The "whichcpu" comes from argv[3].  The cpu_online() macro looks up the
cpu in a bitmap of online cpus, but if the value is too high then it
could read beyond the end of the bitmap and possibly Oops.

Fixes: 5d5314d679 ("kdb: core for kgdb back end (1 of 2)")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2019-05-12 09:50:44 +01:00
Douglas Anderson
ecebc5ce59 kdb: Get rid of broken attempt to print CCVERSION in kdb summary
If you drop into kdb and type "summary", it prints out a line that
says this:

  ccversion  CCVERSION

...and I don't mean that it actually prints out the version of the C
compiler.  It literally prints out the string "CCVERSION".

The version of the C Compiler is already printed at boot up and it
doesn't seem useful to replicate this in kdb.  Let's just delete it.
We can also delete the bit of the Makefile that called the C compiler
in an attempt to pass this into kdb.  This will remove one extra call
to the C compiler at Makefile parse time and (very slightly) speed up
builds.

Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2019-05-12 09:50:43 +01:00
Linus Torvalds
8148c17b17 This is the bulk of the GPIO changes for the v5.2 kernel cycle:
Core changes:
 - The gpiolib MMIO driver has been enhanced to handle two direction
   registers, i.e. one register to set lines as input and one register
   to set lines as output. It turns out some silicon engineer thinks
   the ability to configure a line as input and output at the same
   time makes sense, this can be debated but includes a lot of analog
   electronics reasoning, and the registers are there and need to
   be handled consistently. Unsurprisingly, we enforce the lines to
   be either inputs or outputs in such schemes.
 - Send in the proper argument value to .set_config() dispatched to
   the pin control subsystem. Nobody used it before, now someone
   does, so fix it to work as expected.
 - The ACPI gpiolib portions can now handle pin bias setting (pull up
   or pull down). This has been in the ACPI spec for years and we
   finally have it properly integrated with Linux GPIOs. It was based
   on an observation from Andy Schevchenko that Thomas Petazzoni's
   changes to the core for biasing the PCA950x GPIO expander actually
   happen to fit hand-in-glove with what the ACPI core needed.
   Such nice synergies happen sometimes.
 
 New drivers:
 - A new driver for the Mellanox BlueField GPIO controller. This is
   using 64bit MMIO registers and can configure lines as inputs
   and outputs at the same time and after improving the MMIO library
   we handle it just fine. Interesting.
 - A new IXP4xx proper gpiochip driver with hierarchical interrupts
   should be coming in from the ARM SoC tree as well.
 
 Driver enhancements:
 - The PCA053x driver handles the CAT9554 GPIO expander.
 - The PCA053x driver handles the NXP PCAL6416 GPIO expander.
 - Wake-up support on PCA053x GPIO lines.
 - OMAP now does a nice asynchronous IRQ handling on wake-ups by
   letting everything wake up on edges, and this makes runtime PM
   work as expected too.
 
 Misc:
 - Several cleanups such as devres fixes.
 - Get rid of some languager comstructs that cause problems when
   compiling with LLVMs clang.
 - Documentation review and update.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJc1olZAAoJEEEQszewGV1zEU4P/RmTf3hG8xmNPS3MDTmR6gAy
 /YJOXjXBf3CD/dmEAyyaNLnUQismrtRNvHSoEGbno7gkU+htzp9UfUJkj6+HIXs2
 RpF+Hi78HzZNDxGWuBLu6OZolpmBtx+sRKOhHk/XfNS45qd1FgXWDuulzsYa9Xsr
 hYMXdtdv9wY/vcc68q1rtKAbzlu5ZNCa3Zj1iNOr/XQt3Nl2BW66hGLgjK4mOvgx
 fJy4rFXuDIMfDvo69U1Opz2b39sfE7XMhfZS/MOgg4yEV9zGRgDoI1tyMcTqGb8Q
 8LQbp5dXkP+3dJQB8tgbu3Vk4WC1Rd/pmIli5sMgsk0HYQ6XegfT6HJKozSmwN9r
 0s8jKlrocWZvdPo1aJwQgtRS56t2rFWcrcRye8bLqxkkW5cYIq9CwkE8USwB31Kv
 PFpoOwRuCtj0gkCxf7WIEcC5NAkYPow3K1KPdk3E0Si6I3pj0NqqlaAD0JAlkC2V
 aPq3xbTuFCAdmcADEt2Z+dUJ7WIs5Y9oQgosMAx+A2AD4K3QDBMu3pZsT6SCu4XZ
 mK0eWJi9/CvOj/s7bA0BEJVxQA+p8KYsNRBOULg/8aAOqGcLnSydQjqrxDTE8YrL
 xmmRG7i7ht0B9CchZuIB5hqdvjbCgvcVa5OnCUDfLxE0GdCx8iJ9y9OrsMXbabYq
 8FcPDo1N38cTYLnLqvKI
 =rhto
 -----END PGP SIGNATURE-----

Merge tag 'gpio-v5.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio

Pull gpio updates from Linus Walleij:
 "This is the bulk of the GPIO changes for the v5.2 kernel cycle. A bit
  later than usual because I was ironing out my own mistakes. I'm
  holding some stuff back for the next kernel as a result, and this
  should be a healthy and well tested batch.

  Core changes:

   - The gpiolib MMIO driver has been enhanced to handle two direction
     registers, i.e. one register to set lines as input and one register
     to set lines as output. It turns out some silicon engineer thinks
     the ability to configure a line as input and output at the same
     time makes sense, this can be debated but includes a lot of analog
     electronics reasoning, and the registers are there and need to be
     handled consistently. Unsurprisingly, we enforce the lines to be
     either inputs or outputs in such schemes.

   - Send in the proper argument value to .set_config() dispatched to
     the pin control subsystem. Nobody used it before, now someone does,
     so fix it to work as expected.

   - The ACPI gpiolib portions can now handle pin bias setting (pull up
     or pull down). This has been in the ACPI spec for years and we
     finally have it properly integrated with Linux GPIOs. It was based
     on an observation from Andy Schevchenko that Thomas Petazzoni's
     changes to the core for biasing the PCA950x GPIO expander actually
     happen to fit hand-in-glove with what the ACPI core needed. Such
     nice synergies happen sometimes.

  New drivers:

   - A new driver for the Mellanox BlueField GPIO controller. This is
     using 64bit MMIO registers and can configure lines as inputs and
     outputs at the same time and after improving the MMIO library we
     handle it just fine. Interesting.

   - A new IXP4xx proper gpiochip driver with hierarchical interrupts
     should be coming in from the ARM SoC tree as well.

  Driver enhancements:

   - The PCA053x driver handles the CAT9554 GPIO expander.

   - The PCA053x driver handles the NXP PCAL6416 GPIO expander.

   - Wake-up support on PCA053x GPIO lines.

   - OMAP now does a nice asynchronous IRQ handling on wake-ups by
     letting everything wake up on edges, and this makes runtime PM work
     as expected too.

  Misc:

   - Several cleanups such as devres fixes.

   - Get rid of some languager comstructs that cause problems when
     compiling with LLVMs clang.

   - Documentation review and update"

* tag 'gpio-v5.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio: (85 commits)
  gpio: Update documentation
  docs: gpio: convert docs to ReST and rename to *.rst
  gpio: sch: Remove write-only core_base
  gpio: pxa: Make two symbols static
  gpiolib: acpi: Respect pin bias setting
  gpiolib: acpi: Add acpi_gpio_update_gpiod_lookup_flags() helper
  gpiolib: acpi: Set pin value, based on bias, more accurately
  gpiolib: acpi: Change type of dflags
  gpiolib: Introduce GPIO_LOOKUP_FLAGS_DEFAULT
  gpiolib: Make use of enum gpio_lookup_flags consistent
  gpiolib: Indent entry values of enum gpio_lookup_flags
  gpio: pca953x: add support for pca6416
  dt-bindings: gpio: pca953x: document the nxp,pca6416
  gpio: pca953x: add pcal6416 to the of_device_id table
  gpio: gpio-omap: Remove conditional pm_runtime handling for GPIO interrupts
  gpio: gpio-omap: configure edge detection for level IRQs for idle wakeup
  tracing: stop making gpio tracing configurable
  gpio: pca953x: Configure wake-up path when wake-up is enabled
  gpio: of: Optimize quirk checks
  gpio: mmio: Drop bgpio_dir_inverted
  ...
2019-05-11 10:54:43 -04:00
Daniel Borkmann
af959b18fd bpf: fix out of bounds backwards jmps due to dead code removal
systemtap folks reported the following splat recently:

  [ 7790.862212] WARNING: CPU: 3 PID: 26759 at arch/x86/kernel/kprobes/core.c:1022 kprobe_fault_handler+0xec/0xf0
  [...]
  [ 7790.864113] CPU: 3 PID: 26759 Comm: sshd Not tainted 5.1.0-0.rc7.git1.1.fc31.x86_64 #1
  [ 7790.864198] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS[...]
  [ 7790.864314] RIP: 0010:kprobe_fault_handler+0xec/0xf0
  [ 7790.864375] Code: 48 8b 50 [...]
  [ 7790.864714] RSP: 0018:ffffc06800bdbb48 EFLAGS: 00010082
  [ 7790.864812] RAX: ffff9e2b75a16320 RBX: 0000000000000000 RCX: 0000000000000000
  [ 7790.865306] RDX: ffffffffffffffff RSI: 000000000000000e RDI: ffffc06800bdbbf8
  [ 7790.865514] RBP: ffffc06800bdbbf8 R08: 0000000000000000 R09: 0000000000000000
  [ 7790.865960] R10: 0000000000000000 R11: 0000000000000000 R12: ffffc06800bdbbf8
  [ 7790.866037] R13: ffff9e2ab56a0418 R14: ffff9e2b6d0bb400 R15: ffff9e2b6d268000
  [ 7790.866114] FS:  00007fde49937d80(0000) GS:ffff9e2b75a00000(0000) knlGS:0000000000000000
  [ 7790.866193] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [ 7790.866318] CR2: 0000000000000000 CR3: 000000012f312000 CR4: 00000000000006e0
  [ 7790.866419] Call Trace:
  [ 7790.866677]  do_user_addr_fault+0x64/0x480
  [ 7790.867513]  do_page_fault+0x33/0x210
  [ 7790.868002]  async_page_fault+0x1e/0x30
  [ 7790.868071] RIP: 0010:          (null)
  [ 7790.868144] Code: Bad RIP value.
  [ 7790.868229] RSP: 0018:ffffc06800bdbca8 EFLAGS: 00010282
  [ 7790.868362] RAX: ffff9e2b598b60f8 RBX: ffffc06800bdbe48 RCX: 0000000000000004
  [ 7790.868629] RDX: 0000000000000004 RSI: ffffc06800bdbc6c RDI: ffff9e2b598b60f0
  [ 7790.868834] RBP: ffffc06800bdbcf8 R08: 0000000000000000 R09: 0000000000000004
  [ 7790.870432] R10: 00000000ff6f7a03 R11: 0000000000000000 R12: 0000000000000001
  [ 7790.871859] R13: ffffc06800bdbcb8 R14: 0000000000000000 R15: ffff9e2acd0a5310
  [ 7790.873455]  ? vfs_read+0x5/0x170
  [ 7790.874639]  ? vfs_read+0x1/0x170
  [ 7790.875834]  ? trace_call_bpf+0xf6/0x260
  [ 7790.877044]  ? vfs_read+0x1/0x170
  [ 7790.878208]  ? vfs_read+0x5/0x170
  [ 7790.879345]  ? kprobe_perf_func+0x233/0x260
  [ 7790.880503]  ? vfs_read+0x1/0x170
  [ 7790.881632]  ? vfs_read+0x5/0x170
  [ 7790.882751]  ? kprobe_ftrace_handler+0x92/0xf0
  [ 7790.883926]  ? __vfs_read+0x30/0x30
  [ 7790.885050]  ? ftrace_ops_assist_func+0x94/0x100
  [ 7790.886183]  ? vfs_read+0x1/0x170
  [ 7790.887283]  ? vfs_read+0x5/0x170
  [ 7790.888348]  ? ksys_read+0x5a/0xe0
  [ 7790.889389]  ? do_syscall_64+0x5c/0xa0
  [ 7790.890401]  ? entry_SYSCALL_64_after_hwframe+0x49/0xbe

After some debugging, turns out that the logic in 2cbd95a5c4
("bpf: change parameters of call/branch offset adjustment") has
a bug that is exposed after 52875a04f4 ("bpf: verifier: remove
dead code") in that we miss some of the jump offset adjustments
after code patching when we remove dead code, more concretely,
upon backward jump spanning over the area that is being removed.

BPF insns of a case that was hit pre 52875a04f4:

  [...]
  676: (85) call bpf_perf_event_output#-47616
  677: (05) goto pc-636
  678: (62) *(u32 *)(r10 -64) = 0
  679: (bf) r7 = r10
  680: (07) r7 += -64
  681: (05) goto pc-44
  682: (05) goto pc-1
  683: (05) goto pc-1

BPF insns afterwards:

  [...]
  618: (85) call bpf_perf_event_output#-47616
  619: (05) goto pc-638
  620: (62) *(u32 *)(r10 -64) = 0
  621: (bf) r7 = r10
  622: (07) r7 += -64
  623: (05) goto pc-44

To illustrate the bug, situation looks as follows:
     ____
  0 |    | <-- foo: [...]
  1 |____|
  2 |____| <-- pos / end_new  ^
  3 |    |                    |
  4 |    |                    |  len
  5 |____|                    |  (remove region)
  6 |    | <-- end_old        v
  7 |    |
  8 |    | <-- curr  (jmp foo)
  9 |____|

The condition curr >= end_new && curr + off + 1 < end_new in the
branch delta adjustments is never hit because curr + off + 1 <
end_new is compared as unsigned and therefore curr + off + 1 >
end_new in unsigned realm as curr + off + 1 becomes negative
since the insns are memmove()'d before the offset adjustments.

Correct BPF insns after this fix:

  [...]
  618: (85) call bpf_perf_event_output#-47216
  619: (05) goto pc-578
  620: (62) *(u32 *)(r10 -64) = 0
  621: (bf) r7 = r10
  622: (07) r7 += -64
  623: (05) goto pc-44

Note that unprivileged case is not affected from this.

Fixes: 52875a04f4 ("bpf: verifier: remove dead code")
Fixes: 2cbd95a5c4 ("bpf: change parameters of call/branch offset adjustment")
Reported-by: Frank Ch. Eigler <fche@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-10 18:49:27 -07:00
Jiri Kosina
56e33afd77 livepatch: Remove klp_check_compiler_support()
The only purpose of klp_check_compiler_support() is to make sure that we
are not using ftrace on x86 via mcount (because that's executed only after
prologue has already happened, and that's too late for livepatching
purposes).

Now that mcount is not supported by ftrace any more, there is no need for
klp_check_compiler_support() either.

Link: http://lkml.kernel.org/r/nycvar.YFH.7.76.1905102346100.17054@cbobk.fhfr.pm

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-05-10 17:53:29 -04:00
Christian Brauner
c3b7112df8
fork: do not release lock that wasn't taken
Avoid calling cgroup_threadgroup_change_end() without having called
cgroup_threadgroup_change_begin() first.

During process creation we need to check whether the cgroup we are in
allows us to fork. To perform this check the cgroup needs to guard itself
against threadgroup changes and takes a lock.
Prior to CLONE_PIDFD the cleanup target "bad_fork_free_pid" would also need
to call cgroup_threadgroup_change_end() because said lock had already been
taken.
However, this is not the case anymore with the addition of CLONE_PIDFD. We
are now allocating a pidfd before we check whether the cgroup we're in can
fork and thus prior to taking the lock. So when copy_process() fails at the
right step it would release a lock we haven't taken.
This bug is not even very subtle to be honest. It's just not very clear
from the naming of cgroup_threadgroup_change_{begin,end}() that a lock is
taken.

Here's the relevant splat:

entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139
RIP: 0023:0xf7fec849
Code: 85 d2 74 02 89 0a 5b 5d c3 8b 04 24 c3 8b 14 24 c3 8b 3c 24 c3 90 90
90 90 90 90 90 90 90 90 90 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90
90 90 90 eb 0d 90 90 90 90 90 90 90 90 90 90 90 90
RSP: 002b:00000000ffed5a8c EFLAGS: 00000246 ORIG_RAX: 0000000000000078
RAX: ffffffffffffffda RBX: 0000000000003ffc RCX: 0000000000000000
RDX: 00000000200005c0 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000000000012 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
------------[ cut here ]------------
DEBUG_LOCKS_WARN_ON(depth <= 0)
WARNING: CPU: 1 PID: 7744 at kernel/locking/lockdep.c:4052 __lock_release
kernel/locking/lockdep.c:4052 [inline]
WARNING: CPU: 1 PID: 7744 at kernel/locking/lockdep.c:4052
lock_release+0x667/0xa00 kernel/locking/lockdep.c:4321
Kernel panic - not syncing: panic_on_warn set ...
CPU: 1 PID: 7744 Comm: syz-executor007 Not tainted 5.1.0+ #4
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x172/0x1f0 lib/dump_stack.c:113
  panic+0x2cb/0x65c kernel/panic.c:214
  __warn.cold+0x20/0x45 kernel/panic.c:566
  report_bug+0x263/0x2b0 lib/bug.c:186
  fixup_bug arch/x86/kernel/traps.c:179 [inline]
  fixup_bug arch/x86/kernel/traps.c:174 [inline]
  do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:272
  do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:291
  invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:972
RIP: 0010:__lock_release kernel/locking/lockdep.c:4052 [inline]
RIP: 0010:lock_release+0x667/0xa00 kernel/locking/lockdep.c:4321
Code: 0f 85 a0 03 00 00 8b 35 77 66 08 08 85 f6 75 23 48 c7 c6 a0 55 6b 87
48 c7 c7 40 25 6b 87 4c 89 85 70 ff ff ff e8 b7 a9 eb ff <0f> 0b 4c 8b 85
70 ff ff ff 4c 89 ea 4c 89 e6 4c 89 c7 e8 52 63 ff
RSP: 0018:ffff888094117b48 EFLAGS: 00010086
RAX: 0000000000000000 RBX: 1ffff11012822f6f RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff815af236 RDI: ffffed1012822f5b
RBP: ffff888094117c00 R08: ffff888092bfc400 R09: fffffbfff113301d
R10: fffffbfff113301c R11: ffffffff889980e3 R12: ffffffff8a451df8
R13: ffffffff8142e71f R14: ffffffff8a44cc80 R15: ffff888094117bd8
  percpu_up_read.constprop.0+0xcb/0x110 include/linux/percpu-rwsem.h:92
  cgroup_threadgroup_change_end include/linux/cgroup-defs.h:712 [inline]
  copy_process.part.0+0x47ff/0x6710 kernel/fork.c:2222
  copy_process kernel/fork.c:1772 [inline]
  _do_fork+0x25d/0xfd0 kernel/fork.c:2338
  __do_compat_sys_x86_clone arch/x86/ia32/sys_ia32.c:240 [inline]
  __se_compat_sys_x86_clone arch/x86/ia32/sys_ia32.c:236 [inline]
  __ia32_compat_sys_x86_clone+0xbc/0x140 arch/x86/ia32/sys_ia32.c:236
  do_syscall_32_irqs_on arch/x86/entry/common.c:334 [inline]
  do_fast_syscall_32+0x281/0xd54 arch/x86/entry/common.c:405
  entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139
RIP: 0023:0xf7fec849
Code: 85 d2 74 02 89 0a 5b 5d c3 8b 04 24 c3 8b 14 24 c3 8b 3c 24 c3 90 90
90 90 90 90 90 90 90 90 90 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90
90 90 90 eb 0d 90 90 90 90 90 90 90 90 90 90 90 90
RSP: 002b:00000000ffed5a8c EFLAGS: 00000246 ORIG_RAX: 0000000000000078
RAX: ffffffffffffffda RBX: 0000000000003ffc RCX: 0000000000000000
RDX: 00000000200005c0 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000000000012 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
Kernel Offset: disabled
Rebooting in 86400 seconds..

Reported-and-tested-by: syzbot+3286e58549edc479faae@syzkaller.appspotmail.com
Fixes: b3e5838252 ("clone: add CLONE_PIDFD")
Signed-off-by: Christian Brauner <christian@brauner.io>
2019-05-10 14:26:12 +02:00
Linus Torvalds
abde77eb5c Merge branch 'for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "This includes Roman's cgroup2 freezer implementation.

  It's a separate machanism from cgroup1 freezer. Instead of blocking
  user tasks in arbitrary uninterruptible sleeps, the new implementation
  extends jobctl stop - frozen tasks are trapped in jobctl stop until
  thawed and can be killed and ptraced. Lots of thanks to Oleg for
  sheperding the effort.

  Other than that, there are a few trivial changes"

* 'for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: never call do_group_exit() with task->frozen bit set
  kernel: cgroup: fix misuse of %x
  cgroup: get rid of cgroup_freezer_frozen_exit()
  cgroup: prevent spurious transition into non-frozen state
  cgroup: Remove unused cgrp variable
  cgroup: document cgroup v2 freezer interface
  cgroup: add tracing points for cgroup v2 freezer
  cgroup: make TRACE_CGROUP_PATH irq-safe
  kselftests: cgroup: add freezer controller self-tests
  kselftests: cgroup: don't fail on cg_kill_all() error in cg_destroy()
  cgroup: cgroup v2 freezer
  cgroup: protect cgroup->nr_(dying_)descendants by css_set_lock
  cgroup: implement __cgroup_task_count() helper
  cgroup: rename freezer.c into legacy_freezer.c
  cgroup: remove extra cgroup_migrate_finish() call
2019-05-09 13:52:12 -07:00
Linus Torvalds
23c970608a Merge branch 'for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue updates from Tejun Heo:
 "Only three commits, of which two are trivial.

  The non-trivial chagne is Thomas's patch to switch workqueue from
  sched RCU to regular one. The use of sched RCU is mostly historic and
  doesn't really buy us anything noticeable"

* 'for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: Use normal rcu
  kernel/workqueue: Document wq_worker_last_func() argument
  kernel/workqueue: Use __printf markup to silence compiler in function 'alloc_workqueue'
2019-05-09 13:48:52 -07:00
Linus Torvalds
7664cd6e3a Merge branch 'next-integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull intgrity updates from James Morris:
 "This contains just three patches, the remainder were either included
  in other pull requests (eg. audit, lockdown) or will be upstreamed via
  other subsystems (eg. kselftests, Power).

  Included here is one bug fix, one documentation update, and extending
  the x86 IMA arch policy rules to coordinate the different kernel
  module signature verification methods"

* 'next-integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
  doc/kernel-parameters.txt: Deprecate ima_appraise_tcb
  x86/ima: add missing include
  x86/ima: require signed kernel modules
2019-05-09 12:54:40 -07:00
Linus Torvalds
ddab5337b2 DMA mapping updates for 5.2
- remove the already broken support for NULL dev arguments to the
    DMA API calls
  - Kconfig tidyups
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAlzT00YLHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYO66hAAx2kCUIh+K2gFB5uxHqZiG62UmRjkPzolxcR5/Jx9
 4Rz6NRAE+rp8v2fbBr2bveDx7cF5bm1L+pRyRsFMfwkm3a8dCHQ51ldIm5VFoI3e
 NiX6Zoxk02BCXP/Qk//aHeNW9dBmuiemiXzdPEhOvWvVzqTO5JZrECQpkHEkG+8A
 R/IWU15sr5xzw9Td/HVN9CRJri/qiTAuB9nSoP6BGjZeHkQjREJKNMGKDTvSzH4L
 tlyD1G7yEymQvLBqGGO64ztuav00l8sqjI3tn1mmwpw4VTajabeRHPnWh+7g9Od+
 sH1pRvIOTvEMc456fizufYIOedB5Ze344kgfrxhngRbBVXmMfShr8ZLzdIUGhGjY
 1cdGqIUOEKywiDf13KrHVkNU+lJtvjMCMxvV93mAYRLOIQg0Jf4T2kklgKyEhqrG
 rqFdbbtSBzmLjPyqc1FS0heDWmA+yJsKAumGcH4blJXCpsD1rHWGe0AJ34x+OHPT
 tw5l+P4zAH1eO1qHCtmxN9s0lXZv1VLcFkOrJH91LPvAhZsUCrdqDjyJpTUYaIao
 yzkiLbDwFO7SVoqzaVNlVZIJ/9LX0qfAnl2Atty+sAQomrQMoviNSzGbLSLQqhHN
 FbTIEBMxrxS49+3lfzHOS/lYPpJp6B31yotNM+6YpXmbRQZN5gjGNYBqhKD+7Rgn
 L0Y=
 =IdsP
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-5.2' of git://git.infradead.org/users/hch/dma-mapping

Pull DMA mapping updates from Christoph Hellwig:

 - remove the already broken support for NULL dev arguments to the DMA
   API calls

 - Kconfig tidyups

* tag 'dma-mapping-5.2' of git://git.infradead.org/users/hch/dma-mapping:
  dma-mapping: add a Kconfig symbol to indicate arch_dma_prep_coherent presence
  dma-mapping: remove an unnecessary NULL check
  x86/dma: Remove the x86_dma_fallback_dev hack
  dma-mapping: remove leftover NULL device support
  arm: use a dummy struct device for ISA DMA use of the DMA API
  pxa3xx-gcu: pass struct device to dma_mmap_coherent
  gbefb: switch to managed version of the DMA allocator
  da8xx-fb: pass struct device to DMA API functions
  parport_ip32: pass struct device to DMA API functions
  dma: select GENERIC_ALLOCATOR for DMA_REMAP
2019-05-09 08:40:55 -07:00
Roman Gushchin
f2b31bb598 cgroup: never call do_group_exit() with task->frozen bit set
I've got two independent reports that cgroup_task_frozen() check
in cgroup_exit() has been triggered by lkp libhugetlbfs-test and
LTP ptrace01 tests.

For example:
[   44.576072] WARNING: CPU: 1 PID: 3028 at kernel/cgroup/cgroup.c:5932 cgroup_exit+0x148/0x160
[   44.577724] Modules linked in: crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel sr_mod cdrom
bochs_drm sg ttm ata_generic pata_acpi ppdev drm_kms_helper snd_pcm syscopyarea aesni_intel snd_timer
sysfillrect sysimgblt snd crypto_simd cryptd glue_helper soundcore fb_sys_fops joydev drm serio_raw pcspkr
ata_piix libata i2c_piix4 floppy parport_pc parport ip_tables
[   44.583106] CPU: 1 PID: 3028 Comm: ptrace-write-hu Not tainted 5.1.0-rc3-00053-g9262503 #5
[   44.584600] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[   44.586116] RIP: 0010:cgroup_exit+0x148/0x160
[   44.587135] Code: 0f 84 50 ff ff ff 48 8b 85 c8 0c 00 00 48 8b 78 70 e8 ec 2e 00 00 e9 3b ff ff ff f0 ff 43 60
0f 88 72 21 89 00 e9 48 ff ff ff <0f> 0b e9 1b ff ff ff e8 3c 73 f4 ff 66 90 66 2e 0f 1f 84 00 00 00
[   44.590113] RSP: 0018:ffffb25702dcfd30 EFLAGS: 00010002
[   44.591167] RAX: ffff96a7fee32410 RBX: ffff96a7ff1d6000 RCX: dead000000000200
[   44.592446] RDX: ffff96a7ff1d6080 RSI: ffff96a7fec75290 RDI: ffff96a7fec75290
[   44.593715] RBP: ffff96a7fec745c0 R08: ffff96a7fec74658 R09: 0000000000000000
[   44.594985] R10: 0000000000000000 R11: 0000000000000001 R12: ffff96a7fec75101
[   44.596266] R13: ffff96a7fec745c0 R14: ffff96a7ff3bde30 R15: ffff96a7fec75130
[   44.597550] FS:  0000000000000000(0000) GS:ffff96a7dd700000(0000) knlGS:0000000000000000
[   44.598950] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
[   44.600098] CR2: 00000000f7a00000 CR3: 000000000d20e000 CR4: 00000000000406e0
[   44.601417] Call Trace:
[   44.602777]  do_exit+0x337/0xc40
[   44.603677]  do_group_exit+0x3a/0xa0
[   44.604610]  get_signal+0x12e/0x8d0
[   44.605533]  ? __switch_to_asm+0x40/0x70
[   44.606503]  do_signal+0x36/0x650
[   44.607409]  ? __switch_to_asm+0x40/0x70
[   44.608383]  ? __schedule+0x267/0x860
[   44.609329]  exit_to_usermode_loop+0x89/0xf0
[   44.610349]  do_fast_syscall_32+0x251/0x2e3
[   44.611357]  entry_SYSENTER_compat+0x7f/0x91
[   44.612376] ---[ end trace e4ca5cfc4b7f7964 ]---

The problem is caused by the ptrace_signal() call in the for loop
in get_signal(). There is a cgroup_enter_frozen() call inside
ptrace_signal(), so after exit from ptrace_signal() the task->frozen
bit might be set. In this case do_group_exit() can be called with the
task->frozen bit set and trigger the warning. This is only place where
we can leave the loop with the task->frozen bit set and without
setting JOBCTL_TRAP_FREEZE and TIF_SIGPENDING.

To resolve this problem, let's move cgroup_leave_frozen(true) call to
just after the fatal label. If the task is going to die, the frozen
bit must be cleared no matter how we get into this point.

Reported-by: kernel test robot <rong.a.chen@intel.com>
Reported-by: Qian Cai <cai@lca.pw>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-05-09 07:56:47 -07:00
Miroslav Lichvar
fdc6bae940 ntp: Allow TAI-UTC offset to be set to zero
The ADJ_TAI adjtimex mode sets the TAI-UTC offset of the system clock.
It is typically set by NTP/PTP implementations and it is automatically
updated by the kernel on leap seconds. The initial value is zero (which
applications may interpret as unknown), but this value cannot be set by
adjtimex. This limitation seems to go back to the original "nanokernel"
implementation by David Mills.

Change the ADJ_TAI check to accept zero as a valid TAI-UTC offset in
order to allow setting it back to the initial value.

Fixes: 153b5d054a ("ntp: support for TAI")
Suggested-by: Ondrej Mosnacek <omosnace@redhat.com>
Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Link: https://lkml.kernel.org/r/20190417084833.7401-1-mlichvar@redhat.com
2019-05-09 10:46:58 +02:00
Srivatsa S. Bhat (VMware)
b941699760 tracing: Fix documentation about disabling options using trace_options
To disable a tracing option using the trace_options file, the option
name needs to be prefixed with 'no', and not suffixed, as the README
states. Fix it.

Link: http://lkml.kernel.org/r/154872690031.47356.5739053380942044586.stgit@srivatsa-ubuntu

Signed-off-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-05-08 12:15:12 -04:00
Gustavo A. R. Silva
8623b00676 tracing: Replace kzalloc with kcalloc
Replace kzalloc() function with its 2-factor argument form, kcalloc().

This patch replaces cases of:

	kzalloc(a * b, gfp)

with:
	kcalloc(a, b, gfp)

This code was detected with the help of Coccinelle.

Link: http://lkml.kernel.org/r/20190115043408.GA23456@embeddedor

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-05-08 12:15:12 -04:00
Elazar Leibovich
cbe08bcbbe tracing: Fix partial reading of trace event's id file
When reading only part of the id file, the ppos isn't tracked correctly.
This is taken care by simple_read_from_buffer.

Reading a single byte, and then the next byte would result EOF.

While this seems like not a big deal, this breaks abstractions that
reads information from files unbuffered. See for example
https://github.com/golang/go/issues/29399

This code was mentioned as problematic in
commit cd458ba9d5
("tracing: Do not (ab)use trace_seq in event_id_read()")

An example C code that show this bug is:

  #include <stdio.h>
  #include <stdint.h>

  #include <sys/types.h>
  #include <sys/stat.h>
  #include <fcntl.h>
  #include <unistd.h>

  int main(int argc, char **argv) {
    if (argc < 2)
      return 1;
    int fd = open(argv[1], O_RDONLY);
    char c;
    read(fd, &c, 1);
    printf("First  %c\n", c);
    read(fd, &c, 1);
    printf("Second %c\n", c);
  }

Then run with, e.g.

  sudo ./a.out /sys/kernel/debug/tracing/events/tcp/tcp_set_state/id

You'll notice you're getting the first character twice, instead of the
first two characters in the id file.

Link: http://lkml.kernel.org/r/20181231115837.4932-1-elazar@lightbitslabs.com

Cc: Orit Wasserman <orit.was@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 23725aeeab ("ftrace: provide an id file for each event")
Signed-off-by: Elazar Leibovich <elazar@lightbitslabs.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-05-08 12:15:12 -04:00
Anders Roxell
6fc2171c5c tracing: Allow RCU to run between postponed startup tests
When building a allmodconfig kernel for arm64 and boot that in qemu,
CONFIG_FTRACE_STARTUP_TEST gets enabled and that takes time so the
watchdog expires and prints out a message like this:
'watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1]'
Depending on what the what test gets called from init_trace_selftests()
it stays minutes in the loop.
Rework so that function cond_resched() gets called in the
init_trace_selftests loop.

Link: http://lkml.kernel.org/r/20181130145622.26334-1-anders.roxell@linaro.org

Co-developed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-05-08 12:15:12 -04:00
Colin Ian King
bfcd631eb6 tracing: Fix white space issues in parse_pred() function
Trivial fix to clean up an indentation issue, a whole chunk of code
has an extra space in the indentation.

Link: http://lkml.kernel.org/r/20181109132312.20994-1-colin.king@canonical.com

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-05-08 12:15:12 -04:00
Rasmus Villemoes
0f5e5a3ab7 tracing: Eliminate const char[] auto variables
Automatic const char[] variables cause unnecessary code
generation. For example, the this_mod variable leads to

    3f04:       48 b8 5f 5f 74 68 69 73 5f 6d   movabs $0x6d5f736968745f5f,%rax # __this_m
    3f0e:       4c 8d 44 24 02                  lea    0x2(%rsp),%r8
    3f13:       48 8d 7c 24 10                  lea    0x10(%rsp),%rdi
    3f18:       48 89 44 24 02                  mov    %rax,0x2(%rsp)
    3f1d:       4c 89 e9                        mov    %r13,%rcx
    3f20:       b8 65 00 00 00                  mov    $0x65,%eax # e
    3f25:       48 c7 c2 00 00 00 00            mov    $0x0,%rdx
                        3f28: R_X86_64_32S      .rodata.str1.1+0x18d
    3f2c:       be 48 00 00 00                  mov    $0x48,%esi
    3f31:       c7 44 24 0a 6f 64 75 6c         movl   $0x6c75646f,0xa(%rsp) # odul
    3f39:       66 89 44 24 0e                  mov    %ax,0xe(%rsp)

i.e., the string gets built on the stack at runtime. Similar code can be
found for the other instances I'm replacing here. Putting the string
in .rodata reduces the combined .text+.rodata size and saves time and
stack space at runtime.

The simplest fix, and what I've done for the this_mod case, is to just
make the variable static.

However, for the "<faulted>" case where the same string is used twice,
that prevents the linker from merging those two literals, so instead use
a macro - that also keeps the two instances automatically in
sync (instead of only the compile-time strlen expression).

Finally, for the two runs of spaces, it turns out that the "build
these strings on the stack" is not the worst part of what gcc does -
it turns print_func_help_header_irq() into "if (tgid) { /*
print_event_info + five seq_printf calls */ } else { /* print
event_info + another five seq_printf */}". Taking inspiration from a
suggestion from Al Viro, use %.*s to make snprintf either stop after
the first two spaces or print the whole string. As a bonus, the
seq_printfs now fit on single lines (at least, they are not longer
than the existing ones in the function just above), making it easier
to see that the ascii art lines up.

x86-64 defconfig + CONFIG_FUNCTION_TRACER:

$ scripts/stackdelta /tmp/stackusage.{0,1}
./kernel/trace/ftrace.c ftrace_mod_callback     152     136     -16
./kernel/trace/trace.c  trace_default_header    56      32      -24
./kernel/trace/trace.c  tracing_mark_raw_write  96      72      -24
./kernel/trace/trace.c  tracing_mark_write      104     80      -24

bloat-o-meter

add/remove: 1/0 grow/shrink: 0/4 up/down: 14/-375 (-361)
Function                                     old     new   delta
this_mod                                       -      14     +14
ftrace_mod_callback                          577     542     -35
tracing_mark_raw_write                       444     374     -70
tracing_mark_write                           616     540     -76
trace_default_header                         600     406    -194

Link: http://lkml.kernel.org/r/20190320081757.6037-1-linux@rasmusvillemoes.dk

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-05-08 12:15:12 -04:00
Yangtao Li
5c173bedb2 ring-buffer: Fix mispelling of Calculate
It's not "Caculate".

Link: http://lkml.kernel.org/r/20181101154640.23162-1-tiny.windzz@gmail.com

Signed-off-by: Yangtao Li <tiny.windzz@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-05-08 12:15:12 -04:00
Masami Hiramatsu
3dd1f7f24f tracing: probeevent: Fix to make the type of $comm string
Fix to make the type of $comm "string".  If we set the other type to $comm
argument, it shows meaningless value or wrong data. Currently probe events
allow us to set string array type (e.g. ":string[2]"), or other digit types
like x8 on $comm. But since clearly $comm is just a string data, it should
not be fetched by other types including array.

Link: http://lkml.kernel.org/r/155723736241.9149.14582064184468574539.stgit@devnote2

Cc: Andreas Ziegler <andreas.ziegler@fau.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 533059281e ("tracing: probeevent: Introduce new argument fetching code")
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-05-08 12:15:11 -04:00
Masami Hiramatsu
489fe0096b tracing: probeevent: Do not accumulate on ret variable
Do not accumulate strlen result on "ret" local variable, because
it is accumulated on "total" local variable for array case.

Link: http://lkml.kernel.org/r/155723735237.9149.3192150444705457531.stgit@devnote2

Fixes: 40b53b7718 ("tracing: probeevent: Add array type support")
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-05-08 12:15:11 -04:00
Masami Hiramatsu
4dd537aca2 tracing: uprobes: Re-enable $comm support for uprobe events
Since commit 533059281e ("tracing: probeevent: Introduce new
argument fetching code") dropped the $comm support from uprobe
events, this re-enables it.

For $comm support, uses strlcpy() instead of strncpy_from_user()
to copy current task's comm. Because it is in the kernel space,
strncpy_from_user() always fails to copy the comm.
This also uses strlen() instead of strnlen_user() to measure the
length of the comm.

Note that this uses -ECOMM as a token value to fetch the comm
string. If the user-space pointer points -ECOMM, it will be
translated to task->comm.

Link: http://lkml.kernel.org/r/155723734162.9149.4042756162201097965.stgit@devnote2

Fixes: 533059281e ("tracing: probeevent: Introduce new argument fetching code")
Reported-by: Andreas Ziegler <andreas.ziegler@fau.de>
Acked-by: Andreas Ziegler <andreas.ziegler@fau.de>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-05-08 12:15:11 -04:00
Linus Torvalds
80f232121b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
 "Highlights:

   1) Support AES128-CCM ciphers in kTLS, from Vakul Garg.

   2) Add fib_sync_mem to control the amount of dirty memory we allow to
      queue up between synchronize RCU calls, from David Ahern.

   3) Make flow classifier more lockless, from Vlad Buslov.

   4) Add PHY downshift support to aquantia driver, from Heiner
      Kallweit.

   5) Add SKB cache for TCP rx and tx, from Eric Dumazet. This reduces
      contention on SLAB spinlocks in heavy RPC workloads.

   6) Partial GSO offload support in XFRM, from Boris Pismenny.

   7) Add fast link down support to ethtool, from Heiner Kallweit.

   8) Use siphash for IP ID generator, from Eric Dumazet.

   9) Pull nexthops even further out from ipv4/ipv6 routes and FIB
      entries, from David Ahern.

  10) Move skb->xmit_more into a per-cpu variable, from Florian
      Westphal.

  11) Improve eBPF verifier speed and increase maximum program size,
      from Alexei Starovoitov.

  12) Eliminate per-bucket spinlocks in rhashtable, and instead use bit
      spinlocks. From Neil Brown.

  13) Allow tunneling with GUE encap in ipvs, from Jacky Hu.

  14) Improve link partner cap detection in generic PHY code, from
      Heiner Kallweit.

  15) Add layer 2 encap support to bpf_skb_adjust_room(), from Alan
      Maguire.

  16) Remove SKB list implementation assumptions in SCTP, your's truly.

  17) Various cleanups, optimizations, and simplifications in r8169
      driver. From Heiner Kallweit.

  18) Add memory accounting on TX and RX path of SCTP, from Xin Long.

  19) Switch PHY drivers over to use dynamic featue detection, from
      Heiner Kallweit.

  20) Support flow steering without masking in dpaa2-eth, from Ioana
      Ciocoi.

  21) Implement ndo_get_devlink_port in netdevsim driver, from Jiri
      Pirko.

  22) Increase the strict parsing of current and future netlink
      attributes, also export such policies to userspace. From Johannes
      Berg.

  23) Allow DSA tag drivers to be modular, from Andrew Lunn.

  24) Remove legacy DSA probing support, also from Andrew Lunn.

  25) Allow ll_temac driver to be used on non-x86 platforms, from Esben
      Haabendal.

  26) Add a generic tracepoint for TX queue timeouts to ease debugging,
      from Cong Wang.

  27) More indirect call optimizations, from Paolo Abeni"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1763 commits)
  cxgb4: Fix error path in cxgb4_init_module
  net: phy: improve pause mode reporting in phy_print_status
  dt-bindings: net: Fix a typo in the phy-mode list for ethernet bindings
  net: macb: Change interrupt and napi enable order in open
  net: ll_temac: Improve error message on error IRQ
  net/sched: remove block pointer from common offload structure
  net: ethernet: support of_get_mac_address new ERR_PTR error
  net: usb: smsc: fix warning reported by kbuild test robot
  staging: octeon-ethernet: Fix of_get_mac_address ERR_PTR check
  net: dsa: support of_get_mac_address new ERR_PTR error
  net: dsa: sja1105: Fix status initialization in sja1105_get_ethtool_stats
  vrf: sit mtu should not be updated when vrf netdev is the link
  net: dsa: Fix error cleanup path in dsa_init_module
  l2tp: Fix possible NULL pointer dereference
  taprio: add null check on sched_nest to avoid potential null pointer dereference
  net: mvpp2: cls: fix less than zero check on a u32 variable
  net_sched: sch_fq: handle non connected flows
  net_sched: sch_fq: do not assume EDT packets are ordered
  net: hns3: use devm_kcalloc when allocating desc_cb
  net: hns3: some cleanup for struct hns3_enet_ring
  ...
2019-05-07 22:03:58 -07:00
Linus Torvalds
d27fb65bc2 Merge branch 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc dcache updates from Al Viro:
 "Most of this pile is putting name length into struct name_snapshot and
  making use of it.

  The beginning of this series ("ovl_lookup_real_one(): don't bother
  with strlen()") ought to have been split in two (separate switch of
  name_snapshot to struct qstr from overlayfs reaping the trivial
  benefits of that), but I wanted to avoid a rebase - by the time I'd
  spotted that it was (a) in -next and (b) close to 5.1-final ;-/"

* 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  audit_compare_dname_path(): switch to const struct qstr *
  audit_update_watch(): switch to const struct qstr *
  inotify_handle_event(): don't bother with strlen()
  fsnotify: switch send_to_group() and ->handle_event to const struct qstr *
  fsnotify(): switch to passing const struct qstr * for file_name
  switch fsnotify_move() to passing const struct qstr * for old_name
  ovl_lookup_real_one(): don't bother with strlen()
  sysv: bury the broken "quietly truncate the long filenames" logics
  nsfs: unobfuscate
  unexport d_alloc_pseudo()
2019-05-07 20:03:32 -07:00
Linus Torvalds
02aff8db64 audit/stable-5.2 PR 20190507
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAlzRrzoUHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXNc7hAApgsi+3Jf9i29mgrKdrTciZ35TegK
 C8pTlOIndpBcmdwDakR50/PgfMHdHll8M9TReVNEjbe0S+Ww5GTE7eWtL3YqoPC2
 MuXEqcriz6UNi5Xma6vCZrDznWLXkXnzMDoDoYGDSoKuUYxef0fuqxDBnERM60Ht
 s52+0XvR5ZseBw7I1KIv/ix2fXuCGq6eCdqassm0rvLPQ7bq6nWzFAlNXOLud303
 DjIWu6Op2EL0+fJSmG+9Z76zFjyEbhMIhw5OPDeH4eO3pxX29AIv0m0JlI7ZXxfc
 /VVC3r5G4WrsWxwKMstOokbmsQxZ5pB3ZaceYpco7U+9N2e3SlpsNM9TV+Y/0ac/
 ynhYa//GK195LpMXx1BmWmLpjBHNgL8MvQkVTIpDia0GT+5sX7+haDxNLGYbocmw
 A/mR+KM2jAU3QzNseGh6c659j3K4tbMIFMNxt7pUBxVPLafcccNngFGTpzCwu5GU
 b7y4d21g6g/3Irj14NYU/qS8dTjW0rYrCMDquTpxmMfZ2xYuSvQmnBw91NQzVBp2
 98L2/fsUG3yOa5MApgv+ryJySsIM+SW+7leKS5tjy/IJINzyPEZ85l3o8ck8X4eT
 nohpKc/ELmeyi3omFYq18ecvFf2YRS5jRnz89i9q65/3ESgGiC0wyGOhNTvjvsyv
 k4jT0slIK614aGk=
 =p8Fp
 -----END PGP SIGNATURE-----

Merge tag 'audit-pr-20190507' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit

Pull audit updates from Paul Moore:
 "We've got a reasonably broad set of audit patches for the v5.2 merge
  window, the highlights are below:

   - The biggest change, and the source of all the arch/* changes, is
     the patchset from Dmitry to help enable some of the work he is
     doing around PTRACE_GET_SYSCALL_INFO.

     To be honest, including this in the audit tree is a bit of a
     stretch, but it does help move audit a little further along towards
     proper syscall auditing for all arches, and everyone else seemed to
     agree that audit was a "good" spot for this to land (or maybe they
     just didn't want to merge it? dunno.).

   - We can now audit time/NTP adjustments.

   - We continue the work to connect associated audit records into a
     single event"

* tag 'audit-pr-20190507' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: (21 commits)
  audit: fix a memory leak bug
  ntp: Audit NTP parameters adjustment
  timekeeping: Audit clock adjustments
  audit: purge unnecessary list_empty calls
  audit: link integrity evm_write_xattrs record to syscall event
  syscall_get_arch: add "struct task_struct *" argument
  unicore32: define syscall_get_arch()
  Move EM_UNICORE to uapi/linux/elf-em.h
  nios2: define syscall_get_arch()
  nds32: define syscall_get_arch()
  Move EM_NDS32 to uapi/linux/elf-em.h
  m68k: define syscall_get_arch()
  hexagon: define syscall_get_arch()
  Move EM_HEXAGON to uapi/linux/elf-em.h
  h8300: define syscall_get_arch()
  c6x: define syscall_get_arch()
  arc: define syscall_get_arch()
  Move EM_ARCOMPACT and EM_ARCV2 to uapi/linux/elf-em.h
  audit: Make audit_log_cap and audit_copy_inode static
  audit: connect LOGIN record to its syscall record
  ...
2019-05-07 19:06:04 -07:00
Linus Torvalds
498e8631f2 Merge branch 'stable/for-linus-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb
Pull swiotlb updates from Konrad Rzeszutek Wilk:
 "Cleanups in the swiotlb code and extra debugfs knobs to help with the
  field diagnostics"

* 'stable/for-linus-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb:
  swiotlb-xen: ensure we have a single callsite for xen_dma_map_page
  swiotlb-xen: simplify the DMA sync method implementations
  swiotlb-xen: use ->map_page to implement ->map_sg
  swiotlb-xen: make instances match their method names
  swiotlb: save io_tlb_used to local variable before leaving critical section
  swiotlb: dump used and total slots when swiotlb buffer is full
2019-05-07 18:45:27 -07:00
Linus Torvalds
cf482a49af Driver core/kobject patches for 5.2-rc1
Here is the "big" set of driver core patches for 5.2-rc1
 
 There are a number of ACPI patches in here as well, as Rafael said they
 should go through this tree due to the driver core changes they
 required.  They have all been acked by the ACPI developers.
 
 There are also a number of small subsystem-specific changes in here, due
 to some changes to the kobject core code.  Those too have all been acked
 by the various subsystem maintainers.
 
 As for content, it's pretty boring outside of the ACPI changes:
   - spdx cleanups
   - kobject documentation updates
   - default attribute groups for kobjects
   - other minor kobject/driver core fixes
 
 All have been in linux-next for a while with no reported issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXNHDbw8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ynDAgCfbb4LBR6I50wFXb8JM/R6cAS7qrsAn1unshKV
 8XCYcif2RxjtdJWXbjdm
 =/rLh
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core/kobject updates from Greg KH:
 "Here is the "big" set of driver core patches for 5.2-rc1

  There are a number of ACPI patches in here as well, as Rafael said
  they should go through this tree due to the driver core changes they
  required. They have all been acked by the ACPI developers.

  There are also a number of small subsystem-specific changes in here,
  due to some changes to the kobject core code. Those too have all been
  acked by the various subsystem maintainers.

  As for content, it's pretty boring outside of the ACPI changes:
   - spdx cleanups
   - kobject documentation updates
   - default attribute groups for kobjects
   - other minor kobject/driver core fixes

  All have been in linux-next for a while with no reported issues"

* tag 'driver-core-5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (47 commits)
  kobject: clean up the kobject add documentation a bit more
  kobject: Fix kernel-doc comment first line
  kobject: Remove docstring reference to kset
  firmware_loader: Fix a typo ("syfs" -> "sysfs")
  kobject: fix dereference before null check on kobj
  Revert "driver core: platform: Fix the usage of platform device name(pdev->name)"
  init/config: Do not select BUILD_BIN2C for IKCONFIG
  Provide in-kernel headers to make extending kernel easier
  kobject: Improve doc clarity kobject_init_and_add()
  kobject: Improve docs for kobject_add/del
  driver core: platform: Fix the usage of platform device name(pdev->name)
  livepatch: Replace klp_ktype_patch's default_attrs with groups
  cpufreq: schedutil: Replace default_attrs field with groups
  padata: Replace padata_attr_type default_attrs field with groups
  irqdesc: Replace irq_kobj_type's default_attrs field with groups
  net-sysfs: Replace ktype default_attrs field with groups
  block: Replace all ktype default_attrs with groups
  samples/kobject: Replace foo_ktype's default_attrs field with groups
  kobject: Add support for default attribute groups to kobj_type
  driver core: Postpone DMA tear-down until after devres release for probe failure
  ...
2019-05-07 13:01:40 -07:00
Linus Torvalds
eac7078a0f pidfd patches for v5.2-rc1
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE7btrcuORLb1XUhEwjrBW1T7ssS0FAlzReuoACgkQjrBW1T7s
 sS1uvBAA16pgnhRNxNTrp3LYft6lUWmF4n0baOTVtQNLhPjpwaOxHIrCBugkQCJB
 QcQ9IQSOvIkaEW0XAQoPBaeLviiKhHOFw1Fv89OtW6xUidSfSV15lcI9f1F2pCm2
 4yCL/8XvL6M0NhxiwftJAkWOXeDNLfjFnLwyLxBfgg3EeyqMgUB8raeosEID0ORR
 gm2/g8DYS2r+KNqM/F4xvMSgabfi2bGk+8BtAaVnftJfstpRNrqKwWnSK3Wspj1l
 5gkb8gSsiY6ns3V6RgNHrFlhevFg8V+VjcJt7FR+aUEjOkcoiXas/PhvamMzdsn/
 FM1F/A0pM8FSybIUClhnnnxNPc+p8ZN/71YQAPs+Mnh3xvbtKea2lkhC+Xv4OpK3
 edutSZWFaiIery82Rk00H3vqiSF1+kRIXSpZSS4mElk4FsVljkyH+nSP7rbmE2MR
 EQe+kKnZl8QzWrVbnODC+EVvvVpA2bXDvENJmvKqus+t2G0OdV7Iku3F5E3KjF8k
 S5RRV1zuBF3ugqnjmYrVmJtpEA8mxClmqvg6okru+qW6ngO5oOgVpPLjWn1CXcdj
 wcuQ6Pe1QwAHS54e9WSWgCHVssLvm9nCdCqypdNaoyGWmbTWntwlrY7Y0JUQnAbB
 6/G/DQQiCWY9y8bMZlTEydhIpgcsdROuPYv+oHF5+eQQthsWwHc=
 =LH11
 -----END PGP SIGNATURE-----

Merge tag 'pidfd-v5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull pidfd updates from Christian Brauner:
 "This patchset makes it possible to retrieve pidfds at process creation
  time by introducing the new flag CLONE_PIDFD to the clone() system
  call. Linus originally suggested to implement this as a new flag to
  clone() instead of making it a separate system call.

  After a thorough review from Oleg CLONE_PIDFD returns pidfds in the
  parent_tidptr argument. This means we can give back the associated pid
  and the pidfd at the same time. Access to process metadata information
  thus becomes rather trivial.

  As has been agreed, CLONE_PIDFD creates file descriptors based on
  anonymous inodes similar to the new mount api. They are made
  unconditional by this patchset as they are now needed by core kernel
  code (vfs, pidfd) even more than they already were before (timerfd,
  signalfd, io_uring, epoll etc.). The core patchset is rather small.
  The bulky looking changelist is caused by David's very simple changes
  to Kconfig to make anon inodes unconditional.

  A pidfd comes with additional information in fdinfo if the kernel
  supports procfs. The fdinfo file contains the pid of the process in
  the callers pid namespace in the same format as the procfs status
  file, i.e. "Pid:\t%d".

  To remove worries about missing metadata access this patchset comes
  with a sample/test program that illustrates how a combination of
  CLONE_PIDFD and pidfd_send_signal() can be used to gain race-free
  access to process metadata through /proc/<pid>.

  Further work based on this patchset has been done by Joel. His work
  makes pidfds pollable. It finished too late for this merge window. I
  would prefer to have it sitting in linux-next for a while and send it
  for inclusion during the 5.3 merge window"

* tag 'pidfd-v5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  samples: show race-free pidfd metadata access
  signal: support CLONE_PIDFD with pidfd_send_signal
  clone: add CLONE_PIDFD
  Make anon_inodes unconditional
2019-05-07 12:30:24 -07:00
Linus Torvalds
78438ce18f Merge branch 'stable-fodder' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs stable fodder fixes from Al Viro:

 - acct_on() fix for deadlock caught by overlayfs folks

 - autofs RCU use-after-free SNAFU (->d_manage() can be called
   locklessly, so we need to RCU-delay freeing the objects it looks at)

 - (hopefully) the end of "do we need freeing this dentry RCU-delayed"
   whack-a-mole.

* 'stable-fodder' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  autofs: fix use-after-free in lockless ->d_manage()
  dcache: sort the freeing-without-RCU-delay mess for good.
  acct_on(): don't mess with freeze protection
2019-05-07 11:17:26 -07:00
Linus Torvalds
168e153d5e Merge branch 'work.icache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs inode freeing updates from Al Viro:
 "Introduction of separate method for RCU-delayed part of
  ->destroy_inode() (if any).

  Pretty much as posted, except that destroy_inode() stashes
  ->free_inode into the victim (anon-unioned with ->i_fops) before
  scheduling i_callback() and the last two patches (sockfs conversion
  and folding struct socket_wq into struct socket) are excluded - that
  pair should go through netdev once davem reopens his tree"

* 'work.icache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (58 commits)
  orangefs: make use of ->free_inode()
  shmem: make use of ->free_inode()
  hugetlb: make use of ->free_inode()
  overlayfs: make use of ->free_inode()
  jfs: switch to ->free_inode()
  fuse: switch to ->free_inode()
  ext4: make use of ->free_inode()
  ecryptfs: make use of ->free_inode()
  ceph: use ->free_inode()
  btrfs: use ->free_inode()
  afs: switch to use of ->free_inode()
  dax: make use of ->free_inode()
  ntfs: switch to ->free_inode()
  securityfs: switch to ->free_inode()
  apparmor: switch to ->free_inode()
  rpcpipe: switch to ->free_inode()
  bpf: switch to ->free_inode()
  mqueue: switch to ->free_inode()
  ufs: switch to ->free_inode()
  coda: switch to ->free_inode()
  ...
2019-05-07 10:57:05 -07:00
Linus Torvalds
0968621917 Printk changes for 5.2
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAlzP8nQACgkQUqAMR0iA
 lPK79A/+NkRouqA9ihAZhUbgW0DHzOAFvUJSBgX11HQAZbGjngakuoyYFvwUx0T0
 m80SUTCysxQrWl+xLdccPZ9ZrhP2KFQrEBEdeYHZ6ymcYcl83+3bOIBS7VwdZAbO
 EzB8u/58uU/sI6ABL4lF7ZF/+R+U4CXveEUoVUF04bxdPOxZkRX4PT8u3DzCc+RK
 r4yhwQUXGcKrHa2GrRL3GXKsDxcnRdFef/nzq4RFSZsi0bpskzEj34WrvctV6j+k
 FH/R3kEcZrtKIMPOCoDMMWq07yNqK/QKj0MJlGoAlwfK4INgcrSXLOx+pAmr6BNq
 uMKpkxCFhnkZVKgA/GbKEGzFf+ZGz9+2trSFka9LD2Ig6DIstwXqpAgiUK8JFQYj
 lq1mTaJZD3DfF2vnGHGeAfBFG3XETv+mIT/ow6BcZi3NyNSVIaqa5GAR+lMc6xkR
 waNkcMDkzLFuP1r0p7ZizXOksk9dFkMP3M6KqJomRtApwbSNmtt+O2jvyLPvB3+w
 wRyN9WT7IJZYo4v0rrD5Bl6BjV15ZeCPRSFZRYofX+vhcqJQsFX1M9DeoNqokh55
 Cri8f6MxGzBVjE1G70y2/cAFFvKEKJud0NUIMEuIbcy+xNrEAWPF8JhiwpKKnU10
 c0u674iqHJ2HeVsYWZF0zqzqQ6E1Idhg/PrXfuVuhAaL5jIOnYY=
 =WZfC
 -----END PGP SIGNATURE-----

Merge tag 'printk-for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk

Pull printk updates from Petr Mladek:

 - Allow state reset of printk_once() calls.

 - Prevent crashes when dereferencing invalid pointers in vsprintf().
   Only the first byte is checked for simplicity.

 - Make vsprintf warnings consistent and inlined.

 - Treewide conversion of obsolete %pf, %pF to %ps, %pF printf
   modifiers.

 - Some clean up of vsprintf and test_printf code.

* tag 'printk-for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk:
  lib/vsprintf: Make function pointer_string static
  vsprintf: Limit the length of inlined error messages
  vsprintf: Avoid confusion between invalid address and value
  vsprintf: Prevent crash when dereferencing invalid pointers
  vsprintf: Consolidate handling of unknown pointer specifiers
  vsprintf: Factor out %pO handler as kobject_string()
  vsprintf: Factor out %pV handler as va_format()
  vsprintf: Factor out %p[iI] handler as ip_addr_string()
  vsprintf: Do not check address of well-known strings
  vsprintf: Consistent %pK handling for kptr_restrict == 0
  vsprintf: Shuffle restricted_pointer()
  printk: Tie printk_once / printk_deferred_once into .data.once for reset
  treewide: Switch printk users from %pf and %pF to %ps and %pS, respectively
  lib/test_printf: Switch to bitmap_zalloc()
2019-05-07 09:18:12 -07:00
Linus Torvalds
573de2a6e8 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching
Pull livepatching updates from Jiri Kosina:

 - livepatching kselftests improvements from Joe Lawrence and Miroslav
   Benes

 - making use of gcc's -flive-patching option when available, from
   Miroslav Benes

 - kobject handling cleanups, from Petr Mladek

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching:
  livepatch: Remove duplicated code for early initialization
  livepatch: Remove custom kobject state handling
  livepatch: Convert error about unsupported reliable stacktrace into a warning
  selftests/livepatch: Add functions.sh to TEST_PROGS_EXTENDED
  kbuild: use -flive-patching when CONFIG_LIVEPATCH is enabled
  selftests/livepatch: use TEST_PROGS for test scripts
2019-05-07 08:56:04 -07:00
Linus Torvalds
78ee8b1b9b Merge branch 'next-general' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull security subsystem updates from James Morris:
 "Just a few bugfixes and documentation updates"

* 'next-general' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
  seccomp: fix up grammar in comment
  Revert "security: inode: fix a missing check for securityfs_create_file"
  Yama: mark function as static
  security: inode: fix a missing check for securityfs_create_file
  keys: safe concurrent user->{session,uid}_keyring access
  security: don't use RCU accessors for cred->session_keyring
  Yama: mark local symbols as static
  LSM: lsm_hooks.h: fix documentation format
  LSM: fix documentation for the shm_* hooks
  LSM: fix documentation for the sem_* hooks
  LSM: fix documentation for the msg_queue_* hooks
  LSM: fix documentation for the audit_* hooks
  LSM: fix documentation for the path_chmod hook
  LSM: fix documentation for the socket_getpeersec_dgram hook
  LSM: fix documentation for the task_setscheduler hook
  LSM: fix documentation for the socket_post_create hook
  LSM: fix documentation for the syslog hook
  LSM: fix documentation for sb_copy_data hook
2019-05-07 08:39:54 -07:00
Christian Brauner
2151ad1b06
signal: support CLONE_PIDFD with pidfd_send_signal
Let pidfd_send_signal() use pidfds retrieved via CLONE_PIDFD.  With this
patch pidfd_send_signal() becomes independent of procfs.  This fullfils
the request made when we merged the pidfd_send_signal() patchset.  The
pidfd_send_signal() syscall is now always available allowing for it to
be used by users without procfs mounted or even users without procfs
support compiled into the kernel.

Signed-off-by: Christian Brauner <christian@brauner.io>
Co-developed-by: Jann Horn <jannh@google.com>
Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David Howells <dhowells@redhat.com>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
2019-05-07 14:31:03 +02:00
Christian Brauner
b3e5838252
clone: add CLONE_PIDFD
This patchset makes it possible to retrieve pid file descriptors at
process creation time by introducing the new flag CLONE_PIDFD to the
clone() system call.  Linus originally suggested to implement this as a
new flag to clone() instead of making it a separate system call.  As
spotted by Linus, there is exactly one bit for clone() left.

CLONE_PIDFD creates file descriptors based on the anonymous inode
implementation in the kernel that will also be used to implement the new
mount api.  They serve as a simple opaque handle on pids.  Logically,
this makes it possible to interpret a pidfd differently, narrowing or
widening the scope of various operations (e.g. signal sending).  Thus, a
pidfd cannot just refer to a tgid, but also a tid, or in theory - given
appropriate flag arguments in relevant syscalls - a process group or
session. A pidfd does not represent a privilege.  This does not imply it
cannot ever be that way but for now this is not the case.

A pidfd comes with additional information in fdinfo if the kernel supports
procfs.  The fdinfo file contains the pid of the process in the callers
pid namespace in the same format as the procfs status file, i.e. "Pid:\t%d".

As suggested by Oleg, with CLONE_PIDFD the pidfd is returned in the
parent_tidptr argument of clone.  This has the advantage that we can
give back the associated pid and the pidfd at the same time.

To remove worries about missing metadata access this patchset comes with
a sample program that illustrates how a combination of CLONE_PIDFD, and
pidfd_send_signal() can be used to gain race-free access to process
metadata through /proc/<pid>.  The sample program can easily be
translated into a helper that would be suitable for inclusion in libc so
that users don't have to worry about writing it themselves.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <christian@brauner.io>
Co-developed-by: Jann Horn <jannh@google.com>
Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David Howells <dhowells@redhat.com>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
2019-05-07 14:31:03 +02:00
Waiman Long
a9e9bcb45b locking/rwsem: Prevent decrement of reader count before increment
During my rwsem testing, it was found that after a down_read(), the
reader count may occasionally become 0 or even negative. Consequently,
a writer may steal the lock at that time and execute with the reader
in parallel thus breaking the mutual exclusion guarantee of the write
lock. In other words, both readers and writer can become rwsem owners
simultaneously.

The current reader wakeup code does it in one pass to clear waiter->task
and put them into wake_q before fully incrementing the reader count.
Once waiter->task is cleared, the corresponding reader may see it,
finish the critical section and do unlock to decrement the count before
the count is incremented. This is not a problem if there is only one
reader to wake up as the count has been pre-incremented by 1.  It is
a problem if there are more than one readers to be woken up and writer
can steal the lock.

The wakeup was actually done in 2 passes before the following v4.9 commit:

  70800c3c0c ("locking/rwsem: Scan the wait_list for readers only once")

To fix this problem, the wakeup is now done in two passes
again. In the first pass, we collect the readers and count them.
The reader count is then fully incremented. In the second pass, the
waiter->task is then cleared and they are put into wake_q to be woken
up later.

Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Fixes: 70800c3c0c ("locking/rwsem: Scan the wait_list for readers only once")
Link: http://lkml.kernel.org/r/20190428212557.13482-2-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-05-07 08:46:46 +02:00
Linus Torvalds
81ff5d2cba Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto update from Herbert Xu:
 "API:
   - Add support for AEAD in simd
   - Add fuzz testing to testmgr
   - Add panic_on_fail module parameter to testmgr
   - Use per-CPU struct instead multiple variables in scompress
   - Change verify API for akcipher

  Algorithms:
   - Convert x86 AEAD algorithms over to simd
   - Forbid 2-key 3DES in FIPS mode
   - Add EC-RDSA (GOST 34.10) algorithm

  Drivers:
   - Set output IV with ctr-aes in crypto4xx
   - Set output IV in rockchip
   - Fix potential length overflow with hashing in sun4i-ss
   - Fix computation error with ctr in vmx
   - Add SM4 protected keys support in ccree
   - Remove long-broken mxc-scc driver
   - Add rfc4106(gcm(aes)) cipher support in cavium/nitrox"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (179 commits)
  crypto: ccree - use a proper le32 type for le32 val
  crypto: ccree - remove set but not used variable 'du_size'
  crypto: ccree - Make cc_sec_disable static
  crypto: ccree - fix spelling mistake "protedcted" -> "protected"
  crypto: caam/qi2 - generate hash keys in-place
  crypto: caam/qi2 - fix DMA mapping of stack memory
  crypto: caam/qi2 - fix zero-length buffer DMA mapping
  crypto: stm32/cryp - update to return iv_out
  crypto: stm32/cryp - remove request mutex protection
  crypto: stm32/cryp - add weak key check for DES
  crypto: atmel - remove set but not used variable 'alg_name'
  crypto: picoxcell - Use dev_get_drvdata()
  crypto: crypto4xx - get rid of redundant using_sd variable
  crypto: crypto4xx - use sync skcipher for fallback
  crypto: crypto4xx - fix cfb and ofb "overran dst buffer" issues
  crypto: crypto4xx - fix ctr-aes missing output IV
  crypto: ecrdsa - select ASN1 and OID_REGISTRY for EC-RDSA
  crypto: ux500 - use ccflags-y instead of CFLAGS_<basename>.o
  crypto: ccree - handle tee fips error during power management resume
  crypto: ccree - add function to handle cryptocell tee fips error
  ...
2019-05-06 20:15:06 -07:00
Linus Torvalds
8f5e823f91 Power management updates for 5.2-rc1
- Fix the handling of Performance and Energy Bias Hint (EPB) on
    Intel processors and expose it to user space via sysfs to avoid
    having to access it through the generic MSR I/F (Rafael Wysocki).
 
  - Improve the handling of global turbo changes made by the platform
    firmware in the intel_pstate driver (Rafael Wysocki).
 
  - Convert some slow-path static_cpu_has() callers to boot_cpu_has()
    in cpufreq (Borislav Petkov).
 
  - Fix the frequency calculation loop in the armada-37xx cpufreq
    driver (Gregory CLEMENT).
 
  - Fix possible object reference leaks in multuple cpufreq drivers
    (Wen Yang).
 
  - Fix kerneldoc comment in the centrino cpufreq driver (dongjian).
 
  - Clean up the ACPI and maple cpufreq drivers (Viresh Kumar, Mohan
    Kumar).
 
  - Add support for lx2160a and ls1028a to the qoriq cpufreq driver
    (Vabhav Sharma, Yuantian Tang).
 
  - Fix kobject memory leak in the cpufreq core (Viresh Kumar).
 
  - Simplify the IOwait boosting in the schedutil cpufreq governor
    and rework the TSC cpufreq notifier on x86 (Rafael Wysocki).
 
  - Clean up the cpufreq core and statistics code (Yue Hu, Kyle Lin).
 
  - Improve the cpufreq documentation, add SPDX license tags to
    some PM documentation files and unify copyright notices in
    them (Rafael Wysocki).
 
  - Add support for "CPU" domains to the generic power domains (genpd)
    framework and provide low-level PSCI firmware support for that
    feature (Ulf Hansson).
 
  - Rearrange the PSCI firmware support code and add support for
    SYSTEM_RESET2 to it (Ulf Hansson, Sudeep Holla).
 
  - Improve genpd support for devices in multiple power domains (Ulf
    Hansson).
 
  - Unify target residency for the AFTR and coupled AFTR states in the
    exynos cpuidle driver (Marek Szyprowski).
 
  - Introduce new helper routine in the operating performance points
    (OPP) framework (Andrew-sh.Cheng).
 
  - Add support for passing on-die termination (ODT) and auto power
    down parameters from the kernel to Trusted Firmware-A (TF-A) to
    the rk3399_dmc devfreq driver (Enric Balletbo i Serra).
 
  - Add tracing to devfreq (Lukasz Luba).
 
  - Make the exynos-bus devfreq driver suspend all devices on system
    shutdown (Marek Szyprowski).
 
  - Fix a few minor issues in the devfreq subsystem and clean it up
    somewhat (Enric Balletbo i Serra, MyungJoo Ham, Rob Herring,
    Saravana Kannan, Yangtao Li).
 
  - Improve system wakeup diagnostics (Stephen Boyd).
 
  - Rework filesystem sync messages emitted during system suspend and
    hibernation (Harry Pan).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAlzQEwUSHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxxXwP/jrxikIXdCOV3CJVioV0NetyebwlOqYp
 UsIA7lQBfZ/DY6dHw/oKuAT9LP01vcFg6XGe83Alkta9qczR5KZ/MYHFNSZXjXjL
 kEvIMBCS/oykaBuW+Xn9am8Ke3Yq/rBSTKWVom3vzSQY0qvZ9GBwPDrzw+k63Zhz
 P3afB4ThyY0e9ftgw4HvSSNm13Kn0ItUIQOdaLatXMMcPqP5aAdnUma5Ibinbtpp
 rpTHuHKYx7MSjaCg6wl3kKTJeWbQP4wYO2ISZqH9zEwQgdvSHeFAvfPKTegUkmw9
 uUsQnPD1JvdglOKovr2muehD1Ur+zsjKDf2OKERkWsWXHPyWzA/AqaVv1mkkU++b
 KaWaJ9pE86kGlJ3EXwRbGfV0dM5rrl+dUUQW6nPI1XJnIOFlK61RzwAbqI26F0Mz
 AlKxY4jyPLcM3SpQz9iILqyzHQqB67rm29XvId/9scoGGgoqEI4S+v6LYZqI3Vx6
 aeSRu+Yof7p5w4Kg5fODX+HzrtMnMrPmLUTXhbExfsYZMi7hXURcN6s+tMpH0ckM
 4yiIpnNGCKUSV4vxHBm8XJdAuUnR4Vcz++yFslszgDVVvw5tkvF7SYeHZ6HqcQVm
 af9HdWzx3qajs/oyBwdRBedZYDnP1joC5donBI2ofLeF33NA7TEiPX8Zebw8XLkv
 fNikssA7PGdv
 =nY9p
 -----END PGP SIGNATURE-----

Merge tag 'pm-5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:
 "These fix the (Intel-specific) Performance and Energy Bias Hint (EPB)
  handling and expose it to user space via sysfs, fix and clean up
  several cpufreq drivers, add support for two new chips to the qoriq
  cpufreq driver, fix, simplify and clean up the cpufreq core and the
  schedutil governor, add support for "CPU" domains to the generic power
  domains (genpd) framework and provide low-level PSCI firmware support
  for that feature, fix the exynos cpuidle driver and fix a couple of
  issues in the devfreq subsystem and clean it up.

  Specifics:

   - Fix the handling of Performance and Energy Bias Hint (EPB) on Intel
     processors and expose it to user space via sysfs to avoid having to
     access it through the generic MSR I/F (Rafael Wysocki).

   - Improve the handling of global turbo changes made by the platform
     firmware in the intel_pstate driver (Rafael Wysocki).

   - Convert some slow-path static_cpu_has() callers to boot_cpu_has()
     in cpufreq (Borislav Petkov).

   - Fix the frequency calculation loop in the armada-37xx cpufreq
     driver (Gregory CLEMENT).

   - Fix possible object reference leaks in multuple cpufreq drivers
     (Wen Yang).

   - Fix kerneldoc comment in the centrino cpufreq driver (dongjian).

   - Clean up the ACPI and maple cpufreq drivers (Viresh Kumar, Mohan
     Kumar).

   - Add support for lx2160a and ls1028a to the qoriq cpufreq driver
     (Vabhav Sharma, Yuantian Tang).

   - Fix kobject memory leak in the cpufreq core (Viresh Kumar).

   - Simplify the IOwait boosting in the schedutil cpufreq governor and
     rework the TSC cpufreq notifier on x86 (Rafael Wysocki).

   - Clean up the cpufreq core and statistics code (Yue Hu, Kyle Lin).

   - Improve the cpufreq documentation, add SPDX license tags to some PM
     documentation files and unify copyright notices in them (Rafael
     Wysocki).

   - Add support for "CPU" domains to the generic power domains (genpd)
     framework and provide low-level PSCI firmware support for that
     feature (Ulf Hansson).

   - Rearrange the PSCI firmware support code and add support for
     SYSTEM_RESET2 to it (Ulf Hansson, Sudeep Holla).

   - Improve genpd support for devices in multiple power domains (Ulf
     Hansson).

   - Unify target residency for the AFTR and coupled AFTR states in the
     exynos cpuidle driver (Marek Szyprowski).

   - Introduce new helper routine in the operating performance points
     (OPP) framework (Andrew-sh.Cheng).

   - Add support for passing on-die termination (ODT) and auto power
     down parameters from the kernel to Trusted Firmware-A (TF-A) to the
     rk3399_dmc devfreq driver (Enric Balletbo i Serra).

   - Add tracing to devfreq (Lukasz Luba).

   - Make the exynos-bus devfreq driver suspend all devices on system
     shutdown (Marek Szyprowski).

   - Fix a few minor issues in the devfreq subsystem and clean it up
     somewhat (Enric Balletbo i Serra, MyungJoo Ham, Rob Herring,
     Saravana Kannan, Yangtao Li).

   - Improve system wakeup diagnostics (Stephen Boyd).

   - Rework filesystem sync messages emitted during system suspend and
     hibernation (Harry Pan)"

* tag 'pm-5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (72 commits)
  cpufreq: Fix kobject memleak
  cpufreq: armada-37xx: fix frequency calculation for opp
  cpufreq: centrino: Fix centrino_setpolicy() kerneldoc comment
  cpufreq: qoriq: add support for lx2160a
  x86: tsc: Rework time_cpufreq_notifier()
  PM / Domains: Allow to attach a CPU via genpd_dev_pm_attach_by_id|name()
  PM / Domains: Search for the CPU device outside the genpd lock
  PM / Domains: Drop unused in-parameter to some genpd functions
  PM / Domains: Use the base device for driver_deferred_probe_check_state()
  cpufreq: qoriq: Add ls1028a chip support
  PM / Domains: Enable genpd_dev_pm_attach_by_id|name() for single PM domain
  PM / Domains: Allow OF lookup for multi PM domain case from ->attach_dev()
  PM / Domains: Don't kfree() the virtual device in the error path
  cpufreq: Move ->get callback check outside of __cpufreq_get()
  PM / Domains: remove unnecessary unlikely()
  cpufreq: Remove needless bios_limit check in show_bios_limit()
  drivers/cpufreq/acpi-cpufreq.c: This fixes the following checkpatch warning
  firmware/psci: add support for SYSTEM_RESET2
  PM / devfreq: add tracing for scheduling work
  trace: events: add devfreq trace event file
  ...
2019-05-06 19:40:31 -07:00
Linus Torvalds
c620f7bd0b arm64 updates for 5.2
Mostly just incremental improvements here:
 
 - Introduce AT_HWCAP2 for advertising CPU features to userspace
 
 - Expose SVE2 availability to userspace
 
 - Support for "data cache clean to point of deep persistence" (DC PODP)
 
 - Honour "mitigations=off" on the cmdline and advertise status via sysfs
 
 - CPU timer erratum workaround (Neoverse-N1 #1188873)
 
 - Introduce perf PMU driver for the SMMUv3 performance counters
 
 - Add config option to disable the kuser helpers page for AArch32 tasks
 
 - Futex modifications to ensure liveness under contention
 
 - Rework debug exception handling to seperate kernel and user handlers
 
 - Non-critical fixes and cleanup
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCgAdFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAlzMFGgACgkQt6xw3ITB
 YzTicAf/TX1h1+ecbx4WJAa4qeiOCPoNpG9efldQumqJhKL44MR5bkhuShna5mwE
 ptm5qUXkZCxLTjzssZKnbdbgwa3t+emW8Of3D91IfI9akiZbMoDx5FGgcNbqjazb
 RLrhOFHwgontA38yppZN+DrL+sXbvif/CVELdHahkEx6KepSGaS2lmPXRmz/W56v
 4yIRy/zxc3Dhjgfm3wKh72nBwoZdLiIc4mchd5pthNlR9E2idrYkQegG1C+gA00r
 o8uZRVOWgoh7H+QJE+xLUc8PaNCg8xqRRXOuZYg9GOz6hh7zSWhm+f1nRz9S2tIR
 gIgsCHNqoO2I3E1uJpAQXDGtt2kFhA==
 =ulpJ
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Will Deacon:
 "Mostly just incremental improvements here:

   - Introduce AT_HWCAP2 for advertising CPU features to userspace

   - Expose SVE2 availability to userspace

   - Support for "data cache clean to point of deep persistence" (DC PODP)

   - Honour "mitigations=off" on the cmdline and advertise status via
     sysfs

   - CPU timer erratum workaround (Neoverse-N1 #1188873)

   - Introduce perf PMU driver for the SMMUv3 performance counters

   - Add config option to disable the kuser helpers page for AArch32 tasks

   - Futex modifications to ensure liveness under contention

   - Rework debug exception handling to seperate kernel and user
     handlers

   - Non-critical fixes and cleanup"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (92 commits)
  Documentation: Add ARM64 to kernel-parameters.rst
  arm64/speculation: Support 'mitigations=' cmdline option
  arm64: ssbs: Don't treat CPUs with SSBS as unaffected by SSB
  arm64: enable generic CPU vulnerabilites support
  arm64: add sysfs vulnerability show for speculative store bypass
  arm64: Fix size of __early_cpu_boot_status
  clocksource/arm_arch_timer: Use arch_timer_read_counter to access stable counters
  clocksource/arm_arch_timer: Remove use of workaround static key
  clocksource/arm_arch_timer: Drop use of static key in arch_timer_reg_read_stable
  clocksource/arm_arch_timer: Direcly assign set_next_event workaround
  arm64: Use arch_timer_read_counter instead of arch_counter_get_cntvct
  watchdog/sbsa: Use arch_timer_read_counter instead of arch_counter_get_cntvct
  ARM: vdso: Remove dependency with the arch_timer driver internals
  arm64: Apply ARM64_ERRATUM_1188873 to Neoverse-N1
  arm64: Add part number for Neoverse N1
  arm64: Make ARM64_ERRATUM_1188873 depend on COMPAT
  arm64: Restrict ARM64_ERRATUM_1188873 mitigation to AArch32
  arm64: mm: Remove pte_unmap_nested()
  arm64: Fix compiler warning from pte_unmap() with -Wunused-but-set-variable
  arm64: compat: Reduce address limit for 64K pages
  ...
2019-05-06 17:54:22 -07:00
Linus Torvalds
dd4e5d6106 Remove Mysterious Macro Intended to Obscure Weird Behaviours (mmiowb())
Remove mmiowb() from the kernel memory barrier API and instead, for
 architectures that need it, hide the barrier inside spin_unlock() when
 MMIO has been performed inside the critical section.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCgAdFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAlzMFaUACgkQt6xw3ITB
 YzRICQgAiv7wF/yIbBhDOmCNCAKDO59chvFQWxXWdGk/aAB56kwKAMXJgLOvlMG/
 VRuuLyParTFQETC3jaxKgnO/1hb+PZLDt2Q2KqixtjIzBypKUPWvK2sf6THhSRF1
 GK0DBVUd1rCrWrR815+SPb8el4xXtdBzvAVB+Fx35PXVNpdRdqCkK+EQ6UnXGokm
 rXXHbnfsnquBDtmb4CR4r2beH+aNElXbdt0Kj8VcE5J7f7jTdW3z6Q9WFRvdKmK7
 yrsxXXB2w/EsWXOwFp0SLTV5+fgeGgTvv8uLjDw+SG6t0E0PebxjNAflT7dPrbYL
 WecjKC9WqBxrGY+4ew6YJP70ijLBCw==
 =aC8m
 -----END PGP SIGNATURE-----

Merge tag 'arm64-mmiowb' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull mmiowb removal from Will Deacon:
 "Remove Mysterious Macro Intended to Obscure Weird Behaviours (mmiowb())

  Remove mmiowb() from the kernel memory barrier API and instead, for
  architectures that need it, hide the barrier inside spin_unlock() when
  MMIO has been performed inside the critical section.

  The only relatively recent changes have been addressing review
  comments on the documentation, which is in a much better shape thanks
  to the efforts of Ben and Ingo.

  I was initially planning to split this into two pull requests so that
  you could run the coccinelle script yourself, however it's been plain
  sailing in linux-next so I've just included the whole lot here to keep
  things simple"

* tag 'arm64-mmiowb' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (23 commits)
  docs/memory-barriers.txt: Update I/O section to be clearer about CPU vs thread
  docs/memory-barriers.txt: Fix style, spacing and grammar in I/O section
  arch: Remove dummy mmiowb() definitions from arch code
  net/ethernet/silan/sc92031: Remove stale comment about mmiowb()
  i40iw: Redefine i40iw_mmiowb() to do nothing
  scsi/qla1280: Remove stale comment about mmiowb()
  drivers: Remove explicit invocations of mmiowb()
  drivers: Remove useless trailing comments from mmiowb() invocations
  Documentation: Kill all references to mmiowb()
  riscv/mmiowb: Hook up mmwiob() implementation to asm-generic code
  powerpc/mmiowb: Hook up mmwiob() implementation to asm-generic code
  ia64/mmiowb: Add unconditional mmiowb() to arch_spin_unlock()
  mips/mmiowb: Add unconditional mmiowb() to arch_spin_unlock()
  sh/mmiowb: Add unconditional mmiowb() to arch_spin_unlock()
  m68k/io: Remove useless definition of mmiowb()
  nds32/io: Remove useless definition of mmiowb()
  x86/io: Remove useless definition of mmiowb()
  arm64/io: Remove useless definition of mmiowb()
  ARM/io: Remove useless definition of mmiowb()
  mmiowb: Hook up mmiowb helpers to spinlocks and generic I/O accessors
  ...
2019-05-06 16:57:52 -07:00
Linus Torvalds
14be4c61c2 s390 updates for the 5.2 merge window
- Support for kernel address space layout randomization
 
  - Add support for kernel image signature verification
 
  - Convert s390 to the generic get_user_pages_fast code
 
  - Convert s390 to the stack unwind API analog to x86
 
  - Add support for CPU directed interrupts for PCI devices
 
  - Provide support for MIO instructions to the PCI base layer, this
    will allow the use of direct PCI mappings in user space code
 
  - Add the basic KVM guest ultravisor interface for protected VMs
 
  - Add AT_HWCAP bits for several new hardware capabilities
 
  - Update the CPU measurement facility counter definitions to SVN 6
 
  - Arnds cleanup patches for his quest to get LLVM compiles working
 
  - A vfio-ccw update with bug fixes and support for halt and clear
 
  - Improvements for the hardware TRNG code
 
  - Another round of cleanup for the QDIO layer
 
  - Numerous cleanups and bug fixes
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJc0CCEAAoJEDjwexyKj9rgjmkH/A3e2drvuP/hSF3xfCKTQFdx
 /PoLHQVCqENB3HU3FA/ljoXuG6jMgwj61looqlxBNumXFpIfTg0E1JC5S4wRGJ+K
 cOVhIKV53gcuZkRcCJQp0WMnGzpk1Daf7iYXYmAl+7e+mREUPxOuJ0Ei6vXvRGZS
 8cQrUCGrtPgkAeLlndypHI2M2TDDGJIMczOGbOZau8+8Lo7Wq9zt5y0h/v0ew37g
 ogA0eGh6koU1435dt2pclZRiZ1XOcar3Uin9ioT+RnSgJ4pr1Pza/F6IGO0RdQa+
 rva990lqGFp5r9lE4rMCwK9LWb/rfHdVPd35t9XPwphnQ/ORoWUwLk3uc5XOHow=
 =dbuy
 -----END PGP SIGNATURE-----

Merge tag 's390-5.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

Pull s390 updates from Martin Schwidefsky:

 - Support for kernel address space layout randomization

 - Add support for kernel image signature verification

 - Convert s390 to the generic get_user_pages_fast code

 - Convert s390 to the stack unwind API analog to x86

 - Add support for CPU directed interrupts for PCI devices

 - Provide support for MIO instructions to the PCI base layer, this will
   allow the use of direct PCI mappings in user space code

 - Add the basic KVM guest ultravisor interface for protected VMs

 - Add AT_HWCAP bits for several new hardware capabilities

 - Update the CPU measurement facility counter definitions to SVN 6

 - Arnds cleanup patches for his quest to get LLVM compiles working

 - A vfio-ccw update with bug fixes and support for halt and clear

 - Improvements for the hardware TRNG code

 - Another round of cleanup for the QDIO layer

 - Numerous cleanups and bug fixes

* tag 's390-5.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (98 commits)
  s390/vdso: drop unnecessary cc-ldoption
  s390: fix clang -Wpointer-sign warnigns in boot code
  s390: drop CONFIG_VIRT_TO_BUS
  s390: boot, purgatory: pass $(CLANG_FLAGS) where needed
  s390: only build for new CPUs with clang
  s390: simplify disabled_wait
  s390/ftrace: use HAVE_FUNCTION_GRAPH_RET_ADDR_PTR
  s390/unwind: introduce stack unwind API
  s390/opcodes: add missing instructions to the disassembler
  s390/bug: add entry size to the __bug_table section
  s390: use proper expoline sections for .dma code
  s390/nospec: rename assembler generated expoline thunks
  s390: add missing ENDPROC statements to assembler functions
  locking/lockdep: check for freed initmem in static_obj()
  s390/kernel: add support for kernel address space layout randomization (KASLR)
  s390/kernel: introduce .dma sections
  s390/sclp: do not use static sccbs
  s390/kprobes: use static buffer for insn_page
  s390/kernel: convert SYSCALL and PGM_CHECK handlers to .quad
  s390/kernel: build a relocatable kernel
  ...
2019-05-06 16:42:54 -07:00
Linus Torvalds
0bc40e549a Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm updates from Ingo Molnar:
 "The changes in here are:

   - text_poke() fixes and an extensive set of executability lockdowns,
     to (hopefully) eliminate the last residual circumstances under
     which we are using W|X mappings even temporarily on x86 kernels.
     This required a broad range of surgery in text patching facilities,
     module loading, trampoline handling and other bits.

   - tweak page fault messages to be more informative and more
     structured.

   - remove DISCONTIGMEM support on x86-32 and make SPARSEMEM the
     default.

   - reduce KASLR granularity on 5-level paging kernels from 512 GB to
     1 GB.

   - misc other changes and updates"

* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
  x86/mm: Initialize PGD cache during mm initialization
  x86/alternatives: Add comment about module removal races
  x86/kprobes: Use vmalloc special flag
  x86/ftrace: Use vmalloc special flag
  bpf: Use vmalloc special flag
  modules: Use vmalloc special flag
  mm/vmalloc: Add flag for freeing of special permsissions
  mm/hibernation: Make hibernation handle unmapped pages
  x86/mm/cpa: Add set_direct_map_*() functions
  x86/alternatives: Remove the return value of text_poke_*()
  x86/jump-label: Remove support for custom text poker
  x86/modules: Avoid breaking W^X while loading modules
  x86/kprobes: Set instruction page as executable
  x86/ftrace: Set trampoline pages as executable
  x86/kgdb: Avoid redundant comparison of patched code
  x86/alternatives: Use temporary mm for text poking
  x86/alternatives: Initialize temporary mm for patching
  fork: Provide a function for copying init_mm
  uprobes: Initialize uprobes earlier
  x86/mm: Save debug registers when loading a temporary mm
  ...
2019-05-06 16:13:31 -07:00
Linus Torvalds
a0e928ed7c Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Ingo Molnar:
 "This cycle had the following changes:

   - Timer tracing improvements (Anna-Maria Gleixner)

   - Continued tasklet reduction work: remove the hrtimer_tasklet
     (Thomas Gleixner)

   - Fix CPU hotplug remove race in the tick-broadcast mask handling
     code (Thomas Gleixner)

   - Force upper bound for setting CLOCK_REALTIME, to fix ABI
     inconsistencies with handling values that are close to the maximum
     supported and the vagueness of when uptime related wraparound might
     occur. Make the consistent maximum the year 2232 across all
     relevant ABIs and APIs. (Thomas Gleixner)

   - various cleanups and smaller fixes"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  tick: Fix typos in comments
  tick/broadcast: Fix warning about undefined tick_broadcast_oneshot_offline()
  timekeeping: Force upper bound for setting CLOCK_REALTIME
  timer/trace: Improve timer tracing
  timer/trace: Replace deprecated vsprintf pointer extension %pf by %ps
  timer: Move trace point to get proper index
  tick/sched: Update tick_sched struct documentation
  tick: Remove outgoing CPU from broadcast masks
  timekeeping: Consistently use unsigned int for seqcount snapshot
  softirq: Remove tasklet_hrtimer
  xfrm: Replace hrtimer tasklet with softirq hrtimer
  mac80211_hwsim: Replace hrtimer tasklet with softirq hrtimer
2019-05-06 14:50:46 -07:00
Linus Torvalds
5a2bf1abbf Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull CPU hotplug updates from Ingo Molnar:
 "Two changes in this cycle:

   - Make the /sys/devices/system/cpu/smt/* files available on all
     arches, so user space has a consistent way to detect whether SMT is
     enabled.

   - Sparse annotation fix"

* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  smpboot: Place the __percpu annotation correctly
  cpu/hotplug: Create SMT sysfs interface for all arches
2019-05-06 14:44:49 -07:00
Linus Torvalds
e00d413575 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "The main changes in this cycle were:

   - Make nohz housekeeping processing more permissive and less
     intrusive to isolated CPUs

   - Decouple CPU-bound workqueue acconting from the scheduler and move
     it into the workqueue code.

   - Optimize topology building

   - Better handle quota and period overflows

   - Add more RCU annotations

   - Comment updates, misc cleanups"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
  nohz_full: Allow the boot CPU to be nohz_full
  sched/isolation: Require a present CPU in housekeeping mask
  kernel/cpu: Allow non-zero CPU to be primary for suspend / kexec freeze
  power/suspend: Add function to disable secondaries for suspend
  sched/core: Allow the remote scheduler tick to be started on CPU0
  sched/nohz: Run NOHZ idle load balancer on HK_FLAG_MISC CPUs
  sched/debug: Fix spelling mistake "logaritmic" -> "logarithmic"
  sched/topology: Update init_sched_domains() comment
  cgroup/cpuset: Update stale generate_sched_domains() comments
  sched/core: Check quota and period overflow at usec to nsec conversion
  sched/core: Handle overflow in cpu_shares_write_u64
  sched/rt: Check integer overflow at usec to nsec conversion
  sched/core: Fix typo in comment
  sched/core: Make some functions static
  sched/core: Unify p->on_rq updates
  sched/core: Remove ttwu_activate()
  sched/core, workqueues: Distangle worker accounting from rq lock
  sched/fair: Remove unneeded prototype of capacity_of()
  sched/topology: Skip duplicate group rewrites in build_sched_groups()
  sched/topology: Fix build_sched_groups() comment
  ...
2019-05-06 14:31:50 -07:00
Linus Torvalds
90489a72fb Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Ingo Molnar:
 "The main kernel changes were:

   - add support for Intel's "adaptive PEBS v4" - which embedds LBS data
     in PEBS records and can thus batch up and reduce the IRQ (NMI) rate
     significantly - reducing overhead and making call-graph profiling
     less intrusive.

   - add Intel CPU core and uncore support updates for Tremont, Icelake,

   - extend the x86 PMU constraints scheduler with 'constraint ranges'
     to better support Icelake hw constraints,

   - make x86 call-chain support work better with CONFIG_FRAME_POINTER=y

   - misc other changes

  Tooling changes:

   - updates to the main tools: 'perf record', 'perf trace', 'perf
     stat'

   - updated Intel and S/390 vendor events

   - libtraceevent updates

   - misc other updates and fixes"

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (69 commits)
  perf/x86: Make perf callchains work without CONFIG_FRAME_POINTER
  watchdog: Fix typo in comment
  perf/x86/intel: Add Tremont core PMU support
  perf/x86/intel/uncore: Add Intel Icelake uncore support
  perf/x86/msr: Add Icelake support
  perf/x86/intel/rapl: Add Icelake support
  perf/x86/intel/cstate: Add Icelake support
  perf/x86/intel: Add Icelake support
  perf/x86: Support constraint ranges
  perf/x86/lbr: Avoid reading the LBRs when adaptive PEBS handles them
  perf/x86/intel: Support adaptive PEBS v4
  perf/x86/intel/ds: Extract code of event update in short period
  perf/x86/intel: Extract memory code PEBS parser for reuse
  perf/x86: Support outputting XMM registers
  perf/x86/intel: Force resched when TFA sysctl is modified
  perf/core: Add perf_pmu_resched() as global function
  perf/headers: Fix stale comment for struct perf_addr_filter
  perf/core: Make perf_swevent_init_cpu() static
  perf/x86: Add sanity checks to x86_schedule_events()
  perf/x86: Optimize x86_schedule_events()
  ...
2019-05-06 14:16:36 -07:00
Linus Torvalds
007dc78fea Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
 "Here are the locking changes in this cycle:

   - rwsem unification and simpler micro-optimizations to prepare for
     more intrusive (and more lucrative) scalability improvements in
     v5.3 (Waiman Long)

   - Lockdep irq state tracking flag usage cleanups (Frederic
     Weisbecker)

   - static key improvements (Jakub Kicinski, Peter Zijlstra)

   - misc updates, cleanups and smaller fixes"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (26 commits)
  locking/lockdep: Remove unnecessary unlikely()
  locking/static_key: Don't take sleeping locks in __static_key_slow_dec_deferred()
  locking/static_key: Factor out the fast path of static_key_slow_dec()
  locking/static_key: Add support for deferred static branches
  locking/lockdep: Test all incompatible scenarios at once in check_irq_usage()
  locking/lockdep: Avoid bogus Clang warning
  locking/lockdep: Generate LOCKF_ bit composites
  locking/lockdep: Use expanded masks on find_usage_*() functions
  locking/lockdep: Map remaining magic numbers to lock usage mask names
  locking/lockdep: Move valid_state() inside CONFIG_TRACE_IRQFLAGS && CONFIG_PROVE_LOCKING
  locking/rwsem: Prevent unneeded warning during locking selftest
  locking/rwsem: Optimize rwsem structure for uncontended lock acquisition
  locking/rwsem: Enable lock event counting
  locking/lock_events: Don't show pvqspinlock events on bare metal
  locking/lock_events: Make lock_events available for all archs & other locks
  locking/qspinlock_stat: Introduce generic lockevent_*() counting APIs
  locking/rwsem: Enhance DEBUG_RWSEMS_WARN_ON() macro
  locking/rwsem: Add debug check for __down_read*()
  locking/rwsem: Micro-optimize rwsem_try_read_lock_unqueued()
  locking/rwsem: Move rwsem internal function declarations to rwsem-xadd.h
  ...
2019-05-06 13:50:15 -07:00
Linus Torvalds
2f1835dffa Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Ingo Molnar:
 "The changes in this cycle were:

   - Remove the irq timings/variance statistics code that tried to
     predict when the next interrupt would occur, which didn't work out
     as hoped and is replaced by another mechanism.

   - This new mechanism is the 'array suffix computation' estimate,
     which is superior to the previous one as it can detect not just a
     single periodic pattern, but independent periodic patterns along a
     log-2 scale of bucketing and exponential moving average. The
     comments are longer than the code - and it works better at
     predicting various complex interrupt patterns from real-world
     devices than the previous estimate.

   - avoid IRQ-work self-IPIs on the local CPU

   - fix work-list corruption in irq_set_affinity_notifier()"

* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irq_work: Do not raise an IPI when queueing work on the local CPU
  genirq/devres: Use struct_size() in devm_kzalloc()
  genirq/timings: Add array suffix computation code
  genirq/timings: Remove variance computation code
  genirq: Prevent use-after-free and work list corruption
2019-05-06 13:45:04 -07:00
Linus Torvalds
2c6a392cdd Merge branch 'core-stacktrace-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull stack trace updates from Ingo Molnar:
 "So Thomas looked at the stacktrace code recently and noticed a few
  weirdnesses, and we all know how such stories of crummy kernel code
  meeting German engineering perfection end: a 45-patch series to clean
  it all up! :-)

  Here's the changes in Thomas's words:

   'Struct stack_trace is a sinkhole for input and output parameters
    which is largely pointless for most usage sites. In fact if embedded
    into other data structures it creates indirections and extra storage
    overhead for no benefit.

    Looking at all usage sites makes it clear that they just require an
    interface which is based on a storage array. That array is either on
    stack, global or embedded into some other data structure.

    Some of the stack depot usage sites are outright wrong, but
    fortunately the wrongness just causes more stack being used for
    nothing and does not have functional impact.

    Another oddity is the inconsistent termination of the stack trace
    with ULONG_MAX. It's pointless as the number of entries is what
    determines the length of the stored trace. In fact quite some call
    sites remove the ULONG_MAX marker afterwards with or without nasty
    comments about it. Not all architectures do that and those which do,
    do it inconsistenly either conditional on nr_entries == 0 or
    unconditionally.

    The following series cleans that up by:

      1) Removing the ULONG_MAX termination in the architecture code

      2) Removing the ULONG_MAX fixups at the call sites

      3) Providing plain storage array based interfaces for stacktrace
         and stackdepot.

      4) Cleaning up the mess at the callsites including some related
         cleanups.

      5) Removing the struct stack_trace based interfaces

    This is not changing the struct stack_trace interfaces at the
    architecture level, but it removes the exposure to the generic
    code'"

* 'core-stacktrace-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (45 commits)
  x86/stacktrace: Use common infrastructure
  stacktrace: Provide common infrastructure
  lib/stackdepot: Remove obsolete functions
  stacktrace: Remove obsolete functions
  livepatch: Simplify stack trace retrieval
  tracing: Remove the last struct stack_trace usage
  tracing: Simplify stack trace retrieval
  tracing: Make ftrace_trace_userstack() static and conditional
  tracing: Use percpu stack trace buffer more intelligently
  tracing: Simplify stacktrace retrieval in histograms
  lockdep: Simplify stack trace handling
  lockdep: Remove save argument from check_prev_add()
  lockdep: Remove unused trace argument from print_circular_bug()
  drm: Simplify stacktrace handling
  dm persistent data: Simplify stack trace handling
  dm bufio: Simplify stack trace retrieval
  btrfs: ref-verify: Simplify stack trace retrieval
  dma/debug: Simplify stracktrace retrieval
  fault-inject: Simplify stacktrace retrieval
  mm/page_owner: Simplify stack trace handling
  ...
2019-05-06 13:11:48 -07:00
Linus Torvalds
0a499fc5c3 Merge branch 'core-speculation-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull speculation mitigation update from Ingo Molnar:
 "This adds the "mitigations=" bootline option, which offers a
  cross-arch set of options that will work on x86, PowerPC and s390 that
  will map to the arch specific option internally"

* 'core-speculation-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  s390/speculation: Support 'mitigations=' cmdline option
  powerpc/speculation: Support 'mitigations=' cmdline option
  x86/speculation: Support 'mitigations=' cmdline option
  cpu/speculation: Add 'mitigations=' cmdline option
2019-05-06 13:01:16 -07:00
Linus Torvalds
e50c5d2e72 Merge branch 'core-rseq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull rseq updates from Ingo Molnar:
 "A cleanup and a fix to comments"

* 'core-rseq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  rseq: Remove superfluous rseq_len from task_struct
  rseq: Clean up comments by reflecting removal of event counter
2019-05-06 12:46:54 -07:00
Linus Torvalds
5ba2a4b12f Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 "This cycles's RCU changes include:

   - a couple of straggling RCU flavor consolidation updates

   - SRCU updates

   - RCU CPU stall-warning updates

   - torture-test updates

   - an LKMM commit adding support for synchronize_srcu_expedited()

   - documentation updates

   - miscellaneous fixes"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits)
  net/ipv4/netfilter: Update comment from call_rcu_bh() to call_rcu()
  tools/memory-model: Add support for synchronize_srcu_expedited()
  doc/kprobes: Update obsolete RCU update functions
  torture: Suppress false-positive CONFIG_INITRAMFS_SOURCE complaint
  locktorture: NULL cxt.lwsa and cxt.lrsa to allow bad-arg detection
  rcuperf: Fix cleanup path for invalid perf_type strings
  rcutorture: Fix cleanup path for invalid torture_type strings
  rcutorture: Fix expected forward progress duration in OOM notifier
  rcutorture: Remove ->ext_irq_conflict field
  rcutorture: Make rcutorture_extend_mask() comment match the code
  tools/.../rcutorture: Convert to SPDX license identifier
  torture: Don't try to offline the last CPU
  rcu: Fix nohz status in stall warning
  rcu: Move forward-progress checkers into tree_stall.h
  rcu: Move irq-disabled stall-warning checking to tree_stall.h
  rcu: Organize functions in tree_stall.h
  rcu: Move FAST_NO_HZ stall-warning code to tree_stall.h
  rcu: Inline RCU stall-warning info helper functions
  rcu: Move rcu_print_task_exp_stall() to tree_exp.h
  rcu: Inline RCU task stall-warning helper functions
  ...
2019-05-06 12:04:02 -07:00
Linus Torvalds
6ec62961e6 Merge branch 'core-objtool-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool updates from Ingo Molnar:
 "This is a series from Peter Zijlstra that adds x86 build-time uaccess
  validation of SMAP to objtool, which will detect and warn about the
  following uaccess API usage bugs and weirdnesses:

   - call to %s() with UACCESS enabled
   - return with UACCESS enabled
   - return with UACCESS disabled from a UACCESS-safe function
   - recursive UACCESS enable
   - redundant UACCESS disable
   - UACCESS-safe disables UACCESS

  As it turns out not leaking uaccess permissions outside the intended
  uaccess functionality is hard when the interfaces are complex and when
  such bugs are mostly dormant.

  As a bonus we now also check the DF flag. We had at least one
  high-profile bug in that area in the early days of Linux, and the
  checking is fairly simple. The checks performed and warnings emitted
  are:

   - call to %s() with DF set
   - return with DF set
   - return with modified stack frame
   - recursive STD
   - redundant CLD

  It's all x86-only for now, but later on this can also be used for PAN
  on ARM and objtool is fairly cross-platform in principle.

  While all warnings emitted by this new checking facility that got
  reported to us were fixed, there might be GCC version dependent
  warnings that were not reported yet - which we'll address, should they
  trigger.

  The warnings are non-fatal build warnings"

* 'core-objtool-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
  mm/uaccess: Use 'unsigned long' to placate UBSAN warnings on older GCC versions
  x86/uaccess: Dont leak the AC flag into __put_user() argument evaluation
  sched/x86_64: Don't save flags on context switch
  objtool: Add Direction Flag validation
  objtool: Add UACCESS validation
  objtool: Fix sibling call detection
  objtool: Rewrite alt->skip_orig
  objtool: Add --backtrace support
  objtool: Rewrite add_ignores()
  objtool: Handle function aliases
  objtool: Set insn->func for alternatives
  x86/uaccess, kcov: Disable stack protector
  x86/uaccess, ftrace: Fix ftrace_likely_update() vs. SMAP
  x86/uaccess, ubsan: Fix UBSAN vs. SMAP
  x86/uaccess, kasan: Fix KASAN vs SMAP
  x86/smap: Ditch __stringify()
  x86/uaccess: Introduce user_access_{save,restore}()
  x86/uaccess, signal: Fix AC=1 bloat
  x86/uaccess: Always inline user_access_begin()
  x86/uaccess, xen: Suppress SMAP warnings
  ...
2019-05-06 11:39:17 -07:00
Linus Torvalds
171c2bcbcb Merge branch 'core-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull unified TLB flushing from Ingo Molnar:
 "This contains the generic mmu_gather feature from Peter Zijlstra,
  which is an all-arch unification of TLB flushing APIs, via the
  following (broad) steps:

   - enhance the <asm-generic/tlb.h> APIs to cover more arch details

   - convert most TLB flushing arch implementations to the generic
     <asm-generic/tlb.h> APIs.

   - remove leftovers of per arch implementations

  After this series every single architecture makes use of the unified
  TLB flushing APIs"

* 'core-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  mm/resource: Use resource_overlaps() to simplify region_intersects()
  ia64/tlb: Eradicate tlb_migrate_finish() callback
  asm-generic/tlb: Remove tlb_table_flush()
  asm-generic/tlb: Remove tlb_flush_mmu_free()
  asm-generic/tlb: Remove CONFIG_HAVE_GENERIC_MMU_GATHER
  asm-generic/tlb: Remove arch_tlb*_mmu()
  s390/tlb: Convert to generic mmu_gather
  asm-generic/tlb: Introduce CONFIG_HAVE_MMU_GATHER_NO_GATHER=y
  arch/tlb: Clean up simple architectures
  um/tlb: Convert to generic mmu_gather
  sh/tlb: Convert SH to generic mmu_gather
  ia64/tlb: Convert to generic mmu_gather
  arm/tlb: Convert to generic mmu_gather
  asm-generic/tlb, arch: Invert CONFIG_HAVE_RCU_TABLE_INVALIDATE
  asm-generic/tlb, ia64: Conditionally provide tlb_migrate_finish()
  asm-generic/tlb: Provide generic tlb_flush() based on flush_tlb_mm()
  asm-generic/tlb, arch: Provide generic tlb_flush() based on flush_tlb_range()
  asm-generic/tlb, arch: Provide generic VIPT cache flush
  asm-generic/tlb, arch: Provide CONFIG_HAVE_MMU_GATHER_PAGE_SIZE
  asm-generic/tlb: Provide a comment
2019-05-06 11:36:58 -07:00
Fuqian Huang
1900da520c kernel: cgroup: fix misuse of %x
Pointers should be printed with %p or %px rather than
cast to unsigned long type and printed with %lx.
Change %lx to %p to print the pointers.

Signed-off-by: Fuqian Huang <huangfq.daxian@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-05-06 08:47:48 -07:00
Roman Gushchin
96b9c592de cgroup: get rid of cgroup_freezer_frozen_exit()
A task should never enter the exit path with the task->frozen bit set.
Any frozen task must enter the signal handling loop and the only
way to escape is through cgroup_leave_frozen(true), which
unconditionally drops the task->frozen bit. So it means that
cgroyp_freezer_frozen_exit() has zero chances to be called and
has to be removed.

Let's put a WARN_ON_ONCE() instead of the cgroup_freezer_frozen_exit()
call to catch any potential leak of the task's frozen bit.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-05-06 08:39:11 -07:00
Roman Gushchin
cb2c4cd878 cgroup: prevent spurious transition into non-frozen state
If freezing of a cgroup races with waking of a task from
the frozen state (like waiting in vfork() or in do_signal_stop()),
a spurious transition of the cgroup state can happen.

The task enters cgroup_leave_frozen(true), the cgroup->nr_frozen_tasks
counter decrements, and the cgroup is switched to the unfrozen state.

To prevent it, let's reserve cgroup_leave_frozen(true) for
terminating processes and use cgroup_leave_frozen(false) otherwise.

To avoid busy-looping in the signal handling loop waiting
for JOBCTL_TRAP_FREEZE set from the cgroup freezing path,
let's do it explicitly in cgroup_leave_frozen(), if the task
is going to stay frozen.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-05-06 08:39:06 -07:00
Shaokun Zhang
533307dc20 cgroup: Remove unused cgrp variable
The 'cgrp' is set but not used in commit <76f969e8948d8>
("cgroup: cgroup v2 freezer").
Remove it to avoid [-Wunused-but-set-variable] warning.

Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
Acked-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-05-06 08:33:45 -07:00
Christoph Hellwig
13bf5ced93 dma-mapping: add a Kconfig symbol to indicate arch_dma_prep_coherent presence
Add a Kconfig symbol that indicates an architecture provides a
arch_dma_prep_coherent implementation, and provide a stub otherwise.

This will allow the generic dma-iommu code to use it while still
allowing to be built for cache coherent architectures.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
2019-05-06 15:04:40 +02:00
Thomas Gleixner
fb4e059265 irqchip updates for 5.2
- The huge (and terrifying) TI INTR/INTA set of drivers
 - Rewrite of the stm32mp1-exti driver as a platform driver
 - Update the IOMMU MSI mapping API to be RT friendly
 - A number of cleanups and other low impact fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCgAzFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAlzMUswVHG1hcmMuenlu
 Z2llckBhcm0uY29tAAoJECPQ0LrRPXpDOUUQALpVvlxPA2jXmlYTZ9/1I4ehSv78
 SFHsLHybvbdt0+zZDQE6wIS/OWc8fi8LRHtUCkNOygWMk8Ae8D8IeC6VwbPxhbQa
 RNvIULOC7RqapcWXmF3pYz+tfJehLPSmfTG3yGczXyMHSF6ypHKxX8+RIr6ZCeqH
 U2XVa7ELhWxeTuA52oxEiGymPGBZTEcMGcgGzhRxmDE9GVRw5XWrnKsG0nNCklMK
 DSKPoUtmKny42LIfVAFgS2PD081IRRQZII/j/mbA/NG5KNNQutmkZLJgFywjeJxu
 AOhYHmBiqyEfQs5S02i0GXMUITBvWsi9dPgjnwd2aRsPQ8nzXx2mUH4FQz9fqmHt
 ZxGsE1rLYlB0BS2h1Ap4bpHDjsgx/MvPYdarkN966/T6DHM6UYfdx9wnkVtH3oLJ
 Lg4UT+MWbz2f70JJ1jy5NWMDLVEyL7ERoZLivhxouXqAuGK7x3BUFPXXwTbep7Me
 E+N45kYycryCUjrt12EQ7PX/1W7KXe9Z7UX63VUsGmCxaPDgZ5T/ofybD4vJEIQ0
 fyaOK9jxBuTeVokFecCMxacWfRtQ3trLljlEJY5/1NZMXEmLwFbo0sy7458zIoss
 aTV+bk+N+xrGERRpswjGhhdRxjM41EcoiuJs52L9EL1IB/50ye6ENOPtGIDfuWXT
 XEm5jXGM4TVTOkf/
 =vP2/
 -----END PGP SIGNATURE-----

Merge tag 'irqchip-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/core

Pull irqchip updates from Marc Zyngier

- The huge (and terrifying) TI INTR/INTA set of drivers
- Rewrite of the stm32mp1-exti driver as a platform driver
- Update the IOMMU MSI mapping API to be RT friendly
- A number of cleanups and other low impact fixes
2019-05-06 12:04:12 +02:00
Rafael J. Wysocki
78baa1ea58 Merge branches 'pm-cpuidle' and 'pm-sleep'
* pm-cpuidle:
  PM / Domains: Add genpd governor for CPUs
  cpuidle: Export the next timer expiration for CPUs
  PM / Domains: Add support for CPU devices to genpd
  PM / Domains: Add generic data pointer to struct genpd_power_state
  cpuidle: exynos: Unify target residency for AFTR and coupled AFTR states

* pm-sleep:
  PM / core: Propagate dev->power.wakeup_path when no callbacks
  PM / core: Introduce dpm_async_fn() helper
  PM / core: fix kerneldoc comment for device_pm_wait_for_dev()
  PM / core: fix kerneldoc comment for dpm_watchdog_handler()
  PM / sleep: Measure the time of filesystems syncing
  PM / sleep: Refactor filesystems sync to reduce duplication
  PM / wakeup: Use pm_pr_dbg() instead of pr_debug()
2019-05-06 10:54:43 +02:00
Rafael J. Wysocki
7d4a27c1c8 Merge branch 'pm-cpufreq'
* pm-cpufreq: (24 commits)
  cpufreq: Fix kobject memleak
  cpufreq: armada-37xx: fix frequency calculation for opp
  cpufreq: centrino: Fix centrino_setpolicy() kerneldoc comment
  cpufreq: qoriq: add support for lx2160a
  cpufreq: qoriq: Add ls1028a chip support
  cpufreq: Move ->get callback check outside of __cpufreq_get()
  cpufreq: Remove needless bios_limit check in show_bios_limit()
  drivers/cpufreq/acpi-cpufreq.c: This fixes the following checkpatch warning
  cpufreq: boost: Remove CONFIG_CPU_FREQ_BOOST_SW Kconfig option
  cpufreq: stats: Use lock by stat to replace global spin lock
  cpufreq: Remove cpufreq_driver check in cpufreq_boost_supported()
  cpufreq: maple: Remove redundant code from maple_cpufreq_init()
  cpufreq: ppc_cbe: fix possible object reference leak
  cpufreq: pmac32: fix possible object reference leak
  cpufreq/pasemi: fix possible object reference leak
  cpufreq: maple: fix possible object reference leak
  cpufreq: kirkwood: fix possible object reference leak
  cpufreq: imx6q: fix possible object reference leak
  cpufreq: ap806: fix possible object reference leak
  drivers/cpufreq: Convert some slow-path static_cpu_has() callers to boot_cpu_has()
  ...
2019-05-06 10:54:27 +02:00
Linus Torvalds
7178fb0b23 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "I'd like to apologize for this very late pull request: I was dithering
  through the week whether to send the fixes, and then yesterday Jiri's
  crash fix for a regression introduced in this cycle clearly marked
  perf/urgent as 'must merge now'.

  Most of the commits are tooling fixes, plus there's three kernel fixes
  via four commits:

    - race fix in the Intel PEBS code

    - fix an AUX bug and roll back a previous attempt

    - fix AMD family 17h generic HW cache-event perf counters

  The largest diffstat contribution comes from the AMD fix - a new event
  table is introduced, which is a fairly low risk change but has a large
  linecount"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/intel: Fix race in intel_pmu_disable_event()
  perf/x86/intel/pt: Remove software double buffering PMU capability
  perf/ring_buffer: Fix AUX software double buffering
  perf tools: Remove needless asm/unistd.h include fixing build in some places
  tools arch uapi: Copy missing unistd.h headers for arc, hexagon and riscv
  tools build: Add -ldl to the disassembler-four-args feature test
  perf cs-etm: Always allocate memory for cs_etm_queue::prev_packet
  perf cs-etm: Don't check cs_etm_queue::prev_packet validity
  perf report: Report OOM in status line in the GTK UI
  perf bench numa: Add define for RUSAGE_THREAD if not present
  tools lib traceevent: Change tag string for error
  perf annotate: Fix build on 32 bit for BPF annotation
  tools uapi x86: Sync vmx.h with the kernel
  perf bpf: Return value with unlocking in perf_env__find_btf()
  MAINTAINERS: Include vendor specific files under arch/*/events/*
  perf/x86/amd: Update generic hardware cache events for Family 17h
2019-05-05 14:37:25 -07:00
Linus Torvalds
70c9fb570b Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Ingo Molnar:
 "Fix a kobject memory leak in the cpufreq code"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/cpufreq: Fix kobject memleak
2019-05-05 14:28:48 -07:00
Petr Mladek
f68d67cf2f livepatch: Remove duplicated code for early initialization
kobject_init() call added one more operation that has to be
done when doing the early initialization of both static and
dynamic livepatch structures.

It would have been easier when the early initialization code
was not duplicated. Let's deduplicate it for future generations
of livepatching hackers.

The patch does not change the existing behavior.

Signed-off-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Acked-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2019-05-03 21:11:23 +02:00
Petr Mladek
4d141ab341 livepatch: Remove custom kobject state handling
kobject_init() always succeeds and sets the reference count to 1.
It allows to always free the structures via kobject_put() and
the related release callback.

Note that the custom kobject state handling was used only
because we did not know that kobject_put() can and actually
should get called even when kobject_init_and_add() fails.

The patch should not change the existing behavior.

Suggested-by: "Tobin C. Harding" <tobin@kernel.org>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Acked-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2019-05-03 21:11:22 +02:00
Nicholas Piggin
08ae95f4fd nohz_full: Allow the boot CPU to be nohz_full
Allow the boot CPU/CPU0 to be nohz_full. Have the boot CPU take the
do_timer duty during boot until a housekeeping CPU can take over.

This is supported when CONFIG_PM_SLEEP_SMP is not configured, or when
it is configured and the arch allows suspend on non-zero CPUs.

nohz_full has been trialed at a large supercomputer site and found to
significantly reduce jitter. In order to deploy it in production, they
need CPU0 to be nohz_full because their job control system requires
the application CPUs to start from 0, and the housekeeping CPUs are
placed higher. An equivalent job scheduling that uses CPU0 for
housekeeping could be achieved by modifying their system, but it is
preferable if nohz_full can support their environment without
modification.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linuxppc-dev@lists.ozlabs.org
Link: https://lkml.kernel.org/r/20190411033448.20842-6-npiggin@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-05-03 19:42:58 +02:00
Nicholas Piggin
9219565aa8 sched/isolation: Require a present CPU in housekeeping mask
During housekeeping mask setup, currently a possible CPU is required.
That does not guarantee the CPU would be available at boot time, so
check to ensure that at least one present CPU is in the mask.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linuxppc-dev@lists.ozlabs.org
Link: https://lkml.kernel.org/r/20190411033448.20842-5-npiggin@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-05-03 19:42:58 +02:00
Nicholas Piggin
9ca12ac04b kernel/cpu: Allow non-zero CPU to be primary for suspend / kexec freeze
This patch provides an arch option, ARCH_SUSPEND_NONZERO_CPU, to
opt-in to allowing suspend to occur on one of the housekeeping CPUs
rather than hardcoded CPU0.

This will allow CPU0 to be a nohz_full CPU with a later change.

It may be possible for platforms with hardware/firmware restrictions
on suspend/wake effectively support this by handing off the final
stage to CPU0 when kernel housekeeping is no longer required. Another
option is to make housekeeping / nohz_full mask dynamic at runtime,
but the complexity could not be justified at this time.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linuxppc-dev@lists.ozlabs.org
Link: https://lkml.kernel.org/r/20190411033448.20842-4-npiggin@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-05-03 19:42:58 +02:00
Nicholas Piggin
2f1a6fbbef power/suspend: Add function to disable secondaries for suspend
This adds a function to disable secondary CPUs for suspend that are
not necessarily non-zero / non-boot CPUs. Platforms will be able to
use this to suspend using non-zero CPUs.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linuxppc-dev@lists.ozlabs.org
Link: https://lkml.kernel.org/r/20190411033448.20842-3-npiggin@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-05-03 19:42:41 +02:00
Julien Grall
aaebdf8d68 genirq/msi: Add a new field in msi_desc to store an IOMMU cookie
When an MSI doorbell is located downstream of an IOMMU, it is required
to swizzle the physical address with an appropriately-mapped IOVA for any
device attached to one of our DMA ops domain.

At the moment, the allocation of the mapping may be done when composing
the message. However, the composing may be done in non-preemtible
context while the allocation requires to be called from preemptible
context.

A follow-up change will split the current logic in two functions
requiring to keep an IOMMU cookie per MSI.

A new field is introduced in msi_desc to store an IOMMU cookie. As the
cookie may not be required in some configuration, the field is protected
under a new config CONFIG_IRQ_MSI_IOMMU.

A pair of helpers has also been introduced to access the field.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2019-05-03 15:19:20 +01:00
Nicholas Piggin
77a5352ba9 sched/core: Allow the remote scheduler tick to be started on CPU0
This has no effect yet because CPU0 will always be a housekeeping CPU
until a later change.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linuxppc-dev@lists.ozlabs.org
Link: https://lkml.kernel.org/r/20190411033448.20842-2-npiggin@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-05-03 12:53:14 +02:00
Ingo Molnar
176d2323c7 Merge branch 'linus' into sched/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-05-03 12:52:45 +02:00
Alexander Shishkin
26ae4f4406 perf/ring_buffer: Fix AUX software double buffering
This recent commit:

  5768402fd9 ("perf/ring_buffer: Use high order allocations for AUX buffers optimistically")

overlooked the fact that the previous one page granularity of the AUX buffer
provided an implicit double buffering capability to the PMU driver, which
went away when the entire buffer became one high-order page.

Always make the full-trace mode AUX allocation at least two-part to preserve
the previous behavior and allow the implicit double buffering to continue.

Reported-by: Ammy Yi <ammy.yi@intel.com>
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: adrian.hunter@intel.com
Fixes: 5768402fd9 ("perf/ring_buffer: Use high order allocations for AUX buffers optimistically")
Link: http://lkml.kernel.org/r/20190503085536.24119-2-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-05-03 12:46:10 +02:00
David S. Miller
ff24e4980a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Three trivial overlapping conflicts.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-02 22:14:21 -04:00
Douglas Anderson
03197fc02b tracing: kdb: Allow ftdump to skip all but the last few entries
The 'ftdump' command in kdb is currently a bit of a last resort, at
least if you have lots of traces turned on.  It's going to print a
whole boatload of data out your serial port which is probably running
at 115200.  This could easily take many, many minutes.

Usually you're most interested in what's at the _end_ of the ftrace
buffer, AKA what happened most recently.  That means you've got to
wait the full time for the dump.  The 'ftdump' command does attempt to
help you a little bit by allowing you to skip a fixed number of
entries.  Unfortunately it provides no way for you to know how many
entries you should skip.

Let's do similar to python and allow you to use a negative number to
indicate that you want to skip all entries except the last few.  This
allows you to quickly see what you want.

Note that we also change the printout in ftdump to print the
(positive) number of entries actually skipped since that could be
helpful to know when you've specified a negative skip count.

Link: http://lkml.kernel.org/r/20190319171206.97107-3-dianders@chromium.org

Signed-off-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-05-02 21:32:55 -04:00
Douglas Anderson
ecffc8a8c7 tracing: Add trace_total_entries() / trace_total_entries_cpu()
These two new exported functions will be used in a future patch by
kdb_ftdump() to quickly skip all but the last few trace entries.

Link: http://lkml.kernel.org/r/20190319171206.97107-2-dianders@chromium.org

Acked-by: Daniel Thompson <daniel.thompson@linaro.org>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-05-02 21:32:31 -04:00
Douglas Anderson
dbfe67334a tracing: kdb: The skip_lines parameter should have been skip_entries
The things skipped by kdb's "ftdump" command when you pass it a
parameter has always been entries, not lines.  The difference usually
doesn't matter but when the trace buffer has multi-line entries (like
a stack dump) it can matter.

Let's fix this both in the help text for ftdump and also in the local
variable names.

Link: http://lkml.kernel.org/r/20190319171206.97107-1-dianders@chromium.org

Acked-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-05-02 21:31:19 -04:00
Linus Torvalds
ea9866793d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Out of bounds access in xfrm IPSEC policy unlink, from Yue Haibing.

 2) Missing length check for esp4 UDP encap, from Sabrina Dubroca.

 3) Fix byte order of RX STBC access in mac80211, from Johannes Berg.

 4) Inifnite loop in bpftool map create, from Alban Crequy.

 5) Register mark fix in ebpf verifier after pkt/null checks, from Paul
    Chaignon.

 6) Properly use rcu_dereference_sk_user_data in L2TP code, from Eric
    Dumazet.

 7) Buffer overrun in marvell phy driver, from Andrew Lunn.

 8) Several crash and statistics handling fixes to bnxt_en driver, from
    Michael Chan and Vasundhara Volam.

 9) Several fixes to the TLS layer from Jakub Kicinski (copying negative
    amounts of data in reencrypt, reencrypt frag copying, blind nskb->sk
    NULL deref, etc).

10) Several UDP GRO fixes, from Paolo Abeni and Eric Dumazet.

11) PID/UID checks on ipv6 flow labels are inverted, from Willem de
    Bruijn.

12) Use after free in l2tp, from Eric Dumazet.

13) IPV6 route destroy races, also from Eric Dumazet.

14) SCTP state machine can erroneously run recursively, fix from Xin
    Long.

15) Adjust AF_PACKET msg_name length checks, add padding bytes if
    necessary. From Willem de Bruijn.

16) Preserve skb_iif, so that forwarded packets have consistent values
    even if fragmentation is involved. From Shmulik Ladkani.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (69 commits)
  udp: fix GRO packet of death
  ipv6: A few fixes on dereferencing rt->from
  rds: ib: force endiannes annotation
  selftests: fib_rule_tests: print the result and return 1 if any tests failed
  ipv4: ip_do_fragment: Preserve skb_iif during fragmentation
  net/tls: avoid NULL pointer deref on nskb->sk in fallback
  selftests: fib_rule_tests: Fix icmp proto with ipv6
  packet: validate msg_namelen in send directly
  packet: in recvmsg msg_name return at least sizeof sockaddr_ll
  sctp: avoid running the sctp state machine recursively
  stmmac: pci: Fix typo in IOT2000 comment
  Documentation: fix netdev-FAQ.rst markup warning
  ipv6: fix races in ip6_dst_destroy()
  l2ip: fix possible use-after-free
  appletalk: Set error code if register_snap_client failed
  net: dsa: bcm_sf2: fix buffer overflow doing set_rxnfc
  rxrpc: Fix net namespace cleanup
  ipv6/flowlabel: wait rcu grace period before put_pid()
  vrf: Use orig netdev to count Ip6InNoRoutes and a fresh route lookup when sending dest unreach
  tcp: add sanity tests in tcp_add_backlog()
  ...
2019-05-02 11:03:34 -07:00
Gustavo A. R. Silva
9b555c4d78 kdb: kdb_support: replace strcpy() by strscpy()
The strcpy() function is being deprecated. Replace it by the safer
strscpy() and fix the following Coverity warning:

"You might overrun the 129-character fixed-size string ks_namebuf
by copying name without checking the length."

Addresses-Coverity-ID: 138995 ("Copy into fixed size buffer")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2019-05-02 13:42:01 +01:00
Gustavo A. R. Silva
4cc168eaf3 gdbstub: Replace strcpy() by strscpy()
The strcpy() function is being deprecated. Replace it by the safer
strscpy() and fix the following Coverity warning:

"You might overrun the 1024-character fixed-size string remcom_in_buffer
by copying cmd without checking the length."

Addresses-Coverity-ID: 138999 ("Copy into fixed size buffer")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2019-05-02 13:40:27 +01:00
Gustavo A. R. Silva
a5d5092c92 gdbstub: mark expected switch fall-throughs
In preparation to enabling -Wimplicit-fallthrough, mark switch
cases where we are expecting to fall through.

This patch fixes the following warnings:

kernel/debug/gdbstub.c: In function ‘gdb_serial_stub’:
kernel/debug/gdbstub.c:1031:7: warning: this statement may fall through [-Wimplicit-fallthrough=]
    if (remcom_in_buffer[1] == '\0') {
       ^
kernel/debug/gdbstub.c:1036:3: note: here
   case 'C': /* Exception passing */
   ^~~~
kernel/debug/gdbstub.c:1040:7: warning: this statement may fall through [-Wimplicit-fallthrough=]
    if (tmp == 0)
       ^
kernel/debug/gdbstub.c:1043:3: note: here
   case 'c': /* Continue packet */
   ^~~~
kernel/debug/gdbstub.c:1050:4: warning: this statement may fall through [-Wimplicit-fallthrough=]
    dbg_activate_sw_breakpoints();
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
kernel/debug/gdbstub.c:1052:3: note: here
   default:
   ^~~~~~~

Warning level 3 was used: -Wimplicit-fallthrough=3

Notice that, in this particular case, the code comment is modified
in accordance with what GCC is expecting to find.

This patch is part of the ongoing efforts to enable
-Wimplicit-fallthrough.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Acked-by: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2019-05-02 13:38:04 +01:00
Martin Schwidefsky
98587c2d89 s390: simplify disabled_wait
The disabled_wait() function uses its argument as the PSW address when
it stops the CPU with a wait PSW that is disabled for interrupts.
The different callers sometimes use a specific number like 0xdeadbeef
to indicate a specific failure, the early boot code uses 0 and some
other calls sites use __builtin_return_address(0).

At the time a dump is created the current PSW and the registers of a
CPU are written to lowcore to make them avaiable to the dump analysis
tool. For a CPU stopped with disabled_wait the PSW and the registers
do not really make sense together, the PSW address does not point to
the function the registers belong to.

Simplify disabled_wait() by using _THIS_IP_ for the PSW address and
drop the argument to the function.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2019-05-02 13:54:11 +02:00
Al Viro
524845ff9c bpf: switch to ->free_inode()
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-05-01 22:43:26 -04:00
Will Deacon
50abbe1962 Merge branch 'for-next/mitigations' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux into for-next/core 2019-05-01 15:34:56 +01:00
Lokesh Vutla
2bd1298ac1 genirq: Introduce irq_chip_{request,release}_resource_parent() apis
Introduce irq_chip_{request,release}_resource_parent() apis so
that these can be used in hierarchical irqchips.

Signed-off-by: Lokesh Vutla <lokeshvutla@ti.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2019-05-01 10:41:38 +01:00
Rick Edgecombe
d53d2f78ce bpf: Use vmalloc special flag
Use new flag VM_FLUSH_RESET_PERMS for handling freeing of special
permissioned memory in vmalloc and remove places where memory was set RW
before freeing which is no longer needed. Don't track if the memory is RO
anymore because it is now tracked in vmalloc.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <akpm@linux-foundation.org>
Cc: <ard.biesheuvel@linaro.org>
Cc: <deneen.t.dock@intel.com>
Cc: <kernel-hardening@lists.openwall.com>
Cc: <kristen@linux.intel.com>
Cc: <linux_dti@icloud.com>
Cc: <will.deacon@arm.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190426001143.4983-19-namit@vmware.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-30 12:37:59 +02:00
Rick Edgecombe
1a7b7d9220 modules: Use vmalloc special flag
Use new flag for handling freeing of special permissioned memory in vmalloc
and remove places where memory was set RW before freeing which is no longer
needed.

Since freeing of VM_FLUSH_RESET_PERMS memory is not supported in an
interrupt by vmalloc, the freeing of init sections is moved to a work
queue. Instead of call_rcu it now uses synchronize_rcu() in the work
queue.

Lastly, there is now a WARN_ON in module_memfree since it should not be
called in an interrupt with special memory as is required for
VM_FLUSH_RESET_PERMS.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <akpm@linux-foundation.org>
Cc: <ard.biesheuvel@linaro.org>
Cc: <deneen.t.dock@intel.com>
Cc: <kernel-hardening@lists.openwall.com>
Cc: <kristen@linux.intel.com>
Cc: <linux_dti@icloud.com>
Cc: <will.deacon@arm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190426001143.4983-18-namit@vmware.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-30 12:37:58 +02:00
Rick Edgecombe
d633269286 mm/hibernation: Make hibernation handle unmapped pages
Make hibernate handle unmapped pages on the direct map when
CONFIG_ARCH_HAS_SET_ALIAS=y is set. These functions allow for setting pages
to invalid configurations, so now hibernate should check if the pages have
valid mappings and handle if they are unmapped when doing a hibernate
save operation.

Previously this checking was already done when CONFIG_DEBUG_PAGEALLOC=y
was configured. It does not appear to have a big hibernating performance
impact. The speed of the saving operation before this change was measured
as 819.02 MB/s, and after was measured at 813.32 MB/s.

Before:
[    4.670938] PM: Wrote 171996 kbytes in 0.21 seconds (819.02 MB/s)

After:
[    4.504714] PM: Wrote 178932 kbytes in 0.22 seconds (813.32 MB/s)

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: <akpm@linux-foundation.org>
Cc: <ard.biesheuvel@linaro.org>
Cc: <deneen.t.dock@intel.com>
Cc: <kernel-hardening@lists.openwall.com>
Cc: <kristen@linux.intel.com>
Cc: <linux_dti@icloud.com>
Cc: <will.deacon@arm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190426001143.4983-16-namit@vmware.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-30 12:37:57 +02:00
Nadav Amit
f2c65fb322 x86/modules: Avoid breaking W^X while loading modules
When modules and BPF filters are loaded, there is a time window in
which some memory is both writable and executable. An attacker that has
already found another vulnerability (e.g., a dangling pointer) might be
able to exploit this behavior to overwrite kernel code. Prevent having
writable executable PTEs in this stage.

In addition, avoiding having W+X mappings can also slightly simplify the
patching of modules code on initialization (e.g., by alternatives and
static-key), as would be done in the next patch. This was actually the
main motivation for this patch.

To avoid having W+X mappings, set them initially as RW (NX) and after
they are set as RO set them as X as well. Setting them as executable is
done as a separate step to avoid one core in which the old PTE is cached
(hence writable), and another which sees the updated PTE (executable),
which would break the W^X protection.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Suggested-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <akpm@linux-foundation.org>
Cc: <ard.biesheuvel@linaro.org>
Cc: <deneen.t.dock@intel.com>
Cc: <kernel-hardening@lists.openwall.com>
Cc: <kristen@linux.intel.com>
Cc: <linux_dti@icloud.com>
Cc: <will.deacon@arm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Link: https://lkml.kernel.org/r/20190426001143.4983-12-namit@vmware.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-30 12:37:55 +02:00
Nadav Amit
13585fa066 fork: Provide a function for copying init_mm
Provide a function for copying init_mm. This function will be later used
for setting a temporary mm.

Tested-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: <akpm@linux-foundation.org>
Cc: <ard.biesheuvel@linaro.org>
Cc: <deneen.t.dock@intel.com>
Cc: <kernel-hardening@lists.openwall.com>
Cc: <kristen@linux.intel.com>
Cc: <linux_dti@icloud.com>
Cc: <will.deacon@arm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190426001143.4983-6-namit@vmware.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-30 12:37:51 +02:00
Nadav Amit
aad42dd44d uprobes: Initialize uprobes earlier
In order to have a separate address space for text poking, we need to
duplicate init_mm early during start_kernel(). This, however, introduces
a problem since uprobes functions are called from dup_mmap(), but
uprobes is still not initialized in this early stage.

Since uprobes initialization is necassary for fork, and since all the
dependant initialization has been done when fork is initialized (percpu
and vmalloc), move uprobes initialization to fork_init(). It does not
seem uprobes introduces any security problem for the poking_mm.

Crash and burn if uprobes initialization fails, similarly to other early
initializations. Change the init_probes() name to probes_init() to match
other early initialization functions name convention.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akpm@linux-foundation.org
Cc: ard.biesheuvel@linaro.org
Cc: deneen.t.dock@intel.com
Cc: kernel-hardening@lists.openwall.com
Cc: kristen@linux.intel.com
Cc: linux_dti@icloud.com
Cc: will.deacon@arm.com
Link: https://lkml.kernel.org/r/20190426232303.28381-6-nadav.amit@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-30 12:37:51 +02:00
Nadav Amit
c7b6f29b62 bpf: Fail bpf_probe_write_user() while mm is switched
When using a temporary mm, bpf_probe_write_user() should not be able to
write to user memory, since user memory addresses may be used to map
kernel memory.  Detect these cases and fail bpf_probe_write_user() in
such cases.

Suggested-by: Jann Horn <jannh@google.com>
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <akpm@linux-foundation.org>
Cc: <ard.biesheuvel@linaro.org>
Cc: <deneen.t.dock@intel.com>
Cc: <kernel-hardening@lists.openwall.com>
Cc: <kristen@linux.intel.com>
Cc: <linux_dti@icloud.com>
Cc: <will.deacon@arm.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190426001143.4983-24-namit@vmware.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-30 12:37:48 +02:00
Tobin C. Harding
9a4f26cc98 sched/cpufreq: Fix kobject memleak
Currently the error return path from kobject_init_and_add() is not
followed by a call to kobject_put() - which means we are leaking
the kobject.

Fix it by adding a call to kobject_put() in the error path of
kobject_init_and_add().

Signed-off-by: Tobin C. Harding <tobin@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: http://lkml.kernel.org/r/20190430001144.24890-1-tobin@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-30 07:57:23 +02:00
Linus Torvalds
83a50840e7 seccomp use-after-free fix
- Add logic for making some seccomp flags exclusive (Tycho)
 - Update selftests for exclusivity testing (Kees)
 -----BEGIN PGP SIGNATURE-----
 Comment: Kees Cook <kees@outflux.net>
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAlzHVl0WHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJqZuD/wK/PccncrPcBVtyFwWVVPj1HaM
 97icUgcbzC2mgpGmIDj5lZwpzXjvSlvkLenwcX+QEO0BfRbomUtcFqiMo3GMsHE3
 JMJDQ4r+eQLZX2r/f0rgJ+yS80DzpgF4PjLbC2kcDXdVTNUBetafwq4tfP1wEYbE
 Fumw64hjJidvahKUlJh94xQzatBFSA6gzPcWCn6VbFKDIQ/Zu1zMvWPxsVqOEAol
 rNSW5qFlxHI35znMg2/5tfZ8Z9bbemYcYDwlWwCZkNcoRBfs5rpgFhYuE5o5qYZT
 ndQQnfv24HoH0Q1zMq67uLdcPwVzg8VQjKQiZr9QWhKfSsFi8mtd00/yvqm9z/Hy
 1gwHv6bSzmfNyPYoFCTHKrMutUKy9aUHBdPjXdjOOD6V30QWbCETUHQ+Ipkq7qCm
 YbIhJL/FRHF2BAFU7uT+2/xob9JGD80n5nYZtZDdBx0zgDZb5xTuSN8fi8jVf+Ye
 so6Zwu64OdcAt+AGIl0Q3f+bCBYnjLF1Ec14TfJgOZAuw1fdsi8uAsFBV+aHu7tP
 SsDqDLCcY6p98x7AlFpEf4pN4oIC7kWOMFdJH7dK9pNeh4Q6Omf0vpHY6tAxC8yX
 LsFcimfKgJnlGPoqLN04Aq3K5Qj55lcpNv8RbQ5YuKujzhHH3/yltNCWSR59TFsz
 anZKkfzZckEdoJ9vSg==
 =12Pp
 -----END PGP SIGNATURE-----

Merge tag 'seccomp-v5.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull seccomp fixes from Kees Cook:
 "Syzbot found a use-after-free bug in seccomp due to flags that should
  not be allowed to be used together.

  Tycho fixed this, I updated the self-tests, and the syzkaller PoC has
  been running for several days without triggering KASan (before this
  fix, it would reproduce). These patches have also been in -next for
  almost a week, just to be sure.

   - Add logic for making some seccomp flags exclusive (Tycho)

   - Update selftests for exclusivity testing (Kees)"

* tag 'seccomp-v5.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  seccomp: Make NEW_LISTENER and TSYNC flags exclusive
  selftests/seccomp: Prepare for exclusive seccomp flags
2019-04-29 13:24:34 -07:00
Joel Fernandes (Google)
43d8ce9d65 Provide in-kernel headers to make extending kernel easier
Introduce in-kernel headers which are made available as an archive
through proc (/proc/kheaders.tar.xz file). This archive makes it
possible to run eBPF and other tracing programs that need to extend the
kernel for tracing purposes without any dependency on the file system
having headers.

A github PR is sent for the corresponding BCC patch at:
https://github.com/iovisor/bcc/pull/2312

On Android and embedded systems, it is common to switch kernels but not
have kernel headers available on the file system. Further once a
different kernel is booted, any headers stored on the file system will
no longer be useful. This is an issue even well known to distros.
By storing the headers as a compressed archive within the kernel, we can
avoid these issues that have been a hindrance for a long time.

The best way to use this feature is by building it in. Several users
have a need for this, when they switch debug kernels, they do not want to
update the filesystem or worry about it where to store the headers on
it. However, the feature is also buildable as a module in case the user
desires it not being part of the kernel image. This makes it possible to
load and unload the headers from memory on demand. A tracing program can
load the module, do its operations, and then unload the module to save
kernel memory. The total memory needed is 3.3MB.

By having the archive available at a fixed location independent of
filesystem dependencies and conventions, all debugging tools can
directly refer to the fixed location for the archive, without concerning
with where the headers on a typical filesystem which significantly
simplifies tooling that needs kernel headers.

The code to read the headers is based on /proc/config.gz code and uses
the same technique to embed the headers.

Other approaches were discussed such as having an in-memory mountable
filesystem, but that has drawbacks such as requiring an in-kernel xz
decompressor which we don't have today, and requiring usage of 42 MB of
kernel memory to host the decompressed headers at anytime. Also this
approach is simpler than such approaches.

Reviewed-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-04-29 16:48:03 +02:00
Julien Grall
08970ecf74 irq/irqdomain: Fix typo in the comment on top of __irq_domain_alloc_irqs()
The word 'number' has been misspelt in the comment on top of
_irq_domain_alloc_irqs().

Signed-off-by: Julien Grall <julien.grall@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2019-04-29 15:45:00 +01:00
zhengbin
d671002be6 locking/lockdep: Remove unnecessary unlikely()
DEBUG_LOCKS_WARN_ON() already contains an unlikely(), there is no need
for another one.

Signed-off-by: zhengbin <zhengbin13@huawei.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: houtao1@huawei.com
Link: http://lkml.kernel.org/r/1556540791-23110-1-git-send-email-zhengbin13@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-29 16:11:01 +02:00
Petr Mladek
31adf2308f livepatch: Convert error about unsupported reliable stacktrace into a warning
The commit d0807da78e ("livepatch: Remove immediate feature") caused
that any livepatch was refused when reliable stacktraces were not supported
on the given architecture.

The limitation is too strong. User space processes are safely migrated
even when entering or leaving the kernel. Kthreads transition would
need to get forced. But it is safe when:

   + The livepatch does not change the semantic of the code.
   + Callbacks do not depend on a safely finished transition.

Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2019-04-29 14:46:07 +02:00
Thomas Gleixner
214d8ca6ee stacktrace: Provide common infrastructure
All architectures which support stacktrace carry duplicated code and
do the stack storage and filtering at the architecture side.

Provide a consolidated interface with a callback function for consuming the
stack entries provided by the architecture specific stack walker. This
removes lots of duplicated code and allows to implement better filtering
than 'skip number of entries' in the future without touching any
architecture specific code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: linux-arch@vger.kernel.org
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: linux-mm@kvack.org
Cc: David Rientjes <rientjes@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: kasan-dev@googlegroups.com
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: iommu@lists.linux-foundation.org
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: David Sterba <dsterba@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: linux-btrfs@vger.kernel.org
Cc: dm-devel@redhat.com
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: intel-gfx@lists.freedesktop.org
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: dri-devel@lists.freedesktop.org
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: Miroslav Benes <mbenes@suse.cz>
Link: https://lkml.kernel.org/r/20190425094803.713568606@linutronix.de
2019-04-29 12:37:57 +02:00
Thomas Gleixner
988ec8841c stacktrace: Remove obsolete functions
No more users of the struct stack_trace based interfaces. Remove them.

Remove the macro stubs for !CONFIG_STACKTRACE as well as they are pointless
because the storage on the call sites is conditional on CONFIG_STACKTRACE
already. No point to be 'smart'.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: linux-mm@kvack.org
Cc: David Rientjes <rientjes@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: kasan-dev@googlegroups.com
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: iommu@lists.linux-foundation.org
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: David Sterba <dsterba@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: linux-btrfs@vger.kernel.org
Cc: dm-devel@redhat.com
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: intel-gfx@lists.freedesktop.org
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: dri-devel@lists.freedesktop.org
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: linux-arch@vger.kernel.org
Link: https://lkml.kernel.org/r/20190425094803.524796783@linutronix.de
2019-04-29 12:37:57 +02:00
Thomas Gleixner
25e39e32b0 livepatch: Simplify stack trace retrieval
Replace the indirection through struct stack_trace by using the storage
array based interfaces.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: linux-mm@kvack.org
Cc: David Rientjes <rientjes@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: kasan-dev@googlegroups.com
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: iommu@lists.linux-foundation.org
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: David Sterba <dsterba@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: linux-btrfs@vger.kernel.org
Cc: dm-devel@redhat.com
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: intel-gfx@lists.freedesktop.org
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: dri-devel@lists.freedesktop.org
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: linux-arch@vger.kernel.org
Link: https://lkml.kernel.org/r/20190425094803.437950229@linutronix.de
2019-04-29 12:37:56 +02:00
Thomas Gleixner
9f50c91b11 tracing: Remove the last struct stack_trace usage
Simplify the stack retrieval code by using the storage array based
interface.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: linux-mm@kvack.org
Cc: David Rientjes <rientjes@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: kasan-dev@googlegroups.com
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: iommu@lists.linux-foundation.org
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: David Sterba <dsterba@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: linux-btrfs@vger.kernel.org
Cc: dm-devel@redhat.com
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: intel-gfx@lists.freedesktop.org
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: dri-devel@lists.freedesktop.org
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: linux-arch@vger.kernel.org
Link: https://lkml.kernel.org/r/20190425094803.340000461@linutronix.de
2019-04-29 12:37:56 +02:00
Thomas Gleixner
ee6dd0db4d tracing: Simplify stack trace retrieval
Replace the indirection through struct stack_trace by using the storage
array based interfaces.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: linux-mm@kvack.org
Cc: David Rientjes <rientjes@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: kasan-dev@googlegroups.com
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: iommu@lists.linux-foundation.org
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: David Sterba <dsterba@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: linux-btrfs@vger.kernel.org
Cc: dm-devel@redhat.com
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: intel-gfx@lists.freedesktop.org
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: dri-devel@lists.freedesktop.org
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: linux-arch@vger.kernel.org
Link: https://lkml.kernel.org/r/20190425094803.248604594@linutronix.de
2019-04-29 12:37:55 +02:00
Thomas Gleixner
c438f140cc tracing: Make ftrace_trace_userstack() static and conditional
It's only used in trace.c and there is absolutely no point in compiling it
in when user space stack traces are not supported.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: linux-mm@kvack.org
Cc: David Rientjes <rientjes@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: kasan-dev@googlegroups.com
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: iommu@lists.linux-foundation.org
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: David Sterba <dsterba@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: linux-btrfs@vger.kernel.org
Cc: dm-devel@redhat.com
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: intel-gfx@lists.freedesktop.org
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: dri-devel@lists.freedesktop.org
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: linux-arch@vger.kernel.org
Link: https://lkml.kernel.org/r/20190425094803.162400595@linutronix.de
2019-04-29 12:37:55 +02:00
Thomas Gleixner
2a820bf749 tracing: Use percpu stack trace buffer more intelligently
The per cpu stack trace buffer usage pattern is odd at best. The buffer has
place for 512 stack trace entries on 64-bit and 1024 on 32-bit. When
interrupts or exceptions nest after the per cpu buffer was acquired the
stacktrace length is hardcoded to 8 entries. 512/1024 stack trace entries
in kernel stacks are unrealistic so the buffer is a complete waste.

Split the buffer into 4 nest levels, which are 128/256 entries per
level. This allows nesting contexts (interrupts, exceptions) to utilize the
cpu buffer for stack retrieval and avoids the fixed length allocation along
with the conditional execution pathes.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: linux-mm@kvack.org
Cc: David Rientjes <rientjes@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: kasan-dev@googlegroups.com
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: iommu@lists.linux-foundation.org
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: David Sterba <dsterba@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: linux-btrfs@vger.kernel.org
Cc: dm-devel@redhat.com
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: intel-gfx@lists.freedesktop.org
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: dri-devel@lists.freedesktop.org
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: linux-arch@vger.kernel.org
Link: https://lkml.kernel.org/r/20190425094803.066064076@linutronix.de
2019-04-29 12:37:55 +02:00
Thomas Gleixner
e7d916632b tracing: Simplify stacktrace retrieval in histograms
The indirection through struct stack_trace is not necessary at all. Use the
storage array based interface.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Reviewed-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: linux-mm@kvack.org
Cc: David Rientjes <rientjes@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: kasan-dev@googlegroups.com
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: iommu@lists.linux-foundation.org
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: David Sterba <dsterba@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: linux-btrfs@vger.kernel.org
Cc: dm-devel@redhat.com
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: intel-gfx@lists.freedesktop.org
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: dri-devel@lists.freedesktop.org
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: linux-arch@vger.kernel.org
Link: https://lkml.kernel.org/r/20190425094802.979089273@linutronix.de
2019-04-29 12:37:54 +02:00
Thomas Gleixner
c120bce780 lockdep: Simplify stack trace handling
Replace the indirection through struct stack_trace by using the storage
array based interfaces and storing the information is a small lockdep
specific data structure.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: linux-mm@kvack.org
Cc: David Rientjes <rientjes@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: kasan-dev@googlegroups.com
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: iommu@lists.linux-foundation.org
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: David Sterba <dsterba@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: linux-btrfs@vger.kernel.org
Cc: dm-devel@redhat.com
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: intel-gfx@lists.freedesktop.org
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: dri-devel@lists.freedesktop.org
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: linux-arch@vger.kernel.org
Link: https://lkml.kernel.org/r/20190425094802.891724020@linutronix.de
2019-04-29 12:37:54 +02:00
Thomas Gleixner
76b14436b4 lockdep: Remove save argument from check_prev_add()
There is only one caller which hands in save_trace as function pointer.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: linux-mm@kvack.org
Cc: David Rientjes <rientjes@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: kasan-dev@googlegroups.com
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: iommu@lists.linux-foundation.org
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: David Sterba <dsterba@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: linux-btrfs@vger.kernel.org
Cc: dm-devel@redhat.com
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: intel-gfx@lists.freedesktop.org
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: dri-devel@lists.freedesktop.org
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: linux-arch@vger.kernel.org
Link: https://lkml.kernel.org/r/20190425094802.803362058@linutronix.de
2019-04-29 12:37:53 +02:00
Thomas Gleixner
b1abe4622d lockdep: Remove unused trace argument from print_circular_bug()
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: linux-mm@kvack.org
Cc: David Rientjes <rientjes@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: kasan-dev@googlegroups.com
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: iommu@lists.linux-foundation.org
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: David Sterba <dsterba@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: linux-btrfs@vger.kernel.org
Cc: dm-devel@redhat.com
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: intel-gfx@lists.freedesktop.org
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: dri-devel@lists.freedesktop.org
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: linux-arch@vger.kernel.org
Link: https://lkml.kernel.org/r/20190425094802.716274532@linutronix.de
2019-04-29 12:37:53 +02:00
Thomas Gleixner
746017ed8d dma/debug: Simplify stracktrace retrieval
Replace the indirection through struct stack_trace with an invocation of
the storage array based interface.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: iommu@lists.linux-foundation.org
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: linux-mm@kvack.org
Cc: David Rientjes <rientjes@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: kasan-dev@googlegroups.com
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: David Sterba <dsterba@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: linux-btrfs@vger.kernel.org
Cc: dm-devel@redhat.com
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: intel-gfx@lists.freedesktop.org
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: dri-devel@lists.freedesktop.org
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: linux-arch@vger.kernel.org
Link: https://lkml.kernel.org/r/20190425094802.248658135@linutronix.de
2019-04-29 12:37:50 +02:00
Thomas Gleixner
f93877214a latency_top: Simplify stack trace handling
Replace the indirection through struct stack_trace with an invocation of
the storage array based interface.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: linux-mm@kvack.org
Cc: David Rientjes <rientjes@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: kasan-dev@googlegroups.com
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: iommu@lists.linux-foundation.org
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: David Sterba <dsterba@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: linux-btrfs@vger.kernel.org
Cc: dm-devel@redhat.com
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: intel-gfx@lists.freedesktop.org
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: dri-devel@lists.freedesktop.org
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: linux-arch@vger.kernel.org
Link: https://lkml.kernel.org/r/20190425094801.683039030@linutronix.de
2019-04-29 12:37:48 +02:00
Thomas Gleixner
1b59562d3a backtrace-test: Simplify stack trace handling
Replace the indirection through struct stack_trace by using the storage
array based interfaces.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: linux-mm@kvack.org
Cc: David Rientjes <rientjes@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: kasan-dev@googlegroups.com
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: iommu@lists.linux-foundation.org
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: David Sterba <dsterba@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: linux-btrfs@vger.kernel.org
Cc: dm-devel@redhat.com
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: intel-gfx@lists.freedesktop.org
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: dri-devel@lists.freedesktop.org
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: linux-arch@vger.kernel.org
Link: https://lkml.kernel.org/r/20190425094801.501919093@linutronix.de
2019-04-29 12:37:47 +02:00
Thomas Gleixner
e9b98e162a stacktrace: Provide helpers for common stack trace operations
All operations with stack traces are based on struct stack_trace. That's a
horrible construct as the struct is a kitchen sink for input and
output. Quite some usage sites embed it into their own data structures
which creates weird indirections.

There is absolutely no point in doing so. For all use cases a storage array
and the number of valid stack trace entries in the array is sufficient.

Provide helper functions which avoid the struct stack_trace indirection so
the usage sites can be cleaned up.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: linux-mm@kvack.org
Cc: David Rientjes <rientjes@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: kasan-dev@googlegroups.com
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: iommu@lists.linux-foundation.org
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: David Sterba <dsterba@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: linux-btrfs@vger.kernel.org
Cc: dm-devel@redhat.com
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: intel-gfx@lists.freedesktop.org
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: dri-devel@lists.freedesktop.org
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: linux-arch@vger.kernel.org
Link: https://lkml.kernel.org/r/20190425094801.324810708@linutronix.de
2019-04-29 12:37:46 +02:00
Thomas Gleixner
3d9a807291 tracing: Cleanup stack trace code
- Remove the extra array member of stack_dump_trace[] along with the
  ARRAY_SIZE - 1 initialization for struct stack_trace :: max_entries.

  Both are historical leftovers of no value. The stack tracer never exceeds
  the array and there is no extra storage requirement either.

- Make variables which are only used in trace_stack.c static.

- Simplify the enable/disable logic.

- Rename stack_trace_print() as it's using the stack_trace_ namespace. Free
  the name up for stack trace related functions.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: linux-mm@kvack.org
Cc: David Rientjes <rientjes@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: kasan-dev@googlegroups.com
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: iommu@lists.linux-foundation.org
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: David Sterba <dsterba@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: linux-btrfs@vger.kernel.org
Cc: dm-devel@redhat.com
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: intel-gfx@lists.freedesktop.org
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: dri-devel@lists.freedesktop.org
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: linux-arch@vger.kernel.org
Link: https://lkml.kernel.org/r/20190425094801.230654524@linutronix.de
2019-04-29 12:37:46 +02:00
Gerald Schaefer
7a5da02de8 locking/lockdep: check for freed initmem in static_obj()
The following warning occurred on s390:
WARNING: CPU: 0 PID: 804 at kernel/locking/lockdep.c:1025 lockdep_register_key+0x30/0x150

This is because the check in static_obj() assumes that all memory within
[_stext, _end] belongs to static objects, which at least for s390 isn't
true. The init section is also part of this range, and freeing it allows
the buddy allocator to allocate memory from it. We have virt == phys for
the kernel on s390, so that such allocations would then have addresses
within the range [_stext, _end].

To fix this, introduce arch_is_kernel_initmem_freed(), similar to
arch_is_kernel_text/data(), and add it to the checks in static_obj().
This will always return 0 on architectures that do not define
arch_is_kernel_initmem_freed. On s390, it will return 1 if initmem has
been freed and the address is in the range [__init_begin, __init_end].

Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2019-04-29 10:47:10 +02:00
Jakub Kicinski
94b5f312cf locking/static_key: Don't take sleeping locks in __static_key_slow_dec_deferred()
Changing jump_label state is protected by jump_label_lock().
Rate limited static_key_slow_dec(), however, will never
directly call jump_label_update(), it will schedule a delayed
work instead.  Therefore it's unnecessary to take both the
cpus_read_lock() and jump_label_lock().

This allows static_key_slow_dec_deferred() to be called
from atomic contexts, like socket destructing in net/tls,
without the need for another indirection.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: alexei.starovoitov@gmail.com
Cc: ard.biesheuvel@linaro.org
Cc: oss-drivers@netronome.com
Cc: yamada.masahiro@socionext.com
Link: https://lkml.kernel.org/r/20190330000854.30142-4-jakub.kicinski@netronome.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-29 08:29:21 +02:00
Jakub Kicinski
b92e793bbe locking/static_key: Factor out the fast path of static_key_slow_dec()
static_key_slow_dec() checks if the atomic enable count is larger
than 1, and if so there decrements it before taking the jump_label_lock.
Move this logic into a helper for reuse in rate limitted keys.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: alexei.starovoitov@gmail.com
Cc: ard.biesheuvel@linaro.org
Cc: oss-drivers@netronome.com
Cc: yamada.masahiro@socionext.com
Link: https://lkml.kernel.org/r/20190330000854.30142-3-jakub.kicinski@netronome.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-29 08:29:21 +02:00
Jakub Kicinski
ad282a8117 locking/static_key: Add support for deferred static branches
Add deferred static branches.  We can't unfortunately use the
nice trick of encapsulating the entire structure in true/false
variants, because the inside has to be either struct static_key_true
or struct static_key_false.  Use defines to pass the appropriate
members to the helpers separately.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: alexei.starovoitov@gmail.com
Cc: ard.biesheuvel@linaro.org
Cc: oss-drivers@netronome.com
Cc: yamada.masahiro@socionext.com
Link: https://lkml.kernel.org/r/20190330000854.30142-2-jakub.kicinski@netronome.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-29 08:29:20 +02:00
Frederic Weisbecker
948f83768a locking/lockdep: Test all incompatible scenarios at once in check_irq_usage()
check_prev_add_irq() tests all incompatible scenarios one after the
other while adding a lock (@next) to a tree dependency (@prev):

	LOCK_USED_IN_HARDIRQ          vs         LOCK_ENABLED_HARDIRQ
	LOCK_USED_IN_HARDIRQ_READ     vs         LOCK_ENABLED_HARDIRQ
	LOCK_USED_IN_SOFTIRQ          vs         LOCK_ENABLED_SOFTIRQ
	LOCK_USED_IN_SOFTIRQ_READ     vs         LOCK_ENABLED_SOFTIRQ

Also for these four scenarios, we must at least iterate the @prev
backward dependency. Then if it matches the relevant LOCK_USED_* bit,
we must also iterate the @next forward dependency.

Therefore in the best case we iterate 4 times, in the worst case 8 times.

A different approach can let us divide the number of branch iterations
by 4:

1) Iterate through @prev backward dependencies and accumulate all the IRQ
   uses in a single mask. In the best case where the current lock hasn't
   been used in IRQ, we stop here.

2) Iterate through @next forward dependencies and try to find a lock
   whose usage is exclusive to the accumulated usages gathered in the
   previous step. If we find one (call it @lockA), we have found an
   incompatible use, otherwise we stop here. Only bad locking scenario
   go further. So a sane verification stop here.

3) Iterate again through @prev backward dependency and find the lock
   whose usage matches @lockA in term of incompatibility. Call that
   lock @lockB.

4) Report the incompatible usages of @lockA and @lockB

If no incompatible use is found, the verification never goes beyond
step 2 which means at most two iterations.

The following compares the execution measurements of the function
check_prev_add_irq():

            Number of  calls   | Avg (ns)  | Stdev (ns) | Total time (ns)
  ------------------------------------------------------------------------
  Mainline         8452        |  2652     |    11962   |    22415143
  This patch       8452        |  1518     |     7090   |    12835602

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: https://lkml.kernel.org/r/20190402160244.32434-5-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-29 08:29:20 +02:00
Nicholas Piggin
9b019acb72 sched/nohz: Run NOHZ idle load balancer on HK_FLAG_MISC CPUs
The NOHZ idle balancer runs on the lowest idle CPU. This can
interfere with isolated CPUs, so confine it to HK_FLAG_MISC
housekeeping CPUs.

HK_FLAG_SCHED is not used for this because it is not set anywhere
at the moment. This could be folded into HK_FLAG_SCHED once that
option is fixed.

The problem was observed with increased jitter on an application
running on CPU0, caused by NOHZ idle load balancing being run on
CPU1 (an SMT sibling).

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190412042613.28930-1-npiggin@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-29 08:27:03 +02:00
Al Viro
795d673af1 audit_compare_dname_path(): switch to const struct qstr *
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-04-28 20:33:43 -04:00
David S. Miller
5f0d736e7f Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2019-04-28

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Introduce BPF socket local storage map so that BPF programs can store
   private data they associate with a socket (instead of e.g. separate hash
   table), from Martin.

2) Add support for bpftool to dump BTF types. This is done through a new
   `bpftool btf dump` sub-command, from Andrii.

3) Enable BPF-based flow dissector for skb-less eth_get_headlen() calls which
   was currently not supported since skb was used to lookup netns, from Stanislav.

4) Add an opt-in interface for tracepoints to expose a writable context
   for attached BPF programs, used here for NBD sockets, from Matt.

5) BPF xadd related arm64 JIT fixes and scalability improvements, from Daniel.

6) Change the skb->protocol for bpf_skb_adjust_room() helper in order to
   support tunnels such as sit. Add selftests as well, from Willem.

7) Various smaller misc fixes.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-28 08:42:41 -04:00
Johannes Berg
ef6243acb4 genetlink: optionally validate strictly/dumps
Add options to strictly validate messages and dump messages,
sometimes perhaps validating dump messages non-strictly may
be required, so add an option for that as well.

Since none of this can really be applied to existing commands,
set the options everwhere using the following spatch:

    @@
    identifier ops;
    expression X;
    @@
    struct genl_ops ops[] = {
    ...,
     {
            .cmd = X,
    +       .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
            ...
     },
    ...
    };

For new commands one should just not copy the .validate 'opt-out'
flags and thus get strict validation.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27 17:07:22 -04:00
Johannes Berg
8cb081746c netlink: make validation more configurable for future strictness
We currently have two levels of strict validation:

 1) liberal (default)
     - undefined (type >= max) & NLA_UNSPEC attributes accepted
     - attribute length >= expected accepted
     - garbage at end of message accepted
 2) strict (opt-in)
     - NLA_UNSPEC attributes accepted
     - attribute length >= expected accepted

Split out parsing strictness into four different options:
 * TRAILING     - check that there's no trailing data after parsing
                  attributes (in message or nested)
 * MAXTYPE      - reject attrs > max known type
 * UNSPEC       - reject attributes with NLA_UNSPEC policy entries
 * STRICT_ATTRS - strictly validate attribute size

The default for future things should be *everything*.
The current *_strict() is a combination of TRAILING and MAXTYPE,
and is renamed to _deprecated_strict().
The current regular parsing has none of this, and is renamed to
*_parse_deprecated().

Additionally it allows us to selectively set one of the new flags
even on old policies. Notably, the UNSPEC flag could be useful in
this case, since it can be arranged (by filling in the policy) to
not be an incompatible userspace ABI change, but would then going
forward prevent forgetting attribute entries. Similar can apply
to the POLICY flag.

We end up with the following renames:
 * nla_parse           -> nla_parse_deprecated
 * nla_parse_strict    -> nla_parse_deprecated_strict
 * nlmsg_parse         -> nlmsg_parse_deprecated
 * nlmsg_parse_strict  -> nlmsg_parse_deprecated_strict
 * nla_parse_nested    -> nla_parse_nested_deprecated
 * nla_validate_nested -> nla_validate_nested_deprecated

Using spatch, of course:
    @@
    expression TB, MAX, HEAD, LEN, POL, EXT;
    @@
    -nla_parse(TB, MAX, HEAD, LEN, POL, EXT)
    +nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT)

    @@
    expression NLH, HDRLEN, TB, MAX, POL, EXT;
    @@
    -nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT)
    +nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT)

    @@
    expression NLH, HDRLEN, TB, MAX, POL, EXT;
    @@
    -nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
    +nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT)

    @@
    expression TB, MAX, NLA, POL, EXT;
    @@
    -nla_parse_nested(TB, MAX, NLA, POL, EXT)
    +nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT)

    @@
    expression START, MAX, POL, EXT;
    @@
    -nla_validate_nested(START, MAX, POL, EXT)
    +nla_validate_nested_deprecated(START, MAX, POL, EXT)

    @@
    expression NLH, HDRLEN, MAX, POL, EXT;
    @@
    -nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT)
    +nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT)

For this patch, don't actually add the strict, non-renamed versions
yet so that it breaks compile if I get it wrong.

Also, while at it, make nla_validate and nla_parse go down to a
common __nla_validate_parse() function to avoid code duplication.

Ultimately, this allows us to have very strict validation for every
new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the
next patch, while existing things will continue to work as is.

In effect then, this adds fully strict validation for any new command.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27 17:07:21 -04:00
Michal Kubecek
ae0be8de9a netlink: make nla_nest_start() add NLA_F_NESTED flag
Even if the NLA_F_NESTED flag was introduced more than 11 years ago, most
netlink based interfaces (including recently added ones) are still not
setting it in kernel generated messages. Without the flag, message parsers
not aware of attribute semantics (e.g. wireshark dissector or libmnl's
mnl_nlmsg_fprintf()) cannot recognize nested attributes and won't display
the structure of their contents.

Unfortunately we cannot just add the flag everywhere as there may be
userspace applications which check nlattr::nla_type directly rather than
through a helper masking out the flags. Therefore the patch renames
nla_nest_start() to nla_nest_start_noflag() and introduces nla_nest_start()
as a wrapper adding NLA_F_NESTED. The calls which add NLA_F_NESTED manually
are rewritten to use nla_nest_start().

Except for changes in include/net/netlink.h, the patch was generated using
this semantic patch:

@@ expression E1, E2; @@
-nla_nest_start(E1, E2)
+nla_nest_start_noflag(E1, E2)

@@ expression E1, E2; @@
-nla_nest_start_noflag(E1, E2 | NLA_F_NESTED)
+nla_nest_start(E1, E2)

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27 17:03:44 -04:00
Linus Torvalds
15d4e26b81 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Ingo Molnar:
 "Fix a division by zero bug that can trigger in the NUMA placement
  code"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/numa: Fix a possible divide-by-zero
2019-04-27 10:18:40 -07:00
Martin KaFai Lau
6ac99e8f23 bpf: Introduce bpf sk local storage
After allowing a bpf prog to
- directly read the skb->sk ptr
- get the fullsock bpf_sock by "bpf_sk_fullsock()"
- get the bpf_tcp_sock by "bpf_tcp_sock()"
- get the listener sock by "bpf_get_listener_sock()"
- avoid duplicating the fields of "(bpf_)sock" and "(bpf_)tcp_sock"
  into different bpf running context.

this patch is another effort to make bpf's network programming
more intuitive to do (together with memory and performance benefit).

When bpf prog needs to store data for a sk, the current practice is to
define a map with the usual 4-tuples (src/dst ip/port) as the key.
If multiple bpf progs require to store different sk data, multiple maps
have to be defined.  Hence, wasting memory to store the duplicated
keys (i.e. 4 tuples here) in each of the bpf map.
[ The smallest key could be the sk pointer itself which requires
  some enhancement in the verifier and it is a separate topic. ]

Also, the bpf prog needs to clean up the elem when sk is freed.
Otherwise, the bpf map will become full and un-usable quickly.
The sk-free tracking currently could be done during sk state
transition (e.g. BPF_SOCK_OPS_STATE_CB).

The size of the map needs to be predefined which then usually ended-up
with an over-provisioned map in production.  Even the map was re-sizable,
while the sk naturally come and go away already, this potential re-size
operation is arguably redundant if the data can be directly connected
to the sk itself instead of proxy-ing through a bpf map.

This patch introduces sk->sk_bpf_storage to provide local storage space
at sk for bpf prog to use.  The space will be allocated when the first bpf
prog has created data for this particular sk.

The design optimizes the bpf prog's lookup (and then optionally followed by
an inline update).  bpf_spin_lock should be used if the inline update needs
to be protected.

BPF_MAP_TYPE_SK_STORAGE:
-----------------------
To define a bpf "sk-local-storage", a BPF_MAP_TYPE_SK_STORAGE map (new in
this patch) needs to be created.  Multiple BPF_MAP_TYPE_SK_STORAGE maps can
be created to fit different bpf progs' needs.  The map enforces
BTF to allow printing the sk-local-storage during a system-wise
sk dump (e.g. "ss -ta") in the future.

The purpose of a BPF_MAP_TYPE_SK_STORAGE map is not for lookup/update/delete
a "sk-local-storage" data from a particular sk.
Think of the map as a meta-data (or "type") of a "sk-local-storage".  This
particular "type" of "sk-local-storage" data can then be stored in any sk.

The main purposes of this map are mostly:
1. Define the size of a "sk-local-storage" type.
2. Provide a similar syscall userspace API as the map (e.g. lookup/update,
   map-id, map-btf...etc.)
3. Keep track of all sk's storages of this "type" and clean them up
   when the map is freed.

sk->sk_bpf_storage:
------------------
The main lookup/update/delete is done on sk->sk_bpf_storage (which
is a "struct bpf_sk_storage").  When doing a lookup,
the "map" pointer is now used as the "key" to search on the
sk_storage->list.  The "map" pointer is actually serving
as the "type" of the "sk-local-storage" that is being
requested.

To allow very fast lookup, it should be as fast as looking up an
array at a stable-offset.  At the same time, it is not ideal to
set a hard limit on the number of sk-local-storage "type" that the
system can have.  Hence, this patch takes a cache approach.
The last search result from sk_storage->list is cached in
sk_storage->cache[] which is a stable sized array.  Each
"sk-local-storage" type has a stable offset to the cache[] array.
In the future, a map's flag could be introduced to do cache
opt-out/enforcement if it became necessary.

The cache size is 16 (i.e. 16 types of "sk-local-storage").
Programs can share map.  On the program side, having a few bpf_progs
running in the networking hotpath is already a lot.  The bpf_prog
should have already consolidated the existing sock-key-ed map usage
to minimize the map lookup penalty.  16 has enough runway to grow.

All sk-local-storage data will be removed from sk->sk_bpf_storage
during sk destruction.

bpf_sk_storage_get() and bpf_sk_storage_delete():
------------------------------------------------
Instead of using bpf_map_(lookup|update|delete)_elem(),
the bpf prog needs to use the new helper bpf_sk_storage_get() and
bpf_sk_storage_delete().  The verifier can then enforce the
ARG_PTR_TO_SOCKET argument.  The bpf_sk_storage_get() also allows to
"create" new elem if one does not exist in the sk.  It is done by
the new BPF_SK_STORAGE_GET_F_CREATE flag.  An optional value can also be
provided as the initial value during BPF_SK_STORAGE_GET_F_CREATE.
The BPF_MAP_TYPE_SK_STORAGE also supports bpf_spin_lock.  Together,
it has eliminated the potential use cases for an equivalent
bpf_map_update_elem() API (for bpf_prog) in this patch.

Misc notes:
----------
1. map_get_next_key is not supported.  From the userspace syscall
   perspective,  the map has the socket fd as the key while the map
   can be shared by pinned-file or map-id.

   Since btf is enforced, the existing "ss" could be enhanced to pretty
   print the local-storage.

   Supporting a kernel defined btf with 4 tuples as the return key could
   be explored later also.

2. The sk->sk_lock cannot be acquired.  Atomic operations is used instead.
   e.g. cmpxchg is done on the sk->sk_bpf_storage ptr.
   Please refer to the source code comments for the details in
   synchronization cases and considerations.

3. The mem is charged to the sk->sk_omem_alloc as the sk filter does.

Benchmark:
---------
Here is the benchmark data collected by turning on
the "kernel.bpf_stats_enabled" sysctl.
Two bpf progs are tested:

One bpf prog with the usual bpf hashmap (max_entries = 8192) with the
sk ptr as the key. (verifier is modified to support sk ptr as the key
That should have shortened the key lookup time.)

Another bpf prog is with the new BPF_MAP_TYPE_SK_STORAGE.

Both are storing a "u32 cnt", do a lookup on "egress_skb/cgroup" for
each egress skb and then bump the cnt.  netperf is used to drive
data with 4096 connected UDP sockets.

BPF_MAP_TYPE_HASH with a modifier verifier (152ns per bpf run)
27: cgroup_skb  name egress_sk_map  tag 74f56e832918070b run_time_ns 58280107540 run_cnt 381347633
    loaded_at 2019-04-15T13:46:39-0700  uid 0
    xlated 344B  jited 258B  memlock 4096B  map_ids 16
    btf_id 5

BPF_MAP_TYPE_SK_STORAGE in this patch (66ns per bpf run)
30: cgroup_skb  name egress_sk_stora  tag d4aa70984cc7bbf6 run_time_ns 25617093319 run_cnt 390989739
    loaded_at 2019-04-15T13:47:54-0700  uid 0
    xlated 168B  jited 156B  memlock 4096B  map_ids 17
    btf_id 6

Here is a high-level picture on how are the objects organized:

       sk
    ┌──────┐
    │      │
    │      │
    │      │
    │*sk_bpf_storage─────▶ bpf_sk_storage
    └──────┘                 ┌───────┐
                 ┌───────────┤ list  │
                 │           │       │
                 │           │       │
                 │           │       │
                 │           └───────┘
                 │
                 │     elem
                 │  ┌────────┐
                 ├─▶│ snode  │
                 │  ├────────┤
                 │  │  data  │          bpf_map
                 │  ├────────┤        ┌─────────┐
                 │  │map_node│◀─┬─────┤  list   │
                 │  └────────┘  │     │         │
                 │              │     │         │
                 │     elem     │     │         │
                 │  ┌────────┐  │     └─────────┘
                 └─▶│ snode  │  │
                    ├────────┤  │
   bpf_map          │  data  │  │
 ┌─────────┐        ├────────┤  │
 │  list   ├───────▶│map_node│  │
 │         │        └────────┘  │
 │         │                    │
 │         │           elem     │
 └─────────┘        ┌────────┐  │
                 ┌─▶│ snode  │  │
                 │  ├────────┤  │
                 │  │  data  │  │
                 │  ├────────┤  │
                 │  │map_node│◀─┘
                 │  └────────┘
                 │
                 │
                 │          ┌───────┐
     sk          └──────────│ list  │
  ┌──────┐                  │       │
  │      │                  │       │
  │      │                  │       │
  │      │                  └───────┘
  │*sk_bpf_storage───────▶bpf_sk_storage
  └──────┘

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-27 09:07:04 -07:00
Matt Mullins
9df1c28bb7 bpf: add writable context for raw tracepoints
This is an opt-in interface that allows a tracepoint to provide a safe
buffer that can be written from a BPF_PROG_TYPE_RAW_TRACEPOINT program.
The size of the buffer must be a compile-time constant, and is checked
before allowing a BPF program to attach to a tracepoint that uses this
feature.

The pointer to this buffer will be the first argument of tracepoints
that opt in; the pointer is valid and can be bpf_probe_read() by both
BPF_PROG_TYPE_RAW_TRACEPOINT and BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE
programs that attach to such a tracepoint, but the buffer to which it
points may only be written by the latter.

Signed-off-by: Matt Mullins <mmullins@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-26 19:04:19 -07:00
Linus Torvalds
e9e1a2e7b4 There tracing fixes:
- Use "nosteal" for ring buffer splice pages
  - Memory leak fix in error path of trace_pid_write()
  - Fix preempt_enable_no_resched() (use preempt_enable()) in ring buffer code
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXMMoghQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qmB1AQDfpVxYxcmxibBBAM6fZyILYpKqDWmy
 ut6gHZ+GHhQT4AEAwSRsC6V4yO3d5dJFpkcQXUj1v+Ip9XU+dv//s8O6tAI=
 =LsG/
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
 "Three tracing fixes:

   - Use "nosteal" for ring buffer splice pages

   - Memory leak fix in error path of trace_pid_write()

   - Fix preempt_enable_no_resched() (use preempt_enable()) in ring
     buffer code"

* tag 'trace-v5.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  trace: Fix preempt_enable_no_resched() abuse
  tracing: Fix a memory leak by early error exit in trace_pid_write()
  tracing: Fix buffer_ref pipe ops
2019-04-26 11:09:55 -07:00
Al Viro
6921d4ebe4 audit_update_watch(): switch to const struct qstr *
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-04-26 14:09:49 -04:00
Al Viro
e43e9c339a fsnotify: switch send_to_group() and ->handle_event to const struct qstr *
note that conditions surrounding accesses to dname in audit_watch_handle_event()
and audit_mark_handle_event() guarantee that dname won't have been NULL.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-04-26 13:51:03 -04:00
Peter Zijlstra
d6097c9e44 trace: Fix preempt_enable_no_resched() abuse
Unless the very next line is schedule(), or implies it, one must not use
preempt_enable_no_resched(). It can cause a preemption to go missing and
thereby cause arbitrary delays, breaking the PREEMPT=y invariant.

Link: http://lkml.kernel.org/r/20190423200318.GY14281@hirez.programming.kicks-ass.net

Cc: Waiman Long <longman@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: the arch/x86 maintainers <x86@kernel.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: stable@vger.kernel.org
Fixes: 2c2d7329d8 ("tracing/ftrace: use preempt_enable_no_resched_notrace in ring_buffer_time_stamp()")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-04-26 11:45:03 -04:00
Wenwen Wang
91862cc786 tracing: Fix a memory leak by early error exit in trace_pid_write()
In trace_pid_write(), the buffer for trace parser is allocated through
kmalloc() in trace_parser_get_init(). Later on, after the buffer is used,
it is then freed through kfree() in trace_parser_put(). However, it is
possible that trace_pid_write() is terminated due to unexpected errors,
e.g., ENOMEM. In that case, the allocated buffer will not be freed, which
is a memory leak bug.

To fix this issue, free the allocated buffer when an error is encountered.

Link: http://lkml.kernel.org/r/1555726979-15633-1-git-send-email-wang6495@umn.edu

Fixes: f4d34a87e9 ("tracing: Use pid bitmap instead of a pid array for set_event_pid")
Cc: stable@vger.kernel.org
Signed-off-by: Wenwen Wang <wang6495@umn.edu>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-04-26 11:45:03 -04:00
Jann Horn
b987222654 tracing: Fix buffer_ref pipe ops
This fixes multiple issues in buffer_pipe_buf_ops:

 - The ->steal() handler must not return zero unless the pipe buffer has
   the only reference to the page. But generic_pipe_buf_steal() assumes
   that every reference to the pipe is tracked by the page's refcount,
   which isn't true for these buffers - buffer_pipe_buf_get(), which
   duplicates a buffer, doesn't touch the page's refcount.
   Fix it by using generic_pipe_buf_nosteal(), which refuses every
   attempted theft. It should be easy to actually support ->steal, but the
   only current users of pipe_buf_steal() are the virtio console and FUSE,
   and they also only use it as an optimization. So it's probably not worth
   the effort.
 - The ->get() and ->release() handlers can be invoked concurrently on pipe
   buffers backed by the same struct buffer_ref. Make them safe against
   concurrency by using refcount_t.
 - The pointers stored in ->private were only zeroed out when the last
   reference to the buffer_ref was dropped. As far as I know, this
   shouldn't be necessary anyway, but if we do it, let's always do it.

Link: http://lkml.kernel.org/r/20190404215925.253531-1-jannh@google.com

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@vger.kernel.org
Fixes: 73a757e631 ("ring-buffer: Return reader page back into existing ring buffer")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-04-26 11:44:39 -04:00
Will Deacon
6b4f4bc9cb locking/futex: Allow low-level atomic operations to return -EAGAIN
Some futex() operations, including FUTEX_WAKE_OP, require the kernel to
perform an atomic read-modify-write of the futex word via the userspace
mapping. These operations are implemented by each architecture in
arch_futex_atomic_op_inuser() and futex_atomic_cmpxchg_inatomic(), which
are called in atomic context with the relevant hash bucket locks held.

Although these routines may return -EFAULT in response to a page fault
generated when accessing userspace, they are expected to succeed (i.e.
return 0) in all other cases. This poses a problem for architectures
that do not provide bounded forward progress guarantees or fairness of
contended atomic operations and can lead to starvation in some cases.

In these problematic scenarios, we must return back to the core futex
code so that we can drop the hash bucket locks and reschedule if
necessary, much like we do in the case of a page fault.

Allow architectures to return -EAGAIN from their implementations of
arch_futex_atomic_op_inuser() and futex_atomic_cmpxchg_inatomic(), which
will cause the core futex code to reschedule if necessary and return
back to the architecture code later on.

Cc: <stable@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
2019-04-26 13:57:31 +01:00
Will Deacon
cbafee55b5 Merge branch 'core/speculation' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into for-next/mitigations
Pull in core support for the "mitigations=" cmdline option from Thomas
Gleixner via -tip, which we can build on top of when we expose our
mitigation state via sysfs.
2019-04-26 13:32:20 +01:00
David S. Miller
ad759c9069 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Alexei Starovoitov says:

====================
pull-request: bpf 2019-04-25

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) the bpf verifier fix to properly mark registers in all stack frames, from Paul.

2) preempt_enable_no_resched->preempt_enable fix, from Peter.

3) other misc fixes.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-26 01:54:42 -04:00
David S. Miller
8b44836583 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Two easy cases of overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-25 23:52:29 -04:00
Paul Chaignon
c6a9efa1d8 bpf: mark registers in all frames after pkt/null checks
In case of a null check on a pointer inside a subprog, we should mark all
registers with this pointer as either safe or unknown, in both the current
and previous frames.  Currently, only spilled registers and registers in
the current frame are marked.  Packet bound checks in subprogs have the
same issue.  This patch fixes it to mark registers in previous frames as
well.

A good reproducer for null checks looks as follow:

1: ptr = bpf_map_lookup_elem(map, &key);
2: ret = subprog(ptr) {
3:   return ptr != NULL;
4: }
5: if (ret)
6:   value = *ptr;

With the above, the verifier will complain on line 6 because it sees ptr
as map_value_or_null despite the null check in subprog 1.

Note that this patch fixes another resulting bug when using
bpf_sk_release():

1: sk = bpf_sk_lookup_tcp(...);
2: subprog(sk) {
3:   if (sk)
4:     bpf_sk_release(sk);
5: }
6: if (!sk)
7:   return 0;
8: return 1;

In the above, mark_ptr_or_null_regs will warn on line 6 because it will
try to free the reference state, even though it was already freed on
line 3.

Fixes: f4d7e40a5b ("bpf: introduce function calls (verification)")
Signed-off-by: Paul Chaignon <paul.chaignon@orange.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-25 17:20:06 -07:00
Tycho Andersen
7a0df7fbc1 seccomp: Make NEW_LISTENER and TSYNC flags exclusive
As the comment notes, the return codes for TSYNC and NEW_LISTENER
conflict, because they both return positive values, one in the case of
success and one in the case of error. So, let's disallow both of these
flags together.

While this is technically a userspace break, all the users I know
of are still waiting on me to land this feature in libseccomp, so I
think it'll be safe. Also, at present my use case doesn't require
TSYNC at all, so this isn't a big deal to disallow. If someone
wanted to support this, a path forward would be to add a new flag like
TSYNC_AND_LISTENER_YES_I_UNDERSTAND_THAT_TSYNC_WILL_JUST_RETURN_EAGAIN,
but the use cases are so different I don't see it really happening.

Finally, it's worth noting that this does actually fix a UAF issue: at the
end of seccomp_set_mode_filter(), we have:

        if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
                if (ret < 0) {
                        listener_f->private_data = NULL;
                        fput(listener_f);
                        put_unused_fd(listener);
                } else {
                        fd_install(listener, listener_f);
                        ret = listener;
                }
        }
out_free:
        seccomp_filter_free(prepared);

But if ret > 0 because TSYNC raced, we'll install the listener fd and then
free the filter out from underneath it, causing a UAF when the task closes
it or dies. This patch also switches the condition to be simply if (ret),
so that if someone does add the flag mentioned above, they won't have to
remember to fix this too.

Reported-by: syzbot+b562969adb2e04af3442@syzkaller.appspotmail.com
Fixes: 6a21cc50f0 ("seccomp: add a return code to trap to userspace")
CC: stable@vger.kernel.org # v5.0+
Signed-off-by: Tycho Andersen <tycho@tycho.ws>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: James Morris <jamorris@linux.microsoft.com>
2019-04-25 15:55:58 -07:00
Stanislav Fomichev
118c8e9ae6 bpf: support BPF_PROG_QUERY for BPF_FLOW_DISSECTOR attach_type
target_fd is target namespace. If there is a flow dissector BPF program
attached to that namespace, its (single) id is returned.

v5:
* drop net ref right after rcu unlock (Daniel Borkmann)

v4:
* add missing put_net (Jann Horn)

v3:
* add missing inline to skb_flow_dissector_prog_query static def
  (kbuild test robot <lkp@intel.com>)

v2:
* don't sleep in rcu critical section (Jakub Kicinski)
* check input prog_cnt (exit early)

Cc: Jann Horn <jannh@google.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-25 23:49:06 +02:00
Kimberly Brown
70283454c9 livepatch: Replace klp_ktype_patch's default_attrs with groups
The kobj_type default_attrs field is being replaced by the
default_groups field. Replace klp_ktype_patch's default_attrs field
with default_groups and use the ATTRIBUTE_GROUPS macro to create
klp_patch_groups.

This patch was tested by loading the livepatch-sample module and
verifying that the sysfs files for the attributes in the default groups
were created.

Signed-off-by: Kimberly Brown <kimbrownkd@gmail.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-04-25 22:06:11 +02:00
Kimberly Brown
9782adeb3d cpufreq: schedutil: Replace default_attrs field with groups
The kobj_type default_attrs field is being replaced by the
default_groups field. Replace sugov_tunables_ktype's default_attrs field
with default groups. Change "sugov_attributes" to "sugov_attrs" and use
the ATTRIBUTE_GROUPS macro to create sugov_groups.

This patch was tested by setting the scaling governor to schedutil and
verifying that the sysfs files for the attributes in the default groups
were created.

Signed-off-by: Kimberly Brown <kimbrownkd@gmail.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-04-25 22:06:11 +02:00
Kimberly Brown
2064fbc779 padata: Replace padata_attr_type default_attrs field with groups
The kobj_type default_attrs field is being replaced by the
default_groups field. Replace padata_attr_type's default_attrs field
with default_groups and use the ATTRIBUTE_GROUPS macro to create
padata_default_groups.

This patch was tested by loading the pcrypt module and verifying that
the sysfs files for the attributes in the default groups were created.

Signed-off-by: Kimberly Brown <kimbrownkd@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-04-25 22:06:11 +02:00
Kimberly Brown
52ba92f588 irqdesc: Replace irq_kobj_type's default_attrs field with groups
The kobj_type default_attrs field is being replaced by the
default_groups field. Replace irq_kobj_type's default_attrs field with
default_groups and use the ATTRIBUTE_GROUPS macro to create irq_groups.

This patch was tested by verifying that the sysfs files for the
attributes in the default groups were created.

Signed-off-by: Kimberly Brown <kimbrownkd@gmail.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-04-25 22:06:11 +02:00
Xie XiuQi
a860fa7b96 sched/numa: Fix a possible divide-by-zero
sched_clock_cpu() may not be consistent between CPUs. If a task
migrates to another CPU, then se.exec_start is set to that CPU's
rq_clock_task() by update_stats_curr_start(). Specifically, the new
value might be before the old value due to clock skew.

So then if in numa_get_avg_runtime() the expression:

  'now - p->last_task_numa_placement'

ends up as -1, then the divider '*period + 1' in task_numa_placement()
is 0 and things go bang. Similar to update_curr(), check if time goes
backwards to avoid this.

[ peterz: Wrote new changelog. ]
[ mingo: Tweaked the code comment. ]

Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: cj.chengjian@huawei.com
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20190425080016.GX11158@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-25 19:58:54 +02:00
Eric Biggers
877b5691f2 crypto: shash - remove shash_desc::flags
The flags field in 'struct shash_desc' never actually does anything.
The only ostensibly supported flag is CRYPTO_TFM_REQ_MAY_SLEEP.
However, no shash algorithm ever sleeps, making this flag a no-op.

With this being the case, inevitably some users who can't sleep wrongly
pass MAY_SLEEP.  These would all need to be fixed if any shash algorithm
actually started sleeping.  For example, the shash_ahash_*() functions,
which wrap a shash algorithm with the ahash API, pass through MAY_SLEEP
from the ahash API to the shash API.  However, the shash functions are
called under kmap_atomic(), so actually they're assumed to never sleep.

Even if it turns out that some users do need preemption points while
hashing large buffers, we could easily provide a helper function
crypto_shash_update_large() which divides the data into smaller chunks
and calls crypto_shash_update() and cond_resched() for each chunk.  It's
not necessary to have a flag in 'struct shash_desc', nor is it necessary
to make individual shash algorithms aware of this at all.

Therefore, remove shash_desc::flags, and document that the
crypto_shash_*() functions can be called from any context.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-04-25 15:38:12 +08:00
Dan Carpenter
148a97d5a0 dma-mapping: remove an unnecessary NULL check
We already dereferenced "dev" when we called get_dma_ops() so this NULL
check is too late.  We're not supposed to pass NULL "dev" pointers to
dma_alloc_attrs().

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-04-24 16:28:13 +02:00
Tycho Andersen
6beff00b79 seccomp: fix up grammar in comment
This sentence is kind of a train wreck anyway, but at least dropping the
extra pronoun helps somewhat.

Signed-off-by: Tycho Andersen <tycho@tycho.ws>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: James Morris <jamorris@linux.microsoft.com>
2019-04-23 16:21:12 -07:00
David S. Miller
2843ba2ec7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2019-04-22

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) allow stack/queue helpers from more bpf program types, from Alban.

2) allow parallel verification of root bpf programs, from Alexei.

3) introduce bpf sysctl hook for trusted root cases, from Andrey.

4) recognize var/datasec in btf deduplication, from Andrii.

5) cpumap performance optimizations, from Jesper.

6) verifier prep for alu32 optimization, from Jiong.

7) libbpf xsk cleanup, from Magnus.

8) other various fixes and cleanups.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-22 21:35:55 -07:00
Alexei Starovoitov
45a73c17bf bpf: drop bpf_verifier_lock
Drop bpf_verifier_lock for root to avoid being DoS-ed by unprivileged.
The BPF verifier is now fully parallel.
All unpriv users are still serialized by bpf_verifier_lock to avoid
exhausting kernel memory by running N parallel verifications.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-23 01:50:43 +02:00
Alexei Starovoitov
7df737e991 bpf: remove global variables
Move three global variables protected by bpf_verifier_lock into
'struct bpf_verifier_env' to allow parallel verification.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-23 01:50:43 +02:00
Wenwen Wang
70c4cf17e4 audit: fix a memory leak bug
In audit_rule_change(), audit_data_to_entry() is firstly invoked to
translate the payload data to the kernel's rule representation. In
audit_data_to_entry(), depending on the audit field type, an audit tree may
be created in audit_make_tree(), which eventually invokes kmalloc() to
allocate the tree.  Since this tree is a temporary tree, it will be then
freed in the following execution, e.g., audit_add_rule() if the message
type is AUDIT_ADD_RULE or audit_del_rule() if the message type is
AUDIT_DEL_RULE. However, if the message type is neither AUDIT_ADD_RULE nor
AUDIT_DEL_RULE, i.e., the default case of the switch statement, this
temporary tree is not freed.

To fix this issue, only allocate the tree when the type is AUDIT_ADD_RULE
or AUDIT_DEL_RULE.

Signed-off-by: Wenwen Wang <wang6495@umn.edu>
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2019-04-22 11:22:03 -04:00
Steven Rostedt (VMware)
52fde6e70c function_graph: Have selftest also emulate tr->reset() as it did with tr->init()
The function_graph boot up self test emulates the tr->init() function in
order to add a wrapper around the function graph tracer entry code to test
for lock ups and such. But it does not emulate the tr->reset(), and just
calls the function_graph tracer tr->reset() function which will use its own
fgraph_ops to unregister function tracing with. As the fgraph_ops is
becoming more meaningful with the register_ftrace_graph() and
unregister_ftrace_graph() functions, the two need to be the same. The
emulated tr->init() uses its own fgraph_ops descriptor, which means the
unregister_ftrace_graph() must use the same ftrace_ops, which the selftest
currently does not do. By emulating the tr->reset() as the selftest does
with the tr->init() it will be able to pass the same fgraph_ops descriptor
to the unregister_ftrace_graph() as it did with the register_ftrace_graph().

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-04-21 19:46:56 -04:00
Linus Torvalds
e899cc3b3d Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Ingo Molnar:
 "Misc clocksource driver fixes, and a sched-clock wrapping fix"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timers/sched_clock: Prevent generic sched_clock wrap caused by tick_freeze()
  clocksource/drivers/timer-ti-dm: Remove omap_dm_timer_set_load_start
  clocksource/drivers/oxnas: Fix OX820 compatible
  clocksource/drivers/arm_arch_timer: Remove unneeded pr_fmt macro
  clocksource/drivers/npcm: select TIMER_OF
2019-04-20 10:10:49 -07:00
Linus Torvalds
b25c69b9d5 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Misc fixes:
   - various tooling fixes
   - kretprobe fixes
   - kprobes annotation fixes
   - kprobes error checking fix
   - fix the default events for AMD Family 17h CPUs
   - PEBS fix
   - AUX record fix
   - address filtering fix"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/kprobes: Avoid kretprobe recursion bug
  kprobes: Mark ftrace mcount handler functions nokprobe
  x86/kprobes: Verify stack frame on kretprobe
  perf/x86/amd: Add event map for AMD Family 17h
  perf bpf: Return NULL when RB tree lookup fails in perf_env__find_btf()
  perf tools: Fix map reference counting
  perf evlist: Fix side band thread draining
  perf tools: Check maps for bpf programs
  perf bpf: Return NULL when RB tree lookup fails in perf_env__find_bpf_prog_info()
  tools include uapi: Sync sound/asound.h copy
  perf top: Always sample time to satisfy needs of use of ordered queuing
  perf evsel: Use hweight64() instead of hweight_long(attr.sample_regs_user)
  tools lib traceevent: Fix missing equality check for strcmp
  perf stat: Disable DIR_FORMAT feature for 'perf stat record'
  perf scripts python: export-to-sqlite.py: Fix use of parent_id in calls_view
  perf header: Fix lock/unlock imbalances when processing BPF/BTF info
  perf/x86: Fix incorrect PEBS_REGS
  perf/ring_buffer: Fix AUX record suppression
  perf/core: Fix the address filtering fix
  kprobes: Fix error check when reusing optimized probes
2019-04-20 10:05:02 -07:00
Linus Torvalds
2b4cf5850d Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "A deadline scheduler warning/race fix, and a cfs_period_us quota
  calculation workaround where the real fix looks too involved to merge
  immediately"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/deadline: Correctly handle active 0-lag timers
  sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup
2019-04-20 09:53:36 -07:00
Linus Torvalds
de3af9a990 Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Ingo Molnar:
 "A lockdep warning fix and a script execution fix when atomics are
  generated"

* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/atomics: Don't assume that scripts are executable
  locking/lockdep: Make lockdep_unregister_key() honor 'debug_locks' again
2019-04-20 09:38:01 -07:00
Colin Ian King
ad2e379def sched/debug: Fix spelling mistake "logaritmic" -> "logarithmic"
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-janitors@vger.kernel.org
Link: http://lkml.kernel.org/r/20181128152350.13622-1-colin.king@canonical.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-19 21:04:49 +02:00
Roman Gushchin
4c476d8cff cgroup: add tracing points for cgroup v2 freezer
Add cgroup:cgroup_freeze and cgroup:cgroup_unfreeze events,
which are using the existing cgroup tracing infrastructure.

Add the cgroup_event event class, which is similar to the cgroup
class, but contains an additional integer field to store a new
value (the level field is dropped).

Also add two tracing events: cgroup_notify_populated and
cgroup_notify_frozen, which are raised in a generic way using
the TRACE_CGROUP_PATH() macro.

This allows to trace cgroup state transitions and is generally
helpful for debugging the cgroup freezer code.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-04-19 11:26:49 -07:00
Roman Gushchin
712e351787 cgroup: make TRACE_CGROUP_PATH irq-safe
To use the TRACE_CGROUP_PATH() macro with css_set_lock
locked, let's make the macro irq-safe.
It's necessary in order to trace cgroup freezer state
transitions (frozen/not frozen), which are happening
with css_set_lock locked.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-04-19 11:26:49 -07:00
Roman Gushchin
76f969e894 cgroup: cgroup v2 freezer
Cgroup v1 implements the freezer controller, which provides an ability
to stop the workload in a cgroup and temporarily free up some
resources (cpu, io, network bandwidth and, potentially, memory)
for some other tasks. Cgroup v2 lacks this functionality.

This patch implements freezer for cgroup v2.

Cgroup v2 freezer tries to put tasks into a state similar to jobctl
stop. This means that tasks can be killed, ptraced (using
PTRACE_SEIZE*), and interrupted. It is possible to attach to
a frozen task, get some information (e.g. read registers) and detach.
It's also possible to migrate a frozen tasks to another cgroup.

This differs cgroup v2 freezer from cgroup v1 freezer, which mostly
tried to imitate the system-wide freezer. However uninterruptible
sleep is fine when all tasks are going to be frozen (hibernation case),
it's not the acceptable state for some subset of the system.

Cgroup v2 freezer is not supporting freezing kthreads.
If a non-root cgroup contains kthread, the cgroup still can be frozen,
but the kthread will remain running, the cgroup will be shown
as non-frozen, and the notification will not be delivered.

* PTRACE_ATTACH is not working because non-fatal signal delivery
is blocked in frozen state.

There are some interface differences between cgroup v1 and cgroup v2
freezer too, which are required to conform the cgroup v2 interface
design principles:
1) There is no separate controller, which has to be turned on:
the functionality is always available and is represented by
cgroup.freeze and cgroup.events cgroup control files.
2) The desired state is defined by the cgroup.freeze control file.
Any hierarchical configuration is allowed.
3) The interface is asynchronous. The actual state is available
using cgroup.events control file ("frozen" field). There are no
dedicated transitional states.
4) It's allowed to make any changes with the cgroup hierarchy
(create new cgroups, remove old cgroups, move tasks between cgroups)
no matter if some cgroups are frozen.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
No-objection-from-me-by: Oleg Nesterov <oleg@redhat.com>
Cc: kernel-team@fb.com
2019-04-19 11:26:48 -07:00
Roman Gushchin
4dcabece4c cgroup: protect cgroup->nr_(dying_)descendants by css_set_lock
The number of descendant cgroups and the number of dying
descendant cgroups are currently synchronized using the cgroup_mutex.

The number of descendant cgroups will be required by the cgroup v2
freezer, which will use it to determine if a cgroup is frozen
(depending on total number of descendants and number of frozen
descendants). It's not always acceptable to grab the cgroup_mutex,
especially from quite hot paths (e.g. exit()).

To avoid this, let's additionally synchronize these counters using
the css_set_lock.

So, it's safe to read these counters with either cgroup_mutex or
css_set_lock locked, and for changing both locks should be acquired.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
2019-04-19 11:26:48 -07:00
Roman Gushchin
aade7f9efb cgroup: implement __cgroup_task_count() helper
The helper is identical to the existing cgroup_task_count()
except it doesn't take the css_set_lock by itself, assuming
that the caller does.

Also, move cgroup_task_count() implementation into
kernel/cgroup/cgroup.c, as there is nothing specific to cgroup v1.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
2019-04-19 11:26:48 -07:00
Roman Gushchin
50943f3e13 cgroup: rename freezer.c into legacy_freezer.c
Freezer.c will contain an implementation of cgroup v2 freezer,
so let's rename the v1 freezer to avoid naming conflicts.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: kernel-team@fb.com
2019-04-19 11:26:48 -07:00
Juri Lelli
cb0c04143b sched/topology: Update init_sched_domains() comment
Holding hotplug lock is not a requirement anymore for callers of sched_
init_domains after commit:

  6acce3ef84 ("sched: Remove get_online_cpus() usage")

Update the relative comment preceding init_sched_domains().

Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: cgroups@vger.kernel.org
Cc: lizefan@huawei.com
Link: http://lkml.kernel.org/r/20181219133445.31982-2-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-19 19:44:15 +02:00
Juri Lelli
b6fbbf31d1 cgroup/cpuset: Update stale generate_sched_domains() comments
Commit:

  fc560a26ac ("cpuset: replace cpuset->stack_list with cpuset_for_each_descendant_pre()")

removed the local list (q) that was used to perform a top-down scan
of all cpusets; however, comments mentioning it were not updated.

Update comments to reflect current implementation.

Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: cgroups@vger.kernel.org
Cc: lizefan@huawei.com
Link: http://lkml.kernel.org/r/20181219133445.31982-1-juri.lelli@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-19 19:44:14 +02:00
Laurent Gauthier
13e792a19d tick: Fix typos in comments
Signed-off-by: Laurent Gauthier <laurent.gauthier@soccasys.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-19 19:17:04 +02:00
Sergey Senozhatsky
8f4a8c12ca kernel/watchdog_hld.c: hard lockup message should end with a newline
Separate print_modules() and hard lockup error message.

Before the patch:

  NMI watchdog: Watchdog detected hard LOCKUP on cpu 1Modules linked in: nls_cp437

Link: http://lkml.kernel.org/r/20190412062557.2700-1-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-04-19 09:46:05 -07:00
Masami Hiramatsu
fabe38ab6b kprobes: Mark ftrace mcount handler functions nokprobe
Mark ftrace mcount handler functions nokprobe since
probing on these functions with kretprobe pushes
return address incorrectly on kretprobe shadow stack.

Reported-by: Francis Deslauriers <francis.deslauriers@efficios.com>
Tested-by: Andrea Righi <righi.andrea@gmail.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/155094062044.6137.6419622920568680640.stgit@devbox
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-19 14:26:06 +02:00
Konstantin Khlebnikov
1a8b4540db sched/core: Check quota and period overflow at usec to nsec conversion
Large values could overflow u64 and pass following sanity checks.

 # echo 18446744073750000 > cpu.cfs_period_us
 # cat cpu.cfs_period_us
 40448

 # echo 18446744073750000 > cpu.cfs_quota_us
 # cat cpu.cfs_quota_us
 40448

After this patch they will fail with -EINVAL.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/155125502079.293431.3947497929372138600.stgit@buzz
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-19 13:42:10 +02:00
Konstantin Khlebnikov
5b61d50ab4 sched/core: Handle overflow in cpu_shares_write_u64
Bit shift in scale_load() could overflow shares. This patch saturates
it to MAX_SHARES like following sched_group_set_shares().

Example:

 # echo 9223372036854776832 > cpu.shares
 # cat cpu.shares

Before patch: 1024
After pattch: 262144

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/155125501891.293431.3345233332801109696.stgit@buzz
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-19 13:42:10 +02:00
Konstantin Khlebnikov
1a010e29cf sched/rt: Check integer overflow at usec to nsec conversion
Example of unhandled overflows:

 # echo 18446744073709651 > cpu.rt_runtime_us
 # cat cpu.rt_runtime_us
 99

 # echo 18446744073709900 > cpu.rt_period_us
 # cat cpu.rt_period_us
 348

After this patch they will fail with -EINVAL.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/155125501739.293431.5252197504404771496.stgit@buzz
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-19 13:42:09 +02:00
Wei Yang
f6c6010a07 mm/resource: Use resource_overlaps() to simplify region_intersects()
The three checks in region_intersects() are basically an open-coded version
of resource_overlaps() - so use the real thing.

Also fix typos in comments while at it.

Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Reviewed-by: Like Xu <like.xu@linux.intel.com>
Reviewed-by: Yuan Yao <yuan.yao@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akpm@linux-foundation.org
Cc: bhelgaas@google.com
Cc: bp@suse.de
Cc: dan.j.williams@intel.com
Cc: jack@suse.cz
Cc: rdunlap@infradead.org
Cc: tiwai@suse.de
Link: http://lkml.kernel.org/r/20190305083432.23675-1-richardw.yang@linux.intel.com
[ Rewrote the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-19 12:59:36 +02:00
Mathieu Desnoyers
83b0b15bcb rseq: Remove superfluous rseq_len from task_struct
The rseq system call, when invoked with flags of "0" or
"RSEQ_FLAG_UNREGISTER" values, expects the rseq_len parameter to
be equal to sizeof(struct rseq), which is fixed-size and fixed-layout,
specified in uapi linux/rseq.h.

Expecting a fixed size for rseq_len is a design choice that ensures
multiple libraries and application defining __rseq_abi in the same
process agree on its exact size.

Considering that this size is and will always be the same value, there
is no point in saving this value within task_struct rseq_len. Remove
this field from task_struct.

No change in functionality intended.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Lameter <cl@linux.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-api@vger.kernel.org
Link: http://lkml.kernel.org/r/20190305194755.2602-3-mathieu.desnoyers@efficios.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-19 12:39:32 +02:00
Mathieu Desnoyers
bff9504bfc rseq: Clean up comments by reflecting removal of event counter
The "event counter" was removed from rseq before it was merged upstream.
However, a few comments in the source code still refer to it. Adapt the
comments to match reality.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Lameter <cl@linux.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-api@vger.kernel.org
Link: http://lkml.kernel.org/r/20190305194755.2602-2-mathieu.desnoyers@efficios.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-19 12:39:31 +02:00
Joel Savitz
bee9853932 sched/core: Fix typo in comment
Signed-off-by: Joel Savitz <jsavitz@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: trivial@kernel.org
Link: http://lkml.kernel.org/r/1551921213-813-1-git-send-email-jsavitz@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-19 12:22:16 +02:00
Stephen Suryaputra
0bc1998544 ipv6: Add rate limit mask for ICMPv6 messages
To make ICMPv6 closer to ICMPv4, add ratemask parameter. Since the ICMP
message types use larger numeric values, a simple bitmask doesn't fit.
I use large bitmap. The input and output are the in form of list of
ranges. Set the default to rate limit all error messages but Packet Too
Big. For Packet Too Big, use ratemask instead of hard-coded.

There are functions where icmpv6_xrlim_allow() and icmpv6_global_allow()
aren't called. This patch only adds them to icmpv6_echo_reply().

Rate limiting error messages is mandated by RFC 4443 but RFC 4890 says
that it is also acceptable to rate limit informational messages. Thus,
I removed the current hard-coded behavior of icmpv6_mask_allow() that
doesn't rate limit informational messages.

v2: Add dummy function proc_do_large_bitmap() if CONFIG_PROC_SYSCTL
    isn't defined, expand the description in ip-sysctl.txt and remove
    unnecessary conditional before kfree().
v3: Inline the bitmap instead of dynamically allocated. Still is a
    pointer to it is needed because of the way proc_do_large_bitmap work.

Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-18 16:58:37 -07:00
YueHaibing
b1546edcf2 sched/core: Make some functions static
Fix these sparse warnings:

  kernel/sched/core.c:6577:11: warning: symbol 'min_cfs_quota_period' was not declared. Should it be static?
  kernel/sched/core.c:6657:5: warning: symbol 'tg_set_cfs_quota' was not declared. Should it be static?
  kernel/sched/core.c:6670:6: warning: symbol 'tg_get_cfs_quota' was not declared. Should it be static?
  kernel/sched/core.c:6683:5: warning: symbol 'tg_set_cfs_period' was not declared. Should it be static?
  kernel/sched/core.c:6693:6: warning: symbol 'tg_get_cfs_period' was not declared. Should it be static?
  kernel/sched/fair.c:2596:6: warning: symbol 'task_tick_numa' was not declared. Should it be static?

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20190418144713.34332-1-yuehaibing@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-18 20:28:02 +02:00
Christian Brauner
738a7832d2 signal: use fdget() since we don't allow O_PATH
As stated in the original commit for pidfd_send_signal() we don't allow
to signal processes through O_PATH file descriptors since it is
semantically equivalent to a write on the pidfd.

We already correctly error out right now and return EBADF if an O_PATH
fd is passed.  This is because we use file->f_op to detect whether a
pidfd is passed and O_PATH fds have their file->f_op set to empty_fops
in do_dentry_open() and thus fail the test.

Thus, there is no regression.  It's just semantically correct to use
fdget() and return an error right from there instead of taking a
reference and returning an error later.

Signed-off-by: Christian Brauner <christian@brauner.io>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jann Horn <jann@thejh.net>
Cc: David Howells <dhowells@redhat.com>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-04-18 08:35:18 -07:00
Ingo Molnar
94e4dcc75a Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU and LKMM commits from Paul E. McKenney:

 - An LKMM commit adding support for synchronize_srcu_expedited()
 - A couple of straggling RCU flavor consolidation updates
 - Documentation updates.
 - Miscellaneous fixes
 - SRCU updates
 - RCU CPU stall-warning updates
 - Torture-test updates

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-18 14:42:24 +02:00
Chang-An Chen
3f2552f7e9 timers/sched_clock: Prevent generic sched_clock wrap caused by tick_freeze()
tick_freeze() introduced by suspend-to-idle in commit 124cf9117c ("PM /
sleep: Make it possible to quiesce timers during suspend-to-idle") uses
timekeeping_suspend() instead of syscore_suspend() during
suspend-to-idle. As a consequence generic sched_clock will keep going
because sched_clock_suspend() and sched_clock_resume() are not invoked
during suspend-to-idle which can result in a generic sched_clock wrap.

On a ARM system with suspend-to-idle enabled, sched_clock is registered
as "56 bits at 13MHz, resolution 76ns, wraps every 4398046511101ns", which
means the real wrapping duration is 8796093022202ns.

[  134.551779] suspend-to-idle suspend (timekeeping_suspend())
[ 1204.912239] suspend-to-idle resume (timekeeping_resume())
......
[ 1206.912239] suspend-to-idle suspend (timekeeping_suspend())
[ 5880.502807] suspend-to-idle resume (timekeeping_resume())
......
[ 6000.403724] suspend-to-idle suspend (timekeeping_suspend())
[ 8035.753167] suspend-to-idle resume  (timekeeping_resume())
......
[ 8795.786684] (2)[321:charger_thread]......
[ 8795.788387] (2)[321:charger_thread]......
[    0.057226] (0)[0:swapper/0]......
[    0.061447] (2)[0:swapper/2]......

sched_clock was not stopped during suspend-to-idle, and sched_clock_poll
hrtimer was not expired because timekeeping_suspend() was invoked during
suspend-to-idle. It makes sched_clock wrap at kernel time 8796s.

To prevent this, invoke sched_clock_suspend() and sched_clock_resume() in
tick_freeze() together with timekeeping_suspend() and timekeeping_resume().

Fixes: 124cf9117c (PM / sleep: Make it possible to quiesce timers during suspend-to-idle)
Signed-off-by: Chang-An Chen <chang-an.chen@mediatek.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Corey Minyard <cminyard@mvista.com>
Cc: <linux-mediatek@lists.infradead.org>
Cc: <linux-arm-kernel@lists.infradead.org>
Cc: Stanley Chu <stanley.chu@mediatek.com>
Cc: <kuohong.wang@mediatek.com>
Cc: <freddy.hsin@mediatek.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1553828349-8914-1-git-send-email-chang-an.chen@mediatek.com
2019-04-18 14:34:53 +02:00
Nicholas Piggin
471ba0e686 irq_work: Do not raise an IPI when queueing work on the local CPU
The QEMU PowerPC/PSeries machine model was not expecting a self-IPI,
and it may be a bit surprising thing to do, so have irq_work_queue_on
do local queueing when target is the current CPU.

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: =?UTF-8?q?C=C3=A9dric=20Le=20Goater?= <clg@kaod.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190409093403.20994-1-npiggin@gmail.com
[ Simplified the preprocessor comments.
  Fixed unbalanced curly brackets pointed out by Thomas. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-18 14:07:52 +02:00