Commit Graph

31814 Commits

Author SHA1 Message Date
Krzysztof Kozlowski
fc809bc5ce tracing: Fix Kconfig indentation
Adjust indentation from spaces to tab (+optional two spaces) as in
coding style with command like:
	$ sed -e 's/^        /\t/' -i */Kconfig

Link: http://lkml.kernel.org/r/20191120133807.12741-1-krzk@kernel.org

Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-22 19:41:08 -05:00
Xianting Tian
a82a4804b4 ring-buffer: Fix typos in function ring_buffer_producer
Fix spelling and other typos

Link: http://lkml.kernel.org/r/1573916755-32478-1-git-send-email-xianting_tian@126.com

Signed-off-by: Xianting Tian <xianting_tian@126.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-22 19:41:08 -05:00
Jakub Kicinski
a9f852e92e Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Minor conflict in drivers/s390/net/qeth_l2_main.c, kept the lock
from commit c8183f5489 ("s390/qeth: fix potential deadlock on
workqueue flush"), removed the code which was removed by commit
9897d583b0 ("s390/qeth: consolidate some duplicated HW cmd code").

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-11-22 16:27:24 -08:00
Linus Torvalds
34c36f4564 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Validate tunnel options length in act_tunnel_key, from Xin Long.

 2) Fix DMA sync bug in gve driver, from Adi Suresh.

 3) TSO kills performance on some r8169 chips due to HW issues, disable
    by default in that case, from Corinna Vinschen.

 4) Fix clock disable mismatch in fec driver, from Chubong Yuan.

 5) Fix interrupt status bits define in hns3 driver, from Huazhong Tan.

 6) Fix workqueue deadlocks in qeth driver, from Julian Wiedmann.

 7) Don't napi_disable() twice in r8152 driver, from Hayes Wang.

 8) Fix SKB extension memory leak, from Florian Westphal.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (54 commits)
  r8152: avoid to call napi_disable twice
  MAINTAINERS: Add myself as maintainer of virtio-vsock
  udp: drop skb extensions before marking skb stateless
  net: rtnetlink: prevent underflows in do_setvfinfo()
  can: m_can_platform: remove unnecessary m_can_class_resume() call
  can: m_can_platform: set net_device structure as driver data
  hv_netvsc: Fix send_table offset in case of a host bug
  hv_netvsc: Fix offset usage in netvsc_send_table()
  net-ipv6: IPV6_TRANSPARENT - check NET_RAW prior to NET_ADMIN
  sfc: Only cancel the PPS workqueue if it exists
  nfc: port100: handle command failure cleanly
  net-sysfs: fix netdev_queue_add_kobject() breakage
  r8152: Re-order napi_disable in rtl8152_close
  net: qca_spi: Move reset_count to struct qcaspi
  net: qca_spi: fix receive buffer size check
  net/ibmvnic: Ignore H_FUNCTION return from H_EOI to tolerate XIVE mode
  Revert "net/ibmvnic: Fix EOI when running in XIVE mode"
  net/mlxfw: Verify FSM error code translation doesn't exceed array size
  net/mlx5: Update the list of the PCI supported devices
  net/mlx5: Fix auto group size calculation
  ...
2019-11-22 14:28:14 -08:00
Linus Torvalds
a6b0373ffc Power management regression fix for final 5.4
Fix problems with switching cpufreq drivers on some x86 systems with
 ACPI (and with changing the operation modes of the intel_pstate driver
 on those systems) introduced by recent changes related to the
 management of frequency limits in cpufreq.
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl3X1LQSHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxAgMQAKFu4e3IZ7vVMo3gu61r9amqVG03prMh
 Ca8lvMd0pxp1wtL6gSkAB0Kces1hwooz4bpA/Khm7igBfdilcNu9gv/d9EvGRCU1
 /iOWHrsYS4gQnWwjJ001x+PokQQEFv+uGw9a6vwlEVU4a6TImlJKF+99Beogzr7U
 tuTzEpim2ruMHFaP5EqjC/XcU+uzyW2rMnc7fH0t55W0saIT2trI6Nj+DJgZWPth
 eRbgS4aOy3Bt9foBCVOFtkXz5VvHS3EUMPI7DxOx3OTt8D469NBDYx/ui8y7frIF
 t3vME/UuWCJmEUJ7o0CqhVcB4TW5gZaIE7P2TO9pSCxPxKtPbROdfrR3snGPLjNl
 syD7yqwuYpDbfi+0v3Mi3fnUqnnsiXPg2mFCl/w0tgIKz1iAv20pOWnCqja4QreG
 Z+Ax1owzrEYlDaV5lDXYZd48dqItj4dMuUBSwXu8RBk+LE/Bjro0jnnK4zUD/M1m
 sukrvgFrIXpTgqSnm8sX4fMyZnEvmNtG2kAYl6F84k2EICbOLkXD9ypBANb/9C3h
 KcSzJFr76de18s5bnMBgoQRvfElbOotr63TyzeRo/6mF0G4+BpyAEsP7FzRSq+6u
 cd+lrCvItzwIgD/5XU1WpkoOoYM//c0Y9JUHKwtnd+FSp1psGJwlhgbbckazQ699
 A5DC7maAkkHb
 =S23p
 -----END PGP SIGNATURE-----

Merge tag 'pm-5.4-final' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management regression fix from Rafael Wysocki:
 "Fix problems with switching cpufreq drivers on some x86 systems with
  ACPI (and with changing the operation modes of the intel_pstate driver
  on those systems) introduced by recent changes related to the
  management of frequency limits in cpufreq"

* tag 'pm-5.4-final' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PM: QoS: Invalidate frequency QoS requests after removal
2019-11-22 09:18:16 -08:00
Kusanagi Kouichi
0e4a459f56 tracing: Remove unnecessary DEBUG_FS dependency
Tracing replaced debugfs with tracefs.

Signed-off-by: Kusanagi Kouichi <slash@ac.auone-net.jp>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20191120104350753.EWCT.12796.ppp.dion.ne.jp@dmta0009.auone-net.jp
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-11-22 16:19:13 +01:00
Nicolas Saenz Julienne
a7ba70f178 dma-mapping: treat dev->bus_dma_mask as a DMA limit
Using a mask to represent bus DMA constraints has a set of limitations.
The biggest one being it can only hold a power of two (minus one). The
DMA mapping code is already aware of this and treats dev->bus_dma_mask
as a limit. This quirk is already used by some architectures although
still rare.

With the introduction of the Raspberry Pi 4 we've found a new contender
for the use of bus DMA limits, as its PCIe bus can only address the
lower 3GB of memory (of a total of 4GB). This is impossible to represent
with a mask. To make things worse the device-tree code rounds non power
of two bus DMA limits to the next power of two, which is unacceptable in
this case.

In the light of this, rename dev->bus_dma_mask to dev->bus_dma_limit all
over the tree and treat it as such. Note that dev->bus_dma_limit should
contain the higher accessible DMA address.

Signed-off-by: Nicolas Saenz Julienne <nsaenzjulienne@suse.de>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-11-21 18:14:35 +01:00
Christoph Hellwig
d7293f79ca Merge branch 'for-next/zone-dma' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux into dma-mapping-for-next
Pull in a stable branch from the arm64 tree that adds the zone_dma_bits
variable to avoid creating hard to resolve conflicts with that addition.
2019-11-21 18:13:03 +01:00
Paolo Bonzini
46f4f0aabc Merge branch 'kvm-tsx-ctrl' into HEAD
Conflicts:
	arch/x86/kvm/vmx/vmx.c
2019-11-21 12:03:40 +01:00
Alexander Shishkin
c4b7547974 perf/core: Make the mlock accounting simple again
Commit:

  d44248a413 ("perf/core: Rework memory accounting in perf_mmap()")

does a lot of things to the mlock accounting arithmetics, while the only
thing that actually needed to happen is subtracting the part that is
charged to the mm from the part that is charged to the user, so that the
former isn't charged twice.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Cc: songliubraving@fb.com
Link: https://lkml.kernel.org/r/20191120170640.54123-1-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-21 07:37:50 +01:00
Frederic Weisbecker
74722bb223 sched/vtime: Bring up complete kcpustat accessor
Many callsites want to fetch the values of system, user, user_nice, guest
or guest_nice kcpustat fields altogether or at least a pair of these.

In that case calling kcpustat_field() for each requested field brings
unecessary overhead when we could fetch all of them in a row.

So provide kcpustat_cpu_fetch() that fetches the whole kcpustat array
in a vtime safe way under the same RCU and seqcount block.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Link: https://lkml.kernel.org/r/20191121024430.19938-3-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-21 07:33:24 +01:00
Frederic Weisbecker
5a1c95580f sched/cputime: Support other fields on kcpustat_field()
Provide support for user, nice, guest and guest_nice fields through
kcpustat_field().

Whether we account the delta to a nice or not nice field is decided on
top of the nice value snapshot taken at the time we call kcpustat_field().
If the nice value of the task has been changed since the last vtime
update, we may have inacurrate distribution of the nice VS unnice
cputime.

However this is considered as a minor issue compared to the proper fix
that would involve interrupting the target on nice updates, which is
undesired on nohz_full CPUs.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com>
Link: https://lkml.kernel.org/r/20191121024430.19938-2-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-21 07:33:23 +01:00
David S. Miller
ee5a489fd9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2019-11-20

The following pull-request contains BPF updates for your *net-next* tree.

We've added 81 non-merge commits during the last 17 day(s) which contain
a total of 120 files changed, 4958 insertions(+), 1081 deletions(-).

There are 3 trivial conflicts, resolve it by always taking the chunk from
196e8ca748:

<<<<<<< HEAD
=======
void *bpf_map_area_mmapable_alloc(u64 size, int numa_node);
>>>>>>> 196e8ca748

<<<<<<< HEAD
void *bpf_map_area_alloc(u64 size, int numa_node)
=======
static void *__bpf_map_area_alloc(u64 size, int numa_node, bool mmapable)
>>>>>>> 196e8ca748

<<<<<<< HEAD
        if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
=======
        /* kmalloc()'ed memory can't be mmap()'ed */
        if (!mmapable && size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
>>>>>>> 196e8ca748

The main changes are:

1) Addition of BPF trampoline which works as a bridge between kernel functions,
   BPF programs and other BPF programs along with two new use cases: i) fentry/fexit
   BPF programs for tracing with practically zero overhead to call into BPF (as
   opposed to k[ret]probes) and ii) attachment of the former to networking related
   programs to see input/output of networking programs (covering xdpdump use case),
   from Alexei Starovoitov.

2) BPF array map mmap support and use in libbpf for global data maps; also a big
   batch of libbpf improvements, among others, support for reading bitfields in a
   relocatable manner (via libbpf's CO-RE helper API), from Andrii Nakryiko.

3) Extend s390x JIT with usage of relative long jumps and loads in order to lift
   the current 64/512k size limits on JITed BPF programs there, from Ilya Leoshkevich.

4) Add BPF audit support and emit messages upon successful prog load and unload in
   order to have a timeline of events, from Daniel Borkmann and Jiri Olsa.

5) Extension to libbpf and xdpsock sample programs to demo the shared umem mode
   (XDP_SHARED_UMEM) as well as RX-only and TX-only sockets, from Magnus Karlsson.

6) Several follow-up bug fixes for libbpf's auto-pinning code and a new API
   call named bpf_get_link_xdp_info() for retrieving the full set of prog
   IDs attached to XDP, from Toke Høiland-Jørgensen.

7) Add BTF support for array of int, array of struct and multidimensional arrays
   and enable it for skb->cb[] access in kfree_skb test, from Martin KaFai Lau.

8) Fix AF_XDP by using the correct number of channels from ethtool, from Luigi Rizzo.

9) Two fixes for BPF selftest to get rid of a hang in test_tc_tunnel and to avoid
   xdping to be run as standalone, from Jiri Benc.

10) Various BPF selftest fixes when run with latest LLVM trunk, from Yonghong Song.

11) Fix a memory leak in BPF fentry test run data, from Colin Ian King.

12) Various smaller misc cleanups and improvements mostly all over BPF selftests and
    samples, from Daniel T. Lee, Andre Guedes, Anders Roxell, Mao Wenan, Yue Haibing.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-20 18:11:23 -08:00
Daniel Borkmann
196e8ca748 bpf: Switch bpf_map_{area_alloc,area_mmapable_alloc}() to u64 size
Given we recently extended the original bpf_map_area_alloc() helper in
commit fc9702273e ("bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY"),
we need to apply the same logic as in ff1c08e1f7 ("bpf: Change size
to u64 for bpf_map_{area_alloc, charge_init}()"). To avoid conflicts,
extend it for bpf-next.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-11-20 23:18:58 +01:00
Daniel Borkmann
91e6015b08 bpf: Emit audit messages upon successful prog load and unload
Allow for audit messages to be emitted upon BPF program load and
unload for having a timeline of events. The load itself is in
syscall context, so additional info about the process initiating
the BPF prog creation can be logged and later directly correlated
to the unload event.

The only info really needed from BPF side is the globally unique
prog ID where then audit user space tooling can query / dump all
info needed about the specific BPF program right upon load event
and enrich the record, thus these changes needed here can be kept
small and non-intrusive to the core.

Raw example output:

  # auditctl -D
  # auditctl -a always,exit -F arch=x86_64 -S bpf
  # ausearch --start recent -m 1334
  [...]
  ----
  time->Wed Nov 20 12:45:51 2019
  type=PROCTITLE msg=audit(1574271951.590:8974): proctitle="./test_verifier"
  type=SYSCALL msg=audit(1574271951.590:8974): arch=c000003e syscall=321 success=yes exit=14 a0=5 a1=7ffe2d923e80 a2=78 a3=0 items=0 ppid=742 pid=949 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=2 comm="test_verifier" exe="/root/bpf-next/tools/testing/selftests/bpf/test_verifier" subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key=(null)
  type=UNKNOWN[1334] msg=audit(1574271951.590:8974): auid=0 uid=0 gid=0 ses=2 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 pid=949 comm="test_verifier" exe="/root/bpf-next/tools/testing/selftests/bpf/test_verifier" prog-id=3260 event=LOAD
  ----
  time->Wed Nov 20 12:45:51 2019
type=UNKNOWN[1334] msg=audit(1574271951.590:8975): prog-id=3260 event=UNLOAD
  ----
  [...]

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20191120213816.8186-1-jolsa@kernel.org
2019-11-20 13:44:51 -08:00
Christoph Hellwig
68a33b1794 dma-direct: exclude dma_direct_map_resource from the min_low_pfn check
The valid memory address check in dma_capable only makes sense when mapping
normal memory, not when using dma_map_resource to map a device resource.
Add a new boolean argument to dma_capable to exclude that check for the
dma_map_resource case.

Fixes: b12d66278d ("dma-direct: check for overflows on 32 bit DMA addresses")
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
2019-11-20 20:31:41 +01:00
Christoph Hellwig
4268ac6ae5 dma-direct: don't check swiotlb=force in dma_direct_map_resource
When mapping resources we can't just use swiotlb ram for bounce
buffering.  Switch to a direct dma_capable check instead.

Fixes: cfced78696 ("dma-mapping: remove the default map_resource implementation")
Reported-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
2019-11-20 20:31:41 +01:00
Dan Carpenter
50f579a239 dma-debug: clean up put_hash_bucket()
The put_hash_bucket() is a bit cleaner if it takes an unsigned long
directly instead of a pointer to unsigned long.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-11-20 20:31:41 +01:00
Christoph Hellwig
56e35f9c5b dma-mapping: drop the dev argument to arch_sync_dma_for_*
These are pure cache maintainance routines, so drop the unused
struct device argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Suggested-by: Daniel Vetter <daniel.vetter@ffwll.ch>
2019-11-20 20:31:38 +01:00
Luc Van Oostenryck
9e77716a75 fork: fix pidfd_poll()'s return type
pidfd_poll() is defined as returning 'unsigned int' but the
.poll method is declared as returning '__poll_t', a bitwise type.

Fix this by using the proper return type and using the EPOLL
constants instead of the POLL ones, as required for __poll_t.

Fixes: b53b0b9d9a ("pidfd: add polling support")
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: stable@vger.kernel.org # 5.3
Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20191120003320.31138-1-luc.vanoostenryck@gmail.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2019-11-20 11:48:50 +01:00
Daniel Lezcano
5aa9ba6312 cpuidle: Pass exit latency limit to cpuidle_use_deepest_state()
Modify cpuidle_use_deepest_state() to take an additional exit latency
limit argument to be passed to find_deepest_idle_state() and make
cpuidle_idle_call() pass dev->forced_idle_latency_limit_ns to it for
forced idle.

Suggested-by: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
[ rjw: Rebase and rearrange code, subject & changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-11-20 11:46:18 +01:00
Daniel Lezcano
c55b51a06b cpuidle: Allow idle injection to apply exit latency limit
In some cases it may be useful to specify an exit latency limit for
the idle state to be used during CPU idle time injection.

Instead of duplicating the information in struct cpuidle_device
or propagating the latency limit in the call stack, replace the
use_deepest_state field with forced_latency_limit_ns to represent
that limit, so that the deepest idle state with exit latency within
that limit is forced (i.e. no governors) when it is set.

A zero exit latency limit for forced idle means to use governors in
the usual way (analogous to use_deepest_state equal to "false" before
this change).

Additionally, add play_idle_precise() taking two arguments, the
duration of forced idle and the idle state exit latency limit, both
in nanoseconds, and redefine play_idle() as a wrapper around that
new function.

This change is preparatory, no functional impact is expected.

Suggested-by: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
[ rjw: Subject, changelog, cpuidle_use_deepest_state() kerneldoc, whitespace ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-11-20 11:32:55 +01:00
Rafael J. Wysocki
05ff1ba412 PM: QoS: Invalidate frequency QoS requests after removal
Switching cpufreq drivers (or switching operation modes of the
intel_pstate driver from "active" to "passive" and vice versa)
does not work on some x86 systems with ACPI after commit
3000ce3c52 ("cpufreq: Use per-policy frequency QoS"), because
the ACPI _PPC and thermal code uses the same frequency QoS request
object for a given CPU every time a cpufreq driver is registered
and freq_qos_remove_request() does not invalidate the request after
removing it from its QoS list, so freq_qos_add_request() complains
and fails when that request is passed to it again.

Fix the issue by modifying freq_qos_remove_request() to clear the qos
and type fields of the frequency request pointed to by its argument
after removing it from its QoS list so as to invalidate it.

Fixes: 3000ce3c52 ("cpufreq: Use per-policy frequency QoS")
Reported-and-tested-by: Doug Smythies <dsmythies@telus.net>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2019-11-20 10:46:42 +01:00
Thomas Gleixner
3ef240eaff futex: Prevent exit livelock
Oleg provided the following test case:

int main(void)
{
	struct sched_param sp = {};

	sp.sched_priority = 2;
	assert(sched_setscheduler(0, SCHED_FIFO, &sp) == 0);

	int lock = vfork();
	if (!lock) {
		sp.sched_priority = 1;
		assert(sched_setscheduler(0, SCHED_FIFO, &sp) == 0);
		_exit(0);
	}

	syscall(__NR_futex, &lock, FUTEX_LOCK_PI, 0,0,0);
	return 0;
}

This creates an unkillable RT process spinning in futex_lock_pi() on a UP
machine or if the process is affine to a single CPU. The reason is:

 parent	    	    			child

  set FIFO prio 2

  vfork()			->	set FIFO prio 1
   implies wait_for_child()	 	sched_setscheduler(...)
 			   		exit()
					do_exit()
 					....
					mm_release()
					  tsk->futex_state = FUTEX_STATE_EXITING;
					  exit_futex(); (NOOP in this case)
					  complete() --> wakes parent
  sys_futex()
    loop infinite because
    tsk->futex_state == FUTEX_STATE_EXITING

The same problem can happen just by regular preemption as well:

  task holds futex
  ...
  do_exit()
    tsk->futex_state = FUTEX_STATE_EXITING;

  --> preemption (unrelated wakeup of some other higher prio task, e.g. timer)

  switch_to(other_task)

  return to user
  sys_futex()
	loop infinite as above

Just for the fun of it the futex exit cleanup could trigger the wakeup
itself before the task sets its futex state to DEAD.

To cure this, the handling of the exiting owner is changed so:

   - A refcount is held on the task

   - The task pointer is stored in a caller visible location

   - The caller drops all locks (hash bucket, mmap_sem) and blocks
     on task::futex_exit_mutex. When the mutex is acquired then
     the exiting task has completed the cleanup and the state
     is consistent and can be reevaluated.

This is not a pretty solution, but there is no choice other than returning
an error code to user space, which would break the state consistency
guarantee and open another can of problems including regressions.

For stable backports the preparatory commits ac31c7ff86 .. ba31c1a485
are required as well, but for anything older than 5.3.y the backports are
going to be provided when this hits mainline as the other dependencies for
those kernels are definitely not stable material.

Fixes: 778e9a9c3e ("pi-futex: fix exit races and locking problems")
Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Stable Team <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20191106224557.041676471@linutronix.de
2019-11-20 09:40:38 +01:00
Thomas Gleixner
ac31c7ff86 futex: Provide distinct return value when owner is exiting
attach_to_pi_owner() returns -EAGAIN for various cases:

 - Owner task is exiting
 - Futex value has changed

The caller drops the held locks (hash bucket, mmap_sem) and retries the
operation. In case of the owner task exiting this can result in a live
lock.

As a preparatory step for seperating those cases, provide a distinct return
value (EBUSY) for the owner exiting case.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20191106224556.935606117@linutronix.de
2019-11-20 09:40:10 +01:00
Thomas Gleixner
3f186d9748 futex: Add mutex around futex exit
The mutex will be used in subsequent changes to replace the busy looping of
a waiter when the futex owner is currently executing the exit cleanup to
prevent a potential live lock.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20191106224556.845798895@linutronix.de
2019-11-20 09:40:10 +01:00
Thomas Gleixner
af8cbda2cf futex: Provide state handling for exec() as well
exec() attempts to handle potentially held futexes gracefully by running
the futex exit handling code like exit() does.

The current implementation has no protection against concurrent incoming
waiters. The reason is that the futex state cannot be set to
FUTEX_STATE_DEAD after the cleanup because the task struct is still active
and just about to execute the new binary.

While its arguably buggy when a task holds a futex over exec(), for
consistency sake the state handling can at least cover the actual futex
exit cleanup section. This provides state consistency protection accross
the cleanup. As the futex state of the task becomes FUTEX_STATE_OK after the
cleanup has been finished, this cannot prevent subsequent attempts to
attach to the task in case that the cleanup was not successfull in mopping
up all leftovers.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20191106224556.753355618@linutronix.de
2019-11-20 09:40:09 +01:00
Thomas Gleixner
4a8e991b91 futex: Sanitize exit state handling
Instead of having a smp_mb() and an empty lock/unlock of task::pi_lock move
the state setting into to the lock section.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20191106224556.645603214@linutronix.de
2019-11-20 09:40:09 +01:00
Thomas Gleixner
18f694385c futex: Mark the begin of futex exit explicitly
Instead of relying on PF_EXITING use an explicit state for the futex exit
and set it in the futex exit function. This moves the smp barrier and the
lock/unlock serialization into the futex code.

As with the DEAD state this is restricted to the exit path as exec
continues to use the same task struct.

This allows to simplify that logic in a next step.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20191106224556.539409004@linutronix.de
2019-11-20 09:40:09 +01:00
Thomas Gleixner
f24f22435d futex: Set task::futex_state to DEAD right after handling futex exit
Setting task::futex_state in do_exit() is rather arbitrarily placed for no
reason. Move it into the futex code.

Note, this is only done for the exit cleanup as the exec cleanup cannot set
the state to FUTEX_STATE_DEAD because the task struct is still in active
use.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20191106224556.439511191@linutronix.de
2019-11-20 09:40:08 +01:00
Thomas Gleixner
150d71584b futex: Split futex_mm_release() for exit/exec
To allow separate handling of the futex exit state in the futex exit code
for exit and exec, split futex_mm_release() into two functions and invoke
them from the corresponding exit/exec_mm_release() callsites.

Preparatory only, no functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20191106224556.332094221@linutronix.de
2019-11-20 09:40:08 +01:00
Thomas Gleixner
4610ba7ad8 exit/exec: Seperate mm_release()
mm_release() contains the futex exit handling. mm_release() is called from
do_exit()->exit_mm() and from exec()->exec_mm().

In the exit_mm() case PF_EXITING and the futex state is updated. In the
exec_mm() case these states are not touched.

As the futex exit code needs further protections against exit races, this
needs to be split into two functions.

Preparatory only, no functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20191106224556.240518241@linutronix.de
2019-11-20 09:40:08 +01:00
Thomas Gleixner
3d4775df0a futex: Replace PF_EXITPIDONE with a state
The futex exit handling relies on PF_ flags. That's suboptimal as it
requires a smp_mb() and an ugly lock/unlock of the exiting tasks pi_lock in
the middle of do_exit() to enforce the observability of PF_EXITING in the
futex code.

Add a futex_state member to task_struct and convert the PF_EXITPIDONE logic
over to the new state. The PF_EXITING dependency will be cleaned up in a
later step.

This prepares for handling various futex exit issues later.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20191106224556.149449274@linutronix.de
2019-11-20 09:40:07 +01:00
Thomas Gleixner
ba31c1a485 futex: Move futex exit handling into futex code
The futex exit handling is #ifdeffed into mm_release() which is not pretty
to begin with. But upcoming changes to address futex exit races need to add
more functionality to this exit code.

Split it out into a function, move it into futex code and make the various
futex exit functions static.

Preparatory only and no functional change.

Folded build fix from Borislav.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20191106224556.049705556@linutronix.de
2019-11-20 09:40:07 +01:00
YueHaibing
b2e2f0e6a6 bpf: Make array_map_mmap static
Fix sparse warning:

kernel/bpf/arraymap.c:481:5: warning:
 symbol 'array_map_mmap' was not declared. Should it be static?

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20191119142113.15388-1-yuehaibing@huawei.com
2019-11-19 16:57:32 -08:00
Steven Rostedt (VMware)
46f9469247 ftrace: Rename ftrace_graph_stub to ftrace_stub_graph
The ftrace_graph_stub was created and points to ftrace_stub as a way to
assign the functon graph tracer function pointer to a stub function with a
different prototype than what ftrace_stub has and not trigger the C
verifier. The ftrace_graph_stub was created via the linker script
vmlinux.lds.h. Unfortunately, powerpc already uses the name
ftrace_graph_stub for its internal implementation of the function graph
tracer, and even though powerpc would still build, the change via the linker
script broke function tracer on powerpc from working.

By using the name ftrace_stub_graph, which does not exist anywhere else in
the kernel, this should not be a problem.

Link: https://lore.kernel.org/r/1573849732.5937.136.camel@lca.pw

Fixes: b83b43ffc6 ("fgraph: Fix function type mismatches of ftrace_graph_return using ftrace_stub")
Reorted-by: Qian Cai <cai@lca.pw>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-18 11:42:16 -05:00
Steven Rostedt (VMware)
ea806eb3ea ftrace: Add a helper function to modify_ftrace_direct() to allow arch optimization
If a direct ftrace callback is at a location that does not have any other
ftrace helpers attached to it, it is possible to simply just change the
text to call the new caller (if the architecture supports it). But this
requires special architecture code. Currently, modify_ftrace_direct() uses a
trick to add a stub ftrace callback to the location forcing it to call the
ftrace iterator. Then it can change the direct helper to call the new
function in C, and then remove the stub. Removing the stub will have the
location now call the new location that the direct helper is using.

The new helper function does the registering the stub trick, but is a weak
function, allowing an architecture to override it to do something a bit more
direct.

Link: https://lore.kernel.org/r/20191115215125.mbqv7taqnx376yed@ast-mbp.dhcp.thefacebook.com

Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-18 11:42:09 -05:00
Alexander Shishkin
36b3db03b4 perf/core: Fix the mlock accounting, again
Commit:

  5e6c3c7b1e ("perf/aux: Fix tracking of auxiliary trace buffer allocation")

tried to guess the correct combination of arithmetic operations that would
undo the AUX buffer's mlock accounting, and failed, leaking the bottom part
when an allocation needs to be charged partially to both user->locked_vm
and mm->pinned_vm, eventually leaving the user with no locked bonus:

  $ perf record -e intel_pt//u -m1,128 uname
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.061 MB perf.data ]

  $ perf record -e intel_pt//u -m1,128 uname
  Permission error mapping pages.
  Consider increasing /proc/sys/kernel/perf_event_mlock_kb,
  or try again with a smaller value of -m/--mmap_pages.
  (current value: 1,128)

Fix this by subtracting both locked and pinned counts when AUX buffer is
unmapped.

Reported-by: Thomas Richter <tmricht@linux.ibm.com>
Tested-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-18 16:27:37 +01:00
Vincent Guittot
bef69dd878 sched/cpufreq: Move the cfs_rq_util_change() call to cpufreq_update_util()
update_cfs_rq_load_avg() calls cfs_rq_util_change() every time PELT decays,
which might be inefficient when the cpufreq driver has rate limitation.

When a task is attached on a CPU, we have this call path:

update_load_avg()
  update_cfs_rq_load_avg()
    cfs_rq_util_change -- > trig frequency update
  attach_entity_load_avg()
    cfs_rq_util_change -- > trig frequency update

The 1st frequency update will not take into account the utilization of the
newly attached task and the 2nd one might be discarded because of rate
limitation of the cpufreq driver.

update_cfs_rq_load_avg() is only called by update_blocked_averages()
and update_load_avg() so we can move the call to
cfs_rq_util_change/cpufreq_update_util() into these two functions.

It's also interesting to note that update_load_avg() already calls
cfs_rq_util_change() directly for the !SMP case.

This change will also ensure that cpufreq_update_util() is called even
when there is no more CFS rq in the leaf_cfs_rq_list to update, but only
IRQ, RT or DL PELT signals.

[ mingo: Minor updates. ]

Reported-by: Doug Smythies <dsmythies@telus.net>
Tested-by: Doug Smythies <dsmythies@telus.net>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: juri.lelli@redhat.com
Cc: linux-pm@vger.kernel.org
Cc: mgorman@suse.de
Cc: rostedt@goodmis.org
Cc: sargun@sargun.me
Cc: srinivas.pandruvada@linux.intel.com
Cc: tj@kernel.org
Cc: xiexiuqi@huawei.com
Cc: xiezhipeng1@huawei.com
Fixes: 039ae8bcf7 ("sched/fair: Fix O(nr_cgroups) in the load balancing path")
Link: https://lkml.kernel.org/r/1574083279-799-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-18 14:42:26 +01:00
Ingo Molnar
b21feab0b8 Linux 5.4-rc8
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl3RzgkeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGN18H/0JZbfIpy8/4Irol
 0va7Aj2fBi1a5oxfqYsMKN0u3GKbN3OV9tQ+7w1eBNGvL72TGadgVTzTY+Im7A9U
 UjboAc7jDPCG+YhIwXFufMiIAq5jDIj6h0LDas7ALsMfsnI/RhTwgNtLTAkyI3dH
 YV/6ljFULwueJHCxzmrYbd1x39PScj3kCNL2pOe6On7rXMKOemY/nbbYYISxY30E
 GMgKApSS+li7VuSqgrKoq5Qaox26LyR2wrXB1ij4pqEJ9xgbnKRLdHuvXZnE+/5p
 46EMirt+yeSkltW3d2/9MoCHaA76ESzWMMDijLx7tPgoTc3RB3/3ZLsm3rYVH+cR
 cRlNNSk=
 =0+Cg
 -----END PGP SIGNATURE-----

Merge tag 'v5.4-rc8' into sched/core, to pick up fixes and dependencies

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-18 14:41:02 +01:00
Vincent Guittot
a9723389cc sched/fair: Add comments for group_type and balancing at SD_NUMA level
Add comments to describe each state of goup_type and to add some details
about the load balance at NUMA level.

[ Valentin Schneider: Updates to the comments. ]
[ mingo: Other updates to the comments. ]

Reported-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Valentin Schneider <valentin.schneider@arm.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/1573570243-1903-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-18 14:33:12 +01:00
Vincent Guittot
3318544b72 sched/fair: Fix rework of find_idlest_group()
The task, for which the scheduler looks for the idlest group of CPUs, must
be discounted from all statistics in order to get a fair comparison
between groups. This includes utilization, load, nr_running and idle_cpus.

Such unfairness can be easily highlighted with the unixbench execl 1 task.
This test continuously call execve() and the scheduler looks for the idlest
group/CPU on which it should place the task. Because the task runs on the
local group/CPU, the latter seems already busy even if there is nothing
else running on it. As a result, the scheduler will always select another
group/CPU than the local one.

This recovers most of the performance regression on my system from the
recent load-balancer rewrite.

[ mingo: Minor cleanups. ]

Reported-by: kernel test robot <rong.a.chen@intel.com>
Tested-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: hdanton@sina.com
Cc: parth@linux.ibm.com
Cc: pauld@redhat.com
Cc: quentin.perret@arm.com
Cc: riel@surriel.com
Cc: srikar@linux.vnet.ibm.com
Cc: valentin.schneider@arm.com
Fixes: 57abff067a ("sched/fair: Rework find_idlest_group()")
Link: https://lkml.kernel.org/r/1571762798-25900-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-18 14:11:56 +01:00
Andrii Nakryiko
fc9702273e bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY
Add ability to memory-map contents of BPF array map. This is extremely useful
for working with BPF global data from userspace programs. It allows to avoid
typical bpf_map_{lookup,update}_elem operations, improving both performance
and usability.

There had to be special considerations for map freezing, to avoid having
writable memory view into a frozen map. To solve this issue, map freezing and
mmap-ing is happening under mutex now:
  - if map is already frozen, no writable mapping is allowed;
  - if map has writable memory mappings active (accounted in map->writecnt),
    map freezing will keep failing with -EBUSY;
  - once number of writable memory mappings drops to zero, map freezing can be
    performed again.

Only non-per-CPU plain arrays are supported right now. Maps with spinlocks
can't be memory mapped either.

For BPF_F_MMAPABLE array, memory allocation has to be done through vmalloc()
to be mmap()'able. We also need to make sure that array data memory is
page-sized and page-aligned, so we over-allocate memory in such a way that
struct bpf_array is at the end of a single page of memory with array->value
being aligned with the start of the second page. On deallocation we need to
accomodate this memory arrangement to free vmalloc()'ed memory correctly.

One important consideration regarding how memory-mapping subsystem functions.
Memory-mapping subsystem provides few optional callbacks, among them open()
and close().  close() is called for each memory region that is unmapped, so
that users can decrease their reference counters and free up resources, if
necessary. open() is *almost* symmetrical: it's called for each memory region
that is being mapped, **except** the very first one. So bpf_map_mmap does
initial refcnt bump, while open() will do any extra ones after that. Thus
number of close() calls is equal to number of open() calls plus one more.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/bpf/20191117172806.2195367-4-andriin@fb.com
2019-11-18 11:41:59 +01:00
Andrii Nakryiko
85192dbf4d bpf: Convert bpf_prog refcnt to atomic64_t
Similarly to bpf_map's refcnt/usercnt, convert bpf_prog's refcnt to atomic64
and remove artificial 32k limit. This allows to make bpf_prog's refcounting
non-failing, simplifying logic of users of bpf_prog_add/bpf_prog_inc.

Validated compilation by running allyesconfig kernel build.

Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20191117172806.2195367-3-andriin@fb.com
2019-11-18 11:41:59 +01:00
Andrii Nakryiko
1e0bd5a091 bpf: Switch bpf_map ref counter to atomic64_t so bpf_map_inc() never fails
92117d8443 ("bpf: fix refcnt overflow") turned refcounting of bpf_map into
potentially failing operation, when refcount reaches BPF_MAX_REFCNT limit
(32k). Due to using 32-bit counter, it's possible in practice to overflow
refcounter and make it wrap around to 0, causing erroneous map free, while
there are still references to it, causing use-after-free problems.

But having a failing refcounting operations are problematic in some cases. One
example is mmap() interface. After establishing initial memory-mapping, user
is allowed to arbitrarily map/remap/unmap parts of mapped memory, arbitrarily
splitting it into multiple non-contiguous regions. All this happening without
any control from the users of mmap subsystem. Rather mmap subsystem sends
notifications to original creator of memory mapping through open/close
callbacks, which are optionally specified during initial memory mapping
creation. These callbacks are used to maintain accurate refcount for bpf_map
(see next patch in this series). The problem is that open() callback is not
supposed to fail, because memory-mapped resource is set up and properly
referenced. This is posing a problem for using memory-mapping with BPF maps.

One solution to this is to maintain separate refcount for just memory-mappings
and do single bpf_map_inc/bpf_map_put when it goes from/to zero, respectively.
There are similar use cases in current work on tcp-bpf, necessitating extra
counter as well. This seems like a rather unfortunate and ugly solution that
doesn't scale well to various new use cases.

Another approach to solve this is to use non-failing refcount_t type, which
uses 32-bit counter internally, but, once reaching overflow state at UINT_MAX,
stays there. This utlimately causes memory leak, but prevents use after free.

But given refcounting is not the most performance-critical operation with BPF
maps (it's not used from running BPF program code), we can also just switch to
64-bit counter that can't overflow in practice, potentially disadvantaging
32-bit platforms a tiny bit. This simplifies semantics and allows above
described scenarios to not worry about failing refcount increment operation.

In terms of struct bpf_map size, we are still good and use the same amount of
space:

BEFORE (3 cache lines, 8 bytes of padding at the end):
struct bpf_map {
	const struct bpf_map_ops  * ops __attribute__((__aligned__(64))); /*     0     8 */
	struct bpf_map *           inner_map_meta;       /*     8     8 */
	void *                     security;             /*    16     8 */
	enum bpf_map_type  map_type;                     /*    24     4 */
	u32                        key_size;             /*    28     4 */
	u32                        value_size;           /*    32     4 */
	u32                        max_entries;          /*    36     4 */
	u32                        map_flags;            /*    40     4 */
	int                        spin_lock_off;        /*    44     4 */
	u32                        id;                   /*    48     4 */
	int                        numa_node;            /*    52     4 */
	u32                        btf_key_type_id;      /*    56     4 */
	u32                        btf_value_type_id;    /*    60     4 */
	/* --- cacheline 1 boundary (64 bytes) --- */
	struct btf *               btf;                  /*    64     8 */
	struct bpf_map_memory memory;                    /*    72    16 */
	bool                       unpriv_array;         /*    88     1 */
	bool                       frozen;               /*    89     1 */

	/* XXX 38 bytes hole, try to pack */

	/* --- cacheline 2 boundary (128 bytes) --- */
	atomic_t                   refcnt __attribute__((__aligned__(64))); /*   128     4 */
	atomic_t                   usercnt;              /*   132     4 */
	struct work_struct work;                         /*   136    32 */
	char                       name[16];             /*   168    16 */

	/* size: 192, cachelines: 3, members: 21 */
	/* sum members: 146, holes: 1, sum holes: 38 */
	/* padding: 8 */
	/* forced alignments: 2, forced holes: 1, sum forced holes: 38 */
} __attribute__((__aligned__(64)));

AFTER (same 3 cache lines, no extra padding now):
struct bpf_map {
	const struct bpf_map_ops  * ops __attribute__((__aligned__(64))); /*     0     8 */
	struct bpf_map *           inner_map_meta;       /*     8     8 */
	void *                     security;             /*    16     8 */
	enum bpf_map_type  map_type;                     /*    24     4 */
	u32                        key_size;             /*    28     4 */
	u32                        value_size;           /*    32     4 */
	u32                        max_entries;          /*    36     4 */
	u32                        map_flags;            /*    40     4 */
	int                        spin_lock_off;        /*    44     4 */
	u32                        id;                   /*    48     4 */
	int                        numa_node;            /*    52     4 */
	u32                        btf_key_type_id;      /*    56     4 */
	u32                        btf_value_type_id;    /*    60     4 */
	/* --- cacheline 1 boundary (64 bytes) --- */
	struct btf *               btf;                  /*    64     8 */
	struct bpf_map_memory memory;                    /*    72    16 */
	bool                       unpriv_array;         /*    88     1 */
	bool                       frozen;               /*    89     1 */

	/* XXX 38 bytes hole, try to pack */

	/* --- cacheline 2 boundary (128 bytes) --- */
	atomic64_t                 refcnt __attribute__((__aligned__(64))); /*   128     8 */
	atomic64_t                 usercnt;              /*   136     8 */
	struct work_struct work;                         /*   144    32 */
	char                       name[16];             /*   176    16 */

	/* size: 192, cachelines: 3, members: 21 */
	/* sum members: 154, holes: 1, sum holes: 38 */
	/* forced alignments: 2, forced holes: 1, sum forced holes: 38 */
} __attribute__((__aligned__(64)));

This patch, while modifying all users of bpf_map_inc, also cleans up its
interface to match bpf_map_put with separate operations for bpf_map_inc and
bpf_map_inc_with_uref (to match bpf_map_put and bpf_map_put_with_uref,
respectively). Also, given there are no users of bpf_map_inc_not_zero
specifying uref=true, remove uref flag and default to uref=false internally.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20191117172806.2195367-2-andriin@fb.com
2019-11-18 11:41:59 +01:00
David S. Miller
949610ddd0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2019-11-15

The following pull-request contains BPF updates for your *net* tree.

We've added 1 non-merge commits during the last 9 day(s) which contain
a total of 1 file changed, 3 insertions(+), 1 deletion(-).

The main changes are:

1) Fix a missing unlock of bpf_devs_lock in bpf_offload_dev_create()'s
    error path, from Dan.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-17 10:23:49 -08:00
Linus Torvalds
cbb104f91d Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc scheduler fixes from Ingo Molnar:

 - Fix potential deadlock under CONFIG_DEBUG_OBJECTS=y

 - PELT metrics update ordering fix

 - uclamp logic fix

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/uclamp: Fix incorrect condition
  sched/pelt: Fix update of blocked PELT ordering
  sched/core: Avoid spurious lock dependencies
2019-11-17 08:30:38 -08:00
Valentin Schneider
7763baace1 sched/uclamp: Fix overzealous type replacement
Some uclamp helpers had their return type changed from 'unsigned int' to
'enum uclamp_id' by commit

  0413d7f33e ("sched/uclamp: Always use 'enum uclamp_id' for clamp_id values")

but it happens that some do return a value in the [0, SCHED_CAPACITY_SCALE]
range, which should really be unsigned int. The affected helpers are
uclamp_none(), uclamp_rq_max_value() and uclamp_eff_value(). Fix those up.

Note that this doesn't lead to any obj diff using a relatively recent
aarch64 compiler (8.3-2019.03). The current code of e.g. uclamp_eff_value()
properly returns an 11 bit value (bits_per(1024)) and doesn't seem to do
anything funny. I'm still marking this as fixing the above commit to be on
the safe side.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Qais Yousef <qais.yousef@arm.com>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar.Eggemann@arm.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: patrick.bellasi@matbug.net
Cc: qperret@google.com
Cc: surenb@google.com
Cc: tj@kernel.org
Fixes: 0413d7f33e ("sched/uclamp: Always use 'enum uclamp_id' for clamp_id values")
Link: https://lkml.kernel.org/r/20191115103908.27610-1-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-17 10:46:05 +01:00
David S. Miller
19b7e21c55 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Lots of overlapping changes and parallel additions, stuff
like that.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-16 21:51:42 -08:00
Linus Torvalds
3278b3b678 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Ingo Molnar:
 "Fix integer truncation bug in __do_adjtimex()"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  ntp/y2038: Remove incorrect time_t truncation
2019-11-16 16:08:46 -08:00
Linus Torvalds
5ffaf037e7 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Misc fixes: a handful of AUX event handling related fixes, a Sparse
  fix and two ABI fixes"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/core: Fix missing static inline on perf_cgroup_switch()
  perf/core: Consistently fail fork on allocation failures
  perf/aux: Disallow aux_output for kernel events
  perf/core: Reattach a misplaced comment
  perf/aux: Fix the aux_output group inheritance fix
  perf/core: Disallow uncore-cgroup events
2019-11-16 15:56:01 -08:00
Adrian Reber
49cb2fc42c fork: extend clone3() to support setting a PID
The main motivation to add set_tid to clone3() is CRIU.

To restore a process with the same PID/TID CRIU currently uses
/proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to
ns_last_pid and then (quickly) does a clone(). This works most of the
time, but it is racy. It is also slow as it requires multiple syscalls.

Extending clone3() to support *set_tid makes it possible restore a
process using CRIU without accessing /proc/sys/kernel/ns_last_pid and
race free (as long as the desired PID/TID is available).

This clone3() extension places the same restrictions (CAP_SYS_ADMIN)
on clone3() with *set_tid as they are currently in place for ns_last_pid.

The original version of this change was using a single value for
set_tid. At the 2019 LPC, after presenting set_tid, it was, however,
decided to change set_tid to an array to enable setting the PID of a
process in multiple PID namespaces at the same time. If a process is
created in a PID namespace it is possible to influence the PID inside
and outside of the PID namespace. Details also in the corresponding
selftest.

To create a process with the following PIDs:

      PID NS level         Requested PID
        0 (host)              31496
        1                        42
        2                         1

For that example the two newly introduced parameters to struct
clone_args (set_tid and set_tid_size) would need to be:

  set_tid[0] = 1;
  set_tid[1] = 42;
  set_tid[2] = 31496;
  set_tid_size = 3;

If only the PIDs of the two innermost nested PID namespaces should be
defined it would look like this:

  set_tid[0] = 1;
  set_tid[1] = 42;
  set_tid_size = 2;

The PID of the newly created process would then be the next available
free PID in the PID namespace level 0 (host) and 42 in the PID namespace
at level 1 and the PID of the process in the innermost PID namespace
would be 1.

The set_tid array is used to specify the PID of a process starting
from the innermost nested PID namespaces up to set_tid_size PID namespaces.

set_tid_size cannot be larger then the current PID namespace level.

Signed-off-by: Adrian Reber <areber@redhat.com>
Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com>
Acked-by: Andrei Vagin <avagin@gmail.com>
Link: https://lore.kernel.org/r/20191115123621.142252-1-areber@redhat.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2019-11-15 23:49:22 +01:00
Alexei Starovoitov
5b92a28aae bpf: Support attaching tracing BPF program to other BPF programs
Allow FENTRY/FEXIT BPF programs to attach to other BPF programs of any type
including their subprograms. This feature allows snooping on input and output
packets in XDP, TC programs including their return values. In order to do that
the verifier needs to track types not only of vmlinux, but types of other BPF
programs as well. The verifier also needs to translate uapi/linux/bpf.h types
used by networking programs into kernel internal BTF types used by FENTRY/FEXIT
BPF programs. In some cases LLVM optimizations can remove arguments from BPF
subprograms without adjusting BTF info that LLVM backend knows. When BTF info
disagrees with actual types that the verifiers sees the BPF trampoline has to
fallback to conservative and treat all arguments as u64. The FENTRY/FEXIT
program can still attach to such subprograms, but it won't be able to recognize
pointer types like 'struct sk_buff *' and it won't be able to pass them to
bpf_skb_output() for dumping packets to user space. The FENTRY/FEXIT program
would need to use bpf_probe_read_kernel() instead.

The BPF_PROG_LOAD command is extended with attach_prog_fd field. When it's set
to zero the attach_btf_id is one vmlinux BTF type ids. When attach_prog_fd
points to previously loaded BPF program the attach_btf_id is BTF type id of
main function or one of its subprograms.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20191114185720.1641606-18-ast@kernel.org
2019-11-15 23:45:24 +01:00
Alexei Starovoitov
8c1b6e69dc bpf: Compare BTF types of functions arguments with actual types
Make the verifier check that BTF types of function arguments match actual types
passed into top-level BPF program and into BPF-to-BPF calls. If types match
such BPF programs and sub-programs will have full support of BPF trampoline. If
types mismatch the trampoline has to be conservative. It has to save/restore
five program arguments and assume 64-bit scalars.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20191114185720.1641606-17-ast@kernel.org
2019-11-15 23:45:02 +01:00
Alexei Starovoitov
91cc1a9974 bpf: Annotate context types
Annotate BPF program context types with program-side type and kernel-side type.
This type information is used by the verifier. btf_get_prog_ctx_type() is
used in the later patches to verify that BTF type of ctx in BPF program matches to
kernel expected ctx type. For example, the XDP program type is:
BPF_PROG_TYPE(BPF_PROG_TYPE_XDP, xdp, struct xdp_md, struct xdp_buff)
That means that XDP program should be written as:
int xdp_prog(struct xdp_md *ctx) { ... }

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20191114185720.1641606-16-ast@kernel.org
2019-11-15 23:44:48 +01:00
Alexei Starovoitov
9cc31b3a09 bpf: Fix race in btf_resolve_helper_id()
btf_resolve_helper_id() caching logic is a bit racy, since under root the
verifier can verify several programs in parallel. Fix it with READ/WRITE_ONCE.
Fix the type as well, since error is also recorded.

Fixes: a7658e1a41 ("bpf: Check types of arguments passed into helpers")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20191114185720.1641606-15-ast@kernel.org
2019-11-15 23:44:20 +01:00
Alexei Starovoitov
fec56f5890 bpf: Introduce BPF trampoline
Introduce BPF trampoline concept to allow kernel code to call into BPF programs
with practically zero overhead.  The trampoline generation logic is
architecture dependent.  It's converting native calling convention into BPF
calling convention.  BPF ISA is 64-bit (even on 32-bit architectures). The
registers R1 to R5 are used to pass arguments into BPF functions. The main BPF
program accepts only single argument "ctx" in R1. Whereas CPU native calling
convention is different. x86-64 is passing first 6 arguments in registers
and the rest on the stack. x86-32 is passing first 3 arguments in registers.
sparc64 is passing first 6 in registers. And so on.

The trampolines between BPF and kernel already exist.  BPF_CALL_x macros in
include/linux/filter.h statically compile trampolines from BPF into kernel
helpers. They convert up to five u64 arguments into kernel C pointers and
integers. On 64-bit architectures this BPF_to_kernel trampolines are nops. On
32-bit architecture they're meaningful.

The opposite job kernel_to_BPF trampolines is done by CAST_TO_U64 macros and
__bpf_trace_##call() shim functions in include/trace/bpf_probe.h. They convert
kernel function arguments into array of u64s that BPF program consumes via
R1=ctx pointer.

This patch set is doing the same job as __bpf_trace_##call() static
trampolines, but dynamically for any kernel function. There are ~22k global
kernel functions that are attachable via nop at function entry. The function
arguments and types are described in BTF.  The job of btf_distill_func_proto()
function is to extract useful information from BTF into "function model" that
architecture dependent trampoline generators will use to generate assembly code
to cast kernel function arguments into array of u64s.  For example the kernel
function eth_type_trans has two pointers. They will be casted to u64 and stored
into stack of generated trampoline. The pointer to that stack space will be
passed into BPF program in R1. On x86-64 such generated trampoline will consume
16 bytes of stack and two stores of %rdi and %rsi into stack. The verifier will
make sure that only two u64 are accessed read-only by BPF program. The verifier
will also recognize the precise type of the pointers being accessed and will
not allow typecasting of the pointer to a different type within BPF program.

The tracing use case in the datacenter demonstrated that certain key kernel
functions have (like tcp_retransmit_skb) have 2 or more kprobes that are always
active.  Other functions have both kprobe and kretprobe.  So it is essential to
keep both kernel code and BPF programs executing at maximum speed. Hence
generated BPF trampoline is re-generated every time new program is attached or
detached to maintain maximum performance.

To avoid the high cost of retpoline the attached BPF programs are called
directly. __bpf_prog_enter/exit() are used to support per-program execution
stats.  In the future this logic will be optimized further by adding support
for bpf_stats_enabled_key inside generated assembly code. Introduction of
preemptible and sleepable BPF programs will completely remove the need to call
to __bpf_prog_enter/exit().

Detach of a BPF program from the trampoline should not fail. To avoid memory
allocation in detach path the half of the page is used as a reserve and flipped
after each attach/detach. 2k bytes is enough to call 40+ BPF programs directly
which is enough for BPF tracing use cases. This limit can be increased in the
future.

BPF_TRACE_FENTRY programs have access to raw kernel function arguments while
BPF_TRACE_FEXIT programs have access to kernel return value as well. Often
kprobe BPF program remembers function arguments in a map while kretprobe
fetches arguments from a map and analyzes them together with return value.
BPF_TRACE_FEXIT accelerates this typical use case.

Recursion prevention for kprobe BPF programs is done via per-cpu
bpf_prog_active counter. In practice that turned out to be a mistake. It
caused programs to randomly skip execution. The tracing tools missed results
they were looking for. Hence BPF trampoline doesn't provide builtin recursion
prevention. It's a job of BPF program itself and will be addressed in the
follow up patches.

BPF trampoline is intended to be used beyond tracing and fentry/fexit use cases
in the future. For example to remove retpoline cost from XDP programs.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20191114185720.1641606-5-ast@kernel.org
2019-11-15 23:41:51 +01:00
Alexei Starovoitov
5964b2000f bpf: Add bpf_arch_text_poke() helper
Add bpf_arch_text_poke() helper that is used by BPF trampoline logic to patch
nops/calls in kernel text into calls into BPF trampoline and to patch
calls/nops inside BPF programs too.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20191114185720.1641606-4-ast@kernel.org
2019-11-15 23:41:28 +01:00
Ilya Leoshkevich
b7b3fc8dd9 bpf: Support doubleword alignment in bpf_jit_binary_alloc
Currently passing alignment greater than 4 to bpf_jit_binary_alloc does
not work: in such cases it silently aligns only to 4 bytes.

On s390, in order to load a constant from memory in a large (>512k) BPF
program, one must use lgrl instruction, whose memory operand must be
aligned on an 8-byte boundary.

This patch makes it possible to request 8-byte alignment from
bpf_jit_binary_alloc, and also makes it issue a warning when an
unsupported alignment is requested.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20191115123722.58462-1-iii@linux.ibm.com
2019-11-15 22:25:00 +01:00
Sebastian Andrzej Siewior
49e9d1a9fa workqueue: Add RCU annotation for pwq list walk
An additional check has been recently added to ensure that a RCU related lock
is held while the RCU list is iterated.
The `pwqs' are sometimes iterated without a RCU lock but with the &wq->mutex
acquired leading to a warning.

Teach list_for_each_entry_rcu() that the RCU usage is okay if &wq->mutex
is acquired during the list traversal.

Fixes: 28875945ba ("rcu: Add support for consolidated-RCU reader checking")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-11-15 11:53:35 -08:00
Steven Rostedt (VMware)
128161f47b ftrace: Add helper find_direct_entry() to consolidate code
Both unregister_ftrace_direct() and modify_ftrace_direct() needs to
normalize the ip passed in to match the rec->ip, as it is acceptable to have
the ip on the ftrace call site but not the start. There are also common
validity checks with the record found by the ip, these should be done for
both unregister_ftrace_direct() and modify_ftrace_direct().

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-15 14:25:08 -05:00
Steven Rostedt (VMware)
406acdd32d ftrace: Add another check for match in register_ftrace_direct()
As an instruction pointer passed into register_ftrace_direct() may just
exist on the ftrace call site, but may not be the start of the call site
itself, register_ftrace_direct() still needs to update test if a direct call
exists on the normalized site, as only one direct call is allowed at any one
time.

Fixes: 763e34e74b ("ftrace: Add register_ftrace_direct()")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-15 14:24:46 -05:00
Steven Rostedt (VMware)
1c7f9b673d ftrace: Fix accounting bug with direct->count in register_ftrace_direct()
The direct->count wasn't being updated properly, where it only was updated
when the first entry was added, but should be updated every time.

Fixes: 013bf0da04 ("ftrace: Add ftrace_find_direct_func()")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-15 14:23:55 -05:00
Yang Tao
ca16d5bee5 futex: Prevent robust futex exit race
Robust futexes utilize the robust_list mechanism to allow the kernel to
release futexes which are held when a task exits. The exit can be voluntary
or caused by a signal or fault. This prevents that waiters block forever.

The futex operations in user space store a pointer to the futex they are
either locking or unlocking in the op_pending member of the per task robust
list.

After a lock operation has succeeded the futex is queued in the robust list
linked list and the op_pending pointer is cleared.

After an unlock operation has succeeded the futex is removed from the
robust list linked list and the op_pending pointer is cleared.

The robust list exit code checks for the pending operation and any futex
which is queued in the linked list. It carefully checks whether the futex
value is the TID of the exiting task. If so, it sets the OWNER_DIED bit and
tries to wake up a potential waiter.

This is race free for the lock operation but unlock has two race scenarios
where waiters might not be woken up. These issues can be observed with
regular robust pthread mutexes. PI aware pthread mutexes are not affected.

(1) Unlocking task is killed after unlocking the futex value in user space
    before being able to wake a waiter.

        pthread_mutex_unlock()
                |
                V
        atomic_exchange_rel (&mutex->__data.__lock, 0)
                        <------------------------killed
            lll_futex_wake ()                   |
                                                |
                                                |(__lock = 0)
                                                |(enter kernel)
                                                |
                                                V
                                            do_exit()
                                            exit_mm()
                                          mm_release()
                                        exit_robust_list()
                                        handle_futex_death()
                                                |
                                                |(__lock = 0)
                                                |(uval = 0)
                                                |
                                                V
        if ((uval & FUTEX_TID_MASK) != task_pid_vnr(curr))
                return 0;

    The sanity check which ensures that the user space futex is owned by
    the exiting task prevents the wakeup of waiters which in consequence
    block infinitely.

(2) Waiting task is killed after a wakeup and before it can acquire the
    futex in user space.

        OWNER                         WAITER
				futex_wait()      		
   pthread_mutex_unlock()               |
                |                       |
                |(__lock = 0)           |
                |                       |
                V                       |
         futex_wake() ------------>  wakeup()
                                        |
                                        |(return to userspace)
                                        |(__lock = 0)
                                        |
                                        V
                        oldval = mutex->__data.__lock
                                          <-----------------killed
    atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,  |
                        id | assume_other_futex_waiters, 0)      |
                                                                 |
                                                                 |
                                                   (enter kernel)|
                                                                 |
                                                                 V
                                                         do_exit()
                                                        |
                                                        |
                                                        V
                                        handle_futex_death()
                                        |
                                        |(__lock = 0)
                                        |(uval = 0)
                                        |
                                        V
        if ((uval & FUTEX_TID_MASK) != task_pid_vnr(curr))
                return 0;

    The sanity check which ensures that the user space futex is owned
    by the exiting task prevents the wakeup of waiters, which seems to
    be correct as the exiting task does not own the futex value, but
    the consequence is that other waiters wont be woken up and block
    infinitely.

In both scenarios the following conditions are true:

   - task->robust_list->list_op_pending != NULL
   - user space futex value == 0
   - Regular futex (not PI)

If these conditions are met then it is reasonably safe to wake up a
potential waiter in order to prevent the above problems.

As this might be a false positive it can cause spurious wakeups, but the
waiter side has to handle other types of unrelated wakeups, e.g. signals
gracefully anyway. So such a spurious wakeup will not affect the
correctness of these operations.

This workaround must not touch the user space futex value and cannot set
the OWNER_DIED bit because the lock value is 0, i.e. uncontended. Setting
OWNER_DIED in this case would result in inconsistent state and subsequently
in malfunction of the owner died handling in user space.

The rest of the user space state is still consistent as no other task can
observe the list_op_pending entry in the exiting tasks robust list.

The eventually woken up waiter will observe the uncontended lock value and
take it over.

[ tglx: Massaged changelog and comment. Made the return explicit and not
  	depend on the subsequent check and added constants to hand into
  	handle_futex_death() instead of plain numbers. Fixed a few coding
	style issues. ]

Fixes: 0771dfefc9 ("[PATCH] lightweight robust futexes: core")
Signed-off-by: Yang Tao <yang.tao172@zte.com.cn>
Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1573010582-35297-1-git-send-email-wang.yi59@zte.com.cn
Link: https://lkml.kernel.org/r/20191106224555.943191378@linutronix.de
2019-11-15 19:10:49 +01:00
Linus Torvalds
b4c0800e42 Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs fixes from Al Viro:
 "Assorted fixes all over the place; some of that is -stable fodder,
  some regressions from the last window"

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  ecryptfs_lookup_interpose(): lower_dentry->d_parent is not stable either
  ecryptfs_lookup_interpose(): lower_dentry->d_inode is not stable
  ecryptfs: fix unlink and rmdir in face of underlying fs modifications
  audit_get_nd(): don't unlock parent too early
  exportfs_decode_fh(): negative pinned may become positive without the parent locked
  cgroup: don't put ERR_PTR() into fc->root
  autofs: fix a leak in autofs_expire_indirect()
  aio: Fix io_pgetevents() struct __compat_aio_sigset layout
  fs/namespace.c: fix use-after-free of mount in mnt_warn_timestamp_expiry()
2019-11-15 08:44:08 -08:00
Artem Bityutskiy
58a74a2925 tracing: Increase SYNTH_FIELDS_MAX for synthetic_events
Increase the maximum allowed count of synthetic event fields from 16 to 32
in order to allow for larger-than-usual events.

Link: http://lkml.kernel.org/r/20191115091730.9192-1-dedekind1@gmail.com

Reviewed-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-15 11:30:29 -05:00
Arnd Bergmann
942437c97f y2038: allow disabling time32 system calls
At the moment, the compilation of the old time32 system calls depends
purely on the architecture. As systems with new libc based on 64-bit
time_t are getting deployed, even architectures that previously supported
these (notably x86-32 and arm32 but also many others) no longer depend on
them, and removing them from a kernel image results in a smaller kernel
binary, the same way we can leave out many other optional system calls.

More importantly, on an embedded system that needs to keep working
beyond year 2038, any user space program calling these system calls
is likely a bug, so removing them from the kernel image does provide
an extra debugging help for finding broken applications.

I've gone back and forth on hiding this option unless CONFIG_EXPERT
is set. This version leaves it visible based on the logic that
eventually it will be turned off indefinitely.

Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-11-15 14:38:30 +01:00
Arnd Bergmann
bd40a17576 y2038: itimer: change implementation to timespec64
There is no 64-bit version of getitimer/setitimer since that is not
actually needed. However, the implementation is built around the
deprecated 'struct timeval' type.

Change the code to use timespec64 internally to reduce the dependencies
on timeval and associated helper functions.

Minor adjustments in the code are needed to make the native and compat
version work the same way, and to keep the range check working after
the conversion.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-11-15 14:38:30 +01:00
Arnd Bergmann
ddbc7d0657 y2038: move itimer reset into itimer.c
Preparing for a change to the itimer internals, stop using the
do_setitimer() symbol and instead use a new higher-level interface.

The do_getitimer()/do_setitimer functions can now be made static,
allowing the compiler to potentially produce better object code.

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-11-15 14:38:30 +01:00
Arnd Bergmann
4c22ea2b91 y2038: use compat_{get,set}_itimer on alpha
The itimer handling for the old alpha osf_setitimer/osf_getitimer
system calls is identical to the compat version of getitimer/setitimer,
so just use those directly.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-11-15 14:38:30 +01:00
Arnd Bergmann
c1745f84be y2038: itimer: compat handling to itimer.c
The structure is only used in one place, moving it there simplifies the
interface and helps with later changes to this code.

Rename it to match the other time32 structures in the process.

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-11-15 14:38:30 +01:00
Arnd Bergmann
5e0fb1b57b y2038: time: avoid timespec usage in settimeofday()
The compat_get_timeval() and timeval_valid() interfaces are deprecated
and getting removed along with the definition of struct timeval itself.

Change the two implementations of the settimeofday() system call to
open-code these helpers and completely avoid references to timeval.

The timeval_valid() call is not needed any more here, only a check to
avoid overflowing tv_nsec during the multiplication, as there is another
range check in do_sys_settimeofday64().

Tested-by: syzbot+dccce9b26ba09ca49966@syzkaller.appspotmail.com
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-11-15 14:38:30 +01:00
Arnd Bergmann
75d319c06e y2038: syscalls: change remaining timeval to __kernel_old_timeval
All of the remaining syscalls that pass a timeval (gettimeofday, utime,
futimesat) can trivially be changed to pass a __kernel_old_timeval
instead, which has a compatible layout, but avoids ambiguity with
the timeval type in user space.

Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-11-15 14:38:29 +01:00
Arnd Bergmann
bdd565f817 y2038: rusage: use __kernel_old_timeval
There are two 'struct timeval' fields in 'struct rusage'.

Unfortunately the definition of timeval is now ambiguous when used in
user space with a libc that has a 64-bit time_t, and this also changes
the 'rusage' definition in user space in a way that is incompatible with
the system call interface.

While there is no good solution to avoid all ambiguity here, change
the definition in the kernel headers to be compatible with the kernel
ABI, using __kernel_old_timeval as an unambiguous base type.

In previous discussions, there was also a plan to add a replacement
for rusage based on 64-bit timestamps and nanosecond resolution,
i.e. 'struct __kernel_timespec'. I have patches for that as well,
if anyone thinks we should do that.

Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-11-15 14:38:29 +01:00
Arnd Bergmann
2a785996cc y2038: uapi: change __kernel_time_t to __kernel_old_time_t
This is mainly a patch for clarification, and to let us remove
the time_t definition from the kernel to prevent new users from
creeping in that might not be y2038-safe.

All remaining uses of 'time_t' or '__kernel_time_t' are part of
the user API that cannot be changed by that either have a
replacement or that do not suffer from the y2038 overflow.

Acked-by: Deepa Dinamani <deepa.kernel@gmail.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-11-15 14:38:29 +01:00
Arnd Bergmann
3ca47e958a y2038: remove CONFIG_64BIT_TIME
The CONFIG_64BIT_TIME option is defined on all architectures, and can
be removed for simplicity now.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-11-15 14:38:27 +01:00
Like Xu
52ba4b0b99 perf/core: Provide a kernel-internal interface to pause perf_event
Exporting perf_event_pause() as an external accessor for kernel users (such
as KVM) who may do both disable perf_event and read count with just one
time to hold perf_event_ctx_lock. Also the value could be reset optionally.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Like Xu <like.xu@linux.intel.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-11-15 11:44:07 +01:00
Like Xu
3ca270fc9e perf/core: Provide a kernel-internal interface to recalibrate event period
Currently, perf_event_period() is used by user tools via ioctl. Based on
naming convention, exporting perf_event_period() for kernel users (such
as KVM) who may recalibrate the event period for their assigned counter
according to their requirements.

The perf_event_period() is an external accessor, just like the
perf_event_{en,dis}able() and should thus use perf_event_ctx_lock().

Suggested-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Like Xu <like.xu@linux.intel.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-11-15 11:44:06 +01:00
Qais Yousef
6e1ff0773f sched/uclamp: Fix incorrect condition
uclamp_update_active() should perform the update when
p->uclamp[clamp_id].active is true. But when the logic was inverted in
[1], the if condition wasn't inverted correctly too.

[1] https://lore.kernel.org/lkml/20190902073836.GO2369@hirez.programming.kicks-ass.net/

Reported-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Patrick Bellasi <patrick.bellasi@matbug.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: babbe170e0 ("sched/uclamp: Update CPU's refcount on TG's clamp changes")
Link: https://lkml.kernel.org/r/20191114211052.15116-1-qais.yousef@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-15 11:02:18 +01:00
Steven Rostedt (VMware)
0567d68091 ftrace: Add modify_ftrace_direct()
Add a new function modify_ftrace_direct() that will allow a user to update
an existing direct caller to a new trampoline, without missing hits due to
unregistering one and then adding another.

Link: https://lore.kernel.org/r/20191109022907.6zzo6orhxpt5n2sv@ast-mbp.dhcp.thefacebook.com

Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-14 22:45:47 -05:00
Steven Rostedt (VMware)
36b3615dc3 tracing: Add missing "inline" in stub function of latency_fsnotify()
The latency_fsnotify() stub when the function is not defined, was missing
the "inline".

Link: https://lore.kernel.org/r/20191115140213.74c5efe7@canb.auug.org.au

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-14 22:45:47 -05:00
Tejun Heo
d749534322 cgroup: fix incorrect WARN_ON_ONCE() in cgroup_setup_root()
743210386c ("cgroup: use cgrp->kn->id as the cgroup ID") added WARN
which triggers if cgroup_id(root_cgrp) is not 1.  This is fine on
64bit ino archs but on 32bit archs cgroup ID is ((gen << 32) | ino)
and gen starts at 1, so the root id is 0x1_0000_0001 instead of 1
always triggering the WARN.

What we wanna make sure is that the ino part is 1.  Fix it.

Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Fixes: 743210386c ("cgroup: use cgrp->kn->id as the cgroup ID")
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-11-14 14:46:51 -08:00
Borislav Petkov
9b4712044d tracing: Remove stray tab in TRACE_EVAL_MAP_FILE's help text
There was a stray tab in the help text of the aforementioned config
option which showed like this:

  The "print fmt" of the trace events will show the enum/sizeof names
  instead        of their values. This can cause problems for user space tools
  ...

in menuconfig. Remove it and end a sentence with a fullstop.

No functional changes.

Link: http://lkml.kernel.org/r/20191112174219.10933-1-bp@alien8.de

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-14 13:15:12 -05:00
Piotr Maziarz
ef56e047b2 tracing: Use seq_buf_hex_dump() to dump buffers
Without this, buffers can be printed with __print_array macro that has
no formatting options and can be hard to read. The other way is to
mimic formatting capability with multiple calls of trace event with one
call per row which gives performance impact and different timestamp in
each row.

Link: http://lkml.kernel.org/r/1573130738-29390-2-git-send-email-piotrx.maziarz@linux.intel.com

Signed-off-by: Piotr Maziarz <piotrx.maziarz@linux.intel.com>
Signed-off-by: Cezary Rojewski <cezary.rojewski@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-14 13:15:12 -05:00
Masami Hiramatsu
c7411a1a12 tracing/kprobe: Check whether the non-suffixed symbol is notrace
Check whether the non-suffixed symbol is notrace, since suffixed
symbols are generated by the compilers for optimization. Based on
these suffixed symbols, notrace check might not work because
some of them are just a partial code of the original function.
(e.g. cold-cache (unlikely) code is separated from original
 function as FUNCTION.cold.XX)

For example, without this fix,
  # echo p device_add.cold.67 > /sys/kernel/debug/tracing/kprobe_events
  sh: write error: Invalid argument

  # cat /sys/kernel/debug/tracing/error_log
  [  135.491035] trace_kprobe: error: Failed to register probe event
    Command: p device_add.cold.67
               ^
  # dmesg | tail -n 1
  [  135.488599] trace_kprobe: Could not probe notrace function device_add.cold.67

With this,
  # echo p device_add.cold.66 > /sys/kernel/debug/tracing/kprobe_events
  # cat /sys/kernel/debug/kprobes/list
  ffffffff81599de9  k  device_add.cold.66+0x0    [DISABLED]

Actually, kprobe blacklist already did similar thing,
see within_kprobe_blacklist().

Link: http://lkml.kernel.org/r/157233790394.6706.18243942030937189679.stgit@devnote2

Fixes: 45408c4f92 ("tracing: kprobes: Prohibit probing on notrace function")
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-14 13:15:12 -05:00
Yuming Han
6ee40511cb tracing: use kvcalloc for tgid_map array allocation
Fail to allocate memory for tgid_map, because it requires order-6 page.
detail as:

c3 sh: page allocation failure: order:6,
   mode:0x140c0c0(GFP_KERNEL), nodemask=(null)
c3 sh cpuset=/ mems_allowed=0
c3 CPU: 3 PID: 5632 Comm: sh Tainted: G        W  O    4.14.133+ #10
c3 Hardware name: Generic DT based system
c3 Backtrace:
c3 [<c010bdbc>] (dump_backtrace) from [<c010c08c>](show_stack+0x18/0x1c)
c3 [<c010c074>] (show_stack) from [<c0993c54>](dump_stack+0x84/0xa4)
c3 [<c0993bd0>] (dump_stack) from [<c0229858>](warn_alloc+0xc4/0x19c)
c3 [<c0229798>] (warn_alloc) from [<c022a6e4>](__alloc_pages_nodemask+0xd18/0xf28)
c3 [<c02299cc>] (__alloc_pages_nodemask) from [<c0248344>](kmalloc_order+0x20/0x38)
c3 [<c0248324>] (kmalloc_order) from [<c0248380>](kmalloc_order_trace+0x24/0x108)
c3 [<c024835c>] (kmalloc_order_trace) from [<c01e6078>](set_tracer_flag+0xb0/0x158)
c3 [<c01e5fc8>] (set_tracer_flag) from [<c01e6404>](trace_options_core_write+0x7c/0xcc)
c3 [<c01e6388>] (trace_options_core_write) from [<c0278b1c>](__vfs_write+0x40/0x14c)
c3 [<c0278adc>] (__vfs_write) from [<c0278e10>](vfs_write+0xc4/0x198)
c3 [<c0278d4c>] (vfs_write) from [<c027906c>](SyS_write+0x6c/0xd0)
c3 [<c0279000>] (SyS_write) from [<c01079a0>](ret_fast_syscall+0x0/0x54)

Switch to use kvcalloc to avoid unexpected allocation failures.

Link: http://lkml.kernel.org/r/1571888070-24425-1-git-send-email-chunyan.zhang@unisoc.com

Signed-off-by: Yuming Han <yuming.han@unisoc.com>
Signed-off-by: Chunyan Zhang <chunyan.zhang@unisoc.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-14 13:15:11 -05:00
Srivatsa S. Bhat (VMware)
0c3c86bdc6 tracing/hwlat: Fix a few trivial nits
Update the source file name in the comments, and fix a grammatical
error.

Link: http://lkml.kernel.org/r/157073346821.17189.8946944856026592247.stgit@srivatsa-ubuntu

Signed-off-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-14 13:15:11 -05:00
Andy Shevchenko
80042c8f06 tracing: Use generic type for comparator function
Comparator function type, cmp_func_t, is defined in the types.h,
use it in the code.

Link: http://lkml.kernel.org/r/20191007135656.37734-3-andriy.shevchenko@linux.intel.com

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-14 13:15:11 -05:00
Steven Rostedt (VMware)
b83b43ffc6 fgraph: Fix function type mismatches of ftrace_graph_return using ftrace_stub
The C compiler is allowing more checks to make sure that function pointers
are assigned to the correct prototype function. Unfortunately, the function
graph tracer uses a special name with its assigned ftrace_graph_return
function pointer that maps to a stub function used by the function tracer
(ftrace_stub). The ftrace_graph_return variable is compared to the
ftrace_stub in some archs to know if the function graph tracer is enabled or
not. This means we can not just simply create a new function stub that
compares it without modifying all the archs.

Instead, have the linker script create a function_graph_stub that maps to
ftrace_stub, and this way we can define the prototype for it to match the
prototype of ftrace_graph_return, and make the compiler checks all happy!

Link: http://lkml.kernel.org/r/20191015090055.789a0aed@gandalf.local.home

Cc: linux-sh@vger.kernel.org
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc:  Rich Felker <dalias@libc.org>
Reported-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-14 13:13:43 -05:00
Divya Indi
953ae45a0c tracing: Adding NULL checks for trace_array descriptor pointer
As part of commit f45d1225ad ("tracing: Kernel access to Ftrace
instances") we exported certain functions. Here, we are adding some additional
NULL checks to ensure safe usage by users of these APIs.

Link: http://lkml.kernel.org/r/1565805327-579-4-git-send-email-divya.indi@oracle.com

Signed-off-by: Divya Indi <divya.indi@oracle.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-13 09:37:29 -05:00
Divya Indi
e585e6469d tracing: Verify if trace array exists before destroying it.
A trace array can be destroyed from userspace or kernel. Verify if the
trace array exists before proceeding to destroy/remove it.

Link: http://lkml.kernel.org/r/1565805327-579-3-git-send-email-divya.indi@oracle.com

Reviewed-by: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
Signed-off-by: Divya Indi <divya.indi@oracle.com>
[ Removed unneeded braces ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-13 09:37:29 -05:00
Divya Indi
2d6425af61 tracing: Declare newly exported APIs in include/linux/trace.h
Declare the newly introduced and exported APIs in the header file -
include/linux/trace.h. Moving previous declarations from
kernel/trace/trace.h to include/linux/trace.h.

Link: http://lkml.kernel.org/r/1565805327-579-2-git-send-email-divya.indi@oracle.com

Signed-off-by: Divya Indi <divya.indi@oracle.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-13 09:37:29 -05:00
Ben Dooks
6dff4d7dd3 tracing: Make internal ftrace events static
The event_class_ftrace_##call and event_##call do not seem
to be used outside of trace_export.c so make them both static
to avoid a number of sparse warnings:

kernel/trace/trace_entries.h:59:1: warning: symbol 'event_class_ftrace_function' was not declared. Should it be static?
kernel/trace/trace_entries.h:59:1: warning: symbol '__event_function' was not declared. Should it be static?
kernel/trace/trace_entries.h:77:1: warning: symbol 'event_class_ftrace_funcgraph_entry' was not declared. Should it be static?
kernel/trace/trace_entries.h:77:1: warning: symbol '__event_funcgraph_entry' was not declared. Should it be static?
kernel/trace/trace_entries.h:93:1: warning: symbol 'event_class_ftrace_funcgraph_exit' was not declared. Should it be static?
kernel/trace/trace_entries.h:93:1: warning: symbol '__event_funcgraph_exit' was not declared. Should it be static?
kernel/trace/trace_entries.h:129:1: warning: symbol 'event_class_ftrace_context_switch' was not declared. Should it be static?
kernel/trace/trace_entries.h:129:1: warning: symbol '__event_context_switch' was not declared. Should it be static?
kernel/trace/trace_entries.h:149:1: warning: symbol 'event_class_ftrace_wakeup' was not declared. Should it be static?
kernel/trace/trace_entries.h:149:1: warning: symbol '__event_wakeup' was not declared. Should it be static?
kernel/trace/trace_entries.h:171:1: warning: symbol 'event_class_ftrace_kernel_stack' was not declared. Should it be static?
kernel/trace/trace_entries.h:171:1: warning: symbol '__event_kernel_stack' was not declared. Should it be static?
kernel/trace/trace_entries.h:191:1: warning: symbol 'event_class_ftrace_user_stack' was not declared. Should it be static?
kernel/trace/trace_entries.h:191:1: warning: symbol '__event_user_stack' was not declared. Should it be static?
kernel/trace/trace_entries.h:214:1: warning: symbol 'event_class_ftrace_bprint' was not declared. Should it be static?
kernel/trace/trace_entries.h:214:1: warning: symbol '__event_bprint' was not declared. Should it be static?
kernel/trace/trace_entries.h:230:1: warning: symbol 'event_class_ftrace_print' was not declared. Should it be static?
kernel/trace/trace_entries.h:230:1: warning: symbol '__event_print' was not declared. Should it be static?
kernel/trace/trace_entries.h:247:1: warning: symbol 'event_class_ftrace_raw_data' was not declared. Should it be static?
kernel/trace/trace_entries.h:247:1: warning: symbol '__event_raw_data' was not declared. Should it be static?
kernel/trace/trace_entries.h:262:1: warning: symbol 'event_class_ftrace_bputs' was not declared. Should it be static?
kernel/trace/trace_entries.h:262:1: warning: symbol '__event_bputs' was not declared. Should it be static?
kernel/trace/trace_entries.h:277:1: warning: symbol 'event_class_ftrace_mmiotrace_rw' was not declared. Should it be static?
kernel/trace/trace_entries.h:277:1: warning: symbol '__event_mmiotrace_rw' was not declared. Should it be static?
kernel/trace/trace_entries.h:298:1: warning: symbol 'event_class_ftrace_mmiotrace_map' was not declared. Should it be static?
kernel/trace/trace_entries.h:298:1: warning: symbol '__event_mmiotrace_map' was not declared. Should it be static?
kernel/trace/trace_entries.h:322:1: warning: symbol 'event_class_ftrace_branch' was not declared. Should it be static?
kernel/trace/trace_entries.h:322:1: warning: symbol '__event_branch' was not declared. Should it be static?
kernel/trace/trace_entries.h:343:1: warning: symbol 'event_class_ftrace_hwlat' was not declared. Should it be static?
kernel/trace/trace_entries.h:343:1: warning: symbol '__event_hwlat' was not declared. Should it be static?

Link: http://lkml.kernel.org/r/20191015121012.18824-1-ben.dooks@codethink.co.uk

Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-13 09:37:29 -05:00
Sebastian Andrzej Siewior
9c34fc4b7e tracing: Use CONFIG_PREEMPTION
CONFIG_PREEMPTION is selected by CONFIG_PREEMPT and by CONFIG_PREEMPT_RT.
Both PREEMPT and PREEMPT_RT require the same functionality which today
depends on CONFIG_PREEMPT.

Add additional header output for PREEMPT_RT.
Link: http://lkml.kernel.org/r/20191015191821.11479-34-bigeasy@linutronix.de

Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-13 09:37:28 -05:00
Viktor Rosendahl (BMW)
793937236d preemptirq_delay_test: Add the burst feature and a sysfs trigger
This burst feature enables the user to generate a burst of
preempt/irqsoff latencies. This makes it possible to test whether we
are able to detect latencies that systematically occur very close to
each other.

The maximum burst size is 10. We also create 10 identical test
functions, so that we get 10 different backtraces; this is useful
when we want to test whether we can detect all the latencies in a
burst. Otherwise, there would be no easy way of differentiating
between which latency in a burst was captured by the tracer.

In addition, there is a sysfs trigger, so that it's not necessary to
reload the module to repeat the test. The trigger will appear as
/sys/kernel/preemptirq_delay_test/trigger in sysfs.

Link: http://lkml.kernel.org/r/20191008220824.7911-3-viktor.rosendahl@gmail.com

Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Viktor Rosendahl (BMW) <viktor.rosendahl@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-13 09:37:28 -05:00
Viktor Rosendahl (BMW)
91edde2e6a ftrace: Implement fs notification for tracing_max_latency
This patch implements the feature that the tracing_max_latency file,
e.g. /sys/kernel/debug/tracing/tracing_max_latency will receive
notifications through the fsnotify framework when a new latency is
available.

One particularly interesting use of this facility is when enabling
threshold tracing, through /sys/kernel/debug/tracing/tracing_thresh,
together with the preempt/irqsoff tracers. This makes it possible to
implement a user space program that can, with equal probability,
obtain traces of latencies that occur immediately after each other in
spite of the fact that the preempt/irqsoff tracers operate in overwrite
mode.

This facility works with the hwlat, preempt/irqsoff, and wakeup
tracers.

The tracers may call the latency_fsnotify() from places such as
__schedule() or do_idle(); this makes it impossible to call
queue_work() directly without risking a deadlock. The same would
happen with a softirq,  kernel thread or tasklet. For this reason we
use the irq_work mechanism to call queue_work().

This patch creates a new workqueue. The reason for doing this is that
I wanted to use the WQ_UNBOUND and WQ_HIGHPRI flags.  My thinking was
that WQ_UNBOUND might help with the latency in some important cases.

If we use:

queue_work(system_highpri_wq, &tr->fsnotify_work);

then the work will (almost) always execute on the same CPU but if we are
unlucky that CPU could be too busy while there could be another CPU in
the system that would be able to process the work soon enough.

queue_work_on() could be used to queue the work on another CPU but it
seems difficult to select the right CPU.

Link: http://lkml.kernel.org/r/20191008220824.7911-2-viktor.rosendahl@gmail.com

Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Viktor Rosendahl (BMW) <viktor.rosendahl@gmail.com>
[ Added max() to have one compare for max latency ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-13 09:37:28 -05:00
Steven Rostedt (VMware)
da537f0aef ftrace: Add information on number of page groups allocated
Looking for ways to shrink the size of the dyn_ftrace structure, knowing the
information about how many pages and the number of groups of those pages, is
useful in working out the best ways to save on memory.

This adds one info print on how many groups of pages were used to allocate
the ftrace dyn_ftrace structures, and also shows the number of pages and
groups in the dyn_ftrace_total_info (which is used for debugging).

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-13 09:37:28 -05:00
Steven Rostedt (VMware)
a3ad1a7e39 ftrace/x86: Add a counter to test function_graph with direct
As testing for direct calls from the function graph tracer adds a little
overhead (which is a lot when tracing every function), add a counter that
can be used to test if function_graph tracer needs to test for a direct
caller or not.

It would have been nicer if we could use a static branch, but the static
branch logic fails when used within the function graph tracer trampoline.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-13 09:36:49 -05:00
Steven Rostedt (VMware)
013bf0da04 ftrace: Add ftrace_find_direct_func()
As function_graph tracer modifies the return address to insert a trampoline
to trace the return of a function, it must be aware of a direct caller, as
when it gets called, the function's return address may not be at on the
stack where it expects. It may have to see if that return address points to
the a direct caller and adjust if it is.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-13 09:36:48 -05:00
Steven Rostedt (VMware)
763e34e74b ftrace: Add register_ftrace_direct()
Add the start of the functionality to allow other trampolines to use the
ftrace mcount/fentry/nop location. This adds two new functions:

 register_ftrace_direct() and unregister_ftrace_direct()

Both take two parameters: the first is the instruction address of where the
mcount/fentry/nop exists, and the second is the trampoline to have that
location called.

This will handle cases where ftrace is already used on that same location,
and will make it still work, where the registered direct called trampoline
will get called after all the registered ftrace callers are handled.

Currently, it will not allow for IP_MODIFY functions to be called at the
same locations, which include some kprobes and live kernel patching.

At this point, no architecture supports this. This is only the start of
implementing the framework.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-11-13 09:36:41 -05:00