Commit Graph

35322 Commits

Author SHA1 Message Date
Shakeel Butt
991e767385 mm: memcontrol: account kernel stack per node
Currently the kernel stack is being accounted per-zone.  There is no need
to do that.  In addition due to being per-zone, memcg has to keep a
separate MEMCG_KERNEL_STACK_KB.  Make the stat per-node and deprecate
MEMCG_KERNEL_STACK_KB as memcg_stat_item is an extension of
node_stat_item.  In addition localize the kernel stack stats updates to
account_kernel_stack().

Signed-off-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Link: http://lkml.kernel.org/r/20200630161539.1759185-1-shakeelb@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07 11:33:25 -07:00
Roman Gushchin
d42f3245c7 mm: memcg: convert vmstat slab counters to bytes
In order to prepare for per-object slab memory accounting, convert
NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE vmstat items to bytes.

To make it obvious, rename them to NR_SLAB_RECLAIMABLE_B and
NR_SLAB_UNRECLAIMABLE_B (similar to NR_KERNEL_STACK_KB).

Internally global and per-node counters are stored in pages, however memcg
and lruvec counters are stored in bytes.  This scheme may look weird, but
only for now.  As soon as slab pages will be shared between multiple
cgroups, global and node counters will reflect the total number of slab
pages.  However memcg and lruvec counters will be used for per-memcg slab
memory tracking, which will take separate kernel objects in the account.
Keeping global and node counters in pages helps to avoid additional
overhead.

The size of slab memory shouldn't exceed 4Gb on 32-bit machines, so it
will fit into atomic_long_t we use for vmstats.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/20200623174037.3951353-4-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07 11:33:24 -07:00
Ilias Stamatis
4ca1085c95 kthread: remove incorrect comment in kthread_create_on_cpu()
Originally kthread_create_on_cpu() parked and woke up the new thread.
However, since commit a65d40961d ("kthread/smpboot: do not park in
kthread_create_on_cpu()") this is no longer the case.  This patch removes
the comment that has been left behind and is now incorrect / stale.

Fixes: a65d40961d ("kthread/smpboot: do not park in kthread_create_on_cpu()")
Signed-off-by: Ilias Stamatis <stamatis.iliass@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: http://lkml.kernel.org/r/20200611135920.240551-1-stamatis.iliass@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07 11:33:21 -07:00
Peter Zijlstra
38cf307c1f mm: fix kthread_use_mm() vs TLB invalidate
For SMP systems using IPI based TLB invalidation, looking at
current->active_mm is entirely reasonable.  This then presents the
following race condition:

  CPU0			CPU1

  flush_tlb_mm(mm)	use_mm(mm)
    <send-IPI>
			  tsk->active_mm = mm;
			  <IPI>
			    if (tsk->active_mm == mm)
			      // flush TLBs
			  </IPI>
			  switch_mm(old_mm,mm,tsk);

Where it is possible the IPI flushed the TLBs for @old_mm, not @mm,
because the IPI lands before we actually switched.

Avoid this by disabling IRQs across changing ->active_mm and
switch_mm().

Of the (SMP) architectures that have IPI based TLB invalidate:

  Alpha    - checks active_mm
  ARC      - ASID specific
  IA64     - checks active_mm
  MIPS     - ASID specific flush
  OpenRISC - shoots down world
  PARISC   - shoots down world
  SH       - ASID specific
  SPARC    - ASID specific
  x86      - N/A
  xtensa   - checks active_mm

So at the very least Alpha, IA64 and Xtensa are suspect.

On top of this, for scheduler consistency we need at least preemption
disabled across changing tsk->mm and doing switch_mm(), which is
currently provided by task_lock(), but that's not sufficient for
PREEMPT_RT.

[akpm@linux-foundation.org: add comment]

Reported-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kees Cook <keescook@chromium.org>
Cc: Jann Horn <jannh@google.com>
Cc: Will Deacon <will@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200721154106.GE10769@hirez.programming.kicks-ass.net
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07 11:33:21 -07:00
Kees Cook
11990a5bd7 module: Correctly truncate sysfs sections output
The only-root-readable /sys/module/$module/sections/$section files
did not truncate their output to the available buffer size. While most
paths into the kernfs read handlers end up using PAGE_SIZE buffers,
it's possible to get there through other paths (e.g. splice, sendfile).
Actually limit the output to the "count" passed into the read function,
and report it back correctly. *sigh*

Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/lkml/20200805002015.GE23458@shao2-debian
Fixes: ed66f991bb ("module: Refactor section attr into bin attribute")
Cc: stable@vger.kernel.org
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Jessica Yu <jeyu@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-08-07 10:49:47 -07:00
Linus Torvalds
25d8d4eeca powerpc updates for 5.9
- Add support for (optionally) using queued spinlocks & rwlocks.
 
  - Support for a new faster system call ABI using the scv instruction on Power9
    or later.
 
  - Drop support for the PROT_SAO mmap/mprotect flag as it will be unsupported on
    Power10 and future processors, leaving us with no way to implement the
    functionality it requests. This risks breaking userspace, though we believe
    it is unused in practice.
 
  - A bug fix for, and then the removal of, our custom stack expansion checking.
    We now allow stack expansion up to the rlimit, like other architectures.
 
  - Remove the remnants of our (previously disabled) topology update code, which
    tried to react to NUMA layout changes on virtualised systems, but was prone
    to crashes and other problems.
 
  - Add PMU support for Power10 CPUs.
 
  - A change to our signal trampoline so that we don't unbalance the link stack
    (branch return predictor) in the signal delivery path.
 
  - Lots of other cleanups, refactorings, smaller features and so on as usual.
 
 Thanks to:
   Abhishek Goel, Alastair D'Silva, Alexander A. Klimov, Alexey Kardashevskiy,
   Alistair Popple, Andrew Donnellan, Aneesh Kumar K.V, Anju T Sudhakar, Anton
   Blanchard, Arnd Bergmann, Athira Rajeev, Balamuruhan S, Bharata B Rao, Bill
   Wendling, Bin Meng, Cédric Le Goater, Chris Packham, Christophe Leroy,
   Christoph Hellwig, Daniel Axtens, Dan Williams, David Lamparter, Desnes A.
   Nunes do Rosario, Erhard F., Finn Thain, Frederic Barrat, Ganesh Goudar,
   Gautham R. Shenoy, Geoff Levand, Greg Kurz, Gustavo A. R. Silva, Hari Bathini,
   Harish, Imre Kaloz, Joel Stanley, Joe Perches, John Crispin, Jordan Niethe,
   Kajol Jain, Kamalesh Babulal, Kees Cook, Laurent Dufour, Leonardo Bras, Li
   RongQing, Madhavan Srinivasan, Mahesh Salgaonkar, Mark Cave-Ayland, Michal
   Suchanek, Milton Miller, Mimi Zohar, Murilo Opsfelder Araujo, Nathan
   Chancellor, Nathan Lynch, Naveen N. Rao, Nayna Jain, Nicholas Piggin, Oliver
   O'Halloran, Palmer Dabbelt, Pedro Miraglia Franco de Carvalho, Philippe
   Bergheaud, Pingfan Liu, Pratik Rajesh Sampat, Qian Cai, Qinglang Miao, Randy
   Dunlap, Ravi Bangoria, Sachin Sant, Sam Bobroff, Sandipan Das, Santosh
   Sivaraj, Satheesh Rajendran, Shirisha Ganta, Sourabh Jain, Srikar Dronamraju,
   Stan Johnson, Stephen Rothwell, Thadeu Lima de Souza Cascardo, Thiago Jung
   Bauermann, Tom Lane, Vaibhav Jain, Vladis Dronov, Wei Yongjun, Wen Xiong,
   YueHaibing.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAl8tOxATHG1wZUBlbGxl
 cm1hbi5pZC5hdQAKCRBR6+o8yOGlgDQfEAClXHWf6hnxB84bEu39D51NkVotL1IG
 BRWFvyix+xHuUkHIouBPAAMl6ngY5X6wkYd+Z+CY9zHNtdSDoVlJE30YXdMQA/dE
 L/rYxR1884yGR/uU/3wusboO68ReXwcKQPmKOymUfh0zH7ujyJsSWLpXFK1YDC5d
 2TVVTi0Q+P5ucMHDh0L+AHirIxZvtZSp43+J7xLtywsj+XAxJWCTGo5WCJbdgbCA
 Qbv3aOkVyUa3EgsbdM/STPpv82ebqT+PHxeSIO4Jw6ZODtKRH0R5YsWCApuY9eZ+
 ebY9RLmgv9ZAhJqB2fv9A5NDcMoGpZNmjM7HrWpXwULKQpkBGHCzJ9FcSdHVMOx8
 nbVMFjt4uzLwV1w8lFYslQ2tNH/uH2o9BlryV1RLpiiKokDAJO/NOsWN9y0u/I4J
 EmAM5DSX2LgVvvas96IlGK8KX4xkOkf8FLX/H5UDvvAfloH8J4CZXk/CWCab/nqY
 KEHPnMmYvQZ1w9SzyZg9sO/1p6Bl1Gmm75Jv2F1lBiRW/42VcGBI/qLsJ4lC59Fc
 KbwufYNYYG38wbxDLW1HAPJhRonxIcaZj3EEqk7aTiLZ55nNbu8e2k32CpNXTGqt
 npOhzJHimcq7L6+878ZW+xpbZwogIEUdRSsmwb6aT8za3ShnYwSA2Q3LYxh9xyGH
 j3GifvPq6Efp3Q==
 =QMY1
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-5.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:

 - Add support for (optionally) using queued spinlocks & rwlocks.

 - Support for a new faster system call ABI using the scv instruction on
   Power9 or later.

 - Drop support for the PROT_SAO mmap/mprotect flag as it will be
   unsupported on Power10 and future processors, leaving us with no way
   to implement the functionality it requests. This risks breaking
   userspace, though we believe it is unused in practice.

 - A bug fix for, and then the removal of, our custom stack expansion
   checking. We now allow stack expansion up to the rlimit, like other
   architectures.

 - Remove the remnants of our (previously disabled) topology update
   code, which tried to react to NUMA layout changes on virtualised
   systems, but was prone to crashes and other problems.

 - Add PMU support for Power10 CPUs.

 - A change to our signal trampoline so that we don't unbalance the link
   stack (branch return predictor) in the signal delivery path.

 - Lots of other cleanups, refactorings, smaller features and so on as
   usual.

Thanks to: Abhishek Goel, Alastair D'Silva, Alexander A. Klimov, Alexey
Kardashevskiy, Alistair Popple, Andrew Donnellan, Aneesh Kumar K.V, Anju
T Sudhakar, Anton Blanchard, Arnd Bergmann, Athira Rajeev, Balamuruhan
S, Bharata B Rao, Bill Wendling, Bin Meng, Cédric Le Goater, Chris
Packham, Christophe Leroy, Christoph Hellwig, Daniel Axtens, Dan
Williams, David Lamparter, Desnes A. Nunes do Rosario, Erhard F., Finn
Thain, Frederic Barrat, Ganesh Goudar, Gautham R. Shenoy, Geoff Levand,
Greg Kurz, Gustavo A. R. Silva, Hari Bathini, Harish, Imre Kaloz, Joel
Stanley, Joe Perches, John Crispin, Jordan Niethe, Kajol Jain, Kamalesh
Babulal, Kees Cook, Laurent Dufour, Leonardo Bras, Li RongQing, Madhavan
Srinivasan, Mahesh Salgaonkar, Mark Cave-Ayland, Michal Suchanek, Milton
Miller, Mimi Zohar, Murilo Opsfelder Araujo, Nathan Chancellor, Nathan
Lynch, Naveen N. Rao, Nayna Jain, Nicholas Piggin, Oliver O'Halloran,
Palmer Dabbelt, Pedro Miraglia Franco de Carvalho, Philippe Bergheaud,
Pingfan Liu, Pratik Rajesh Sampat, Qian Cai, Qinglang Miao, Randy
Dunlap, Ravi Bangoria, Sachin Sant, Sam Bobroff, Sandipan Das, Santosh
Sivaraj, Satheesh Rajendran, Shirisha Ganta, Sourabh Jain, Srikar
Dronamraju, Stan Johnson, Stephen Rothwell, Thadeu Lima de Souza
Cascardo, Thiago Jung Bauermann, Tom Lane, Vaibhav Jain, Vladis Dronov,
Wei Yongjun, Wen Xiong, YueHaibing.

* tag 'powerpc-5.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (337 commits)
  selftests/powerpc: Fix pkey syscall redefinitions
  powerpc: Fix circular dependency between percpu.h and mmu.h
  powerpc/powernv/sriov: Fix use of uninitialised variable
  selftests/powerpc: Skip vmx/vsx/tar/etc tests on older CPUs
  powerpc/40x: Fix assembler warning about r0
  powerpc/papr_scm: Add support for fetching nvdimm 'fuel-gauge' metric
  powerpc/papr_scm: Fetch nvdimm performance stats from PHYP
  cpuidle: pseries: Fixup exit latency for CEDE(0)
  cpuidle: pseries: Add function to parse extended CEDE records
  cpuidle: pseries: Set the latency-hint before entering CEDE
  selftests/powerpc: Fix online CPU selection
  powerpc/perf: Consolidate perf_callchain_user_[64|32]()
  powerpc/pseries/hotplug-cpu: Remove double free in error path
  powerpc/pseries/mobility: Add pr_debug() for device tree changes
  powerpc/pseries/mobility: Set pr_fmt()
  powerpc/cacheinfo: Warn if cache object chain becomes unordered
  powerpc/cacheinfo: Improve diagnostics about malformed cache lists
  powerpc/cacheinfo: Use name@unit instead of full DT path in debug messages
  powerpc/cacheinfo: Set pr_fmt()
  powerpc: fix function annotations to avoid section mismatch warnings with gcc-10
  ...
2020-08-07 10:33:50 -07:00
Randy Dunlap
b8c1a30907 bpf: Delete repeated words in comments
Drop repeated words in kernel/bpf/: {has, the}

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200807033141.10437-1-rdunlap@infradead.org
2020-08-07 18:57:24 +02:00
Linus Torvalds
19b39c38ab Merge branch 'work.regset' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull ptrace regset updates from Al Viro:
 "Internal regset API changes:

   - regularize copy_regset_{to,from}_user() callers

   - switch to saner calling conventions for ->get()

   - kill user_regset_copyout()

  The ->put() side of things will have to wait for the next cycle,
  unfortunately.

  The balance is about -1KLoC and replacements for ->get() instances are
  a lot saner"

* 'work.regset' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (41 commits)
  regset: kill user_regset_copyout{,_zero}()
  regset(): kill ->get_size()
  regset: kill ->get()
  csky: switch to ->regset_get()
  xtensa: switch to ->regset_get()
  parisc: switch to ->regset_get()
  nds32: switch to ->regset_get()
  nios2: switch to ->regset_get()
  hexagon: switch to ->regset_get()
  h8300: switch to ->regset_get()
  openrisc: switch to ->regset_get()
  riscv: switch to ->regset_get()
  c6x: switch to ->regset_get()
  ia64: switch to ->regset_get()
  arc: switch to ->regset_get()
  arm: switch to ->regset_get()
  sh: convert to ->regset_get()
  arm64: switch to ->regset_get()
  mips: switch to ->regset_get()
  sparc: switch to ->regset_get()
  ...
2020-08-07 09:29:25 -07:00
Linus Torvalds
eb65405eb6 \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAl8qeCkACgkQnJ2qBz9k
 QNlAGQf/YVruyVLZ7kCv6EMCHauXm3K1lEGpbXsTW04HpStxGx7mtLGN/Au+EYJR
 VnRkCMt6TSMQGMBkNF83dUCwXHkeL1rd6frJBLVOErkg50nUuD4kjTVw9Lzw9itx
 CPhKnPPlsRkDkZPxkg3WEdqPgzJREWBZUaB38QUPjYN46q7HfPYDANTh5wI1GiGs
 27+PvzlttjhkQpQ14pYU/nu4xf/nmgmmHhgfsJArQP2EzYOrKxsWKhXS5uPdtNlf
 mXiZMaqW2AlyDGlw3myOEySrrSuaR77M2bzDo7mjqffI9wSVTytKEhtg0i8OMWmv
 pZ38OQobznnFoqzc1GL70IE0DEU48g==
 =d81d
 -----END PGP SIGNATURE-----

Merge tag 'fsnotify_for_v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull fsnotify updates from Jan Kara:

 - fanotify fix for softlockups when there are many queued events

 - performance improvement to reduce fsnotify overhead when not used

 - Amir's implementation of fanotify events with names. With these you
   can now efficiently monitor whole filesystem, eg to mirror changes to
   another machine.

* tag 'fsnotify_for_v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (37 commits)
  fanotify: compare fsid when merging name event
  fsnotify: create method handle_inode_event() in fsnotify_operations
  fanotify: report parent fid + child fid
  fanotify: report parent fid + name + child fid
  fanotify: add support for FAN_REPORT_NAME
  fanotify: report events with parent dir fid to sb/mount/non-dir marks
  fanotify: add basic support for FAN_REPORT_DIR_FID
  fsnotify: remove check that source dentry is positive
  fsnotify: send event with parent/name info to sb/mount/non-dir marks
  audit: do not set FS_EVENT_ON_CHILD in audit marks mask
  inotify: do not set FS_EVENT_ON_CHILD in non-dir mark mask
  fsnotify: pass dir and inode arguments to fsnotify()
  fsnotify: create helper fsnotify_inode()
  fsnotify: send event to parent and child with single callback
  inotify: report both events on parent and child with single callback
  dnotify: report both events on parent and child with single callback
  fanotify: no external fh buffer in fanotify_name_event
  fanotify: use struct fanotify_info to parcel the variable size buffer
  fsnotify: add object type "child" to object type iterator
  fanotify: use FAN_EVENT_ON_CHILD as implicit flag on sb/mount/non-dir marks
  ...
2020-08-06 19:29:51 -07:00
Stanislav Fomichev
0d360d64b0 bpf: Remove inline from bpf_do_trace_printk
I get the following error during compilation on my side:
kernel/trace/bpf_trace.c: In function 'bpf_do_trace_printk':
kernel/trace/bpf_trace.c:386:34: error: function 'bpf_do_trace_printk' can never be inlined because it uses variable argument lists
 static inline __printf(1, 0) int bpf_do_trace_printk(const char *fmt, ...)
                                  ^

Fixes: ac5a72ea5c ("bpf: Use dedicated bpf_trace_printk event instead of trace_printk()")
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200806182612.1390883-1-sdf@google.com
2020-08-06 16:53:17 -07:00
Yonghong Song
5e7b30205c bpf: Change uapi for bpf iterator map elements
Commit a5cbe05a66 ("bpf: Implement bpf iterator for
map elements") added bpf iterator support for
map elements. The map element bpf iterator requires
info to identify a particular map. In the above
commit, the attr->link_create.target_fd is used
to carry map_fd and an enum bpf_iter_link_info
is added to uapi to specify the target_fd actually
representing a map_fd:
    enum bpf_iter_link_info {
	BPF_ITER_LINK_UNSPEC = 0,
	BPF_ITER_LINK_MAP_FD = 1,

	MAX_BPF_ITER_LINK_INFO,
    };

This is an extensible approach as we can grow
enumerator for pid, cgroup_id, etc. and we can
unionize target_fd for pid, cgroup_id, etc.
But in the future, there are chances that
more complex customization may happen, e.g.,
for tasks, it could be filtered based on
both cgroup_id and user_id.

This patch changed the uapi to have fields
	__aligned_u64	iter_info;
	__u32		iter_info_len;
for additional iter_info for link_create.
The iter_info is defined as
	union bpf_iter_link_info {
		struct {
			__u32   map_fd;
		} map;
	};

So future extension for additional customization
will be easier. The bpf_iter_link_info will be
passed to target callback to validate and generic
bpf_iter framework does not need to deal it any
more.

Note that map_fd = 0 will be considered invalid
and -EBADF will be returned to user space.

Fixes: a5cbe05a66 ("bpf: Implement bpf iterator for map elements")
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200805055056.1457463-1-yhs@fb.com
2020-08-06 16:39:14 -07:00
Lianbo Jiang
475f63ae63 kexec_file: Correctly output debugging information for the PT_LOAD ELF header
Currently, when we enable the debugging switch to debug kexec_file,
we always get the following incorrect results:

  kexec_file: Crash PT_LOAD elf header. phdr=00000000c988639b vaddr=0x0, paddr=0x0, sz=0x0 e_phnum=51 p_offset=0x0
  kexec_file: Crash PT_LOAD elf header. phdr=000000003cca69a0 vaddr=0x0, paddr=0x0, sz=0x0 e_phnum=52 p_offset=0x0
  kexec_file: Crash PT_LOAD elf header. phdr=00000000c584cb9f vaddr=0x0, paddr=0x0, sz=0x0 e_phnum=53 p_offset=0x0
  kexec_file: Crash PT_LOAD elf header. phdr=00000000cf85d57f vaddr=0x0, paddr=0x0, sz=0x0 e_phnum=54 p_offset=0x0
  kexec_file: Crash PT_LOAD elf header. phdr=00000000a4a8f847 vaddr=0x0, paddr=0x0, sz=0x0 e_phnum=55 p_offset=0x0
  kexec_file: Crash PT_LOAD elf header. phdr=00000000272ec49f vaddr=0x0, paddr=0x0, sz=0x0 e_phnum=56 p_offset=0x0
  kexec_file: Crash PT_LOAD elf header. phdr=00000000ea0b65de vaddr=0x0, paddr=0x0, sz=0x0 e_phnum=57 p_offset=0x0
  kexec_file: Crash PT_LOAD elf header. phdr=000000001f5e490c vaddr=0x0, paddr=0x0, sz=0x0 e_phnum=58 p_offset=0x0
  kexec_file: Crash PT_LOAD elf header. phdr=00000000dfe4109e vaddr=0x0, paddr=0x0, sz=0x0 e_phnum=59 p_offset=0x0
  kexec_file: Crash PT_LOAD elf header. phdr=00000000480ed2b6 vaddr=0x0, paddr=0x0, sz=0x0 e_phnum=60 p_offset=0x0
  kexec_file: Crash PT_LOAD elf header. phdr=0000000080b65151 vaddr=0x0, paddr=0x0, sz=0x0 e_phnum=61 p_offset=0x0
  kexec_file: Crash PT_LOAD elf header. phdr=0000000024e31c5e vaddr=0x0, paddr=0x0, sz=0x0 e_phnum=62 p_offset=0x0
  kexec_file: Crash PT_LOAD elf header. phdr=00000000332e0385 vaddr=0x0, paddr=0x0, sz=0x0 e_phnum=63 p_offset=0x0
  kexec_file: Crash PT_LOAD elf header. phdr=000000002754d5da vaddr=0x0, paddr=0x0, sz=0x0 e_phnum=64 p_offset=0x0
  kexec_file: Crash PT_LOAD elf header. phdr=00000000783320dd vaddr=0x0, paddr=0x0, sz=0x0 e_phnum=65 p_offset=0x0
  kexec_file: Crash PT_LOAD elf header. phdr=0000000076fe5b64 vaddr=0x0, paddr=0x0, sz=0x0 e_phnum=66 p_offset=0x0

The reason is that kernel always prints the values of the next PT_LOAD
instead of the current PT_LOAD. Change it to ensure that we can get the
correct debugging information.

[ mingo: Amended changelog, capitalized "ELF". ]

Signed-off-by: Lianbo Jiang <lijiang@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Dave Young <dyoung@redhat.com>
Link: https://lore.kernel.org/r/20200804044933.1973-4-lijiang@redhat.com
2020-08-07 01:32:00 +02:00
Lianbo Jiang
a2e9a95d21 kexec: Improve & fix crash_exclude_mem_range() to handle overlapping ranges
The crash_exclude_mem_range() function can only handle one memory region a time.

It will fail in the case in which the passed in area covers several memory
regions. In this case, it will only exclude the first region, then return,
but leave the later regions unsolved.

E.g in a NEC system with two usable RAM regions inside the low 1M:

  ...
  BIOS-e820: [mem 0x0000000000000000-0x000000000003efff] usable
  BIOS-e820: [mem 0x000000000003f000-0x000000000003ffff] reserved
  BIOS-e820: [mem 0x0000000000040000-0x000000000009ffff] usable

It will only exclude the memory region [0, 0x3efff], the memory region
[0x40000, 0x9ffff] will still be added into /proc/vmcore, which may cause
the following failure when dumping vmcore:

 ioremap on RAM at 0x0000000000040000 - 0x0000000000040fff
 WARNING: CPU: 0 PID: 665 at arch/x86/mm/ioremap.c:186 __ioremap_caller+0x2c7/0x2e0
 ...
 RIP: 0010:__ioremap_caller+0x2c7/0x2e0
 ...
 cp: error reading '/proc/vmcore': Cannot allocate memory
 kdump: saving vmcore failed

In order to fix this bug, let's extend the crash_exclude_mem_range()
to handle the overlapping ranges.

[ mingo: Amended the changelog. ]

Signed-off-by: Lianbo Jiang <lijiang@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Dave Young <dyoung@redhat.com>
Link: https://lore.kernel.org/r/20200804044933.1973-3-lijiang@redhat.com
2020-08-07 01:32:00 +02:00
Linus Torvalds
921d2597ab s390: implement diag318
x86:
 * Report last CPU for debugging
 * Emulate smaller MAXPHYADDR in the guest than in the host
 * .noinstr and tracing fixes from Thomas
 * nested SVM page table switching optimization and fixes
 
 Generic:
 * Unify shadow MMU cache data structures across architectures
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAl8pC+oUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroNcOwgAjomqtEqQNlp7DdZT7VyyklzbxX1/
 ud7v+oOJ8K4sFlf64lSthjPo3N9rzZCcw+yOXmuyuITngXOGc3tzIwXpCzpLtuQ1
 WO1Ql3B/2dCi3lP5OMmsO1UAZqy9pKLg1dfeYUPk48P5+p7d/NPmk+Em5kIYzKm5
 JsaHfCp2EEXomwmljNJ8PQ1vTjIQSSzlgYUBZxmCkaaX7zbEUMtxAQCStHmt8B84
 33LczwXBm3viSWrzsoBV37I70+tseugiSGsCfUyupXOvq55d6D9FCqtCb45Hn4Vh
 Ik8ggKdalsk/reiGEwNw1/3nr6mRMkHSbl+Mhc4waOIFf9dn0urgQgOaDg==
 =YVx0
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "s390:
   - implement diag318

  x86:
   - Report last CPU for debugging
   - Emulate smaller MAXPHYADDR in the guest than in the host
   - .noinstr and tracing fixes from Thomas
   - nested SVM page table switching optimization and fixes

  Generic:
   - Unify shadow MMU cache data structures across architectures"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (127 commits)
  KVM: SVM: Fix sev_pin_memory() error handling
  KVM: LAPIC: Set the TDCR settable bits
  KVM: x86: Specify max TDP level via kvm_configure_mmu()
  KVM: x86/mmu: Rename max_page_level to max_huge_page_level
  KVM: x86: Dynamically calculate TDP level from max level and MAXPHYADDR
  KVM: VXM: Remove temporary WARN on expected vs. actual EPTP level mismatch
  KVM: x86: Pull the PGD's level from the MMU instead of recalculating it
  KVM: VMX: Make vmx_load_mmu_pgd() static
  KVM: x86/mmu: Add separate helper for shadow NPT root page role calc
  KVM: VMX: Drop a duplicate declaration of construct_eptp()
  KVM: nSVM: Correctly set the shadow NPT root level in its MMU role
  KVM: Using macros instead of magic values
  MIPS: KVM: Fix build error caused by 'kvm_run' cleanup
  KVM: nSVM: remove nonsensical EXITINFO1 adjustment on nested NPF
  KVM: x86: Add a capability for GUEST_MAXPHYADDR < HOST_MAXPHYADDR support
  KVM: VMX: optimize #PF injection when MAXPHYADDR does not match
  KVM: VMX: Add guest physical address check in EPT violation and misconfig
  KVM: VMX: introduce vmx_need_pf_intercept
  KVM: x86: update exception bitmap on CPUID changes
  KVM: x86: rename update_bp_intercept to update_exception_bitmap
  ...
2020-08-06 12:59:31 -07:00
Linus Torvalds
6d2b84a4e5 This tree adds the sched_set_fifo*() encapsulation APIs to remove
static priority level knowledge from non-scheduler code.
 
 The three APIs for non-scheduler code to set SCHED_FIFO are:
 
  - sched_set_fifo()
  - sched_set_fifo_low()
  - sched_set_normal()
 
 These are two FIFO priority levels: default (high), and a 'low' priority level,
 plus sched_set_normal() to set the policy back to non-SCHED_FIFO.
 
 Since the changes affect a lot of non-scheduler code, we kept this in a separate
 tree.
 
 When merging to the latest upstream tree there's a conflict in drivers/spi/spi.c,
 which can be resolved via:
 
 	sched_set_fifo(ctlr->kworker_task);
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl8pPQIRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1j0Jw/+LlSyX6gD2ATy3cizGL7DFPZogD5MVKTb
 IXbhXH/ACpuPQlBe1+haRLbJj6XfXqbOlAleVKt7eh+jZ1jYjC972RCSTO4566mJ
 0v8Iy9kkEeb2TDbYx1H3bnk78lf85t0CB+sCzyKUYFuTrXU04eRj7MtN3vAQyRQU
 xJg83x/sT5DGdDTP50sL7lpbwk3INWkD0aDCJEaO/a9yHElMsTZiZBKoXxN/s30o
 FsfzW56jqtng771H2bo8ERN7+abwJg10crQU5mIaLhacNMETuz0NZ/f8fY/fydCL
 Ju8HAdNKNXyphWkAOmixQuyYtWKe2/GfbHg8hld0jmpwxkOSTgZjY+pFcv7/w306
 g2l1TPOt8e1n5jbfnY3eig+9Kr8y0qHkXPfLfgRqKwMMaOqTTYixEzj+NdxEIRX9
 Kr7oFAv6VEFfXGSpb5L1qyjIGVgQ5/JE/p3OC3GHEsw5VKiy5yjhNLoSmSGzdS61
 1YurVvypSEUAn3DqTXgeGX76f0HH365fIKqmbFrUWxliF+YyflMhtrj2JFtejGzH
 Md3RgAzxusE9S6k3gw1ev4byh167bPBbY8jz0w3Gd7IBRKy9vo92h6ZRYIl6xeoC
 BU2To1IhCAydIr6hNsIiCSDTgiLbsYQzPuVVovUxNh+l1ZvKV2X+csEHhs8oW4pr
 4BRU7dKL2NE=
 =/7JH
 -----END PGP SIGNATURE-----

Merge tag 'sched-fifo-2020-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull sched/fifo updates from Ingo Molnar:
 "This adds the sched_set_fifo*() encapsulation APIs to remove static
  priority level knowledge from non-scheduler code.

  The three APIs for non-scheduler code to set SCHED_FIFO are:

   - sched_set_fifo()
   - sched_set_fifo_low()
   - sched_set_normal()

  These are two FIFO priority levels: default (high), and a 'low'
  priority level, plus sched_set_normal() to set the policy back to
  non-SCHED_FIFO.

  Since the changes affect a lot of non-scheduler code, we kept this in
  a separate tree"

* tag 'sched-fifo-2020-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
  sched,tracing: Convert to sched_set_fifo()
  sched: Remove sched_set_*() return value
  sched: Remove sched_setscheduler*() EXPORTs
  sched,psi: Convert to sched_set_fifo_low()
  sched,rcutorture: Convert to sched_set_fifo_low()
  sched,rcuperf: Convert to sched_set_fifo_low()
  sched,locktorture: Convert to sched_set_fifo()
  sched,irq: Convert to sched_set_fifo()
  sched,watchdog: Convert to sched_set_fifo()
  sched,serial: Convert to sched_set_fifo()
  sched,powerclamp: Convert to sched_set_fifo()
  sched,ion: Convert to sched_set_normal()
  sched,powercap: Convert to sched_set_fifo*()
  sched,spi: Convert to sched_set_fifo*()
  sched,mmc: Convert to sched_set_fifo*()
  sched,ivtv: Convert to sched_set_fifo*()
  sched,drm/scheduler: Convert to sched_set_fifo*()
  sched,msm: Convert to sched_set_fifo*()
  sched,psci: Convert to sched_set_fifo*()
  sched,drbd: Convert to sched_set_fifo*()
  ...
2020-08-06 11:55:43 -07:00
Linus Torvalds
4cec929370 integrity-v5.9
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEEjSMCCC7+cjo3nszSa3kkZrA+cVoFAl8puJgUHHpvaGFyQGxp
 bnV4LmlibS5jb20ACgkQa3kkZrA+cVq47w//VDg2pTD+/fPadleRJkKVSPaKJu4k
 N/gAVPxhYpJVJ+BTZKMFzTjX3kjfQG7udjORzC+saEdii7W1EfJJqHabLEnihfxd
 VDUS0RQndMwOkioAAZOsy5dFE84wUOX8O1kq31Aw2G+QLCYhn1dNMg10j6SBM034
 cJbS59k3w+lyqFy/Fje8e7aO1xmc/83x9MfLgzZTscCZqzf1vIJY8onwfTxRVBpQ
 QS0AZJM+b0+9MlJxpzBYxZARwYb5cXBLh07W/vBFmJRh15n0e20uWM4YFkBixicX
 gi3LtXd/75hFIHgm6QqbwDJrrA45zOJs5YsOudCctWVAe5k5mV0H7ysJ6phcRI9E
 uQvBb7Z+0viQXis6Cjx4gYSYAcAJPcDrfcjR4itQSOj5anUFBvCju+Jr373S0Vn8
 3eXGyimRAc33vEFkI7RJNfExkGh7pkYWzcruk90bHD6dAKuki/tisIs7ZvhTuFOp
 eyWt7hbctqbt/gESop3zXjUDRJsX9GyAA4OvJwFGRfRJ4ziQ5w8LGc+VendSWald
 1zjkJxXAZLjDPQlYv2074PYeIguTbcDkjeRVxUD9mWvdi0tyXK+r2qC+PeX7Rs71
 y1aGIT/NX9qYI2H0xIm3ettztdIE8F1tnAn2ziNkQiXEzCrEqKtAAxxSErTQuB78
 LMgCDPF8y06ZjD8=
 =M/tq
 -----END PGP SIGNATURE-----

Merge tag 'integrity-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity

Pull integrity updates from Mimi Zohar:
 "The nicest change is the IMA policy rule checking. The other changes
  include allowing the kexec boot cmdline line measure policy rules to
  be defined in terms of the inode associated with the kexec kernel
  image, making the IMA_APPRAISE_BOOTPARAM, which governs the IMA
  appraise mode (log, fix, enforce), a runtime decision based on the
  secure boot mode of the system, and including errno in the audit log"

* tag 'integrity-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity:
  integrity: remove redundant initialization of variable ret
  ima: move APPRAISE_BOOTPARAM dependency on ARCH_POLICY to runtime
  ima: AppArmor satisfies the audit rule requirements
  ima: Rename internal filter rule functions
  ima: Support additional conditionals in the KEXEC_CMDLINE hook function
  ima: Use the common function to detect LSM conditionals in a rule
  ima: Move comprehensive rule validation checks out of the token parser
  ima: Use correct type for the args_p member of ima_rule_entry.lsm elements
  ima: Shallow copy the args_p member of ima_rule_entry.lsm elements
  ima: Fail rule parsing when appraise_flag=blacklist is unsupportable
  ima: Fail rule parsing when the KEY_CHECK hook is combined with an invalid cond
  ima: Fail rule parsing when the KEXEC_CMDLINE hook is combined with an invalid cond
  ima: Fail rule parsing when buffer hook functions have an invalid action
  ima: Free the entire rule if it fails to parse
  ima: Free the entire rule when deleting a list of rules
  ima: Have the LSM free its audit rule
  IMA: Add audit log for failure conditions
  integrity: Add errno field in audit message
2020-08-06 11:35:57 -07:00
Thomas Gleixner
1fb497dd00 posix-cpu-timers: Provide mechanisms to defer timer handling to task_work
Running posix CPU timers in hard interrupt context has a few downsides:

 - For PREEMPT_RT it cannot work as the expiry code needs to take
   sighand lock, which is a 'sleeping spinlock' in RT. The original RT
   approach of offloading the posix CPU timer handling into a high
   priority thread was clumsy and provided no real benefit in general.

 - For fine grained accounting it's just wrong to run this in context of
   the timer interrupt because that way a process specific CPU time is
   accounted to the timer interrupt.

 - Long running timer interrupts caused by a large amount of expiring
   timers which can be created and armed by unpriviledged user space.

There is no hard requirement to expire them in interrupt context.

If the signal is targeted at the task itself then it won't be delivered
before the task returns to user space anyway. If the signal is targeted at
a supervisor process then it might be slightly delayed, but posix CPU
timers are inaccurate anyway due to the fact that they are tied to the
tick.

Provide infrastructure to schedule task work which allows splitting the
posix CPU timer code into a quick check in interrupt context and a thread
context expiry and signal delivery function. This has to be enabled by
architectures as it requires that the architecture specific KVM
implementation handles pending task work before exiting to guest mode.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20200730102337.783470146@linutronix.de
2020-08-06 16:50:59 +02:00
Thomas Gleixner
820903c784 posix-cpu-timers: Split run_posix_cpu_timers()
Split it up as a preparatory step to move the heavy lifting out of
interrupt context.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20200730102337.677439437@linutronix.de
2020-08-06 16:50:59 +02:00
Muchun Song
10de795a5a kprobes: Fix compiler warning for !CONFIG_KPROBES_ON_FTRACE
Fix compiler warning(as show below) for !CONFIG_KPROBES_ON_FTRACE.

kernel/kprobes.c: In function 'kill_kprobe':
kernel/kprobes.c:1116:33: warning: statement with no effect
[-Wunused-value]
 1116 | #define disarm_kprobe_ftrace(p) (-ENODEV)
      |                                 ^
kernel/kprobes.c:2154:3: note: in expansion of macro
'disarm_kprobe_ftrace'
 2154 |   disarm_kprobe_ftrace(p);

Link: https://lore.kernel.org/r/20200805142136.0331f7ea@canb.auug.org.au
Link: https://lkml.kernel.org/r/20200805172046.19066-1-songmuchun@bytedance.com

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Fixes: 0cb2f1372b ("kprobes: Fix NULL pointer dereference at kprobe_ftrace_handler")
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-08-06 09:16:27 -04:00
Alexey Budankov
45fd22da97 perf/core: Take over CAP_SYS_PTRACE creds to CAP_PERFMON capability
Open access to per-process monitoring for CAP_PERFMON only
privileged processes [1]. Extend ptrace_may_access() check
in perf_events subsystem with perfmon_capable() to simplify
user experience and make monitoring more secure by reducing
attack surface.

[1] https://lore.kernel.org/lkml/7776fa40-6c65-2aa6-1322-eb3a01201000@linux.intel.com/

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/6e8392ff-4732-0012-2949-e1587709f0f6@linux.intel.com
2020-08-06 15:03:20 +02:00
Thomas Gleixner
19d0070a27 timekeeping/vsyscall: Provide vdso_update_begin/end()
Architectures can have the requirement to add additional architecture
specific data to the VDSO data page which needs to be updated independent
of the timekeeper updates.

To protect these updates vs. concurrent readers and a conflicting update
through timekeeping, provide helper functions to make such updates safe.

vdso_update_begin() takes the timekeeper_lock to protect against a
potential update from timekeeper code and increments the VDSO sequence
count to signal data inconsistency to concurrent readers. vdso_update_end()
makes the sequence count even again to signal data consistency and drops
the timekeeper lock.

[ Sven: Add interrupt disable handling to the functions ]

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200804150124.41692-3-svens@linux.ibm.com
2020-08-06 10:57:30 +02:00
Ingo Molnar
a703f3633f Merge branch 'WIP.locking/seqlocks' into locking/urgent
Pick up the full seqlock series PeterZ is working on.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-08-06 10:16:38 +02:00
Phil Auld
a1bd06853e sched: Fix use of count for nr_running tracepoint
The count field is meant to tell if an update to nr_running
is an add or a subtract. Make it do so by adding the missing
minus sign.

Fixes: 9d246053a6 ("sched: Add a tracepoint to track rq->nr_running")
Signed-off-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200805203138.1411-1-pauld@redhat.com
2020-08-06 09:36:59 +02:00
Linus Torvalds
47ec5303d7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from David Miller:

 1) Support 6Ghz band in ath11k driver, from Rajkumar Manoharan.

 2) Support UDP segmentation in code TSO code, from Eric Dumazet.

 3) Allow flashing different flash images in cxgb4 driver, from Vishal
    Kulkarni.

 4) Add drop frames counter and flow status to tc flower offloading,
    from Po Liu.

 5) Support n-tuple filters in cxgb4, from Vishal Kulkarni.

 6) Various new indirect call avoidance, from Eric Dumazet and Brian
    Vazquez.

 7) Fix BPF verifier failures on 32-bit pointer arithmetic, from
    Yonghong Song.

 8) Support querying and setting hardware address of a port function via
    devlink, use this in mlx5, from Parav Pandit.

 9) Support hw ipsec offload on bonding slaves, from Jarod Wilson.

10) Switch qca8k driver over to phylink, from Jonathan McDowell.

11) In bpftool, show list of processes holding BPF FD references to
    maps, programs, links, and btf objects. From Andrii Nakryiko.

12) Several conversions over to generic power management, from Vaibhav
    Gupta.

13) Add support for SO_KEEPALIVE et al. to bpf_setsockopt(), from Dmitry
    Yakunin.

14) Various https url conversions, from Alexander A. Klimov.

15) Timestamping and PHC support for mscc PHY driver, from Antoine
    Tenart.

16) Support bpf iterating over tcp and udp sockets, from Yonghong Song.

17) Support 5GBASE-T i40e NICs, from Aleksandr Loktionov.

18) Add kTLS RX HW offload support to mlx5e, from Tariq Toukan.

19) Fix the ->ndo_start_xmit() return type to be netdev_tx_t in several
    drivers. From Luc Van Oostenryck.

20) XDP support for xen-netfront, from Denis Kirjanov.

21) Support receive buffer autotuning in MPTCP, from Florian Westphal.

22) Support EF100 chip in sfc driver, from Edward Cree.

23) Add XDP support to mvpp2 driver, from Matteo Croce.

24) Support MPTCP in sock_diag, from Paolo Abeni.

25) Commonize UDP tunnel offloading code by creating udp_tunnel_nic
    infrastructure, from Jakub Kicinski.

26) Several pci_ --> dma_ API conversions, from Christophe JAILLET.

27) Add FLOW_ACTION_POLICE support to mlxsw, from Ido Schimmel.

28) Add SK_LOOKUP bpf program type, from Jakub Sitnicki.

29) Refactor a lot of networking socket option handling code in order to
    avoid set_fs() calls, from Christoph Hellwig.

30) Add rfc4884 support to icmp code, from Willem de Bruijn.

31) Support TBF offload in dpaa2-eth driver, from Ioana Ciornei.

32) Support XDP_REDIRECT in qede driver, from Alexander Lobakin.

33) Support PCI relaxed ordering in mlx5 driver, from Aya Levin.

34) Support TCP syncookies in MPTCP, from Flowian Westphal.

35) Fix several tricky cases of PMTU handling wrt. briding, from Stefano
    Brivio.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2056 commits)
  net: thunderx: initialize VF's mailbox mutex before first usage
  usb: hso: remove bogus check for EINPROGRESS
  usb: hso: no complaint about kmalloc failure
  hso: fix bailout in error case of probe
  ip_tunnel_core: Fix build for archs without _HAVE_ARCH_IPV6_CSUM
  selftests/net: relax cpu affinity requirement in msg_zerocopy test
  mptcp: be careful on subflow creation
  selftests: rtnetlink: make kci_test_encap() return sub-test result
  selftests: rtnetlink: correct the final return value for the test
  net: dsa: sja1105: use detected device id instead of DT one on mismatch
  tipc: set ub->ifindex for local ipv6 address
  ipv6: add ipv6_dev_find()
  net: openvswitch: silence suspicious RCU usage warning
  Revert "vxlan: fix tos value before xmit"
  ptp: only allow phase values lower than 1 period
  farsync: switch from 'pci_' to 'dma_' API
  wan: wanxl: switch from 'pci_' to 'dma_' API
  hv_netvsc: do not use VF device if link is down
  dpaa2-eth: Fix passing zero to 'PTR_ERR' warning
  net: macb: Properly handle phylink on at91sam9x
  ...
2020-08-05 20:13:21 -07:00
Linus Torvalds
dd27111e32 Driver core changes for 5.9-rc1
Here is the "big" set of changes to the driver core, and some drivers
 using the changes, for 5.9-rc1.
 
 "Biggest" thing in here is the device link exposure in sysfs, to help
 to tame the madness that is SoC device tree representations and driver
 interactions with it.
 
 Other stuff in here that is interesting is:
 	- device probe log helper so that drivers can report problems in
 	  a unified way easier.
 	- devres functions added
 	- DEVICE_ATTR_ADMIN_* macro added to make it harder to write
 	  incorrect sysfs file permissions
 	- documentation cleanups
 	- ability for debugfs to be present in the kernel, yet not
 	  exposed to userspace.  Needed for systems that want it
 	  enabled, but do not trust users, so they can still use some
 	  kernel functions that were otherwise disabled.
 	- other minor fixes and cleanups
 
 The patches outside of drivers/base/ all have acks from the respective
 subsystem maintainers to go through this tree instead of theirs.
 
 All of these have been in linux-next with no reported issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXylhOQ8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ylGdACeKqxm8IIDZycj0QjLUlPiEwVIROgAnjpf5jAB
 mb4jMvgEGsB6/FwxypPG
 =RUss
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core updates from Greg KH:
 "Here is the "big" set of changes to the driver core, and some drivers
  using the changes, for 5.9-rc1.

  "Biggest" thing in here is the device link exposure in sysfs, to help
  to tame the madness that is SoC device tree representations and driver
  interactions with it.

  Other stuff in here that is interesting is:

   - device probe log helper so that drivers can report problems in a
     unified way easier.

   - devres functions added

   - DEVICE_ATTR_ADMIN_* macro added to make it harder to write
     incorrect sysfs file permissions

   - documentation cleanups

   - ability for debugfs to be present in the kernel, yet not exposed to
     userspace. Needed for systems that want it enabled, but do not
     trust users, so they can still use some kernel functions that were
     otherwise disabled.

   - other minor fixes and cleanups

  The patches outside of drivers/base/ all have acks from the respective
  subsystem maintainers to go through this tree instead of theirs.

  All of these have been in linux-next with no reported issues"

* tag 'driver-core-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (39 commits)
  drm/bridge: lvds-codec: simplify error handling
  drm/bridge/sii8620: fix resource acquisition error handling
  driver core: add deferring probe reason to devices_deferred property
  driver core: add device probe log helper
  driver core: Avoid binding drivers to dead devices
  Revert "test_firmware: Test platform fw loading on non-EFI systems"
  firmware_loader: EFI firmware loader must handle pre-allocated buffer
  selftest/firmware: Add selftest timeout in settings
  test_firmware: Test platform fw loading on non-EFI systems
  driver core: Change delimiter in devlink device's name to "--"
  debugfs: Add access restriction option
  tracefs: Remove unnecessary debug_fs checks.
  driver core: Fix probe_count imbalance in really_probe()
  kobject: remove unused KOBJ_MAX action
  driver core: Fix sleeping in invalid context during device link deletion
  driver core: Add waiting_for_supplier sysfs file for devices
  driver core: Add state_synced sysfs file for devices that support it
  driver core: Expose device link details in sysfs
  driver core: Drop mention of obsolete bus rwsem from kernel-doc
  debugfs: file: Remove unnecessary cast in kfree()
  ...
2020-08-05 11:52:17 -07:00
Linus Torvalds
1785d11612 Char/Misc driver patches for 5.9-rc1
Here is the large set of char and misc and other driver subsystem
 patches for 5.9-rc1.  Lots of new driver submissions in here, and
 cleanups and features for existing drivers.
 
 Highlights are:
 	- habanalabs driver updates
 	- coresight driver updates
 	- nvmem driver updates
 	- huge number of "W=1" build warning cleanups from Lee Jones
 	- dyndbg updates
 	- virtbox driver fixes and updates
 	- soundwire driver updates
 	- mei driver updates
 	- phy driver updates
 	- fpga driver updates
 	- lots of smaller individual misc/char driver cleanups and fixes
 
 Full details are in the shortlog.
 
 All of these have been in linux-next with no reported issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXylccQ8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ymofgCfZ1CxNWd0ZVM0YIn8cY9gO6ON7MsAnRq48hvn
 Vjf4rKM73GC11bVF4Gyy
 =Xq1R
 -----END PGP SIGNATURE-----

Merge tag 'char-misc-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc

Pull char/misc driver updates from Greg KH:
 "Here is the large set of char and misc and other driver subsystem
  patches for 5.9-rc1. Lots of new driver submissions in here, and
  cleanups and features for existing drivers.

  Highlights are:
   - habanalabs driver updates
   - coresight driver updates
   - nvmem driver updates
   - huge number of "W=1" build warning cleanups from Lee Jones
   - dyndbg updates
   - virtbox driver fixes and updates
   - soundwire driver updates
   - mei driver updates
   - phy driver updates
   - fpga driver updates
   - lots of smaller individual misc/char driver cleanups and fixes

  Full details are in the shortlog.

  All of these have been in linux-next with no reported issues"

* tag 'char-misc-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (322 commits)
  habanalabs: remove unused but set variable 'ctx_asid'
  nvmem: qcom-spmi-sdam: Enable multiple devices
  dt-bindings: nvmem: SID: add binding for A100's SID controller
  nvmem: update Kconfig description
  nvmem: qfprom: Add fuse blowing support
  dt-bindings: nvmem: Add properties needed for blowing fuses
  dt-bindings: nvmem: qfprom: Convert to yaml
  nvmem: qfprom: use NVMEM_DEVID_AUTO for multiple instances
  nvmem: core: add support to auto devid
  nvmem: core: Add nvmem_cell_read_u8()
  nvmem: core: Grammar fixes for help text
  nvmem: sc27xx: add sc2730 efuse support
  nvmem: Enforce nvmem stride in the sysfs interface
  MAINTAINERS: Add git tree for NVMEM FRAMEWORK
  nvmem: sprd: Fix return value of sprd_efuse_probe()
  drivers: android: Fix the SPDX comment style
  drivers: android: Fix a variable declaration coding style issue
  drivers: android: Remove braces for a single statement if-else block
  drivers: android: Remove the use of else after return
  drivers: android: Fix a variable declaration coding style issue
  ...
2020-08-05 11:43:47 -07:00
Christoph Hellwig
262e6ae708 modules: inherit TAINT_PROPRIETARY_MODULE
If a TAINT_PROPRIETARY_MODULE exports symbol, inherit the taint flag
for all modules importing these symbols, and don't allow loading
symbols from TAINT_PROPRIETARY_MODULE modules if the module previously
imported gplonly symbols.  Add a anti-circumvention devices so people
don't accidentally get themselves into trouble this way.

Comment from Greg:
  "Ah, the proven-to-be-illegal "GPL Condom" defense :)"

[jeyu: pr_info -> pr_err and pr_warn as per discussion]
Link: http://lore.kernel.org/r/20200730162957.GA22469@lst.de
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2020-08-05 10:31:28 +02:00
Linus Torvalds
2324d50d05 It's been a busy cycle for documentation - hopefully the busiest for a
while to come.  Changes include:
 
  - Some new Chinese translations
 
  - Progress on the battle against double words words and non-HTTPS URLs
 
  - Some block-mq documentation
 
  - More RST conversions from Mauro.  At this point, that task is
    essentially complete, so we shouldn't see this kind of churn again for a
    while.  Unless we decide to switch to asciidoc or something...:)
 
  - Lots of typo fixes, warning fixes, and more.
 -----BEGIN PGP SIGNATURE-----
 
 iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAl8oVkwPHGNvcmJldEBs
 d24ubmV0AAoJEBdDWhNsDH5YoW8H/jJ/xnXFn7tkgVPQAlL3k5HCnK7A5nDP9RVR
 cg1pTx1cEFdjzxPlJyExU6/v+AImOvtweHXC+JDK7YcJ6XFUNYXJI3LxL5KwUXbY
 BL/xRFszDSXH2C7SJF5GECcFYp01e/FWSLN3yWAh+g+XwsKiTJ8q9+CoIDkHfPGO
 7oQsHKFu6s36Af0LfSgxk4sVB7EJbo8e4psuPsP5SUrl+oXRO43Put0rXkR4yJoH
 9oOaB51Do5fZp8I4JVAqGXvpXoExyLMO4yw0mASm6YSZ3KyjR8Fae+HD9Cq4ZuwY
 0uzb9K+9NEhqbfwtyBsi99S64/6Zo/MonwKwevZuhtsDTK4l4iU=
 =JQLZ
 -----END PGP SIGNATURE-----

Merge tag 'docs-5.9' of git://git.lwn.net/linux

Pull documentation updates from Jonathan Corbet:
 "It's been a busy cycle for documentation - hopefully the busiest for a
  while to come. Changes include:

   - Some new Chinese translations

   - Progress on the battle against double words words and non-HTTPS
     URLs

   - Some block-mq documentation

   - More RST conversions from Mauro. At this point, that task is
     essentially complete, so we shouldn't see this kind of churn again
     for a while. Unless we decide to switch to asciidoc or
     something...:)

   - Lots of typo fixes, warning fixes, and more"

* tag 'docs-5.9' of git://git.lwn.net/linux: (195 commits)
  scripts/kernel-doc: optionally treat warnings as errors
  docs: ia64: correct typo
  mailmap: add entry for <alobakin@marvell.com>
  doc/zh_CN: add cpu-load Chinese version
  Documentation/admin-guide: tainted-kernels: fix spelling mistake
  MAINTAINERS: adjust kprobes.rst entry to new location
  devices.txt: document rfkill allocation
  PCI: correct flag name
  docs: filesystems: vfs: correct flag name
  docs: filesystems: vfs: correct sync_mode flag names
  docs: path-lookup: markup fixes for emphasis
  docs: path-lookup: more markup fixes
  docs: path-lookup: fix HTML entity mojibake
  CREDITS: Replace HTTP links with HTTPS ones
  docs: process: Add an example for creating a fixes tag
  doc/zh_CN: add Chinese translation prefer section
  doc/zh_CN: add clearing-warn-once Chinese version
  doc/zh_CN: add admin-guide index
  doc:it_IT: process: coding-style.rst: Correct __maybe_unused compiler label
  futex: MAINTAINERS: Re-add selftests directory
  ...
2020-08-04 22:47:54 -07:00
Linus Torvalds
a754292348 Printk changes for 5.9
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAl8pk84ACgkQUqAMR0iA
 lPIrTxAAhD6fosJx+7LCrDRABIw/ZybeS5MIxTuPsNtdMmGBemigew5Ao1wYY6Ww
 3BFiNC2LpDXPxSOCQpz0Zm5/oCLhShPJmS6ukjLbufDsiw0MezliKCAa2Bfw3W31
 6xntQtf7ps+bmTEQDyuznu8Kfg+I3lmdGUOEBBluHIP4gb7XKQE8ttyUHB6qdiXI
 3eAl53Q8dOMMjtk5eNBXA19JY43g4JmLZRBumrAUc1vsv15KTDmSyWKlV8+tLH9K
 JbQAHe0pNVec4sJUIYLvIwDZXvtsvxjdJyX3tTeZ7xJ/ARcvRLoixVGqWxKhqdth
 j5U/L+YQfCJifyqvEVo03yy4Ti+OraliRpGcRf/bM2HpmFBA2+dISr7/VEqRwkG7
 Sy8HuvBHHyUqdrPjB7izhv8iyRN+LxFfpdT5LMnzsvxMxAJ+QwNjxb13RA4kkeRU
 5SgOhfGWgTsLy71J6qdSeXYB2oPFw4Onp5yAtoUsOJVYqWkN9x0zdl+9HmqIHF7T
 dY+KNriEO6gmpsQrMR4FC/GVMtwYWf8AoqeZen5O5SQULmzuKQ5AkOo0IAMrU92i
 iAdFrSZj35HAQjIJRccPNGZ3FwTd1Z4r5GT7VRvrN+nq2wVopzbbz924/lmsGoAS
 YppAw31sKfXDc5uWE8jP8GP3OJqhORn2PPXq3D5Q3XSVbGgey0Q=
 =ZcMq
 -----END PGP SIGNATURE-----

Merge tag 'printk-for-5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux

Pull printk updates from Petr Mladek:

 - Herbert Xu made printk header file self-contained.

 - Andy Shevchenko and Sergey Senozhatsky cleaned up console->setup()
   error handling.

 - Andy Shevchenko did some cleanups (e.g. sparse warning) in vsprintf
   code.

 - Minor documentation updates.

* tag 'printk-for-5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
  lib/vsprintf: Force type of flags value for gfp_t
  lib/vsprintf: Replace custom spec to print decimals with generic one
  lib/vsprintf: Replace hidden BUILD_BUG_ON() with static_assert()
  printk: Make linux/printk.h self-contained
  doc:kmsg: explicitly state the return value in case of SEEK_CUR
  Replace HTTP links with HTTPS ones: vsprintf
  hvc: unify console setup naming
  console: Fix trivia typo 'change' -> 'chance'
  console: Propagate error code from console ->setup()
  tty: hvc: Return proper error code from console ->setup() hook
  serial: sunzilog: Return proper error code from console ->setup() hook
  serial: sunsab: Return proper error code from console ->setup() hook
  mips: Return proper error code from console ->setup() hook
2020-08-04 22:22:25 -07:00
Linus Torvalds
3f0d6ecdf1 Generic implementation of common syscall, interrupt and exception
entry/exit functionality based on the recent X86 effort to ensure
 correctness of entry/exit vs. RCU and instrumentation.
 
 As this functionality and the required entry/exit sequences are not
 architecture specific, sharing them allows other architectures to benefit
 instead of copying the same code over and over again.
 
 This branch was kept standalone to allow others to work on it. The
 conversion of x86 comes in a seperate pull request which obviously is based
 on this branch.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl8pCYsTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoY1MD/9VNT5ehFZwDBxX8EUY7QcBAPiR1yql
 XgHVbfhUe9Zta4q6eXn1A6IGpperY+2TLdU1Gm0aVXGAZwt5WeM7mAMIGpOXqibK
 oRZcTGOdxovY/548H3EWmrPAeJRKtpGDOF9MqmDfSBI4PXPyu9oKTRbWtRztgZa2
 f8CALSXRCWRztZwI4xZKInC78p564Bz4x98wu/CbSZ7iTid/FIm4BcrH+eSbhLGt
 LUjKp74zDl4HncJUUCRv1RZmfiK4N0XwgfNLqHlkNu2ep1sJ92t4YuqyQC5acUUp
 L+fzlMdG1elFi5HlCmOTLrZIRerOyhqxfiWsfMiqapSvWdjW05HJ2AwyQbyhXMTt
 iLe8Rds0kcGGvCjt2X7S1mJFrPmV8QlrpQkOh9l/R5ekMsxG2jbzt7ZCbEASNtBp
 +riLLEQcl+IOej5zDAUUcdpWA8/ODlY9RZwv0vW9kR3v6SUtBdoS9YHSgbh5rgOt
 USEJwipyNLsD5tUWEIAZhw6moMzFFkO512O23bUgAwYKJx/KVYaBGWKq2nGLjqLc
 njqR3NX568/0ixPy3Vmhf3fde8Izp/CgK12gJxCj7sM77W8nvjD2IaqRsW2nK5Tk
 nD5yCLpolcl5vU8Bu0G9ln+jabKwbZHBOGFnqAUW0AKKv7jTkjILEoZbNVrd8MOG
 Sj/asNIIKw3LPg==
 =y2Ew
 -----END PGP SIGNATURE-----

Merge tag 'core-entry-2020-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull generic kernel entry/exit code from Thomas Gleixner:
 "Generic implementation of common syscall, interrupt and exception
  entry/exit functionality based on the recent X86 effort to ensure
  correctness of entry/exit vs RCU and instrumentation.

  As this functionality and the required entry/exit sequences are not
  architecture specific, sharing them allows other architectures to
  benefit instead of copying the same code over and over again.

  This branch was kept standalone to allow others to work on it. The
  conversion of x86 comes in a seperate pull request which obviously is
  based on this branch"

* tag 'core-entry-2020-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  entry: Correct __secure_computing() stub
  entry: Correct 'noinstr' attributes
  entry: Provide infrastructure for work before transitioning to guest mode
  entry: Provide generic interrupt entry/exit code
  entry: Provide generic syscall exit function
  entry: Provide generic syscall entry functionality
  seccomp: Provide stub for __secure_computing()
2020-08-04 21:00:11 -07:00
Linus Torvalds
442489c219 Time, timers and related driver updates:
- Prevent unnecessary timer softirq invocations by extending the tracking
    of the next expiring timer in the timer wheel beyond the existing NOHZ
    functionality. The tracking overhead at enqueue time is within the
    noise, but on sensitive workloads the avoidance of the soft interrupt
    invocation is a measurable improvement.
 
  - The obligatory new clocksource driver for Ingenic X100 OST
 
  - The usual fixes, improvements, cleanups and extensions for newer chip
    variants all over the driver space.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl8pD7ITHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoRIXD/9VRiGKHIP27O0aoPj9HGFiZyY+bXbC
 xv5HA9CTlJjG23JTZWg13Kk26l8+mzIJoH54nMnceVDdCwPb1e7iRFgefyHOgEW4
 oKpJnwqvGOA9cvAnu8Tl9oNNILUoS2k0dHDeGICMCOqqjycUoKGRPpiizsbXZ08x
 yOLUMktX0wtNnL6DOqOpvmfN+b3T8gO0fuNzgRcvcHZpamQxo7wN2P05mt9nmWLV
 zfEwyhn33Xy9toGPZfkbCYNzVSI3fkMXuMDIkLo5jOtt18i06AeUZov8Z0V7xk9B
 S1lu2HmP4PnX00/P7KB8LwtlhzhM/H7IxK4bxYJYlHmGcd2hJHjKdIfCg3bqo41d
 YmsIelukI3jLvnrB6YXyWx3mt1a8p/i3zf/+Fwqs81qV/60FXhp0zD2QnltJEEC3
 INXrb93CkC5vMqOs0otizL5cPnPhTS0fMe/GhnHlsteUXlqEeJ1HU5f+j0FFaIJA
 h+dEPT57eJwDyuh6iWNHjvAI/HtLSBTsHC0CPWa+DxHKxzItZWpiVl+EEw5ofepX
 zJyf8nxq1nOMDOROCiTxdbyp4yacDk3dak/trbRZCfX9fapSuzJFzDRCM0Ums2lH
 lh12jR9nRZgKb5atC31UUpw4HYZfvcbj2NGr27SAx9b3hh5q6SRW8yowL8tta1lK
 /Afs0OhmQS5Raw==
 =uJnp
 -----END PGP SIGNATURE-----

Merge tag 'timers-core-2020-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer updates from Thomas Gleixner:
 "Time, timers and related driver updates:

   - Prevent unnecessary timer softirq invocations by extending the
     tracking of the next expiring timer in the timer wheel beyond the
     existing NOHZ functionality.

     The tracking overhead at enqueue time is within the noise, but on
     sensitive workloads the avoidance of the soft interrupt invocation
     is a measurable improvement.

   - The obligatory new clocksource driver for Ingenic X100 OST

   - The usual fixes, improvements, cleanups and extensions for newer
     chip variants all over the driver space"

* tag 'timers-core-2020-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (28 commits)
  timers: Recalculate next timer interrupt only when necessary
  clocksource/drivers/ingenic: Add support for the Ingenic X1000 OST.
  dt-bindings: timer: Add Ingenic X1000 OST bindings.
  clocksource/drivers: Replace HTTP links with HTTPS ones
  clocksource/drivers/nomadik-mtu: Handle 32kHz clock
  clocksource/drivers/sh_cmt: Use "kHz" for kilohertz
  clocksource/drivers/imx: Add support for i.MX TPM driver with ARM64
  clocksource/drivers/ingenic: Add high resolution timer support for SMP/SMT.
  timers: Lower base clock forwarding threshold
  timers: Remove must_forward_clk
  timers: Spare timer softirq until next expiry
  timers: Expand clk forward logic beyond nohz
  timers: Reuse next expiry cache after nohz exit
  timers: Always keep track of next expiry
  timers: Optimize _next_timer_interrupt() level iteration
  timers: Add comments about calc_index() ceiling work
  timers: Move trigger_dyntick_cpu() to enqueue_timer()
  timers: Use only bucket expiry for base->next_expiry value
  timers: Preserve higher bits of expiration on index calculation
  clocksource/drivers/timer-atmel-tcb: Add sama5d2 support
  ...
2020-08-04 18:17:37 -07:00
Linus Torvalds
f8b036a7fc The usual boring updates from the interrupt subsystem:
- Infrastructure to allow building irqchip drivers as modules
 
  - Consolidation of irqchip ACPI probing
 
  - Removal of the EOI-preflow interrupt handler which was required for
    SPARC support and became obsolete after SPARC was converted to
    use sparse interrupts.
 
  - Cleanups, fixes and improvements all over the place
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl8pDL0THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoRTFEACYvH2LnSu1GlXB0XtL3+XyV8bWN3Yr
 Qfcp9JbIibx65YkJjcyvfBNA6GjXoogMr9vOHeRVnPtOwzl/7n/lnh/43d6+YPot
 7UvIjGtpH3E/lF0kJKfuEsM8CX8DcVhn6dV/T+dJ00m69dAVQHNRsVqAi1/iWEeT
 9vBBELoJL79BU2g83NQZ7V0UrqiA5QlPYLpbSffliE6UWjG6XTH2CPM5XucuySNQ
 es3szxQ55rtPEzqCHVL0YW75vV39bmKZPqoApA/XQDJrp3bgftjdldoTe7YPQfSG
 MXAvB+6axPD+mdeag7/XZFC1DcMx8CnistZSJKpdYZe7mQ7iunfeJRhkEzb+DrO1
 WdcDcYOm0rLHhPrUZItJdACjuPNmN9pMaK1PbabsivnHVWzMYYKmMwbW+AEsygGW
 nnlsZP1Nr61Mo7O8+EKmxDdox4Qjk3lmQl4SdQgUKNKsI5yFYjvt2CfCjWLQJNBa
 w7YiLnL9IChXwrvdGqMIoEueUi0pC3gGbZ/bjDbxI4NJxJgEEav49m/prxM2A2Pl
 gfNdwlM1xgNydIBgt/jij/a8Lmv555RuZmvDV7QV7fFwaIqt3Qb5cs0Roq+GlzZR
 e0wuikGl0r/Bdow62rle7EysbBBGosAYf6K/kaGhd8v/kx2ByDnPPWzOqtxc+K+i
 Iw/daEQRsSnWuw==
 =KA8b
 -----END PGP SIGNATURE-----

Merge tag 'irq-core-2020-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq updates from Thomas Gleixner:
 "The usual boring updates from the interrupt subsystem:

   - Infrastructure to allow building irqchip drivers as modules

   - Consolidation of irqchip ACPI probing

   - Removal of the EOI-preflow interrupt handler which was required for
     SPARC support and became obsolete after SPARC was converted to use
     sparse interrupts.

   - Cleanups, fixes and improvements all over the place"

* tag 'irq-core-2020-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits)
  irqchip/loongson-pch-pic: Fix the misused irq flow handler
  irqchip/loongson-htvec: Support 8 groups of HT vectors
  irqchip/loongson-liointc: Fix misuse of gc->mask_cache
  dt-bindings: interrupt-controller: Update Loongson HTVEC description
  irqchip/imx-intmux: Fix irqdata regs save in imx_intmux_runtime_suspend()
  irqchip/imx-intmux: Implement intmux runtime power management
  irqchip/gic-v4.1: Use GFP_ATOMIC flag in allocate_vpe_l1_table()
  irqchip: Fix IRQCHIP_PLATFORM_DRIVER_* compilation by including module.h
  irqchip/stm32-exti: Map direct event to irq parent
  irqchip/mtk-cirq: Convert to a platform driver
  irqchip/mtk-sysirq: Convert to a platform driver
  irqchip/qcom-pdc: Switch to using IRQCHIP_PLATFORM_DRIVER helper macros
  irqchip: Add IRQCHIP_PLATFORM_DRIVER_BEGIN/END and IRQCHIP_MATCH helper macros
  irqchip: irq-bcm2836.h: drop a duplicated word
  irqchip/gic-v4.1: Ensure accessing the correct RD when writing INVALLR
  irqchip/irq-bcm7038-l1: Guard uses of cpu_logical_map
  irqchip/gic-v3: Remove unused register definition
  irqchip/qcom-pdc: Allow QCOM_PDC to be loadable as a permanent module
  genirq: Export irq_chip_retrigger_hierarchy and irq_chip_set_vcpu_affinity_parent
  irqdomain: Export irq_domain_update_bus_token
  ...
2020-08-04 18:11:58 -07:00
Linus Torvalds
2ed90dbbf7 dma-mapping updates for 5.9
- make support for dma_ops optional
  - move more code out of line
  - add generic support for a dma_ops bypass mode
  - misc cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAl8oGscLHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYNfEhAAmFwd6BBHGwAhXUchoIue5vdNnuY3GiBFRzUdz67W
 zRYYgZYiPjl+MwflRmwPcoWEnGzmweRa2s6OnyDostiCRauioa8BuQfGqJasf1yZ
 D36dFNVHGW0o6pRDUQkd688k/4A6szwuwpq83qi4e8X2I9QzAITHtW8izjfPM923
 FlJzxEFggbB2TvwfUXOZhmpuG4Dog8S7VZ1Uz4QAg0Z/5FDqIKAAG2aZMqCXBbiX
 01E8tr0AqU/jn2xpc8O+DJGFiYIRhqhyNxQbH6qz1Q3xGFSokcLYm3YqkqVOgpn1
 DLs2UFDxWkly/F+wGnYtju7OD9VGPywzOcW125/LIsApYN5R/rYrtQzK41eq7Mp5
 HY3tqgNTIMdnl4so7QXeU4Vxj+lUdPlI26NZGszcM5AVftdTX8KjGdS+0+PBza6i
 i7trwG7J5/DnwiBCvEKoul7Ul1psUMTSvYwINTXRqsU4mZXhhx/mwyXbtruELnkj
 3agM98u6hoalLNjd2aueh+NjMZi1r+MchTrfRvTcxJ+yQ5BoR5kF+iz7eT/LtZ72
 AqWwimsPGNkLHUa0TrqWql5tv90cdDkBZzWXVbixwxRfgynWYLE6jugeIy8hwjFf
 GjO5XKbBwnWPjdSzFsVMPeuNpmr7ZjVHHewy2Q/jWQAIOyeof0VztEl23LN5yUkx
 pc8=
 =90UK
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-5.9' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping updates from Christoph Hellwig:

 - make support for dma_ops optional

 - move more code out of line

 - add generic support for a dma_ops bypass mode

 - misc cleanups

* tag 'dma-mapping-5.9' of git://git.infradead.org/users/hch/dma-mapping:
  dma-contiguous: cleanup dma_alloc_contiguous
  dma-debug: use named initializers for dir2name
  powerpc: use the generic dma_ops_bypass mode
  dma-mapping: add a dma_ops_bypass flag to struct device
  dma-mapping: make support for dma ops optional
  dma-mapping: inline the fast path dma-direct calls
  dma-mapping: move the remaining DMA API calls out of line
2020-08-04 17:29:57 -07:00
Steven Rostedt (VMware)
afcab63665 tracing: Use trace_sched_process_free() instead of exit() for pid tracing
On exit, if a process is preempted after the trace_sched_process_exit()
tracepoint but before the process is done exiting, then when it gets
scheduled in, the function tracers will not filter it properly against the
function tracing pid filters.

That is because the function tracing pid filters hooks to the
sched_process_exit() tracepoint to remove the exiting task's pid from the
filter list. Because the filtering happens at the sched_switch tracepoint,
when the exiting task schedules back in to finish up the exit, it will no
longer be in the function pid filtering tables.

This was noticeable in the notrace self tests on a preemptable kernel, as
the tests would fail as it exits and preempted after being taken off the
notrace filter table and on scheduling back in it would not be in the
notrace list, and then the ending of the exit function would trace. The test
detected this and would fail.

Cc: stable@vger.kernel.org
Cc: Namhyung Kim <namhyung@kernel.org>
Fixes: 1e10486ffe ("ftrace: Add 'function-fork' trace option")
Fixes: c37775d578 ("tracing: Add infrastructure to allow set_event_pid to follow children"
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-08-04 20:15:07 -04:00
Linus Torvalds
4f30a60aa7 close-range-v5.9
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXygcpgAKCRCRxhvAZXjc
 ogPeAQDv1ncqtNroFAC4pJ4tQhH7JSjW0OltiMk/AocY/J2SdQD9GJ15luYJ0/om
 697q/Z68sndRynhdoZlMuf3oYuBlHQw=
 =3ZhE
 -----END PGP SIGNATURE-----

Merge tag 'close-range-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull close_range() implementation from Christian Brauner:
 "This adds the close_range() syscall. It allows to efficiently close a
  range of file descriptors up to all file descriptors of a calling
  task.

  This is coordinated with the FreeBSD folks which have copied our
  version of this syscall and in the meantime have already merged it in
  April 2019:

    https://reviews.freebsd.org/D21627
    https://svnweb.freebsd.org/base?view=revision&revision=359836

  The syscall originally came up in a discussion around the new mount
  API and making new file descriptor types cloexec by default. During
  this discussion, Al suggested the close_range() syscall.

  First, it helps to close all file descriptors of an exec()ing task.
  This can be done safely via (quoting Al's example from [1] verbatim):

        /* that exec is sensitive */
        unshare(CLONE_FILES);
        /* we don't want anything past stderr here */
        close_range(3, ~0U);
        execve(....);

  The code snippet above is one way of working around the problem that
  file descriptors are not cloexec by default. This is aggravated by the
  fact that we can't just switch them over without massively regressing
  userspace. For a whole class of programs having an in-kernel method of
  closing all file descriptors is very helpful (e.g. demons, service
  managers, programming language standard libraries, container managers
  etc.).

  Second, it allows userspace to avoid implementing closing all file
  descriptors by parsing through /proc/<pid>/fd/* and calling close() on
  each file descriptor and other hacks. From looking at various
  large(ish) userspace code bases this or similar patterns are very
  common in service managers, container runtimes, and programming
  language runtimes/standard libraries such as Python or Rust.

  In addition, the syscall will also work for tasks that do not have
  procfs mounted and on kernels that do not have procfs support compiled
  in. In such situations the only way to make sure that all file
  descriptors are closed is to call close() on each file descriptor up
  to UINT_MAX or RLIMIT_NOFILE, OPEN_MAX trickery.

  Based on Linus' suggestion close_range() also comes with a new flag
  CLOSE_RANGE_UNSHARE to more elegantly handle file descriptor dropping
  right before exec. This would usually be expressed in the sequence:

        unshare(CLONE_FILES);
        close_range(3, ~0U);

  as pointed out by Linus it might be desirable to have this be a part
  of close_range() itself under a new flag CLOSE_RANGE_UNSHARE which
  gets especially handy when we're closing all file descriptors above a
  certain threshold.

  Test-suite as always included"

* tag 'close-range-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  tests: add CLOSE_RANGE_UNSHARE tests
  close_range: add CLOSE_RANGE_UNSHARE
  tests: add close_range() tests
  arch: wire-up close_range()
  open: add close_range()
2020-08-04 15:12:02 -07:00
Linus Torvalds
74858abbb1 cap-checkpoint-restore-v5.9
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXygegQAKCRCRxhvAZXjc
 olWZAQCMPbhI/20LA3OYJ6s+BgBEnm89PymvlHcym6Z4AvTungD+KqZonIYuxWgi
 6Ttlv/fzgFFbXgJgbuass5mwFVoN5wM=
 =oK7d
 -----END PGP SIGNATURE-----

Merge tag 'cap-checkpoint-restore-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull checkpoint-restore updates from Christian Brauner:
 "This enables unprivileged checkpoint/restore of processes.

  Given that this work has been going on for quite some time the first
  sentence in this summary is hopefully more exciting than the actual
  final code changes required. Unprivileged checkpoint/restore has seen
  a frequent increase in interest over the last two years and has thus
  been one of the main topics for the combined containers &
  checkpoint/restore microconference since at least 2018 (cf. [1]).

  Here are just the three most frequent use-cases that were brought forward:

   - The JVM developers are integrating checkpoint/restore into a Java
     VM to significantly decrease the startup time.

   - In high-performance computing environment a resource manager will
     typically be distributing jobs where users are always running as
     non-root. Long-running and "large" processes with significant
     startup times are supposed to be checkpointed and restored with
     CRIU.

   - Container migration as a non-root user.

  In all of these scenarios it is either desirable or required to run
  without CAP_SYS_ADMIN. The userspace implementation of
  checkpoint/restore CRIU already has the pull request for supporting
  unprivileged checkpoint/restore up (cf. [2]).

  To enable unprivileged checkpoint/restore a new dedicated capability
  CAP_CHECKPOINT_RESTORE is introduced. This solution has last been
  discussed in 2019 in a talk by Google at Linux Plumbers (cf. [1]
  "Update on Task Migration at Google Using CRIU") with Adrian and
  Nicolas providing the implementation now over the last months. In
  essence, this allows the CRIU binary to be installed with the
  CAP_CHECKPOINT_RESTORE vfs capability set thereby enabling
  unprivileged users to restore processes.

  To make this possible the following permissions are altered:

   - Selecting a specific PID via clone3() set_tid relaxed from userns
     CAP_SYS_ADMIN to CAP_CHECKPOINT_RESTORE.

   - Selecting a specific PID via /proc/sys/kernel/ns_last_pid relaxed
     from userns CAP_SYS_ADMIN to CAP_CHECKPOINT_RESTORE.

   - Accessing /proc/pid/map_files relaxed from init userns
     CAP_SYS_ADMIN to init userns CAP_CHECKPOINT_RESTORE.

   - Changing /proc/self/exe from userns CAP_SYS_ADMIN to userns
     CAP_CHECKPOINT_RESTORE.

  Of these four changes the /proc/self/exe change deserves a few words
  because the reasoning behind even restricting /proc/self/exe changes
  in the first place is just full of historical quirks and tracking this
  down was a questionable version of fun that I'd like to spare others.

  In short, it is trivial to change /proc/self/exe as an unprivileged
  user, i.e. without userns CAP_SYS_ADMIN right now. Either via ptrace()
  or by simply intercepting the elf loader in userspace during exec.
  Nicolas was nice enough to even provide a POC for the latter (cf. [3])
  to illustrate this fact.

  The original patchset which introduced PR_SET_MM_MAP had no
  permissions around changing the exe link. They too argued that it is
  trivial to spoof the exe link already which is true. The argument
  brought up against this was that the Tomoyo LSM uses the exe link in
  tomoyo_manager() to detect whether the calling process is a policy
  manager. This caused changing the exe links to be guarded by userns
  CAP_SYS_ADMIN.

  All in all this rather seems like a "better guard it with something
  rather than nothing" argument which imho doesn't qualify as a great
  security policy. Again, because spoofing the exe link is possible for
  the calling process so even if this were security relevant it was
  broken back then and would be broken today. So technically, dropping
  all permissions around changing the exe link would probably be
  possible and would send a clearer message to any userspace that relies
  on /proc/self/exe for security reasons that they should stop doing
  this but for now we're only relaxing the exe link permissions from
  userns CAP_SYS_ADMIN to userns CAP_CHECKPOINT_RESTORE.

  There's a final uapi change in here. Changing the exe link used to
  accidently return EINVAL when the caller lacked the necessary
  permissions instead of the more correct EPERM. This pr contains a
  commit fixing this. I assume that userspace won't notice or care and
  if they do I will revert this commit. But since we are changing the
  permissions anyway it seems like a good opportunity to try this fix.

  With these changes merged unprivileged checkpoint/restore will be
  possible and has already been tested by various users"

[1] LPC 2018
     1. "Task Migration at Google Using CRIU"
        https://www.youtube.com/watch?v=yI_1cuhoDgA&t=12095
     2. "Securely Migrating Untrusted Workloads with CRIU"
        https://www.youtube.com/watch?v=yI_1cuhoDgA&t=14400
     LPC 2019
     1. "CRIU and the PID dance"
         https://www.youtube.com/watch?v=LN2CUgp8deo&list=PLVsQ_xZBEyN30ZA3Pc9MZMFzdjwyz26dO&index=9&t=2m48s
     2. "Update on Task Migration at Google Using CRIU"
        https://www.youtube.com/watch?v=LN2CUgp8deo&list=PLVsQ_xZBEyN30ZA3Pc9MZMFzdjwyz26dO&index=9&t=1h2m8s

[2] https://github.com/checkpoint-restore/criu/pull/1155

[3] https://github.com/nviennot/run_as_exe

* tag 'cap-checkpoint-restore-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  selftests: add clone3() CAP_CHECKPOINT_RESTORE test
  prctl: exe link permission error changed from -EINVAL to -EPERM
  prctl: Allow local CAP_CHECKPOINT_RESTORE to change /proc/self/exe
  proc: allow access in init userns for map_files with CAP_CHECKPOINT_RESTORE
  pid_namespace: use checkpoint_restore_ns_capable() for ns_last_pid
  pid: use checkpoint_restore_ns_capable() for set_tid
  capabilities: Introduce CAP_CHECKPOINT_RESTORE
2020-08-04 15:02:07 -07:00
Linus Torvalds
9ba27414f2 fork-v5.9
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXyge/QAKCRCRxhvAZXjc
 oildAQCCWpnTeXm6hrIE3VZ36X5npFtbaEthdBVAUJM7mo0FYwEA8+Wbnubg6jCw
 mztkXCnTfU7tApUdhKtQzcpEws45/Qk=
 =REE/
 -----END PGP SIGNATURE-----

Merge tag 'fork-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull fork cleanups from Christian Brauner:
 "This is cleanup series from when we reworked a chunk of the process
  creation paths in the kernel and switched to struct
  {kernel_}clone_args.

  High-level this does two main things:

   - Remove the double export of both do_fork() and _do_fork() where
     do_fork() used the incosistent legacy clone calling convention.

     Now we only export _do_fork() which is based on struct
     kernel_clone_args.

   - Remove the copy_thread_tls()/copy_thread() split making the
     architecture specific HAVE_COYP_THREAD_TLS config option obsolete.

  This switches all remaining architectures to select
  HAVE_COPY_THREAD_TLS and thus to the copy_thread_tls() calling
  convention. The current split makes the process creation codepaths
  more convoluted than they need to be. Each architecture has their own
  copy_thread() function unless it selects HAVE_COPY_THREAD_TLS then it
  has a copy_thread_tls() function.

  The split is not needed anymore nowadays, all architectures support
  CLONE_SETTLS but quite a few of them never bothered to select
  HAVE_COPY_THREAD_TLS and instead simply continued to use copy_thread()
  and use the old calling convention. Removing this split cleans up the
  process creation codepaths and paves the way for implementing clone3()
  on such architectures since it requires the copy_thread_tls() calling
  convention.

  After having made each architectures support copy_thread_tls() this
  series simply renames that function back to copy_thread(). It also
  switches all architectures that call do_fork() directly over to
  _do_fork() and the struct kernel_clone_args calling convention. This
  is a corollary of switching the architectures that did not yet support
  it over to copy_thread_tls() since do_fork() is conditional on not
  supporting copy_thread_tls() (Mostly because it lacks a separate
  argument for tls which is trivial to fix but there's no need for this
  function to exist.).

  The do_fork() removal is in itself already useful as it allows to to
  remove the export of both do_fork() and _do_fork() we currently have
  in favor of only _do_fork(). This has already been discussed back when
  we added clone3(). The legacy clone() calling convention is - as is
  probably well-known - somewhat odd:

    #
    # ABI hall of shame
    #
    config CLONE_BACKWARDS
    config CLONE_BACKWARDS2
    config CLONE_BACKWARDS3

  that is aggravated by the fact that some architectures such as sparc
  follow the CLONE_BACKWARDSx calling convention but don't really select
  the corresponding config option since they call do_fork() directly.

  So do_fork() enforces a somewhat arbitrary calling convention in the
  first place that doesn't really help the individual architectures that
  deviate from it. They can thus simply be switched to _do_fork()
  enforcing a single calling convention. (I really hope that any new
  architectures will __not__ try to implement their own calling
  conventions...)

  Most architectures already have made a similar switch (m68k comes to
  mind).

  Overall this removes more code than it adds even with a good portion
  of added comments. It simplifies a chunk of arch specific assembly
  either by moving the code into C or by simply rewriting the assembly.

  Architectures that have been touched in non-trivial ways have all been
  actually boot and stress tested: sparc and ia64 have been tested with
  Debian 9 images. They are the two architectures which have been
  touched the most. All non-trivial changes to architectures have seen
  acks from the relevant maintainers. nios2 with a custom built
  buildroot image. h8300 I couldn't get something bootable to test on
  but the changes have been fairly automatic and I'm sure we'll hear
  people yell if I broke something there.

  All other architectures that have been touched in trivial ways have
  been compile tested for each single patch of the series via git rebase
  -x "make ..." v5.8-rc2. arm{64} and x86{_64} have been boot tested
  even though they have just been trivially touched (removal of the
  HAVE_COPY_THREAD_TLS macro from their Kconfig) because well they are
  basically "core architectures" and since it is trivial to get your
  hands on a useable image"

* tag 'fork-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  arch: rename copy_thread_tls() back to copy_thread()
  arch: remove HAVE_COPY_THREAD_TLS
  unicore: switch to copy_thread_tls()
  sh: switch to copy_thread_tls()
  nds32: switch to copy_thread_tls()
  microblaze: switch to copy_thread_tls()
  hexagon: switch to copy_thread_tls()
  c6x: switch to copy_thread_tls()
  alpha: switch to copy_thread_tls()
  fork: remove do_fork()
  h8300: select HAVE_COPY_THREAD_TLS, switch to kernel_clone_args
  nios2: enable HAVE_COPY_THREAD_TLS, switch to kernel_clone_args
  ia64: enable HAVE_COPY_THREAD_TLS, switch to kernel_clone_args
  sparc: unconditionally enable HAVE_COPY_THREAD_TLS
  sparc: share process creation helpers between sparc and sparc64
  sparc64: enable HAVE_COPY_THREAD_TLS
  fork: fold legacy_clone_args_valid() into _do_fork()
2020-08-04 14:47:45 -07:00
Linus Torvalds
0a72761b27 threads-v5.9
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXygcLwAKCRCRxhvAZXjc
 ohajAP4n5E3BmN0jpIviXT4eNhP62jzxJtxlVXtgGT3D8b1mpQEA5n8NSOlQLoAh
 yUGsjtwR9xDcHMcrhXD3yN6eYJSK0A8=
 =tn4R
 -----END PGP SIGNATURE-----

Merge tag 'threads-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull thread updates from Christian Brauner:
 "This contains the changes to add the missing support for attaching to
  time namespaces via pidfds.

  Last cycle setns() was changed to support attaching to multiple
  namespaces atomically. This requires all namespaces to have a point of
  no return where they can't fail anymore.

  Specifically, <namespace-type>_install() is allowed to perform
  permission checks and install the namespace into the new struct nsset
  that it has been given but it is not allowed to make visible changes
  to the affected task. Once <namespace-type>_install() returns,
  anything that the given namespace type additionally requires to be
  setup needs to ideally be done in a function that can't fail or if it
  fails the failure must be non-fatal.

  For time namespaces the relevant functions that fell into this
  category were timens_set_vvar_page() and vdso_join_timens(). The
  latter could still fail although it didn't need to. This function is
  only implemented for vdso_join_timens() in current mainline. As
  discussed on-list (cf. [1]), in order to make setns() support time
  namespaces when attaching to multiple namespaces at once properly we
  changed vdso_join_timens() to always succeed. So vdso_join_timens()
  replaces the mmap_write_lock_killable() with mmap_read_lock().

  Please note that arm is about to grow vdso support for time namespaces
  (possibly this merge window). We've synced on this change and arm64
  also uses mmap_read_lock(), i.e. makes vdso_join_timens() a function
  that can't fail. Once the changes here and the arm64 changes have
  landed, vdso_join_timens() should be turned into a void function so
  it's obvious to callers and implementers on other architectures that
  the expectation is that it can't fail.

  We didn't do this right away because it would've introduced
  unnecessary merge conflicts between the two trees for no major gain.

  As always, tests included"

[1]: https://lore.kernel.org/lkml/20200611110221.pgd3r5qkjrjmfqa2@wittgenstein

* tag 'threads-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  tests: add CLONE_NEWTIME setns tests
  nsproxy: support CLONE_NEWTIME with setns()
  timens: add timens_commit() helper
  timens: make vdso_join_timens() always succeed
2020-08-04 14:40:07 -07:00
Linus Torvalds
3950e97543 Merge branch 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull execve updates from Eric Biederman:
 "During the development of v5.7 I ran into bugs and quality of
  implementation issues related to exec that could not be easily fixed
  because of the way exec is implemented. So I have been diggin into
  exec and cleaning up what I can.

  This cycle I have been looking at different ideas and different
  implementations to see what is possible to improve exec, and cleaning
  the way exec interfaces with in kernel users. Only cleaning up the
  interfaces of exec with rest of the kernel has managed to stabalize
  and make it through review in time for v5.9-rc1 resulting in 2 sets of
  changes this cycle.

   - Implement kernel_execve

   - Make the user mode driver code a better citizen

  With kernel_execve the code size got a little larger as the copying of
  parameters from userspace and copying of parameters from userspace is
  now separate. The good news is kernel threads no longer need to play
  games with set_fs to use exec. Which when combined with the rest of
  Christophs set_fs changes should security bugs with set_fs much more
  difficult"

* 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (23 commits)
  exec: Implement kernel_execve
  exec: Factor bprm_stack_limits out of prepare_arg_pages
  exec: Factor bprm_execve out of do_execve_common
  exec: Move bprm_mm_init into alloc_bprm
  exec: Move initialization of bprm->filename into alloc_bprm
  exec: Factor out alloc_bprm
  exec: Remove unnecessary spaces from binfmts.h
  umd: Stop using split_argv
  umd: Remove exit_umh
  bpfilter: Take advantage of the facilities of struct pid
  exit: Factor thread_group_exited out of pidfd_poll
  umd: Track user space drivers with struct pid
  bpfilter: Move bpfilter_umh back into init data
  exec: Remove do_execve_file
  umh: Stop calling do_execve_file
  umd: Transform fork_usermode_blob into fork_usermode_driver
  umd: Rename umd_info.cmdline umd_info.driver_name
  umd: For clarity rename umh_info umd_info
  umh: Separate the user mode driver and the user mode helper support
  umh: Remove call_usermodehelper_setup_file.
  ...
2020-08-04 14:27:25 -07:00
Linus Torvalds
fd76a74d94 audit/stable-5.9 PR 20200803
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAl8okpIUHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXNqOQ/8D+m9Ykcby3csEKsp8YtsaukEu62U
 lRVaxzRNO9wwB24aFwDFuJnIkmsSi/s/O4nBsy2mw+Apn+uDCvHQ9tBU07vlNn2f
 lu27YaTya7YGlqoe315xijd8tyoX99k8cpQeixvAVr9/jdR09yka7SJ8O7X9mjV7
 +SUVDiKCplPKpiwCCRS9cqD7F64T6y35XKzbrzYqdP0UOF2XelZo/Evt5rDRvWUf
 5qDN2tP+iM/Fvu5lCfczFwAeivfAdxjQ11n783hx8Ms2qyiaKQCzbEwjqAslmkbs
 1k/+ED0NjzXX1ne0JZaz/bk0wsMnmOoa8o+NDcyd7Za/cj5prUZi7kBy+xry4YV8
 qKJ40Lk0flCWgUpm6bkYVOByIYHk0gmfBNvjilqf25NR/eOC/9e9ir8PywvYUW/7
 kvVK37+N/a3LnFj80sZpIeqqnNU8z9PV1i7//5/kDuKvz94Bq83TJDO6pPKvqDtC
 njQfCFoHwdEeF8OalK793lIiYaoODqvbkWKChKMqziODJ4ZP8AW06gXpEbEWn7G3
 TTnJx7hqzR9t90vBQJeO3Fromfn+9TDlZVdX+EGO8gIqUiLGr0r7LPPep4VkDbNw
 LxMYKeC2cgRp8Z+XXPDxfXSDL2psTwg6CXcDrXcYnUyBo/yerpBvbJkeaR0h+UR0
 j6cvMX+T39X2JXM=
 =Xs3M
 -----END PGP SIGNATURE-----

Merge tag 'audit-pr-20200803' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit

Pull audit updates from Paul Moore:
 "Aside from some smaller bug fixes, here are the highlights:

   - add a new backlog wait metric to the audit status message, this is
     intended to help admins determine how long processes have been
     waiting for the audit backlog queue to clear

   - generate audit records for nftables configuration changes

   - generate CWD audit records for for the relevant LSM audit records"

* tag 'audit-pr-20200803' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
  audit: report audit wait metric in audit status reply
  audit: purge audit_log_string from the intra-kernel audit API
  audit: issue CWD record to accompany LSM_AUDIT_DATA_* records
  audit: use the proper gfp flags in the audit_log_nfcfg() calls
  audit: remove unused !CONFIG_AUDITSYSCALL __audit_inode* stubs
  audit: add gfp parameter to audit_log_nfcfg
  audit: log nftables configuration change events
  audit: Use struct_size() helper in alloc_chunk
2020-08-04 14:20:26 -07:00
Linus Torvalds
9ecc6ea491 seccomp updates for v5.9-rc1
- Improved selftest coverage, timeouts, and reporting
 - Add EPOLLHUP support for SECCOMP_RET_USER_NOTIF (Christian Brauner)
 - Refactor __scm_install_fd() into __receive_fd() and fix buggy callers
 - Introduce "addfd" command for SECCOMP_RET_USER_NOTIF (Sargun Dhillon)
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAl8oZcQWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJomDD/4x3j7eXREcXDsHOmlgEaHWGx4l
 JldHFQhV5GjmD7gOkPcoZSG7NfG7F6VpwAJg7ZoR3qUkem7K8DFucxqgo1RldCot
 nigleeLX6JeMS0Z+iwjAVZd+5t4xG4J/7GGDHIIMiG5qvwJ0Yf64o1bkjaB2Q/Bv
 tluBg0WF32kFMG/ZwyY/V2QDbbue97CFPflybOh1o2nWbVzmUlFEEum3UUvZsxc8
 smMsattJyuAV7kcEKzKrs8b010NdFZqwdbub5Np9W3XEXGBYMdIPoNsOQGmB9wby
 j2ui0lzboXRG997jM7TCd1l/XZAv8aAwvPplw3FJRybzkOGs9NDyLMoz87yJpR1T
 xp511vnMyMbyKIGdungkt7cIyzaictHwaYzznsmuNdCPEjTaIQJr1ctsa4GEgtqf
 pnkktZ9YbMCcHU0CtZ8GlOVqA9wE+FUm0/u0zgikzJQsB+HcNItiARTTTHRyco7p
 VJCqK8o4Zx4ELV7QNkSH4nhFkVgRopvrvBiPAGro/qwGOofBg8W8wM8O1+V/MDmp
 zSU22v4SncT1Xb7dtmdJqDEeHfDikhaCAb4Je2hsGQWzbdAqwHGlpa7vpk9x3Q5r
 L+XyP+Z+rPHlXYyypJwUvvOQhXOmP0zYxcEHxByqIBfXiwy+3dN4tDDfatWbccwl
 uTlTDM8kmQn6QzSztA==
 =yb55
 -----END PGP SIGNATURE-----

Merge tag 'seccomp-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull seccomp updates from Kees Cook:
 "There are a bunch of clean ups and selftest improvements along with
  two major updates to the SECCOMP_RET_USER_NOTIF filter return:
  EPOLLHUP support to more easily detect the death of a monitored
  process, and being able to inject fds when intercepting syscalls that
  expect an fd-opening side-effect (needed by both container folks and
  Chrome). The latter continued the refactoring of __scm_install_fd()
  started by Christoph, and in the process found and fixed a handful of
  bugs in various callers.

   - Improved selftest coverage, timeouts, and reporting

   - Add EPOLLHUP support for SECCOMP_RET_USER_NOTIF (Christian Brauner)

   - Refactor __scm_install_fd() into __receive_fd() and fix buggy
     callers

   - Introduce 'addfd' command for SECCOMP_RET_USER_NOTIF (Sargun
     Dhillon)"

* tag 'seccomp-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (30 commits)
  selftests/seccomp: Test SECCOMP_IOCTL_NOTIF_ADDFD
  seccomp: Introduce addfd ioctl to seccomp user notifier
  fs: Expand __receive_fd() to accept existing fd
  pidfd: Replace open-coded receive_fd()
  fs: Add receive_fd() wrapper for __receive_fd()
  fs: Move __scm_install_fd() to __receive_fd()
  net/scm: Regularize compat handling of scm_detach_fds()
  pidfd: Add missing sock updates for pidfd_getfd()
  net/compat: Add missing sock updates for SCM_RIGHTS
  selftests/seccomp: Check ENOSYS under tracing
  selftests/seccomp: Refactor to use fixture variants
  selftests/harness: Clean up kern-doc for fixtures
  seccomp: Use -1 marker for end of mode 1 syscall list
  seccomp: Fix ioctl number for SECCOMP_IOCTL_NOTIF_ID_VALID
  selftests/seccomp: Rename user_trap_syscall() to user_notif_syscall()
  selftests/seccomp: Make kcmp() less required
  seccomp: Use pr_fmt
  selftests/seccomp: Improve calibration loop
  selftests/seccomp: use 90s as timeout
  selftests/seccomp: Expand benchmark to per-filter measurements
  ...
2020-08-04 14:11:08 -07:00
Linus Torvalds
99ea1521a0 Remove uninitialized_var() macro for v5.9-rc1
- Clean up non-trivial uses of uninitialized_var()
 - Update documentation and checkpatch for uninitialized_var() removal
 - Treewide removal of uninitialized_var()
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAl8oYLQWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJsfjEACvf0D3WL3H7sLHtZ2HeMwOgAzq
 il08t6vUscINQwiIIK3Be43ok3uQ1Q+bj8sr2gSYTwunV2IYHFferzgzhyMMno3o
 XBIGd1E+v1E4DGBOiRXJvacBivKrfvrdZ7AWiGlVBKfg2E0fL1aQbe9AYJ6eJSbp
 UGqkBkE207dugS5SQcwrlk1tWKUL089lhDAPd7iy/5RK76OsLRCJFzIerLHF2ZK2
 BwvA+NWXVQI6pNZ0aRtEtbbxwEU4X+2J/uaXH5kJDszMwRrgBT2qoedVu5LXFPi8
 +B84IzM2lii1HAFbrFlRyL/EMueVFzieN40EOB6O8wt60Y4iCy5wOUzAdZwFuSTI
 h0xT3JI8BWtpB3W+ryas9cl9GoOHHtPA8dShuV+Y+Q2bWe1Fs6kTl2Z4m4zKq56z
 63wQCdveFOkqiCLZb8s6FhnS11wKtAX4czvXRXaUPgdVQS1Ibyba851CRHIEY+9I
 AbtogoPN8FXzLsJn7pIxHR4ADz+eZ0dQ18f2hhQpP6/co65bYizNP5H3h+t9hGHG
 k3r2k8T+jpFPaddpZMvRvIVD8O2HvJZQTyY6Vvneuv6pnQWtr2DqPFn2YooRnzoa
 dbBMtpon+vYz6OWokC5QNWLqHWqvY9TmMfcVFUXE4AFse8vh4wJ8jJCNOFVp8On+
 drhmmImUr1YylrtVOw==
 =xHmk
 -----END PGP SIGNATURE-----

Merge tag 'uninit-macro-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull uninitialized_var() macro removal from Kees Cook:
 "This is long overdue, and has hidden too many bugs over the years. The
  series has several "by hand" fixes, and then a trivial treewide
  replacement.

   - Clean up non-trivial uses of uninitialized_var()

   - Update documentation and checkpatch for uninitialized_var() removal

   - Treewide removal of uninitialized_var()"

* tag 'uninit-macro-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  compiler: Remove uninitialized_var() macro
  treewide: Remove uninitialized_var() usage
  checkpatch: Remove awareness of uninitialized_var() macro
  mm/debug_vm_pgtable: Remove uninitialized_var() usage
  f2fs: Eliminate usage of uninitialized_var() macro
  media: sur40: Remove uninitialized_var() usage
  KVM: PPC: Book3S PR: Remove uninitialized_var() usage
  clk: spear: Remove uninitialized_var() usage
  clk: st: Remove uninitialized_var() usage
  spi: davinci: Remove uninitialized_var() usage
  ide: Remove uninitialized_var() usage
  rtlwifi: rtl8192cu: Remove uninitialized_var() usage
  b43: Remove uninitialized_var() usage
  drbd: Remove uninitialized_var() usage
  x86/mm/numa: Remove uninitialized_var() usage
  docs: deprecated.rst: Add uninitialized_var()
2020-08-04 13:49:43 -07:00
Linus Torvalds
427714f258 tasklets API update for v5.9-rc1
- Prepare for tasklet API modernization (Romain Perier, Allen Pais, Kees Cook)
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAl8oXpMWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJtJgEACVb88nzYwu5mC5ZcfvwSyXeQsR
 eDpCkX5HT6CsxlOn0/YJvxUtkkerQftbRuAXrzoUpQkpyBh82PviVZFKDS7NE9Lc
 6xPqloi2gbZ8EfgMraVynL+9lpLh0+qNCM7LPg4xT+JxMDLut/nWRdrp8d7uBYfQ
 AXV6CV4Tc4ijOMROV6AEVVdSTzkRCbiqUnRDBLETBfiJOdDn5MgJgxicWvN5FTpu
 PiUVF3CtWaKCRfQO/GEAXTG65hOtmql5IbX9n7uooNu/wCCnEFfVUus1uTcsrqxN
 ByrZ56NVPoO7z2jYLt8Lft3myo2e/mn88PKqrzS2p9GPn0VBv7rcO3ePmbbHL/RA
 mp+pg8wdpmKrHv4YGfsF+obT1v8f6VJoTLUt5S/WqZAzl1sVJgEJdAkjmDKythGG
 yYKKCemMceMMzLXxnFAYMzdXzdXZ3YEpiW4UkBb77EhUisDrLxCHSL5t4UzyWnuO
 Gtzw7N69iHPHLsxAk1hESAD8sdlk2EdN6vzJVelOsiW955x1hpR+msvNpwZwBqdq
 A2h8VnnrxLK2APl93T5VW9T6kvhzaTwLhoCH+oKklE+U0XJTAYZ4D/AcRVghBvMg
 bC1+1vDx+t/S+8P308evPQnEygLtL2I+zpPnBA1DZzHRAoY8inCLc5HQOfr6pi/f
 koNTtKkmSSKaFSYITw==
 =hb+e
 -----END PGP SIGNATURE-----

Merge tag 'tasklets-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull tasklets API update from Kees Cook:
 "These are the infrastructure updates needed to support converting the
  tasklet API to something more modern (and hopefully for removal
  further down the road).

  There is a 300-patch series waiting in the wings to get set out to
  subsystem maintainers, but these changes need to be present in the
  kernel first. Since this has some treewide changes, I carried this
  series for -next instead of paining Thomas with it in -tip, but it's
  got his Ack.

  This is similar to the timer_struct modernization from a while back,
  but not nearly as messy (I hope). :)

   - Prepare for tasklet API modernization (Romain Perier, Allen Pais,
     Kees Cook)"

* tag 'tasklets-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  tasklet: Introduce new initialization API
  treewide: Replace DECLARE_TASKLET() with DECLARE_TASKLET_OLD()
  usb: gadget: udc: Avoid tasklet passing a global
2020-08-04 13:40:35 -07:00
Linus Torvalds
3e4a12a1ba GCC plugins updates for v5.9-rc1
- Update URLs for HTTPS scheme where available (Alexander A. Klimov)
 - Improve STACKLEAK code generation on x86 (Alexander Popov)
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAl8oXDwWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJk+oD/0VHjn3KWSMtJmBkABzbWnzi6m6
 O3J5IJ1qb7b7AriD04/YAx1YaIPknsircv5hJNAiB4c8f9QoVcnufQlp0lsSW/FR
 3bQ8B7zwuw19bq2nITndc9HvjVbNg5aie6I4umeIbkzWzaHfXPuQ/wF0arSDDB7I
 Kmq1gxsSj9wHl5rly06dPW536zTehRfrHiB4nFQnGk1HKBOlhosJ4bNpC9wkbrii
 0TKcOoGw9aAT1m/RYQdaLKDThuEZFdYK8xcNP1gUrH5gHuntpZprVRT4jCZuEMLx
 sEpcabjvfILBGn8/74g/ld1UOjti+5sNUPqHt8poViMlM06YReZlH3QcxJwa+mSY
 spWx54IJs7FXRw42Sj4HEmQQPcffdvFLkes26h3colAhFKJWwRs3vWZRW8ahyLE2
 U/TbkhAWeKpCaLUf6oPST76TdYKGxKxypVG9xaE31YVacjwbHIBE9uP6iNFR974R
 caWoSmMp6ImtxUNAwQGK4zJHJe1x/V5msh85y9TihwX6DNJJp12WuiN6OX5DL4do
 wYhVFDD71v8F6zzYAwI22yPd77P44fQZ40Aayw8Yaa7A6yuB0Pru/paiEttfIBqo
 knVAczXetZKWBogmXply4vqwLXx6wIAgslQLzxDBAaNjQ62DZ63ZbxKjaa317hL6
 mKucFRyn4LXA2i3Dsw==
 =X+DU
 -----END PGP SIGNATURE-----

Merge tag 'gcc-plugins-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull gcc plugin updates from Kees Cook:
 "Primarily improvements to STACKLEAK from Alexander Popov, along with
  some additional cleanups.

    - Update URLs for HTTPS scheme where available (Alexander A. Klimov)

   - Improve STACKLEAK code generation on x86 (Alexander Popov)"

* tag 'gcc-plugins-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  gcc-plugins: Replace HTTP links with HTTPS ones
  gcc-plugins/stackleak: Add 'verbose' plugin parameter
  gcc-plugins/stackleak: Use asm instrumentation to avoid useless register saving
  ARM: vdso: Don't use gcc plugins for building vgettimeofday.c
  gcc-plugins/stackleak: Don't instrument itself
2020-08-04 13:26:06 -07:00
Petr Mladek
57e60db3bc Merge branch 'for-5.9-console-return-codes' into for-linus 2020-08-04 16:27:43 +02:00
Linus Torvalds
0408497800 Power management updates for 5.9-rc1
- Make the Energy Model cover non-CPU devices (Lukasz Luba).
 
  - Add Ice Lake server idle states table to the intel_idle driver
    and eliminate a redundant static variable from it (Chen Yu,
    Rafael Wysocki).
 
  - Eliminate all W=1 build warnings from cpufreq (Lee Jones).
 
  - Add support for Sapphire Rapids and for Power Limit 4 to the
    Intel RAPL power capping driver (Sumeet Pawnikar, Zhang Rui).
 
  - Fix function name in kerneldoc comments in the idle_inject power
    capping driver (Yangtao Li).
 
  - Fix locking issues with cpufreq governors and drop a redundant
    "weak" function definition from cpufreq (Viresh Kumar).
 
  - Rearrange cpufreq to register non-modular governors at the
    core_initcall level and allow the default cpufreq governor to
    be specified in the kernel command line (Quentin Perret).
 
  - Extend, fix and clean up the intel_pstate driver (Srinivas
    Pandruvada, Rafael Wysocki):
 
    * Add a new sysfs attribute for disabling/enabling CPU
      energy-efficiency optimizations in the processor.
 
    * Make the driver avoid enabling HWP if EPP is not supported.
 
    * Allow the driver to handle numeric EPP values in the sysfs
      interface and fix the setting of EPP via sysfs in the active
      mode.
 
    * Eliminate a static checker warning and clean up a kerneldoc
      comment.
 
  - Clean up some variable declarations in the powernv cpufreq
    driver (Wei Yongjun).
 
  - Fix up the ->enter_s2idle callback definition to cover the case
    when it points to the same function as ->idle correctly (Neal
    Liu).
 
  - Rearrange and clean up the PSCI cpuidle driver (Ulf Hansson).
 
  - Make the PM core emit "changed" uevent when adding/removing the
    "wakeup" sysfs attribute of devices (Abhishek Pandit-Subedi).
 
  - Add a helper macro for declaring PM callbacks and use it in the
    MMC jz4740 driver (Paul Cercueil).
 
  - Fix white space in some places in the hibernate code and make the
    system-wide PM code use "const char *" where appropriate (Xiang
    Chen, Alexey Dobriyan).
 
  - Add one more "unsafe" helper macro to the freezer to cover the NFS
    use case (He Zhe).
 
  - Change the language in the generic PM domains framework to use
    parent/child terminology and clean up a typo and some comment
    fromatting in that code (Kees Cook, Geert Uytterhoeven).
 
  - Update the operating performance points OPP framework (Lukasz
    Luba, Andrew-sh.Cheng, Valdis Kletnieks):
 
    * Refactor dev_pm_opp_of_register_em() and update related drivers.
 
    * Add a missing function export.
 
    * Allow disabled OPPs in dev_pm_opp_get_freq().
 
  - Update devfreq core and drivers (Chanwoo Choi, Lukasz Luba, Enric
    Balletbo i Serra, Dmitry Osipenko, Kieran Bingham, Marc Zyngier):
 
    * Add support for delayed timers to the devfreq core and make the
      Samsung exynos5422-dmc driver use it.
 
    * Unify sysfs interface to use "df-" as a prefix in instance names
      consistently.
 
    * Fix devfreq_summary debugfs node indentation.
 
    * Add the rockchip,pmu phandle to the rk3399_dmc driver DT
      bindings.
 
    * List Dmitry Osipenko as the Tegra devfreq driver maintainer.
 
    * Fix typos in the core devfreq code.
 
  - Update the pm-graph utility to version 5.7 including a number of
    fixes related to suspend-to-idle (Todd Brandt).
 
  - Fix coccicheck errors and warnings in the cpupower utility (Shuah
    Khan).
 
  - Replace HTTP links with HTTPs ones in multiple places (Alexander
    A. Klimov).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl8oO24SHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRx7ZQP/0lQ0yABnASnwomdOH6+K/m7rvc+e9FE
 zx5pTDQswhU5tM7SQAIKqe0uSI+okF2UrBrT5onA16F+JUbnrbexJLazBPfVTTGF
 AKpKEQ7Wh69Wz+Y6cQZjm1dTuRL+dlBJuBrzR2tLSnONPMMHuFcO3xd7lgE9UAxC
 oGEf393taA6OqcUNRQIa2gqbq+k1qhKjeDucGkbOaoJ6CL0ZyWI+Tfw1WWaBBGv0
 /2wBd6V513OH8WtQCW6H3YpHmhYW6OwL8w19KyGcjPRGJaeaIP4W/Ng7mkvgL5ZB
 vZqg3XiufFV9uTe8W1NQaVv/NjlN256OteuK809aosTVjD0dhFkhBYg5TLu6HbQq
 C/NciZ+78oLedWLT73EUfw3NyS+V0jk6X2EIlBUwNi0Qw1B1pCifGOCKzWFFe5cr
 ci4xr4FG7dBkxScOxwFAU2s5TdPHLOkGkQtg4jZr0OYDrzkyLEdsnZEUjLPORo+0
 6EBXGfTOSy2CBHcYswRtzJr/1pUTzj7oejhTAMCCuYW2r3VyQtnYcVjlehtp20if
 6BfmGisk8nmtxlSm+/Y2FqKa4bNnSTMmr0UJQ+Rjp0tHs47QeucI0ORfZ5nPaBac
 +ptvIjWmn3xejT/+oAehpH9066Iuy66vzHdnj7x5+WAsmYS8n8OFtlBFkYELmLJB
 3xI5hIl7WtGo
 =8cUO
 -----END PGP SIGNATURE-----

Merge tag 'pm-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:
 "The most significant change here is the extension of the Energy Model
  to cover non-CPU devices (as well as CPUs) from Lukasz Luba.

  There is also some new hardware support (Ice Lake server idle states
  table for intel_idle, Sapphire Rapids and Power Limit 4 support in the
  RAPL driver), some new functionality in the existing drivers (eg. a
  new switch to disable/enable CPU energy-efficiency optimizations in
  intel_pstate, delayed timers in devfreq), some assorted fixes (cpufreq
  core, intel_pstate, intel_idle) and cleanups (eg. cpuidle-psci,
  devfreq), including the elimination of W=1 build warnings from cpufreq
  done by Lee Jones.

  Specifics:

   - Make the Energy Model cover non-CPU devices (Lukasz Luba).

   - Add Ice Lake server idle states table to the intel_idle driver and
     eliminate a redundant static variable from it (Chen Yu, Rafael
     Wysocki).

   - Eliminate all W=1 build warnings from cpufreq (Lee Jones).

   - Add support for Sapphire Rapids and for Power Limit 4 to the Intel
     RAPL power capping driver (Sumeet Pawnikar, Zhang Rui).

   - Fix function name in kerneldoc comments in the idle_inject power
     capping driver (Yangtao Li).

   - Fix locking issues with cpufreq governors and drop a redundant
     "weak" function definition from cpufreq (Viresh Kumar).

   - Rearrange cpufreq to register non-modular governors at the
     core_initcall level and allow the default cpufreq governor to be
     specified in the kernel command line (Quentin Perret).

   - Extend, fix and clean up the intel_pstate driver (Srinivas
     Pandruvada, Rafael Wysocki):

       * Add a new sysfs attribute for disabling/enabling CPU
         energy-efficiency optimizations in the processor.

       * Make the driver avoid enabling HWP if EPP is not supported.

       * Allow the driver to handle numeric EPP values in the sysfs
         interface and fix the setting of EPP via sysfs in the active
         mode.

       * Eliminate a static checker warning and clean up a kerneldoc
         comment.

   - Clean up some variable declarations in the powernv cpufreq driver
     (Wei Yongjun).

   - Fix up the ->enter_s2idle callback definition to cover the case
     when it points to the same function as ->idle correctly (Neal Liu).

   - Rearrange and clean up the PSCI cpuidle driver (Ulf Hansson).

   - Make the PM core emit "changed" uevent when adding/removing the
     "wakeup" sysfs attribute of devices (Abhishek Pandit-Subedi).

   - Add a helper macro for declaring PM callbacks and use it in the MMC
     jz4740 driver (Paul Cercueil).

   - Fix white space in some places in the hibernate code and make the
     system-wide PM code use "const char *" where appropriate (Xiang
     Chen, Alexey Dobriyan).

   - Add one more "unsafe" helper macro to the freezer to cover the NFS
     use case (He Zhe).

   - Change the language in the generic PM domains framework to use
     parent/child terminology and clean up a typo and some comment
     fromatting in that code (Kees Cook, Geert Uytterhoeven).

   - Update the operating performance points OPP framework (Lukasz Luba,
     Andrew-sh.Cheng, Valdis Kletnieks):

       * Refactor dev_pm_opp_of_register_em() and update related drivers.

       * Add a missing function export.

       * Allow disabled OPPs in dev_pm_opp_get_freq().

   - Update devfreq core and drivers (Chanwoo Choi, Lukasz Luba, Enric
     Balletbo i Serra, Dmitry Osipenko, Kieran Bingham, Marc Zyngier):

       * Add support for delayed timers to the devfreq core and make the
         Samsung exynos5422-dmc driver use it.

       * Unify sysfs interface to use "df-" as a prefix in instance
         names consistently.

       * Fix devfreq_summary debugfs node indentation.

       * Add the rockchip,pmu phandle to the rk3399_dmc driver DT
         bindings.

       * List Dmitry Osipenko as the Tegra devfreq driver maintainer.

       * Fix typos in the core devfreq code.

   - Update the pm-graph utility to version 5.7 including a number of
     fixes related to suspend-to-idle (Todd Brandt).

   - Fix coccicheck errors and warnings in the cpupower utility (Shuah
     Khan).

   - Replace HTTP links with HTTPs ones in multiple places (Alexander A.
     Klimov)"

* tag 'pm-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (71 commits)
  cpuidle: ACPI: fix 'return' with no value build warning
  cpufreq: intel_pstate: Fix EPP setting via sysfs in active mode
  cpufreq: intel_pstate: Rearrange the storing of new EPP values
  intel_idle: Customize IceLake server support
  PM / devfreq: Fix the wrong end with semicolon
  PM / devfreq: Fix indentaion of devfreq_summary debugfs node
  PM / devfreq: Clean up the devfreq instance name in sysfs attr
  memory: samsung: exynos5422-dmc: Add module param to control IRQ mode
  memory: samsung: exynos5422-dmc: Adjust polling interval and uptreshold
  memory: samsung: exynos5422-dmc: Use delayed timer as default
  PM / devfreq: Add support delayed timer for polling mode
  dt-bindings: devfreq: rk3399_dmc: Add rockchip,pmu phandle
  PM / devfreq: tegra: Add Dmitry as a maintainer
  PM / devfreq: event: Fix trivial spelling
  PM / devfreq: rk3399_dmc: Fix kernel oops when rockchip,pmu is absent
  cpuidle: change enter_s2idle() prototype
  cpuidle: psci: Prevent domain idlestates until consumers are ready
  cpuidle: psci: Convert PM domain to platform driver
  cpuidle: psci: Fix error path via converting to a platform driver
  cpuidle: psci: Fail cpuidle registration if set OSI mode failed
  ...
2020-08-03 20:28:08 -07:00
David S. Miller
2e7199bd77 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2020-08-04

The following pull-request contains BPF updates for your *net-next* tree.

We've added 73 non-merge commits during the last 9 day(s) which contain
a total of 135 files changed, 4603 insertions(+), 1013 deletions(-).

The main changes are:

1) Implement bpf_link support for XDP. Also add LINK_DETACH operation for the BPF
   syscall allowing processes with BPF link FD to force-detach, from Andrii Nakryiko.

2) Add BPF iterator for map elements and to iterate all BPF programs for efficient
   in-kernel inspection, from Yonghong Song and Alexei Starovoitov.

3) Separate bpf_get_{stack,stackid}() helpers for perf events in BPF to avoid
   unwinder errors, from Song Liu.

4) Allow cgroup local storage map to be shared between programs on the same
   cgroup. Also extend BPF selftests with coverage, from YiFei Zhu.

5) Add BPF exception tables to ARM64 JIT in order to be able to JIT BPF_PROBE_MEM
   load instructions, from Jean-Philippe Brucker.

6) Follow-up fixes on BPF socket lookup in combination with reuseport group
   handling. Also add related BPF selftests, from Jakub Sitnicki.

7) Allow to use socket storage in BPF_PROG_TYPE_CGROUP_SOCK-typed programs for
   socket create/release as well as bind functions, from Stanislav Fomichev.

8) Fix an info leak in xsk_getsockopt() when retrieving XDP stats via old struct
   xdp_statistics, from Peilin Ye.

9) Fix PT_REGS_RC{,_CORE}() macros in libbpf for MIPS arch, from Jerry Crunchtime.

10) Extend BPF kernel test infra with skb->family and skb->{local,remote}_ip{4,6}
    fields and allow user space to specify skb->dev via ifindex, from Dmitry Yakunin.

11) Fix a bpftool segfault due to missing program type name and make it more robust
    to prevent them in future gaps, from Quentin Monnet.

12) Consolidate cgroup helper functions across selftests and fix a v6 localhost
    resolver issue, from John Fastabend.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-03 18:27:40 -07:00
Linus Torvalds
e4cbce4d13 The main changes in this cycle were:
- Improve uclamp performance by using a static key for the fast path
 
  - Add the "sched_util_clamp_min_rt_default" sysctl, to optimize for
    better power efficiency of RT tasks on battery powered devices.
    (The default is to maximize performance & reduce RT latencies.)
 
  - Improve utime and stime tracking accuracy, which had a fixed boundary
    of error, which created larger and larger relative errors as the values
    become larger. This is now replaced with more precise arithmetics,
    using the new mul_u64_u64_div_u64() helper in math64.h.
 
  - Improve the deadline scheduler, such as making it capacity aware
 
  - Improve frequency-invariant scheduling
 
  - Misc cleanups in energy/power aware scheduling
 
  - Add sched_update_nr_running tracepoint to track changes to nr_running
 
  - Documentation additions and updates
 
  - Misc cleanups and smaller fixes
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl8oJDURHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1ixLg//bqWzFlfWirvngTgDxDnplwUTyKXmMCcq
 R1IYhlyK2O5FxvhbRmdmW11W3yzyTPvgCs6Q/70negGaPNe2w1OxfxiK9NMKz5eu
 M1LoXas7pL5g7Pr/ZxxHk/8VqJLV4t9MkodiiInmV6lTaznT3sU6a/kpYQjJyFnG
 Tuu9jd6JhdRKmePDJnNmUBoGQ7JiOQDcX4HtkcQ3OA+An3624tmJzbW1yts+uj7J
 ZWo2EY60RfbA9MxQXGPOaR/nAjngWs4Q6tddAh10mftsPq1gR2iFUKju1d31MQt/
 RHLdiqJf+AyUC4popKG7a+7ilCKMBwPociSreTJNPyEUQ1X4AM3vUVk4yjUoiDph
 k2WdsCF8/JRdhXg0NnrpPUqOaAbQj53EeXnitEb92E7WyTZgLOvAtpV//xZo6utp
 2QHerfrQ9SoGQjz/ho78za5vQtV1x25yDhd+X4XV4QEhIy85G9/2JCpC/Kc/TXLf
 OO7A4X69XztKTEJhP60g8ldCPUe4N2vbh1vKY6oAD8AFQVVNZ6n7375/Qa//b0/k
 ++hcYkPc2EK97/aBFdvzDgqb7aUo7Mtn2ibke16sQU4szulaoRuAHQG4jdGKMwbD
 dk2VBoxyxeYFXWHsNneSe87+ha3sd0dSN0ul1EB/SlFrVELMvy634YXnMYGW8ima
 PzyPB0ezpuA=
 =PbO7
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:

 - Improve uclamp performance by using a static key for the fast path

 - Add the "sched_util_clamp_min_rt_default" sysctl, to optimize for
   better power efficiency of RT tasks on battery powered devices.
   (The default is to maximize performance & reduce RT latencies.)

 - Improve utime and stime tracking accuracy, which had a fixed boundary
   of error, which created larger and larger relative errors as the
   values become larger. This is now replaced with more precise
   arithmetics, using the new mul_u64_u64_div_u64() helper in math64.h.

 - Improve the deadline scheduler, such as making it capacity aware

 - Improve frequency-invariant scheduling

 - Misc cleanups in energy/power aware scheduling

 - Add sched_update_nr_running tracepoint to track changes to nr_running

 - Documentation additions and updates

 - Misc cleanups and smaller fixes

* tag 'sched-core-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits)
  sched/doc: Factorize bits between sched-energy.rst & sched-capacity.rst
  sched/doc: Document capacity aware scheduling
  sched: Document arch_scale_*_capacity()
  arm, arm64: Fix selection of CONFIG_SCHED_THERMAL_PRESSURE
  Documentation/sysctl: Document uclamp sysctl knobs
  sched/uclamp: Add a new sysctl to control RT default boost value
  sched/uclamp: Fix a deadlock when enabling uclamp static key
  sched: Remove duplicated tick_nohz_full_enabled() check
  sched: Fix a typo in a comment
  sched/uclamp: Remove unnecessary mutex_init()
  arm, arm64: Select CONFIG_SCHED_THERMAL_PRESSURE
  sched: Cleanup SCHED_THERMAL_PRESSURE kconfig entry
  arch_topology, sched/core: Cleanup thermal pressure definition
  trace/events/sched.h: fix duplicated word
  linux/sched/mm.h: drop duplicated words in comments
  smp: Fix a potential usage of stale nr_cpus
  sched/fair: update_pick_idlest() Select group with lowest group_util when idle_cpus are equal
  sched: nohz: stop passing around unused "ticks" parameter.
  sched: Better document ttwu()
  sched: Add a tracepoint to track rq->nr_running
  ...
2020-08-03 14:58:38 -07:00
Linus Torvalds
b34133fec8 - HW support updates:
- Add uncore support for Intel Comet Lake
 
    - Add RAPL support for Hygon Fam18h
 
    - Add Intel "IIO stack to PMON mapping" support on Skylake-SP CPUs,
      which enumerates per device performance counters via sysfs and enables
      the perf stat --iiostat functionality
 
    - Add support for Intel "Architectural LBRs", which generalized the model
      specific LBR hardware tracing feature into a model-independent, architected
      performance monitoring feature. Usage is mostly seamless to tooling, as the
      pre-existing LBR features are kept, but there's a couple of advantages
      under the hood, such as faster context-switching, faster LBR reads,
      cleaner exposure of LBR features to guest kernels, etc.
 
      ( Since architectural LBRs are supported via XSAVE, there's related
        changes to the x86 FPU code as well. )
 
  - ftrace/perf updates: Add support to add a text poke event to record changes
                         to kernel text (i.e. self-modifying code) in order to
                         support tracers like Intel PT decoding through
                         jump labels, kprobes and ftrace trampolines.
 
  - Misc cleanups, smaller fixes.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl8oAgcRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gcSA/9EwKeLF03jkEXwzF/a/YhCxZXODH/klz/
 5D/Li+0HJy9TTVQWaSOxu31VcnWyAPER97aRjHohNMrAFKpAC4GwzxF2fjKUzzKJ
 eoWIgXvtlMM+nQb93UTB2+9Z3eHBEpKsqP8oc6qeXa74b2p3WfmvFRPBWFuzmOlH
 nb26F/Cu46HTEUfWvggU9flS0HpkdZ8X2Rt14sRwq5Gi2Wa/5+ygaksD+5nwRlGM
 r7jBrZBDTOGhy7HjrjpDPif056YU31giKmMQ/j17h1NaT3ciyXYSi0FuKEghDCII
 2OFyH0wZ1vsp63GISosIKFLFoBmOd4He4/sKjdtOtnosan250t3/ZDH/7tw6Rq2V
 tf1o/dMbDmV9v0lAVBZO76Z74ZQbk3+TvFxyDwtBSQYBe2eVfNz0VY4YjSRlRIp0
 1nIbJqiMLa7uquL2K4zZKapt7qsMaVqLO4YUVTzYPvv3luAqFLvC83a2+hapz4cs
 w4nET8lpWanUBK0hidQe1J6NPM4v1mnsvuZfM0p/QwKN9uvV5KoT6YJhRqfTy51g
 je+G80q0XqOH0H8x9iWuLiJe0G72UyhRqzSTxg+Cjj9cAhnsFPFLCNMWSVHqioLP
 JXGQiTp+6SQM6JDXkj5F8InsyT4KfzqizMSnAaH+6bsv9iQKDL4AbD7r92g6nbN9
 PP43QQh23Fg=
 =4pKU
 -----END PGP SIGNATURE-----

Merge tag 'perf-core-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf event updates from Ingo Molnar:
 "HW support updates:

   - Add uncore support for Intel Comet Lake

   - Add RAPL support for Hygon Fam18h

   - Add Intel "IIO stack to PMON mapping" support on Skylake-SP CPUs,
     which enumerates per device performance counters via sysfs and
     enables the perf stat --iiostat functionality

   - Add support for Intel "Architectural LBRs", which generalized the
     model specific LBR hardware tracing feature into a
     model-independent, architected performance monitoring feature.

     Usage is mostly seamless to tooling, as the pre-existing LBR
     features are kept, but there's a couple of advantages under the
     hood, such as faster context-switching, faster LBR reads, cleaner
     exposure of LBR features to guest kernels, etc.

     ( Since architectural LBRs are supported via XSAVE, there's related
       changes to the x86 FPU code as well. )

  ftrace/perf updates:

   - Add support to add a text poke event to record changes to kernel
     text (i.e. self-modifying code) in order to support tracers like
     Intel PT decoding through jump labels, kprobes and ftrace
     trampolines.

  Misc cleanups, smaller fixes..."

* tag 'perf-core-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (47 commits)
  perf/x86/rapl: Add Hygon Fam18h RAPL support
  kprobes: Remove unnecessary module_mutex locking from kprobe_optimizer()
  x86/perf: Fix a typo
  perf: <linux/perf_event.h>: drop a duplicated word
  perf/x86/intel/lbr: Support XSAVES for arch LBR read
  perf/x86/intel/lbr: Support XSAVES/XRSTORS for LBR context switch
  x86/fpu/xstate: Add helpers for LBR dynamic supervisor feature
  x86/fpu/xstate: Support dynamic supervisor feature for LBR
  x86/fpu: Use proper mask to replace full instruction mask
  perf/x86: Remove task_ctx_size
  perf/x86/intel/lbr: Create kmem_cache for the LBR context data
  perf/core: Use kmem_cache to allocate the PMU specific data
  perf/core: Factor out functions to allocate/free the task_ctx_data
  perf/x86/intel/lbr: Support Architectural LBR
  perf/x86/intel/lbr: Factor out intel_pmu_store_lbr
  perf/x86/intel/lbr: Factor out rdlbr_all() and wrlbr_all()
  perf/x86/intel/lbr: Mark the {rd,wr}lbr_{to,from} wrappers __always_inline
  perf/x86/intel/lbr: Unify the stored format of LBR information
  perf/x86/intel/lbr: Support LBR_CTL
  perf/x86: Expose CPUID enumeration bits for arch LBR
  ...
2020-08-03 14:51:09 -07:00
Linus Torvalds
9ba19ccd2d These were the main changes in this cycle:
- LKMM updates: mostly documentation changes, but also some new litmus tests for atomic ops.
 
  - KCSAN updates: the most important change is that GCC 11 now has all fixes in place
                   to support KCSAN, so GCC support can be enabled again. Also more annotations.
 
  - futex updates: minor cleanups and simplifications
 
  - seqlock updates: merge preparatory changes/cleanups for the 'associated locks' facilities.
 
  - lockdep updates:
     - simplify IRQ trace event handling
     - add various new debug checks
     - simplify header dependencies, split out <linux/lockdep_types.h>, decouple
       lockdep from other low level headers some more
     - fix NMI handling
 
  - misc cleanups and smaller fixes
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl8n9/wRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hZFQ//dD+AKw9Nym+WbylovmeD0qxWxPyeN/jG
 vBVDTOJIJLtZTkZf6YHcYOJlPwaMDYUQluqTPQhsaQZy/NoEb5NM2cFAj2R9gjyT
 O8665T1dvhW9Sh353mBpuwviqdrnvCeHTBEcglSlFY7hxToYAflUN0+DXGVtNys8
 PFNf3L9SHT0GLVC8+di/eJzQaRqxiB0Pq7kvh2RvPJM/dcQNA9Ho3CCNO5j6qGoY
 u7OnMT8xJXkgbdjjUO4RO0v9VjMuNthZ2JiONDgvgKtJfIL2wt5YXIv1EYX0GuWp
 WZgIzE4o1G7GJOOzKpFfZFyK8grHu2fWgK1plvodWjlLkBmltJZ1qyOM+wngd/m2
 TgtPo73/YFbxFUbbBpkb0eiIaH2t99kMvfCWd05+GiPCtzn9UL9GfFRWd42vonwc
 sQWjFrHKlnuzifUfNcLmKg7R2nUtF3Dm/SydiTJ+9NtH/QA17YJKWnlE1moulNtQ
 p7H7+8UdcvSQ7F38A74v2IYNIyDsv5qcE8ar4QHdaanBBX/LCyD0UlfgsgxEReXf
 GDKkpx7LFQlI6Y2YB+dZgkCwhNBl3/OQ3v6hC95B37fA67dAIQyPIWHiHbaM+029
 gghqU4GcUcbjSnHPzl9PPL+hi9MyXrMjpb7CBXytg4NI4EE1waHR+0kX14V8ndRj
 MkWQOKPUgB0=
 =3MTT
 -----END PGP SIGNATURE-----

Merge tag 'locking-core-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking updates from Ingo Molnar:

 - LKMM updates: mostly documentation changes, but also some new litmus
   tests for atomic ops.

 - KCSAN updates: the most important change is that GCC 11 now has all
   fixes in place to support KCSAN, so GCC support can be enabled again.
   Also more annotations.

 - futex updates: minor cleanups and simplifications

 - seqlock updates: merge preparatory changes/cleanups for the
   'associated locks' facilities.

 - lockdep updates:
    - simplify IRQ trace event handling
    - add various new debug checks
    - simplify header dependencies, split out <linux/lockdep_types.h>,
      decouple lockdep from other low level headers some more
    - fix NMI handling

 - misc cleanups and smaller fixes

* tag 'locking-core-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits)
  kcsan: Improve IRQ state trace reporting
  lockdep: Refactor IRQ trace events fields into struct
  seqlock: lockdep assert non-preemptibility on seqcount_t write
  lockdep: Add preemption enabled/disabled assertion APIs
  seqlock: Implement raw_seqcount_begin() in terms of raw_read_seqcount()
  seqlock: Add kernel-doc for seqcount_t and seqlock_t APIs
  seqlock: Reorder seqcount_t and seqlock_t API definitions
  seqlock: seqcount_t latch: End read sections with read_seqcount_retry()
  seqlock: Properly format kernel-doc code samples
  Documentation: locking: Describe seqlock design and usage
  locking/qspinlock: Do not include atomic.h from qspinlock_types.h
  locking/atomic: Move ATOMIC_INIT into linux/types.h
  lockdep: Move list.h inclusion into lockdep.h
  locking/lockdep: Fix TRACE_IRQFLAGS vs. NMIs
  futex: Remove unused or redundant includes
  futex: Consistently use fshared as boolean
  futex: Remove needless goto's
  futex: Remove put_futex_key()
  rwsem: fix commas in initialisation
  docs: locking: Replace HTTP links with HTTPS ones
  ...
2020-08-03 14:39:35 -07:00
Linus Torvalds
8f0cb6660a These are the latest RCU bits for v5.9:
- kfree_rcu updates
   - RCU tasks updates
   - Read-side scalability tests
   - SRCU updates
   - Torture-test updates
   - Documentation updates
   - Miscellaneous fixes
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl8n80ERHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gauA/+NtuExW9V9cPDZ8AAp6x6QfoEIgqN4VEk
 pYuyP0+ZbmwH+h8z7qPqMrwxUHQnhef7gqtlWa7wj9MawbEbmqnA/3uivjX/3Aao
 bGMMXkqXppc6hgwktgLNk8vfq3LRVEH2P0i0I+Tymgxu3DCHSGRep4LWfdAS/q3z
 4pe5JXqdMx+Qnfy/bsVxJTaJAncMq1LQNAtWY1TIwK8L8RmpXrj5dvuLKUr7q+zl
 P+BfXyrdX+x05TpmHHnI/bR3w9yASL32E0S3IaQYRRqH8TsUIGHWe13Ib6hKXXG5
 j7W5KrsOgr0fQBxi+JW2fgGQkrua4o7yk4H2Ygj+Fi5RvP2uqNZdvXFAlP2cUMu/
 7Pg8+7kC6jKIrwpD03s9ZZzm0QN3jsCxFs2PEkkHMzjXbe1CI4tIkTH6ex1uvjR2
 v3OhCIp6ypxpEIJbFQucia0iQ4NF+evKjqCvRkbepqQ096jg+CNFh0VG0Tp8XR+y
 Gk9B9oXvLLPMd6ah5CI9nLJKiMWVRV8mvvqspoblGo//+39ksh4mzxm865tFXYg4
 C+DPJvKlY15Ib5eJ/xr8EZ/oS0K2sUF9sMYnK4P8QMhyTBMbpAZiljHYK+Wujt8I
 g/JCWxrEMv3LHPY9/guB5Nod/Qb4Jqqm9iE9qEX3MQxtt2O2nmmWd91pzFcUXlFU
 RDBWYJ63Okg=
 =rNhf
 -----END PGP SIGNATURE-----

Merge tag 'core-rcu-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull RCU updates from Ingo Molnar:

 - kfree_rcu updates

 - RCU tasks updates

 - Read-side scalability tests

 - SRCU updates

 - Torture-test updates

 - Documentation updates

 - Miscellaneous fixes

* tag 'core-rcu-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (109 commits)
  torture: Remove obsolete "cd $KVM"
  torture: Avoid duplicate specification of qemu command
  torture: Dump ftrace at shutdown only if requested
  torture: Add kvm-tranform.sh script for qemu-cmd files
  torture: Add more tracing crib notes to kvm.sh
  torture: Improve diagnostic for KCSAN-incapable compilers
  torture: Correctly summarize build-only runs
  torture: Pass --kmake-arg to all make invocations
  rcutorture: Check for unwatched readers
  torture: Abstract out console-log error detection
  torture: Add a stop-run capability
  torture: Create qemu-cmd in --buildonly runs
  rcu/rcutorture: Replace 0 with false
  torture: Add --allcpus argument to the kvm.sh script
  torture: Remove whitespace from identify_qemu_vcpus output
  rcutorture: NULL rcu_torture_current earlier in cleanup code
  rcutorture: Handle non-statistic bang-string error messages
  torture: Set configfile variable to current scenario
  rcutorture: Add races with task-exit processing
  locktorture: Use true and false to assign to bool variables
  ...
2020-08-03 14:31:33 -07:00
Linus Torvalds
3b4b84b2ea Fix a recent IRQ affinities regression, add in a missing debugfs printout
that helps the debugging of IRQ affinity logic bugs, and fix a memory leak.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl8nEn8RHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1ipYA/+KOWjDuRp1YBZeZ4/55RjGzimsW5jkLIY
 0Na3WGjN/QBKCzmRJNnMyW1UjRgpHpBhOsphTcHVdhJo9jg5+DX+XdVTwKGTqAI+
 7DqzP4dzifSgUwdcxIbKwtZquBRzKk1K0Z25b6Jc0WJwkGRx3LWhhRDERPUEHtXg
 Sl07XxiuqFLcQZz9o3hisKzEfA2llB4bfXOjLCJlLK3HUZKccoBjWKbTrI3ymCiz
 f0iV9a7kNzo4fJNddKOBTtDWFEhpj6NgEVtLNdAaDti7MSSjPbB1BsiK64UInGMQ
 4881ItYAOHGuCHe8yYnjlWA5kmwX14KjN6c3RAXK3n4+wvf+17RJC+FLH1PbkFIx
 hZ8k9x2Y5Dpt8vD8fGkoqi2nr2JYbIiOm79AjrD+Li+wWKG3iw4AGEoBHBJzKHUb
 naEGiUDJpn7pdpPWMACoctAIhy7/gDA1pPyb5F7Bf/RwoskIyu4i/d/xz05zBg3H
 HZMC2Lqcgh7LTS91NnmCx8XELdgL14mN19LK5enH3QTIPtdxmZ5x4quKw6ajMAAQ
 jwRpExqy6E1TQkIG5T5hjT0EMuj4uA6OzaoeOroFzKuzo+jiEDl49WAx+9Im9oBb
 i7hT4PM/wR7BcfmTMVhmmns4Dp0LkW7dRxHIjo7Fzft5iF8UkO7o7A4VUoAIxrSm
 xDFlBO/mo3w=
 =oKU9
 -----END PGP SIGNATURE-----

Merge tag 'irq-urgent-2020-08-02' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq fixes from Ingo Molnar:
 "Fix a recent IRQ affinities regression, add in a missing debugfs
  printout that helps the debugging of IRQ affinity logic bugs, and fix
  a memory leak"

* tag 'irq-urgent-2020-08-02' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq/debugfs: Add missing irqchip flags
  genirq/affinity: Make affinity setting if activated opt-in
  irqdomain/treewide: Free firmware node after domain removal
2020-08-03 14:21:52 -07:00
Linus Torvalds
145ff1ec09 arm64 and cross-arch updates for 5.9:
- Removal of the tremendously unpopular read_barrier_depends() barrier,
   which is a NOP on all architectures apart from Alpha, in favour of
   allowing architectures to override READ_ONCE() and do whatever dance
   they need to do to ensure address dependencies provide LOAD ->
   LOAD/STORE ordering. This work also offers a potential solution if
   compilers are shown to convert LOAD -> LOAD address dependencies into
   control dependencies (e.g. under LTO), as weakly ordered architectures
   will effectively be able to upgrade READ_ONCE() to smp_load_acquire().
   The latter case is not used yet, but will be discussed further at LPC.
 
 - Make the MSI/IOMMU input/output ID translation PCI agnostic, augment
   the MSI/IOMMU ACPI/OF ID mapping APIs to accept an input ID
   bus-specific parameter and apply the resulting changes to the device
   ID space provided by the Freescale FSL bus.
 
 - arm64 support for TLBI range operations and translation table level
   hints (part of the ARMv8.4 architecture version).
 
 - Time namespace support for arm64.
 
 - Export the virtual and physical address sizes in vmcoreinfo for
   makedumpfile and crash utilities.
 
 - CPU feature handling cleanups and checks for programmer errors
   (overlapping bit-fields).
 
 - ACPI updates for arm64: disallow AML accesses to EFI code regions and
   kernel memory.
 
 - perf updates for arm64.
 
 - Miscellaneous fixes and cleanups, most notably PLT counting
   optimisation for module loading, recordmcount fix to ignore
   relocations other than R_AARCH64_CALL26, CMA areas reserved for
   gigantic pages on 16K and 64K configurations.
 
 - Trivial typos, duplicate words.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAl8oTcsACgkQa9axLQDI
 XvEj6hAAkn39mO5xrR/Vhpg3DyFPk63ZlMSX9SsOeVyaLbovT6stTs1XAZXPpnkt
 rV3gwACyGSrqH6+uey9pHgHJuPF2TdrGEVK08yVKo9KGW/6yXSIncdKFE4jUJ/WJ
 wF5j7eMET2aGzcpm5AlzMmq6HOrKB8nZac9H8/x6H+Ox2WdgJkEjOkDvyqACUyum
 N3FsTZkWj2pIkTXHNgDZ8KjxVLO8HlFaB2hkxFDl9NPlX2UTCQJ8Tg1KiPLafKaK
 gUvH4usQDFdb5RU/UWogre37J4emO0ZTApZOyju+U+PMMWlWVHjZ4isUIS9zz/AE
 JNZ23dnKZX2HrYa5p8HZx175zwj/vXUqUHCZPLvQXaAudCEhF8BVljPiG0e80FV5
 GHFUgUbylKspp01I/9L+2JvsG96Mr0e+P3Sx7L2HTI42cmtoSa14+MpoSRj7zlft
 Qcl8hfrVOjCjUnFRHa/1y1cGvnD9GbgnKJR7zgVxl9bD/Jd48r1HUtwRORZCzWFr
 mRPVbPS72fWxMzMV9DZYJm02jJY9kLX2BMl49njbB8MhAhzOvrMVzoVVtMMeRFLR
 XHeJpmg36W09FiRGe7LRXlkXIhCQzQG2bJfiphuupCfhjRAitPoq8I925G6Pig60
 c8RWaXGU7PrEsdMNrL83vekvGKgqrkoFkRVtsCoQ2X6Hvu/XdYI=
 =mh79
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 and cross-arch updates from Catalin Marinas:
 "Here's a slightly wider-spread set of updates for 5.9.

  Going outside the usual arch/arm64/ area is the removal of
  read_barrier_depends() series from Will and the MSI/IOMMU ID
  translation series from Lorenzo.

  The notable arm64 updates include ARMv8.4 TLBI range operations and
  translation level hint, time namespace support, and perf.

  Summary:

   - Removal of the tremendously unpopular read_barrier_depends()
     barrier, which is a NOP on all architectures apart from Alpha, in
     favour of allowing architectures to override READ_ONCE() and do
     whatever dance they need to do to ensure address dependencies
     provide LOAD -> LOAD/STORE ordering.

     This work also offers a potential solution if compilers are shown
     to convert LOAD -> LOAD address dependencies into control
     dependencies (e.g. under LTO), as weakly ordered architectures will
     effectively be able to upgrade READ_ONCE() to smp_load_acquire().
     The latter case is not used yet, but will be discussed further at
     LPC.

   - Make the MSI/IOMMU input/output ID translation PCI agnostic,
     augment the MSI/IOMMU ACPI/OF ID mapping APIs to accept an input ID
     bus-specific parameter and apply the resulting changes to the
     device ID space provided by the Freescale FSL bus.

   - arm64 support for TLBI range operations and translation table level
     hints (part of the ARMv8.4 architecture version).

   - Time namespace support for arm64.

   - Export the virtual and physical address sizes in vmcoreinfo for
     makedumpfile and crash utilities.

   - CPU feature handling cleanups and checks for programmer errors
     (overlapping bit-fields).

   - ACPI updates for arm64: disallow AML accesses to EFI code regions
     and kernel memory.

   - perf updates for arm64.

   - Miscellaneous fixes and cleanups, most notably PLT counting
     optimisation for module loading, recordmcount fix to ignore
     relocations other than R_AARCH64_CALL26, CMA areas reserved for
     gigantic pages on 16K and 64K configurations.

   - Trivial typos, duplicate words"

Link: http://lkml.kernel.org/r/20200710165203.31284-1-will@kernel.org
Link: http://lkml.kernel.org/r/20200619082013.13661-1-lorenzo.pieralisi@arm.com

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (82 commits)
  arm64: use IRQ_STACK_SIZE instead of THREAD_SIZE for irq stack
  arm64/mm: save memory access in check_and_switch_context() fast switch path
  arm64: sigcontext.h: delete duplicated word
  arm64: ptrace.h: delete duplicated word
  arm64: pgtable-hwdef.h: delete duplicated words
  bus: fsl-mc: Add ACPI support for fsl-mc
  bus/fsl-mc: Refactor the MSI domain creation in the DPRC driver
  of/irq: Make of_msi_map_rid() PCI bus agnostic
  of/irq: make of_msi_map_get_device_domain() bus agnostic
  dt-bindings: arm: fsl: Add msi-map device-tree binding for fsl-mc bus
  of/device: Add input id to of_dma_configure()
  of/iommu: Make of_map_rid() PCI agnostic
  ACPI/IORT: Add an input ID to acpi_dma_configure()
  ACPI/IORT: Remove useless PCI bus walk
  ACPI/IORT: Make iort_msi_map_rid() PCI agnostic
  ACPI/IORT: Make iort_get_device_domain IRQ domain agnostic
  ACPI/IORT: Make iort_match_node_callback walk the ACPI namespace for NC
  arm64: enable time namespace support
  arm64/vdso: Restrict splitting VVAR VMA
  arm64/vdso: Handle faults on timens page
  ...
2020-08-03 14:11:08 -07:00
Linus Torvalds
05119217a9 Remove unicore32 support
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEeOVYVaWZL5900a/pOQOGJssO/ZEFAl8eiUATHHJwcHRAbGlu
 dXguaWJtLmNvbQAKCRA5A4Ymyw79kWs8B/4wEWVJGTkjyrMX57/Ew8yRYAJE6JjA
 kSONPjElVrPR1pRLYyjyde+zqumkJFhk+41De09J2byL29p7tK8ISNrTwJrIN7n/
 dzT73CmuNEjI0rZJxPX+USKFph75FQVvAVOOWs+6fiBFxdUaIsBheVntH7/NsCTk
 HFrIjIn5wXFVs5Nh+2cHydvEpOVoUWzjvs+uJIEpHCVCBz6gaYq2dxEmeTquKuz1
 k7PZqCqVsyB9iWLqN65/Q+30N8znJwcUl8HAzs5nvPrXLjGxwuEjOxtYYhbdLAfP
 OBiIF9J77sZxBlms0WNomDW3Rr5Vlt5nF9oUWpi3AmHNWIuX0GkM4i0C
 =V+Kl
 -----END PGP SIGNATURE-----

Merge tag 'rm-unicore32' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux

Pull unicore32 removal from Mike Rapoport:
 "Remove unicore32 support.

  The unicore32 port do not seem maintained for a long time now, there
  is no upstream toolchain that can create unicore32 binaries and all
  the links to prebuilt toolchains for unicore32 are dead. Even
  compilers that were available are not supported by the kernel anymore.

  Guenter Roeck says:
    "I have stopped building unicore32 images since v4.19 since there is
     no available compiler that is still supported by the kernel. I am
     surprised that support for it has not been removed from the kernel"

  However, it's worth pointing out two things:

   - Guan Xuetao is still listed as maintainer and asked for the port to
     be kept around the last time Arnd suggested removing it two years
     ago. He promised that there would be compiler sources (presumably
     llvm), but has not made those available since.

   - https://github.com/gxt has patches to linux-4.9 and qemu-2.7, both
     released in 2016, with patches dated early 2019. These patches
     mainly restore a syscall ABI that was never part of mainline Linux
     but apparently used in production. qemu-2.8 removed support for
     that ABI and newer kernels (4.19+) can no longer be built with the
     old toolchain, so apparently there will not be any future updates
     to that git tree"

* tag 'rm-unicore32' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux:
  MAINTAINERS: remove "PKUNITY SOC DRIVERS" entry
  rtc: remove fb-puv3  driver
  video: fbdev: remove fb-puv3  driver
  pwm: remove pwm-puv3  driver
  input: i8042: remove support for 8042-unicore32io
  i2c/buses: remove i2c-puv3  driver
  cpufreq: remove unicore32 driver
  arch: remove unicore32 port
2020-08-03 14:00:16 -07:00
Peng Fan
231621d0c5 tracing/uprobe: Remove dead code in trace_uprobe_register()
In the function trace_uprobe_register(), the statement "return 0;"
out of switch case is dead code, remove it.

Link: https://lkml.kernel.org/r/1595561064-29186-1-git-send-email-fanpeng@loongson.cn

Signed-off-by: Peng Fan <fanpeng@loongson.cn>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-08-03 16:16:46 -04:00
Muchun Song
0cb2f1372b kprobes: Fix NULL pointer dereference at kprobe_ftrace_handler
We found a case of kernel panic on our server. The stack trace is as
follows(omit some irrelevant information):

  BUG: kernel NULL pointer dereference, address: 0000000000000080
  RIP: 0010:kprobe_ftrace_handler+0x5e/0xe0
  RSP: 0018:ffffb512c6550998 EFLAGS: 00010282
  RAX: 0000000000000000 RBX: ffff8e9d16eea018 RCX: 0000000000000000
  RDX: ffffffffbe1179c0 RSI: ffffffffc0535564 RDI: ffffffffc0534ec0
  RBP: ffffffffc0534ec1 R08: ffff8e9d1bbb0f00 R09: 0000000000000004
  R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
  R13: ffff8e9d1f797060 R14: 000000000000bacc R15: ffff8e9ce13eca00
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000080 CR3: 00000008453d0005 CR4: 00000000003606e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   <IRQ>
   ftrace_ops_assist_func+0x56/0xe0
   ftrace_call+0x5/0x34
   tcpa_statistic_send+0x5/0x130 [ttcp_engine]

The tcpa_statistic_send is the function being kprobed. After analysis,
the root cause is that the fourth parameter regs of kprobe_ftrace_handler
is NULL. Why regs is NULL? We use the crash tool to analyze the kdump.

  crash> dis tcpa_statistic_send -r
         <tcpa_statistic_send>: callq 0xffffffffbd8018c0 <ftrace_caller>

The tcpa_statistic_send calls ftrace_caller instead of ftrace_regs_caller.
So it is reasonable that the fourth parameter regs of kprobe_ftrace_handler
is NULL. In theory, we should call the ftrace_regs_caller instead of the
ftrace_caller. After in-depth analysis, we found a reproducible path.

  Writing a simple kernel module which starts a periodic timer. The
  timer's handler is named 'kprobe_test_timer_handler'. The module
  name is kprobe_test.ko.

  1) insmod kprobe_test.ko
  2) bpftrace -e 'kretprobe:kprobe_test_timer_handler {}'
  3) echo 0 > /proc/sys/kernel/ftrace_enabled
  4) rmmod kprobe_test
  5) stop step 2) kprobe
  6) insmod kprobe_test.ko
  7) bpftrace -e 'kretprobe:kprobe_test_timer_handler {}'

We mark the kprobe as GONE but not disarm the kprobe in the step 4).
The step 5) also do not disarm the kprobe when unregister kprobe. So
we do not remove the ip from the filter. In this case, when the module
loads again in the step 6), we will replace the code to ftrace_caller
via the ftrace_module_enable(). When we register kprobe again, we will
not replace ftrace_caller to ftrace_regs_caller because the ftrace is
disabled in the step 3). So the step 7) will trigger kernel panic. Fix
this problem by disarming the kprobe when the module is going away.

Link: https://lkml.kernel.org/r/20200728064536.24405-1-songmuchun@bytedance.com

Cc: stable@vger.kernel.org
Fixes: ae6aa16fdc ("kprobes: introduce ftrace based optimization")
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Co-developed-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-08-03 16:14:54 -04:00
Josef Bacik
c58b6b0372 ftrace: Fix ftrace_trace_task return value
I was attempting to use pid filtering with function_graph, but it wasn't
allowing anything to make it through.  Turns out ftrace_trace_task
returns false if ftrace_ignore_pid is not-empty, which isn't correct
anymore.  We're now setting it to FTRACE_PID_IGNORE if we need to ignore
that pid, otherwise it's set to the pid (which is weird considering the
name) or to FTRACE_PID_TRACE.  Fix the check to check for !=
FTRACE_PID_IGNORE.  With this we can now use function_graph with pid
filtering.

Link: https://lkml.kernel.org/r/20200725005048.1790-1-josef@toxicpanda.com

Fixes: 717e3f5ebc ("ftrace: Make function trace pid filtering a bit more exact")
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-08-03 16:12:31 -04:00
Linus Torvalds
382625d0d4 for-5.9/block-20200802
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl8m7YwQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpt+dEAC7a0HYuX2OrkyawBnsgd1QQR/soC7surec
 yDDa7SMM8cOq3935bfzcYHV9FWJszEGIknchiGb9R3/T+vmSohbvDsM5zgwya9u/
 FHUIuTq324I6JWXKl30k4rwjiX9wQeMt+WZ5gC8KJYCWA296i2IpJwd0A45aaKuS
 x4bTjxqknE+fD4gQiMUSt+bmuOUAp81fEku3EPapCRYDPAj8f5uoY7R2arT/POwB
 b+s+AtXqzBymIqx1z0sZ/XcdZKmDuhdurGCWu7BfJFIzw5kQ2Qe3W8rUmrQ3pGut
 8a21YfilhUFiBv+B4wptfrzJuzU6Ps0BXHCnBsQjzvXwq5uFcZH495mM/4E4OJvh
 SbjL2K4iFj+O1ngFkukG/F8tdEM1zKBYy2ZEkGoWKUpyQanbAaGI6QKKJA+DCdBi
 yPEb7yRAa5KfLqMiocm1qCEO1I56HRiNHaJVMqCPOZxLmpXj19Fs71yIRplP1Trv
 GGXdWZsccjuY6OljoXWdEfnxAr5zBsO3Yf2yFT95AD+egtGsU1oOzlqAaU1mtflw
 ABo452pvh6FFpxGXqz6oK4VqY4Et7WgXOiljA4yIGoPpG/08L1Yle4eVc2EE01Jb
 +BL49xNJVeUhGFrvUjPGl9kVMeLmubPFbmgrtipW+VRg9W8+Yirw7DPP6K+gbPAR
 RzAUdZFbWw==
 =abJG
 -----END PGP SIGNATURE-----

Merge tag 'for-5.9/block-20200802' of git://git.kernel.dk/linux-block

Pull core block updates from Jens Axboe:
 "Good amount of cleanups and tech debt removals in here, and as a
  result, the diffstat shows a nice net reduction in code.

   - Softirq completion cleanups (Christoph)

   - Stop using ->queuedata (Christoph)

   - Cleanup bd claiming (Christoph)

   - Use check_events, moving away from the legacy media change
     (Christoph)

   - Use inode i_blkbits consistently (Christoph)

   - Remove old unused writeback congestion bits (Christoph)

   - Cleanup/unify submission path (Christoph)

   - Use bio_uninit consistently, instead of bio_disassociate_blkg
     (Christoph)

   - sbitmap cleared bits handling (John)

   - Request merging blktrace event addition (Jan)

   - sysfs add/remove race fixes (Luis)

   - blk-mq tag fixes/optimizations (Ming)

   - Duplicate words in comments (Randy)

   - Flush deferral cleanup (Yufen)

   - IO context locking/retry fixes (John)

   - struct_size() usage (Gustavo)

   - blk-iocost fixes (Chengming)

   - blk-cgroup IO stats fixes (Boris)

   - Various little fixes"

* tag 'for-5.9/block-20200802' of git://git.kernel.dk/linux-block: (135 commits)
  block: blk-timeout: delete duplicated word
  block: blk-mq-sched: delete duplicated word
  block: blk-mq: delete duplicated word
  block: genhd: delete duplicated words
  block: elevator: delete duplicated word and fix typos
  block: bio: delete duplicated words
  block: bfq-iosched: fix duplicated word
  iocost_monitor: start from the oldest usage index
  iocost: Fix check condition of iocg abs_vdebt
  block: Remove callback typedefs for blk_mq_ops
  block: Use non _rcu version of list functions for tag_set_list
  blk-cgroup: show global disk stats in root cgroup io.stat
  blk-cgroup: make iostat functions visible to stat printing
  block: improve discard bio alignment in __blkdev_issue_discard()
  block: change REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL to be odd numbers
  block: defer flush request no matter whether we have elevator
  block: make blk_timeout_init() static
  block: remove retry loop in ioc_release_fn()
  block: remove unnecessary ioc nested locking
  block: integrate bd_start_claiming into __blkdev_get
  ...
2020-08-03 11:57:03 -07:00
Linus Torvalds
ab5c60b79a Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
 "API:
   - Add support for allocating transforms on a specific NUMA Node
   - Introduce the flag CRYPTO_ALG_ALLOCATES_MEMORY for storage users

  Algorithms:
   - Drop PMULL based ghash on arm64
   - Fixes for building with clang on x86
   - Add sha256 helper that does the digest in one go
   - Add SP800-56A rev 3 validation checks to dh

  Drivers:
   - Permit users to specify NUMA node in hisilicon/zip
   - Add support for i.MX6 in imx-rngc
   - Add sa2ul crypto driver
   - Add BA431 hwrng driver
   - Add Ingenic JZ4780 and X1000 hwrng driver
   - Spread IRQ affinity in inside-secure and marvell/cesa"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (157 commits)
  crypto: sa2ul - Fix inconsistent IS_ERR and PTR_ERR
  hwrng: core - remove redundant initialization of variable ret
  crypto: x86/curve25519 - Remove unused carry variables
  crypto: ingenic - Add hardware RNG for Ingenic JZ4780 and X1000
  dt-bindings: RNG: Add Ingenic RNG bindings.
  crypto: caam/qi2 - add module alias
  crypto: caam - add more RNG hw error codes
  crypto: caam/jr - remove incorrect reference to caam_jr_register()
  crypto: caam - silence .setkey in case of bad key length
  crypto: caam/qi2 - create ahash shared descriptors only once
  crypto: caam/qi2 - fix error reporting for caam_hash_alloc
  crypto: caam - remove deadcode on 32-bit platforms
  crypto: ccp - use generic power management
  crypto: xts - Replace memcpy() invocation with simple assignment
  crypto: marvell/cesa - irq balance
  crypto: inside-secure - irq balance
  crypto: ecc - SP800-56A rev 3 local public key validation
  crypto: dh - SP800-56A rev 3 local public key validation
  crypto: dh - check validity of Z before export
  lib/mpi: Add mpi_sub_ui()
  ...
2020-08-03 10:40:14 -07:00
Zhaoyang Huang
0f69dae4d1 trace : Have tracing buffer info use kvzalloc instead of kzalloc
High order memory stuff within trace could introduce OOM, use kvzalloc instead.

Please find the bellowing for the call stack we run across in an android system.
The scenario happens when traced_probes is woken up to get a large quantity of
trace even if free memory is even higher than watermark_low. 

traced_probes invoked oom-killer: gfp_mask=0x140c0c0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null),  order=2, oom_score_adj=-1

traced_probes cpuset=system-background mems_allowed=0
CPU: 3 PID: 588 Comm: traced_probes Tainted: G        W  O    4.14.181 #1
Hardware name: Generic DT based system
(unwind_backtrace) from [<c010d824>] (show_stack+0x20/0x24)
(show_stack) from [<c0b2e174>] (dump_stack+0xa8/0xec)
(dump_stack) from [<c027d584>] (dump_header+0x9c/0x220)
(dump_header) from [<c027cfe4>] (oom_kill_process+0xc0/0x5c4)
(oom_kill_process) from [<c027cb94>] (out_of_memory+0x220/0x310)
(out_of_memory) from [<c02816bc>] (__alloc_pages_nodemask+0xff8/0x13a4)
(__alloc_pages_nodemask) from [<c02a6a1c>] (kmalloc_order+0x30/0x48)
(kmalloc_order) from [<c02a6a64>] (kmalloc_order_trace+0x30/0x118)
(kmalloc_order_trace) from [<c0223d7c>] (tracing_buffers_open+0x50/0xfc)
(tracing_buffers_open) from [<c02e6f58>] (do_dentry_open+0x278/0x34c)
(do_dentry_open) from [<c02e70d0>] (vfs_open+0x50/0x70)
(vfs_open) from [<c02f7c24>] (path_openat+0x5fc/0x169c)
(path_openat) from [<c02f75c4>] (do_filp_open+0x94/0xf8)
(do_filp_open) from [<c02e7650>] (do_sys_open+0x168/0x26c)
(do_sys_open) from [<c02e77bc>] (SyS_openat+0x34/0x38)
(SyS_openat) from [<c0108bc0>] (ret_fast_syscall+0x0/0x28)

Link: https://lkml.kernel.org/r/1596155265-32365-1-git-send-email-zhaoyang.huang@unisoc.com

Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-08-03 11:52:20 -04:00
Thomas Gleixner
3d5128c1de irqchip updates for Linux 5.9
- Add infrastructure to allow DT irqchip platform drivers to
   be built as modules
 - Allow qcom-pdc, mtk-cirq and mtk-sysirq to be built as module
 - Fix ACPI probing to avoid abusing function pointer casting
 - Allow bcm7120-l2 and brcmstb-l2 to be used as wake-up sources
 - Teach NXP's IMX INTMUX some power management
 - Allow stm32-exti to be used as a hierarchical irqchip
 - Let stm32-exti use the hw spinlock API in its full glory
 - A couple of GICv4.1 fixes
 - Tons of cleanups (mtk-sysirq, aic5, bcm7038-l1, imx-intmux,
   brcmstb-l2, ativic32, ti-sci-inta, lonsoon, MIPS GIC, GICv3)
 -----BEGIN PGP SIGNATURE-----
 
 iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAl8n5hEPHG1hekBrZXJu
 ZWwub3JnAAoJECPQ0LrRPXpDcE8P/1qNZD3riUrljI/LubsT13ernJ8jeSv658Xp
 YYZ1ItJ9I5Bwcwi/mqrQCULmHWXBVtXIGU7mzaFAXskfVR09tjmmMHbVyB+AT9OR
 C4zH2+G0Hl8axYtQwDrUP/klCLy9GDPvTPTFhmX3eiOwfEGXfBD5bw0Za9lQJ2OL
 SttVxYp/4xJQli7LvOFJ8RrvF9egW5O0mbGTKGhwi+yBEuFanJw5xwn3PYHaApLk
 gpxdcESZskZo6CaKUVFCVr+/t/P6hO2aGv+y4QQMzC3g/wr6evkxYrFZuc3lWtku
 UieGwxfTS1PA16h9ndwXdH6JIlbaynsHkeCY+xKNqwTE+wf4pDdP2zsUjsf8NPBy
 BupyajOpQ1T3m4G4Y6DymoEb+7LyJUddSL0kuFSRd33Y0pf9BskYlHycAkXhCzLZ
 8kZp09SLh6ujRCjjgtHyfOw0/0ZuVmNlt6v/DdoLOAN228smH5KIdwXb46wbox1o
 hFyvPOg1BuGIpDLET+qja+ajZHkPbPBQKsfbG0xWfGOhlYNnMyd8L3RL/IkEuunQ
 RVKpHQTXYOfWpV2apklGzZP6XiYyEYF5cIiP7ECAqbcOTTX1JDghbsXNHdt1/L+Y
 NEwJYk2C7XFOqaOx6ZGffxrA2dkr9jE47aRr5WarYcOHOBBksoL4qZs3HHSvFb94
 2FjSVo+U
 =hgPS
 -----END PGP SIGNATURE-----

Merge tag 'irqchip-5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/core

Pull irqchip updates from Marc Zyngier:

 - Add infrastructure to allow DT irqchip platform drivers to
   be built as modules
 - Allow qcom-pdc, mtk-cirq and mtk-sysirq to be built as module
 - Fix ACPI probing to avoid abusing function pointer casting
 - Allow bcm7120-l2 and brcmstb-l2 to be used as wake-up sources
 - Teach NXP's IMX INTMUX some power management
 - Allow stm32-exti to be used as a hierarchical irqchip
 - Let stm32-exti use the hw spinlock API in its full glory
 - A couple of GICv4.1 fixes
 - Tons of cleanups (mtk-sysirq, aic5, bcm7038-l1, imx-intmux,
   brcmstb-l2, ativic32, ti-sci-inta, lonsoon, MIPS GIC, GICv3)
2020-08-03 14:33:23 +02:00
Rafael J. Wysocki
86ba54fb08 Merge branches 'pm-sleep', 'pm-domains', 'powercap' and 'pm-tools'
* pm-sleep:
  PM: sleep: spread "const char *" correctness
  PM: hibernate: fix white space in a few places
  freezer: Add unsafe version of freezable_schedule_timeout_interruptible() for NFS
  PM: sleep: core: Emit changed uevent on wakeup_sysfs_add/remove

* pm-domains:
  PM: domains: Restore comment indentation for generic_pm_domain.child_links
  PM: domains: Fix up terminology with parent/child

* powercap:
  powercap: Add Power Limit4 support
  powercap: idle_inject: Replace play_idle() with play_idle_precise() in comments
  powercap: intel_rapl: add support for Sapphire Rapids

* pm-tools:
  pm-graph v5.7 - important s2idle fixes
  cpupower: Replace HTTP links with HTTPS ones
  cpupower: Fix NULL but dereferenced coccicheck errors
  cpupower: Fix comparing pointer to 0 coccicheck warns
2020-08-03 13:12:44 +02:00
Rafael J. Wysocki
c81b30c895 Merge branch 'pm-cpufreq'
* pm-cpufreq: (24 commits)
  cpufreq: intel_pstate: Fix EPP setting via sysfs in active mode
  cpufreq: intel_pstate: Rearrange the storing of new EPP values
  cpufreq: intel_pstate: Avoid enabling HWP if EPP is not supported
  cpufreq: intel_pstate: Clean up aperf_mperf_shift description
  cpufreq: powernv: Make some symbols static
  cpufreq: amd_freq_sensitivity: Mark sometimes used ID structs as __maybe_unused
  cpufreq: intel_pstate: Supply struct attribute description for get_aperf_mperf_shift()
  cpufreq: pcc-cpufreq: Mark sometimes used ID structs as __maybe_unused
  cpufreq: powernow-k8: Mark 'hi' and 'lo' dummy variables as __always_unused
  cpufreq: acpi-cpufreq: Mark sometimes used ID structs as __maybe_unused
  cpufreq: acpi-cpufreq: Mark 'dummy' variable as __always_unused
  cpufreq: powernv-cpufreq: Fix a bunch of kerneldoc related issues
  cpufreq: pasemi: Include header file for {check,restore}_astate prototypes
  cpufreq: cpufreq_governor: Demote store_sampling_rate() header to standard comment block
  cpufreq: cpufreq: Demote lots of function headers unworthy of kerneldoc status
  cpufreq: freq_table: Demote obvious misuse of kerneldoc to standard comment blocks
  cpufreq: Replace HTTP links with HTTPS ones
  cpufreq: intel_pstate: Fix static checker warning for epp variable
  cpufreq: Remove the weakly defined cpufreq_default_governor()
  cpufreq: Specify default governor on command line
  ...
2020-08-03 13:12:36 +02:00
Rafael J. Wysocki
5b5642075c Merge branches 'pm-em' and 'pm-core'
* pm-em:
  OPP: refactor dev_pm_opp_of_register_em() and update related drivers
  Documentation: power: update Energy Model description
  PM / EM: change name of em_pd_energy to em_cpu_energy
  PM / EM: remove em_register_perf_domain
  PM / EM: add support for other devices than CPUs in Energy Model
  PM / EM: update callback structure and add device pointer
  PM / EM: introduce em_dev_register_perf_domain function
  PM / EM: change naming convention from 'capacity' to 'performance'

* pm-core:
  mmc: jz4740: Use pm_ptr() macro
  PM: Make *_DEV_PM_OPS macros use __maybe_unused
  PM: core: introduce pm_ptr() macro
2020-08-03 13:11:39 +02:00
Ingo Molnar
992414a18c Merge branch 'locking/nmi' into locking/core, to pick up completed topic branch
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-08-03 13:00:27 +02:00
Linus Torvalds
c6fe44d96f list: add "list_del_init_careful()" to go with "list_empty_careful()"
That gives us ordering guarantees around the pair.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-02 20:39:44 -07:00
David S. Miller
bd0b33b248 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Resolved kernel/bpf/btf.c using instructions from merge commit
69138b34a7

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-02 01:02:12 -07:00
Andrii Nakryiko
73b11c2ab0 bpf: Add support for forced LINK_DETACH command
Add LINK_DETACH command to force-detach bpf_link without destroying it. It has
the same behavior as auto-detaching of bpf_link due to cgroup dying for
bpf_cgroup_link or net_device being destroyed for bpf_xdp_link. In such case,
bpf_link is still a valid kernel object, but is defuncts and doesn't hold BPF
program attached to corresponding BPF hook. This functionality allows users
with enough access rights to manually force-detach attached bpf_link without
killing respective owner process.

This patch implements LINK_DETACH for cgroup, xdp, and netns links, mostly
re-using existing link release handling code.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200731182830.286260-2-andriin@fb.com
2020-08-01 20:38:28 -07:00
Linus Torvalds
ac3a0c8472 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Encap offset calculation is incorrect in esp6, from Sabrina Dubroca.

 2) Better parameter validation in pfkey_dump(), from Mark Salyzyn.

 3) Fix several clang issues on powerpc in selftests, from Tanner Love.

 4) cmsghdr_from_user_compat_to_kern() uses the wrong length, from Al
    Viro.

 5) Out of bounds access in mlx5e driver, from Raed Salem.

 6) Fix transfer buffer memleak in lan78xx, from Johan Havold.

 7) RCU fixups in rhashtable, from Herbert Xu.

 8) Fix ipv6 nexthop refcnt leak, from Xiyu Yang.

 9) vxlan FDB dump must be done under RCU, from Ido Schimmel.

10) Fix use after free in mlxsw, from Ido Schimmel.

11) Fix map leak in HASH_OF_MAPS bpf code, from Andrii Nakryiko.

12) Fix bug in mac80211 Tx ack status reporting, from Vasanthakumar
    Thiagarajan.

13) Fix memory leaks in IPV6_ADDRFORM code, from Cong Wang.

14) Fix bpf program reference count leaks in mlx5 during
    mlx5e_alloc_rq(), from Xin Xiong.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (86 commits)
  vxlan: fix memleak of fdb
  rds: Prevent kernel-infoleak in rds_notify_queue_get()
  net/sched: The error lable position is corrected in ct_init_module
  net/mlx5e: fix bpf_prog reference count leaks in mlx5e_alloc_rq
  net/mlx5e: E-Switch, Specify flow_source for rule with no in_port
  net/mlx5e: E-Switch, Add misc bit when misc fields changed for mirroring
  net/mlx5e: CT: Support restore ipv6 tunnel
  net: gemini: Fix missing clk_disable_unprepare() in error path of gemini_ethernet_port_probe()
  ionic: unlock queue mutex in error path
  atm: fix atm_dev refcnt leaks in atmtcp_remove_persistent
  net: ethernet: mtk_eth_soc: fix MTU warnings
  net: nixge: fix potential memory leak in nixge_probe()
  devlink: ignore -EOPNOTSUPP errors on dumpit
  rxrpc: Fix race between recvmsg and sendmsg on immediate call failure
  MAINTAINERS: Replace Thor Thayer as Altera Triple Speed Ethernet maintainer
  selftests/bpf: fix netdevsim trap_flow_action_cookie read
  ipv6: fix memory leaks on IPV6_ADDRFORM path
  net/bpfilter: Initialize pos in __bpfilter_process_sockopt
  igb: reinit_locked() should be called with rtnl_lock
  e1000e: continue to init PHY even when failed to disable ULP
  ...
2020-08-01 16:47:24 -07:00
Linus Torvalds
0ae3495b65 for-linus-2020-08-01
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXyXDTQAKCRCRxhvAZXjc
 olxlAQDCiyWstd8pmtyX4vuaoyDZ6re6P/TCr3mzr6tQyux/zgD/chlfAvJdyzk8
 2Tw44odp3gF5EfzF+5wx2whZZPfVrQY=
 =Hv2c
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-2020-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull thread fix from Christian Brauner:
 "A simple spelling fix for dequeue_synchronous_signal()"

* tag 'for-linus-2020-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  signal: fix typo in dequeue_synchronous_signal()
2020-08-01 16:40:59 -07:00
Christoph Hellwig
ef1dac6021 modules: return licensing information from find_symbol
Report the GPLONLY status through a new argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2020-08-01 16:05:02 +02:00
Christoph Hellwig
cd8732cdcc modules: rename the licence field in struct symsearch to license
Use the same spelling variant as the rest of the file.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2020-08-01 16:05:02 +02:00
Christoph Hellwig
34e64705ad modules: unexport __module_address
__module_address is only used by built-in code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2020-08-01 16:05:01 +02:00
Christoph Hellwig
3fe1e56d0e modules: unexport __module_text_address
__module_text_address is only used by built-in code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2020-08-01 16:05:00 +02:00
Christoph Hellwig
a54e04914c modules: mark each_symbol_section static
each_symbol_section is only used inside of module.c.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2020-08-01 16:05:00 +02:00
Christoph Hellwig
773110470e modules: mark find_symbol static
find_symbol is only used in module.c.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2020-08-01 16:04:59 +02:00
Christoph Hellwig
7ef5264de7 modules: mark ref_module static
ref_module isn't used anywhere outside of module.c.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2020-08-01 16:04:55 +02:00
Ingo Molnar
63722bbca6 Merge branch 'kcsan' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into locking/core
Pull v5.9 KCSAN bits from Paul E. McKenney.

Perhaps the most important change is that GCC 11 now has all fixes in place
to support KCSAN, so GCC support can be enabled again.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-08-01 09:26:27 +02:00
Valentin Schneider
f4470cdf10 sched: Document arch_scale_*_capacity()
Rather that hide their purpose in some dark, damp corner of Documentation/,
add some documentation to the default implementations.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200731192016.7484-2-valentin.schneider@arm.com
2020-08-01 09:19:43 +02:00
David S. Miller
69138b34a7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2020-07-31

The following pull-request contains BPF updates for your *net* tree.

We've added 5 non-merge commits during the last 21 day(s) which contain
a total of 5 files changed, 126 insertions(+), 18 deletions(-).

The main changes are:

1) Fix a map element leak in HASH_OF_MAPS map type, from Andrii Nakryiko.

2) Fix a NULL pointer dereference in __btf_resolve_helper_id() when no
   btf_vmlinux is available, from Peilin Ye.

3) Init pos variable in __bpfilter_process_sockopt(), from Christoph Hellwig.

4) Fix a cgroup sockopt verifier test by specifying expected attach type,
   from Jean-Philippe Brucker.

Note that when net gets merged into net-next later on, there is a small
merge conflict in kernel/bpf/btf.c between commit 5b801dfb7f ("bpf: Fix
NULL pointer dereference in __btf_resolve_helper_id()") from the bpf tree
and commit 138b9a0511 ("bpf: Remove btf_id helpers resolving") from the
net-next tree.

Resolve as follows: remove the old hunk with the __btf_resolve_helper_id()
function. Change the btf_resolve_helper_id() so it actually tests for a
NULL btf_vmlinux and bails out:

int btf_resolve_helper_id(struct bpf_verifier_log *log,
                          const struct bpf_func_proto *fn, int arg)
{
        int id;

        if (fn->arg_type[arg] != ARG_PTR_TO_BTF_ID || !btf_vmlinux)
                return -EINVAL;
        id = fn->btf_id[arg];
        if (!id || id > btf_vmlinux->nr_types)
                return -EINVAL;
        return id;
}

Let me know if you run into any others issues (CC'ing Jiri Olsa so he's in
the loop with regards to merge conflict resolution).
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-31 17:19:47 -07:00
Catalin Marinas
4557062da7 Merge branches 'for-next/misc', 'for-next/vmcoreinfo', 'for-next/cpufeature', 'for-next/acpi', 'for-next/perf', 'for-next/timens', 'for-next/msi-iommu' and 'for-next/trivial' into for-next/core
* for-next/misc:
  : Miscellaneous fixes and cleanups
  arm64: use IRQ_STACK_SIZE instead of THREAD_SIZE for irq stack
  arm64/mm: save memory access in check_and_switch_context() fast switch path
  recordmcount: only record relocation of type R_AARCH64_CALL26 on arm64.
  arm64: Reserve HWCAP2_MTE as (1 << 18)
  arm64/entry: deduplicate SW PAN entry/exit routines
  arm64: s/AMEVTYPE/AMEVTYPER
  arm64/hugetlb: Reserve CMA areas for gigantic pages on 16K and 64K configs
  arm64: stacktrace: Move export for save_stack_trace_tsk()
  smccc: Make constants available to assembly
  arm64/mm: Redefine CONT_{PTE, PMD}_SHIFT
  arm64/defconfig: Enable CONFIG_KEXEC_FILE
  arm64: Document sysctls for emulated deprecated instructions
  arm64/panic: Unify all three existing notifier blocks
  arm64/module: Optimize module load time by optimizing PLT counting

* for-next/vmcoreinfo:
  : Export the virtual and physical address sizes in vmcoreinfo
  arm64/crash_core: Export TCR_EL1.T1SZ in vmcoreinfo
  crash_core, vmcoreinfo: Append 'MAX_PHYSMEM_BITS' to vmcoreinfo

* for-next/cpufeature:
  : CPU feature handling cleanups
  arm64/cpufeature: Validate feature bits spacing in arm64_ftr_regs[]
  arm64/cpufeature: Replace all open bits shift encodings with macros
  arm64/cpufeature: Add remaining feature bits in ID_AA64MMFR2 register
  arm64/cpufeature: Add remaining feature bits in ID_AA64MMFR1 register
  arm64/cpufeature: Add remaining feature bits in ID_AA64MMFR0 register

* for-next/acpi:
  : ACPI updates for arm64
  arm64/acpi: disallow writeable AML opregion mapping for EFI code regions
  arm64/acpi: disallow AML memory opregions to access kernel memory

* for-next/perf:
  : perf updates for arm64
  arm64: perf: Expose some new events via sysfs
  tools headers UAPI: Update tools's copy of linux/perf_event.h
  arm64: perf: Add cap_user_time_short
  perf: Add perf_event_mmap_page::cap_user_time_short ABI
  arm64: perf: Only advertise cap_user_time for arch_timer
  arm64: perf: Implement correct cap_user_time
  time/sched_clock: Use raw_read_seqcount_latch()
  sched_clock: Expose struct clock_read_data
  arm64: perf: Correct the event index in sysfs
  perf/smmuv3: To simplify code for ioremap page in pmcg

* for-next/timens:
  : Time namespace support for arm64
  arm64: enable time namespace support
  arm64/vdso: Restrict splitting VVAR VMA
  arm64/vdso: Handle faults on timens page
  arm64/vdso: Add time namespace page
  arm64/vdso: Zap vvar pages when switching to a time namespace
  arm64/vdso: use the fault callback to map vvar pages

* for-next/msi-iommu:
  : Make the MSI/IOMMU input/output ID translation PCI agnostic, augment the
  : MSI/IOMMU ACPI/OF ID mapping APIs to accept an input ID bus-specific parameter
  : and apply the resulting changes to the device ID space provided by the
  : Freescale FSL bus
  bus: fsl-mc: Add ACPI support for fsl-mc
  bus/fsl-mc: Refactor the MSI domain creation in the DPRC driver
  of/irq: Make of_msi_map_rid() PCI bus agnostic
  of/irq: make of_msi_map_get_device_domain() bus agnostic
  dt-bindings: arm: fsl: Add msi-map device-tree binding for fsl-mc bus
  of/device: Add input id to of_dma_configure()
  of/iommu: Make of_map_rid() PCI agnostic
  ACPI/IORT: Add an input ID to acpi_dma_configure()
  ACPI/IORT: Remove useless PCI bus walk
  ACPI/IORT: Make iort_msi_map_rid() PCI agnostic
  ACPI/IORT: Make iort_get_device_domain IRQ domain agnostic
  ACPI/IORT: Make iort_match_node_callback walk the ACPI namespace for NC

* for-next/trivial:
  : Trivial fixes
  arm64: sigcontext.h: delete duplicated word
  arm64: ptrace.h: delete duplicated word
  arm64: pgtable-hwdef.h: delete duplicated words
2020-07-31 18:09:39 +01:00
Ingo Molnar
28cff52eae Merge branch 'linus' into locking/core, to resolve conflict
Conflicts:
	arch/arm/include/asm/percpu.h

As Stephen Rothwell noted, there's a conflict between this commit
in locking/core:

  a21ee6055c ("lockdep: Change hardirq{s_enabled,_context} to per-cpu variables")

and this fresh upstream commit:

  aa54ea903a ("ARM: percpu.h: fix build error")

a21ee6055c is a simpler solution to the dependency problem and doesn't
further increase header hell - so this conflict resolution effectively
reverts aa54ea903a and uses the a21ee6055c solution.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-07-31 12:16:09 +02:00
Marco Elver
92c209ac6d kcsan: Improve IRQ state trace reporting
To improve the general usefulness of the IRQ state trace events with
KCSAN enabled, save and restore the trace information when entering and
exiting the KCSAN runtime as well as when generating a KCSAN report.

Without this, reporting the IRQ trace events (whether via a KCSAN report
or outside of KCSAN via a lockdep report) is rather useless due to
continuously being touched by KCSAN. This is because if KCSAN is
enabled, every instrumented memory access causes changes to IRQ trace
events (either by KCSAN disabling/enabling interrupts or taking
report_lock when generating a report).

Before "lockdep: Prepare for NMI IRQ state tracking", KCSAN avoided
touching the IRQ trace events via raw_local_irq_save/restore() and
lockdep_off/on().

Fixes: 248591f5d2 ("kcsan: Make KCSAN compatible with new IRQ state tracking")
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200729110916.3920464-2-elver@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-07-31 12:12:03 +02:00
Marco Elver
0584df9c12 lockdep: Refactor IRQ trace events fields into struct
Refactor the IRQ trace events fields, used for printing information
about the IRQ trace events, into a separate struct 'irqtrace_events'.

This improves readability by separating the information only used in
reporting, as well as enables (simplified) storing/restoring of
irqtrace_events snapshots.

No functional change intended.

Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200729110916.3920464-1-elver@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-07-31 12:11:58 +02:00
Vincent Whitchurch
ee896ee805 tracing: Remove outdated comment in stack handling
This comment describes the behaviour before commit 2a820bf749
("tracing: Use percpu stack trace buffer more intelligently").  Since
that commit, interrupts and NMIs do use the per-cpu stacks so the
comment is no longer correct.  Remove it.

(Note that the FTRACE_STACK_SIZE mentioned in the comment has never
existed, it probably should have said FTRACE_STACK_ENTRIES.)

Link: https://lkml.kernel.org/r/20200727092840.18659-1-vincent.whitchurch@axis.com

Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-07-30 22:54:50 -04:00
Chengming Zhou
c5f51572a7 ftrace: Do not let direct or IPMODIFY ftrace_ops be added to module and set trampolines
When inserting a module, we find all ftrace_ops referencing it on the
ftrace_ops_list. But FTRACE_OPS_FL_DIRECT and FTRACE_OPS_FL_IPMODIFY
flags are special, and should not be set automatically. So warn and
skip ftrace_ops that have these two flags set and adding new code.
Also check if only one ftrace_ops references the module, in which case
we can use a trampoline as an optimization.

Link: https://lkml.kernel.org/r/20200728180554.65203-2-zhouchengming@bytedance.com

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-07-30 22:45:31 -04:00
Chengming Zhou
8a224ffb3f ftrace: Setup correct FTRACE_FL_REGS flags for module
When module loaded and enabled, we will use __ftrace_replace_code
for module if any ftrace_ops referenced it found. But we will get
wrong ftrace_addr for module rec in ftrace_get_addr_new, because
rec->flags has not been setup correctly. It can cause the callback
function of a ftrace_ops has FTRACE_OPS_FL_SAVE_REGS to be called
with pt_regs set to NULL.
So setup correct FTRACE_FL_REGS flags for rec when we call
referenced_filters to find ftrace_ops references it.

Link: https://lkml.kernel.org/r/20200728180554.65203-1-zhouchengming@bytedance.com

Cc: stable@vger.kernel.org
Fixes: 8c4f3c3fa9 ("ftrace: Check module functions being traced on reload")
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-07-30 19:35:19 -04:00
Kevin Hao
96b4833b68 tracing/hwlat: Honor the tracing_cpumask
In calculation of the cpu mask for the hwlat kernel thread, the wrong
cpu mask is used instead of the tracing_cpumask, this causes the
tracing/tracing_cpumask useless for hwlat tracer. Fixes it.

Link: https://lkml.kernel.org/r/20200730082318.42584-2-haokexin@gmail.com

Cc: Ingo Molnar <mingo@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 0330f7aa8e ("tracing: Have hwlat trace migrate across tracing_cpumask CPUs")
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-07-30 19:35:04 -04:00
Kevin Hao
a9d0ba6772 tracing/hwlat: Drop the duplicate assignment in start_kthread()
We have set 'current_mask' to '&save_cpumask' in its declaration,
so there is no need to assign again.

Link: https://lkml.kernel.org/r/20200730082318.42584-1-haokexin@gmail.com

Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-07-30 19:35:04 -04:00
Yonghong Song
4fc00b79b8 bpf: Add missing newline characters in verifier error messages
Newline characters are added in two verifier error messages,
refactored in Commit afbf21dce6 ("bpf: Support readonly/readwrite
buffers in verifier"). This way, they do not mix with
messages afterwards.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200728221801.1090349-1-yhs@fb.com
2020-07-31 00:43:49 +02:00
Ingo Molnar
c1cc4784ce Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull the v5.9 RCU bits from Paul E. McKenney:

 - Documentation updates
 - Miscellaneous fixes
 - kfree_rcu updates
 - RCU tasks updates
 - Read-side scalability tests
 - SRCU updates
 - Torture-test updates

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-07-31 00:15:53 +02:00
Romain Perier
12cc923f1c tasklet: Introduce new initialization API
Nowadays, modern kernel subsystems that use callbacks pass the data
structure associated with a given callback as argument to the callback.
The tasklet subsystem remains one which passes an arbitrary unsigned
long to the callback function. This has several problems:

- This keeps an extra field for storing the argument in each tasklet
  data structure, it bloats the tasklet_struct structure with a redundant
  .data field

- No type checking can be performed on this argument. Instead of
  using container_of() like other callback subsystems, it forces callbacks
  to do explicit type cast of the unsigned long argument into the required
  object type.

- Buffer overflows can overwrite the .func and the .data field, so
  an attacker can easily overwrite the function and its first argument
  to whatever it wants.

Add a new tasklet initialization API, via DECLARE_TASKLET() and
tasklet_setup(), which will replace the existing ones.

This work is greatly inspired by the timer_struct conversion series,
see commit e99e88a9d2 ("treewide: setup_timer() -> timer_setup()")

To avoid problems with both -Wcast-function-type (which is enabled in
the kernel via -Wextra is several subsystems), and with mismatched
function prototypes when build with Control Flow Integrity enabled,
this adds the "use_callback" member to let the tasklet caller choose
which union member to call through. Once all old API uses are removed,
this and the .data member will be removed as well. (On 64-bit this does
not grow the struct size as the new member fills the hole after atomic_t,
which is also "int" sized.)

Signed-off-by: Romain Perier <romain.perier@gmail.com>
Co-developed-by: Allen Pais <allen.lkml@gmail.com>
Signed-off-by: Allen Pais <allen.lkml@gmail.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Co-developed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-07-30 11:16:01 -07:00
Kees Cook
b13fecb1c3 treewide: Replace DECLARE_TASKLET() with DECLARE_TASKLET_OLD()
This converts all the existing DECLARE_TASKLET() (and ...DISABLED)
macros with DECLARE_TASKLET_OLD() in preparation for refactoring the
tasklet callback type. All existing DECLARE_TASKLET() users had a "0"
data argument, it has been removed here as well.

Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-07-30 11:15:58 -07:00
Andrii Nakryiko
1d4e1eab45 bpf: Fix map leak in HASH_OF_MAPS map
Fix HASH_OF_MAPS bug of not putting inner map pointer on bpf_map_elem_update()
operation. This is due to per-cpu extra_elems optimization, which bypassed
free_htab_elem() logic doing proper clean ups. Make sure that inner map is put
properly in optimized case as well.

Fixes: 8c290e60fa ("bpf: fix hashmap extra_elems logic")
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200729040913.2815687-1-andriin@fb.com
2020-07-30 01:30:22 +02:00
Linus Torvalds
d3590ebf6f audit/stable-5.8 PR 20200729
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAl8hgm0UHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXPc4xAAxWSkLThFbdC+dWA8cFQvyJhXdcl6
 C3ALyBnx2hyr/MxJ9OcfYDl8TMafKFkXzq4+2vLiZPl/UBSpnr47ralUHl+aAh+I
 cZdV9bF3aSlsb4mIEg3H03xkPBCWfTR+UMzdrYAgqxyeYoZ/VteR1O3yWi80caQK
 vh2UlbuPyiEsz1A21ems88dDw28RkzETNFmBARSh7cPrvGorQNJKYGkMNqsVpUbb
 elx+DCSh4J+QYqByeQUY64L1n7jHGQkTpdZaVA7FhBeAilelL6PIa4qpyHU28VGg
 ZzOWJBkZwYz1lVEhHu1h3Jzv9dwTzzyopJ/YpPZUsvZ+GPuIfYmY+C1InkMvGd4S
 Ytj9WO+rNpvJR8EWUhl1O7J/0HN+dy3MGst9MkJOMea0gsgf9cTgnIEohFawYZRt
 t1pKB2VximglOx2IRVK/2//8u/s8d7c5/5uVY4akS++tbrk5j8uPcO+4wIf/njMM
 WqfUT58M6oY9mQkErewNrZEi2CHBg71GT4hJQ+1qnyrTSe9WfrmA01m/pIUNHzu3
 j1hhZH2KCT5IKF4b5dA2DmssorfVgC1VnAoa0UM9jC+awqSYI83S20d8EF48msIW
 XqEUSURh/bfn3T9Y75YVsNJ6EOvrhsf9TSCb43oNhAXBv0+XgO3bKOpBB6W+UIZ7
 86vGfemi82Rt+Sk=
 =zLU9
 -----END PGP SIGNATURE-----

Merge tag 'audit-pr-20200729' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit

Pull audit fixes from Paul Moore:
 "One small audit fix that you can hopefully merge before v5.8 is
  released. Unfortunately it is a revert of a patch that went in during
  the v5.7 window and we just recently started to see some bug reports
  relating to that commit.

  We are working on a proper fix, but I'm not yet clear on when that
  will be ready and we need to fix the v5.7 kernels anyway, so in the
  interest of time a revert seemed like the best solution right now"

* tag 'audit-pr-20200729' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
  revert: 1320a4052e ("audit: trigger accompanying records when no rules present")
2020-07-29 12:35:36 -07:00
Willy Tarreau
f227e3ec3b random32: update the net random state on interrupt and activity
This modifies the first 32 bits out of the 128 bits of a random CPU's
net_rand_state on interrupt or CPU activity to complicate remote
observations that could lead to guessing the network RNG's internal
state.

Note that depending on some network devices' interrupt rate moderation
or binding, this re-seeding might happen on every packet or even almost
never.

In addition, with NOHZ some CPUs might not even get timer interrupts,
leaving their local state rarely updated, while they are running
networked processes making use of the random state.  For this reason, we
also perform this update in update_process_times() in order to at least
update the state when there is user or system activity, since it's the
only case we care about.

Reported-by: Amit Klein <aksecurity@gmail.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-07-29 10:35:37 -07:00
Ahmed S. Darwish
af5a06b582 hrtimer: Use sequence counter with associated raw spinlock
A sequence counter write side critical section must be protected by some
form of locking to serialize writers. A plain seqcount_t does not
contain the information of which lock must be held when entering a write
side critical section.

Use the new seqcount_raw_spinlock_t data type, which allows to associate
a raw spinlock with the sequence counter. This enables lockdep to verify
that the raw spinlock used for writer serialization is held when the
write side critical section is entered.

If lockdep is disabled this lock association is compiled out and has
neither storage size nor runtime overhead.

Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200720155530.1173732-25-a.darwish@linutronix.de
2020-07-29 16:14:29 +02:00
Ahmed S. Darwish
025e82bcbc timekeeping: Use sequence counter with associated raw spinlock
A sequence counter write side critical section must be protected by some
form of locking to serialize writers. A plain seqcount_t does not
contain the information of which lock must be held when entering a write
side critical section.

Use the new seqcount_raw_spinlock_t data type, which allows to associate
a raw spinlock with the sequence counter. This enables lockdep to verify
that the raw spinlock used for writer serialization is held when the
write side critical section is entered.

If lockdep is disabled this lock association is compiled out and has
neither storage size nor runtime overhead.

Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200720155530.1173732-18-a.darwish@linutronix.de
2020-07-29 16:14:27 +02:00
Ahmed S. Darwish
b75058614f sched: tasks: Use sequence counter with associated spinlock
A sequence counter write side critical section must be protected by some
form of locking to serialize writers. A plain seqcount_t does not
contain the information of which lock must be held when entering a write
side critical section.

Use the new seqcount_spinlock_t data type, which allows to associate a
spinlock with the sequence counter. This enables lockdep to verify that
the spinlock used for writer serialization is held when the write side
critical section is entered.

If lockdep is disabled this lock association is compiled out and has
neither storage size nor runtime overhead.

Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200720155530.1173732-14-a.darwish@linutronix.de
2020-07-29 16:14:26 +02:00
Paul Moore
8ac68dc455 revert: 1320a4052e ("audit: trigger accompanying records when no rules present")
Unfortunately the commit listed in the subject line above failed
to ensure that the task's audit_context was properly initialized/set
before enabling the "accompanying records".  Depending on the
situation, the resulting audit_context could have invalid values in
some of it's fields which could cause a kernel panic/oops when the
task/syscall exists and the audit records are generated.

We will revisit the original patch, with the necessary fixes, in a
future kernel but right now we just want to fix the kernel panic
with the least amount of added risk.

Cc: stable@vger.kernel.org
Fixes: 1320a4052e ("audit: trigger accompanying records when no rules present")
Reported-by: j2468h@googlemail.com
Signed-off-by: Paul Moore <paul@paul-moore.com>
2020-07-29 10:00:36 -04:00
Hari Bathini
f891f19736 kexec_file: Allow archs to handle special regions while locating memory hole
Some architectures may have special memory regions, within the given
memory range, which can't be used for the buffer in a kexec segment.
Implement weak arch_kexec_locate_mem_hole() definition which arch code
may override, to take care of special regions, while trying to locate
a memory hole.

Also, add the missing declarations for arch overridable functions and
and drop the __weak descriptors in the declarations to avoid non-weak
definitions from becoming weak.

Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Tested-by: Pingfan Liu <piliu@redhat.com>
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Acked-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/159602273603.575379.17665852963340380839.stgit@hbathini
2020-07-29 23:47:53 +10:00
Qais Yousef
13685c4a08 sched/uclamp: Add a new sysctl to control RT default boost value
RT tasks by default run at the highest capacity/performance level. When
uclamp is selected this default behavior is retained by enforcing the
requested uclamp.min (p->uclamp_req[UCLAMP_MIN]) of the RT tasks to be
uclamp_none(UCLAMP_MAX), which is SCHED_CAPACITY_SCALE; the maximum
value.

This is also referred to as 'the default boost value of RT tasks'.

See commit 1a00d99997 ("sched/uclamp: Set default clamps for RT tasks").

On battery powered devices, it is desired to control this default
(currently hardcoded) behavior at runtime to reduce energy consumed by
RT tasks.

For example, a mobile device manufacturer where big.LITTLE architecture
is dominant, the performance of the little cores varies across SoCs, and
on high end ones the big cores could be too power hungry.

Given the diversity of SoCs, the new knob allows manufactures to tune
the best performance/power for RT tasks for the particular hardware they
run on.

They could opt to further tune the value when the user selects
a different power saving mode or when the device is actively charging.

The runtime aspect of it further helps in creating a single kernel image
that can be run on multiple devices that require different tuning.

Keep in mind that a lot of RT tasks in the system are created by the
kernel. On Android for instance I can see over 50 RT tasks, only
a handful of which created by the Android framework.

To control the default behavior globally by system admins and device
integrator, introduce the new sysctl_sched_uclamp_util_min_rt_default
to change the default boost value of the RT tasks.

I anticipate this to be mostly in the form of modifying the init script
of a particular device.

To avoid polluting the fast path with unnecessary code, the approach
taken is to synchronously do the update by traversing all the existing
tasks in the system. This could race with a concurrent fork(), which is
dealt with by introducing sched_post_fork() function which will ensure
the racy fork will get the right update applied.

Tested on Juno-r2 in combination with the RT capacity awareness [1].
By default an RT task will go to the highest capacity CPU and run at the
maximum frequency, which is particularly energy inefficient on high end
mobile devices because the biggest core[s] are 'huge' and power hungry.

With this patch the RT task can be controlled to run anywhere by
default, and doesn't cause the frequency to be maximum all the time.
Yet any task that really needs to be boosted can easily escape this
default behavior by modifying its requested uclamp.min value
(p->uclamp_req[UCLAMP_MIN]) via sched_setattr() syscall.

[1] 804d402fb6: ("sched/rt: Make RT capacity-aware")

Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200716110347.19553-2-qais.yousef@arm.com
2020-07-29 13:51:47 +02:00
Qais Yousef
e65855a52b sched/uclamp: Fix a deadlock when enabling uclamp static key
The following splat was caught when setting uclamp value of a task:

  BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:49

   cpus_read_lock+0x68/0x130
   static_key_enable+0x1c/0x38
   __sched_setscheduler+0x900/0xad8

Fix by ensuring we enable the key outside of the critical section in
__sched_setscheduler()

Fixes: 46609ce227 ("sched/uclamp: Protect uclamp fast path code with static key")
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200716110347.19553-4-qais.yousef@arm.com
2020-07-29 13:51:47 +02:00
Peter Zijlstra
4fd5750af0 sched,tracing: Convert to sched_set_fifo()
One module user of sched_setscheduler() was overlooked and is
obviously causing build failures.

Convert ring_buffer_benchmark to use sched_set_fifo_low() when fifo==1
and sched_set_fifo() when fifo==2. This is a bit of an abuse, but it
makes the thing 'work' again.

Specifically, it enables all combinations that were previously
possible:

  producer higher than consumer
  consumer higher than producer

Fixes: 616d91b68c ("sched: Remove sched_setscheduler*() EXPORTs")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Link: https://lkml.kernel.org/r/20200720214918.GM5523@worktop.programming.kicks-ass.net
2020-07-29 11:43:53 +02:00
Dan Williams
48001ea50d PM, libnvdimm: Add runtime firmware activation support
Abstract platform specific mechanics for nvdimm firmware activation
behind a handful of generic ops. At the bus level ->activate_state()
indicates the unified state (idle, busy, armed) of all DIMMs on the bus,
and ->capability() indicates the system state expectations for activate.
At the DIMM level ->activate_state() indicates the per-DIMM state,
->activate_result() indicates the outcome of the last activation
attempt, and ->arm() attempts to transition the DIMM from 'idle' to
'armed'.

A new hibernate_quiet_exec() facility is added to support firmware
activation in an OS defined system quiesce state. It leverages the fact
that the hibernate-freeze state wants to assert that a memory
hibernation snapshot can be taken. This is in contrast to a platform
firmware defined quiesce state that may forcefully quiet the memory
controller independent of whether an individual device-driver properly
supports hibernate-freeze.

The libnvdimm sysfs interface is extended to support detection of a
firmware activate capability. The mechanism supports enumeration and
triggering of firmware activate, optionally in the
hibernate_quiet_exec() context.

[rafael: hibernate_quiet_exec() proposal]
[vishal: fix up sparse warning, grammar in Documentation/]

Cc: Pavel Machek <pavel@ucw.cz>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Reported-by: kernel test robot <lkp@intel.com>
Co-developed-by: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Signed-off-by: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
2020-07-28 19:28:32 -06:00
Andrii Nakryiko
310ad7970a bpf: Fix build without CONFIG_NET when using BPF XDP link
Entire net/core subsystem is not built without CONFIG_NET. linux/netdevice.h
just assumes that it's always there, so the easiest way to fix this is to
conditionally compile out bpf_xdp_link_attach() use in bpf/syscall.c.

Fixes: aa8d3a716b ("bpf, xdp: Add bpf_link-based XDP attachment API")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200728190527.110830-1-andriin@fb.com
2020-07-29 00:29:00 +02:00
Paul Menzel
7d8365771f moduleparams: Add hexint type parameter
For bitmasks printing values in hex is more convenient.

Prefix with `0x` to make it clear, that it’s a hex value, and pad it
out.

Using the helper for `amdgpu.ppfeaturemask`, it will look like below.

Before:

    $ more /sys/module/amdgpu/parameters/ppfeaturemask
    4294950911

After:

    $ more /sys/module/amdgpu/parameters/ppfeaturemask
    0xffffbfff

Cc: linux-kernel@vger.kernel.org
Cc: amd-gfx@lists.freedesktop.org
Signed-off-by: Paul Menzel <pmenzel@molgen.mpg.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Christian König <christian.koenig@amd.com>
Link: https://patchwork.freedesktop.org/patch/374726/
2020-07-28 13:44:53 +02:00
Paul Menzel
31ed1b5dff kernel/params.c: Align last argument with a tab
The second and third arguments are aligned with tabs, so do the same for
the fourth.

Cc: linux-kernel@vger.kernel.org
Signed-off-by: Paul Menzel <pmenzel@molgen.mpg.de>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Christian König <christian.koenig@amd.com>
Link: https://lore.kernel.org/patchwork/patch/1267600/
2020-07-28 13:43:56 +02:00
Christoph Hellwig
274b3f7bf3 dma-contiguous: cleanup dma_alloc_contiguous
Split out a cma_alloc_aligned helper to deal with the "interesting"
calling conventions for cma_alloc, which then allows to the main
function to be written straight forward.  This also takes advantage
of the fact that NULL dev arguments have been gone from the DMA API
for a while.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Nicolin Chen <nicoleotsuka@gmail.com>
Reviewed-by: Barry Song <song.bao.hua@hisilicon.com>
2020-07-28 13:42:15 +02:00
Miaohe Lin
21a6ee14a8 sched: Remove duplicated tick_nohz_full_enabled() check
In sched_update_tick_dependency() there's two calls that check
whether nohz_full is enabled: tick_nohz_full_cpu() does it
implicitly, while there's also an explicit call to tick_nohz_full_enabled().

Remove the duplicated, open coded check.

[ mingo: Amended the changelog. ]

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/1595935075-14223-1-git-send-email-linmiaohe@huawei.com
2020-07-28 13:27:54 +02:00
Masami Hiramatsu
112a0e4171 kprobes: Remove unnecessary module_mutex locking from kprobe_optimizer()
Since we already lock both kprobe_mutex and text_mutex in the optimizer,
text will not be changed and the module unloading will be stopped
inside kprobes_module_callback().

The mutex_lock() has originally been introduced to avoid conflict with text modification,
at that point we didn't hold text_mutex.

But after:

  f1c6ece237 ("kprobes: Fix potential deadlock in kprobe_optimizer()")

We started holding the text_mutex and don't need the modules mutex anyway.

So remove the module_mutex locking.

[ mingo: Amended the changelog. ]

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
Link: https://lore.kernel.org/r/20200728163400.e00b09c594763349f99ce6cb@kernel.org
2020-07-28 13:19:05 +02:00
Ingo Molnar
e89d4ca1df Linux 5.8-rc7
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl8d8h4eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGd0sH/2iktYhMwPxzzpnb
 eI3OuTX/mRn4vUFOfpx9dmGVleMfKkpbvnn3IY7wA62Qfv7J7lkFRa1Bd1DlqXfW
 yyGTGDSKG5chiRCOU3s9ni92M4xIzFlrojyt/dIK2lUGMzUPI9FGlZRGQLKqqwLh
 2syOXRWbcQ7e52IHtDSy3YBNveKRsP4NyqV+GxGiex18SMB/M3Pw9EMH614eDPsE
 QAGQi5uGv4hPJtFHgXgUyBPLFHIyFAiVxhFRIj7u2DSEKY79+wO1CGWFiFvdTY4B
 CbqKXLffY3iQdFsLJkj9Dl8cnOQnoY44V0EBzhhORxeOp71StUVaRwQMFa5tp48G
 171s5Hs=
 =BQIl
 -----END PGP SIGNATURE-----

Merge tag 'v5.8-rc7' into perf/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-07-28 13:18:01 +02:00
Colin Ian King
f6dfbe31e8 bpf: Fix swapped arguments in calls to check_buffer_access
There are a couple of arguments of the boolean flag zero_size_allowed and
the char pointer buf_info when calling to function check_buffer_access that
are swapped by mistake. Fix these by swapping them to correct the argument
ordering.

Fixes: afbf21dce6 ("bpf: Support readonly/readwrite buffers in verifier")
Addresses-Coverity: ("Array compared to 0")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200727175411.155179-1-colin.king@canonical.com
2020-07-28 12:40:10 +02:00
Amir Goldstein
b9a1b97725 fsnotify: create method handle_inode_event() in fsnotify_operations
The method handle_event() grew a lot of complexity due to the design of
fanotify and merging of ignore masks.

Most backends do not care about this complex functionality, so we can hide
this complexity from them.

Introduce a method handle_inode_event() that serves those backends and
passes a single inode mark and less arguments.

This change converts all backends except fanotify and inotify to use the
simplified handle_inode_event() method.  In pricipal, inotify could have
also used the new method, but that would require passing more arguments
on the simple helper (data, data_type, cookie), so we leave it with the
handle_event() method.

Link: https://lore.kernel.org/r/20200722125849.17418-9-amir73il@gmail.com
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-07-27 23:25:50 +02:00
Amir Goldstein
7dbe608016 audit: do not set FS_EVENT_ON_CHILD in audit marks mask
The audit group marks mask does not contain any events possible on
a child so setting the flag FS_EVENT_ON_CHILD in the mask is counter
productive.

It may lead to the undesired outcome of setting the dentry flag
DCACHE_FSNOTIFY_PARENT_WATCHED on a directory inode even though it is
not watching children, because the audit mark contribute the flag
FS_EVENT_ON_CHILD to the inode's fsnotify_mask and another mark could
be contributing an event that is possible on child to the inode's mask.

Furthermore in the following patches we want to use FS_EVENT_ON_CHILD
for non-dir inodes for other purposes so stop using the flag.

Link: https://lore.kernel.org/r/20200722125849.17418-4-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-07-27 23:16:33 +02:00
Amir Goldstein
82ace1efb3 fsnotify: create helper fsnotify_inode()
Simple helper to consolidate biolerplate code.

Link: https://lore.kernel.org/r/20200722125849.17418-5-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-07-27 23:13:51 +02:00
Al Viro
1e6986c9db regset: kill ->get()
no instances left

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-07-27 14:31:12 -04:00
Al Viro
7717cb9bdd regset: new method and helpers for it
->regset_get() takes task+regset+buffer, returns the amount of free space
left in the buffer on success and -E... on error.

buffer is represented as struct membuf - a pair of (kernel) pointer
and amount of space left

Primitives for writing to such:
	* membuf_write(buf, data, size)
	* membuf_zero(buf, size)
	* membuf_store(buf, value)

These are implemented as inlines (in case of membuf_store - a macro).
All writes are sequential; they become no-ops when there's no space
left.  Return value of all primitives is the amount of space left
after the operation, so they can be used as return values of ->regset_get().

Example of use:

// stores pt_regs of task + 64 bytes worth of zeroes + 32bit PID of task
int foo_get(struct task_struct *task, const struct regset *regset,
	    struct membuf to)
{
	membuf_write(&to, task_pt_regs(task), sizeof(struct pt_regs));
	membuf_zero(&to, 64);
	return membuf_store(&to, (u32)task_tgid_vnr(task));
}

regset_get()/regset_get_alloc() taught to use that thing if present.
By the end of the series all users of ->get() will be converted;
then ->get() and ->get_size() can go.

Note that unlike ->get() this thing always starts at offset 0 and,
since it only writes to kernel buffer, can't fail on copyout.
It can, of course, fail for other reasons, but those tend to
be less numerous.

The caller guarantees that the buffer size won't be bigger than
regset->n * regset->size.  That simplifies life for quite a few
instances.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-07-27 14:31:07 -04:00
Al Viro
dc12d7968f copy_regset_to_user(): do all copyout at once.
Turn copy_regset_to_user() into regset_get_alloc() + copy_to_user().
Now all ->get() calls have a kernel buffer as destination.

Note that we'd already eliminated the callers of copy_regset_to_user()
with non-zero offset; now that argument is simply unused.

Uninlined, while we are at it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-07-27 14:31:07 -04:00
Al Viro
b4e9c9549f introduction of regset ->get() wrappers, switching ELF coredumps to those
Two new helpers: given a process and regset, dump into a buffer.
regset_get() takes a buffer and size, regset_get_alloc() takes size
and allocates a buffer.

Return value in both cases is the amount of data actually dumped in
case of success or -E...  on error.

In both cases the size is capped by regset->n * regset->size, so
->get() is called with offset 0 and size no more than what regset
expects.

binfmt_elf.c callers of ->get() are switched to using those; the other
caller (copy_regset_to_user()) will need some preparations to switch.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-07-27 14:24:50 -04:00
Amir Goldstein
b54cecf5e2 fsnotify: pass dir argument to handle_event() callback
The 'inode' argument to handle_event(), sometimes referred to as
'to_tell' is somewhat obsolete.
It is a remnant from the times when a group could only have an inode mark
associated with an event.

We now pass an iter_info array to the callback, with all marks associated
with an event.

Most backends ignore this argument, with two exceptions:
1. dnotify uses it for sanity check that event is on directory
2. fanotify uses it to report fid of directory on directory entry
   modification events

Remove the 'inode' argument and add a 'dir' argument.
The callback function signature is deliberately changed, because
the meaning of the argument has changed and the arguments have
been documented.

The 'dir' argument is set to when 'file_name' is specified and it is
referring to the directory that the 'file_name' entry belongs to.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-07-27 18:32:47 +02:00
Marc Zyngier
aa251fc5b9 genirq/debugfs: Add missing irqchip flags
Recently introduced irqchip flags lack the corresponding printouts in
debugfs. Add them.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/874kpvydxc.wl-maz@kernel.org
2020-07-27 16:20:40 +02:00
Thomas Gleixner
f0c7baca18 genirq/affinity: Make affinity setting if activated opt-in
John reported that on a RK3288 system the perf per CPU interrupts are all
affine to CPU0 and provided the analysis:

 "It looks like what happens is that because the interrupts are not per-CPU
  in the hardware, armpmu_request_irq() calls irq_force_affinity() while
  the interrupt is deactivated and then request_irq() with IRQF_PERCPU |
  IRQF_NOBALANCING.  

  Now when irq_startup() runs with IRQ_STARTUP_NORMAL, it calls
  irq_setup_affinity() which returns early because IRQF_PERCPU and
  IRQF_NOBALANCING are set, leaving the interrupt on its original CPU."

This was broken by the recent commit which blocked interrupt affinity
setting in hardware before activation of the interrupt. While this works in
general, it does not work for this particular case. As contrary to the
initial analysis not all interrupt chip drivers implement an activate
callback, the safe cure is to make the deferred interrupt affinity setting
at activation time opt-in.

Implement the necessary core logic and make the two irqchip implementations
for which this is required opt-in. In hindsight this would have been the
right thing to do, but ...

Fixes: baedb87d1b ("genirq/affinity: Handle affinity setting on inactive interrupts correctly")
Reported-by: John Keeping <john@metanate.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Marc Zyngier <maz@kernel.org>
Acked-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/87blk4tzgm.fsf@nanos.tec.linutronix.de
2020-07-27 16:20:40 +02:00
peterz@infradead.org
ed00495333 locking/lockdep: Fix TRACE_IRQFLAGS vs. NMIs
Prior to commit:

  859d069ee1 ("lockdep: Prepare for NMI IRQ state tracking")

IRQ state tracking was disabled in NMIs due to nmi_enter()
doing lockdep_off() -- with the obvious requirement that NMI entry
call nmi_enter() before trace_hardirqs_off().

[ AFAICT, PowerPC and SH violate this order on their NMI entry ]

However, that commit explicitly changed lockdep_hardirqs_*() to ignore
lockdep_off() and breaks every architecture that has irq-tracing in
it's NMI entry that hasn't been fixed up (x86 being the only fixed one
at this point).

The reason for this change is that by ignoring lockdep_off() we can:

  - get rid of 'current->lockdep_recursion' in lockdep_assert_irqs*()
    which was going to to give header-recursion issues with the
    seqlock rework.

  - allow these lockdep_assert_*() macros to function in NMI context.

Restore the previous state of things and allow an architecture to
opt-in to the NMI IRQ tracking support, however instead of relying on
lockdep_off(), rely on in_nmi(), both are part of nmi_enter() and so
over-all entry ordering doesn't need to change.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200727124852.GK119549@hirez.programming.kicks-ass.net
2020-07-27 15:13:29 +02:00
Greg Kroah-Hartman
eea2c51f81 Linux 5.8-rc7
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl8d8h4eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGd0sH/2iktYhMwPxzzpnb
 eI3OuTX/mRn4vUFOfpx9dmGVleMfKkpbvnn3IY7wA62Qfv7J7lkFRa1Bd1DlqXfW
 yyGTGDSKG5chiRCOU3s9ni92M4xIzFlrojyt/dIK2lUGMzUPI9FGlZRGQLKqqwLh
 2syOXRWbcQ7e52IHtDSy3YBNveKRsP4NyqV+GxGiex18SMB/M3Pw9EMH614eDPsE
 QAGQi5uGv4hPJtFHgXgUyBPLFHIyFAiVxhFRIj7u2DSEKY79+wO1CGWFiFvdTY4B
 CbqKXLffY3iQdFsLJkj9Dl8cnOQnoY44V0EBzhhORxeOp71StUVaRwQMFa5tp48G
 171s5Hs=
 =BQIl
 -----END PGP SIGNATURE-----

Merge 5.8-rc7 into driver-core-next

We want the driver core fixes in here as well.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-07-27 12:39:54 +02:00
Rafael J. Wysocki
80e3036866 Merge back cpufreq material for v5.9. 2020-07-27 12:34:55 +02:00
Greg Kroah-Hartman
65a9bde6ed Linux 5.8-rc7
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl8d8h4eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGd0sH/2iktYhMwPxzzpnb
 eI3OuTX/mRn4vUFOfpx9dmGVleMfKkpbvnn3IY7wA62Qfv7J7lkFRa1Bd1DlqXfW
 yyGTGDSKG5chiRCOU3s9ni92M4xIzFlrojyt/dIK2lUGMzUPI9FGlZRGQLKqqwLh
 2syOXRWbcQ7e52IHtDSy3YBNveKRsP4NyqV+GxGiex18SMB/M3Pw9EMH614eDPsE
 QAGQi5uGv4hPJtFHgXgUyBPLFHIyFAiVxhFRIj7u2DSEKY79+wO1CGWFiFvdTY4B
 CbqKXLffY3iQdFsLJkj9Dl8cnOQnoY44V0EBzhhORxeOp71StUVaRwQMFa5tp48G
 171s5Hs=
 =BQIl
 -----END PGP SIGNATURE-----

Merge 5.8-rc7 into char-misc-next

This should resolve the merge/build issues reported when trying to
create linux-next.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-07-27 11:49:37 +02:00
John Stultz
8d16f5b979 genirq: Export irq_chip_retrigger_hierarchy and irq_chip_set_vcpu_affinity_parent
Add EXPORT_SYMBOL_GPL entries for irq_chip_retrigger_hierarchy()
and irq_chip_set_vcpu_affinity_parent() so that we can allow
drivers like the qcom-pdc driver to be loadable as a module.

Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Cc: Andy Gross <agross@kernel.org>
Cc: Bjorn Andersson <bjorn.andersson@linaro.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Maulik Shah <mkshah@codeaurora.org>
Cc: Lina Iyer <ilina@codeaurora.org>
Cc: Saravana Kannan <saravanak@google.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: linux-arm-msm@vger.kernel.org
Cc: iommu@lists.linux-foundation.org
Cc: linux-gpio@vger.kernel.org
Link: https://lore.kernel.org/r/20200710231824.60699-3-john.stultz@linaro.org
2020-07-27 08:55:03 +01:00
John Stultz
8a667928cf irqdomain: Export irq_domain_update_bus_token
Add export for irq_domain_update_bus_token() so that
we can allow drivers like the qcom-pdc driver to be
loadable as a module.

Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Cc: Andy Gross <agross@kernel.org>
Cc: Bjorn Andersson <bjorn.andersson@linaro.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Maulik Shah <mkshah@codeaurora.org>
Cc: Lina Iyer <ilina@codeaurora.org>
Cc: Saravana Kannan <saravanak@google.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: linux-arm-msm@vger.kernel.org
Cc: iommu@lists.linux-foundation.org
Cc: linux-gpio@vger.kernel.org
Link: https://lore.kernel.org/r/20200710231824.60699-2-john.stultz@linaro.org
2020-07-27 08:55:02 +01:00
Zenghui Yu
45e9504f10 genirq/irqdomain: Remove redundant NULL pointer check on fwnode
The is_fwnode_irqchip() helper will check if the fwnode_handle is empty.
There is no need to perform a redundant check outside of it.

Signed-off-by: Zenghui Yu <yuzenghui@huawei.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20200716083905.287-1-yuzenghui@huawei.com
2020-07-27 08:55:02 +01:00
Pavel Machek
7665a47f70
signal: fix typo in dequeue_synchronous_signal()
s/postive/positive/

Signed-off-by: Pavel Machek (CIP) <pavel@denx.de>
Link: https://lore.kernel.org/r/20200724090531.GA14409@amd
[christian.brauner@ubuntu.com: tweak commit message]
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-07-26 23:57:52 +02:00
Ingo Molnar
aadfc2f957 entry: Correct 'noinstr' attributes
The noinstr attribute is to be specified before the return type in the
same way 'inline' is used.

Similar cases were recently fixed for x86 in commit 7f6fa101df ("x86:
Correct noinstr qualifiers"), but the generic entry code was based on the
the original version and did not carry the fix over.

Fixes: a5497bab5f ("entry: Provide generic interrupt entry/exit code")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200725091951.744848-3-mingo@kernel.org
2020-07-26 15:42:20 +02:00
Andrii Nakryiko
aa8d3a716b bpf, xdp: Add bpf_link-based XDP attachment API
Add bpf_link-based API (bpf_xdp_link) to attach BPF XDP program through
BPF_LINK_CREATE command.

bpf_xdp_link is mutually exclusive with direct BPF program attachment,
previous BPF program should be detached prior to attempting to create a new
bpf_xdp_link attachment (for a given XDP mode). Once BPF link is attached, it
can't be replaced by other BPF program attachment or link attachment. It will
be detached only when the last BPF link FD is closed.

bpf_xdp_link will be auto-detached when net_device is shutdown, similarly to
how other BPF links behave (cgroup, flow_dissector). At that point bpf_link
will become defunct, but won't be destroyed until last FD is closed.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200722064603.3350758-5-andriin@fb.com
2020-07-25 20:37:02 -07:00
Song Liu
2b9b305fcd bpf: Fix build on architectures with special bpf_user_pt_regs_t
Architectures like s390, powerpc, arm64, riscv have speical definition of
bpf_user_pt_regs_t. So we need to cast the pointer before passing it to
bpf_get_stack(). This is similar to bpf_get_stack_tp().

Fixes: 03d42fd2d83f ("bpf: Separate bpf_get_[stack|stackid] for perf events BPF")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200724200503.3629591-1-songliubraving@fb.com
2020-07-25 20:16:36 -07:00
YiFei Zhu
dfcdf0e9ad bpf/local_storage: Fix build without CONFIG_CGROUP
local_storage.o has its compile guard as CONFIG_BPF_SYSCALL, which
does not imply that CONFIG_CGROUP is on. Including cgroup-internal.h
when CONFIG_CGROUP is off cause a compilation failure.

Fixes: f67cfc233706 ("bpf: Make cgroup storages shared between programs on the same cgroup")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: YiFei Zhu <zhuyifei@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200724211753.902969-1-zhuyifei1999@gmail.com
2020-07-25 20:16:36 -07:00
YiFei Zhu
7d9c342789 bpf: Make cgroup storages shared between programs on the same cgroup
This change comes in several parts:

One, the restriction that the CGROUP_STORAGE map can only be used
by one program is removed. This results in the removal of the field
'aux' in struct bpf_cgroup_storage_map, and removal of relevant
code associated with the field, and removal of now-noop functions
bpf_free_cgroup_storage and bpf_cgroup_storage_release.

Second, we permit a key of type u64 as the key to the map.
Providing such a key type indicates that the map should ignore
attach type when comparing map keys. However, for simplicity newly
linked storage will still have the attach type at link time in
its key struct. cgroup_storage_check_btf is adapted to accept
u64 as the type of the key.

Third, because the storages are now shared, the storages cannot
be unconditionally freed on program detach. There could be two
ways to solve this issue:
* A. Reference count the usage of the storages, and free when the
     last program is detached.
* B. Free only when the storage is impossible to be referred to
     again, i.e. when either the cgroup_bpf it is attached to, or
     the map itself, is freed.
Option A has the side effect that, when the user detach and
reattach a program, whether the program gets a fresh storage
depends on whether there is another program attached using that
storage. This could trigger races if the user is multi-threaded,
and since nondeterminism in data races is evil, go with option B.

The both the map and the cgroup_bpf now tracks their associated
storages, and the storage unlink and free are removed from
cgroup_bpf_detach and added to cgroup_bpf_release and
cgroup_storage_map_free. The latter also new holds the cgroup_mutex
to prevent any races with the former.

Fourth, on attach, we reuse the old storage if the key already
exists in the map, via cgroup_storage_lookup. If the storage
does not exist yet, we create a new one, and publish it at the
last step in the attach process. This does not create a race
condition because for the whole attach the cgroup_mutex is held.
We keep track of an array of new storages that was allocated
and if the process fails only the new storages would get freed.

Signed-off-by: YiFei Zhu <zhuyifei@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/d5401c6106728a00890401190db40020a1f84ff1.1595565795.git.zhuyifei@google.com
2020-07-25 20:16:35 -07:00
Song Liu
5d99cb2c86 bpf: Fail PERF_EVENT_IOC_SET_BPF when bpf_get_[stack|stackid] cannot work
bpf_get_[stack|stackid] on perf_events with precise_ip uses callchain
attached to perf_sample_data. If this callchain is not presented, do not
allow attaching BPF program that calls bpf_get_[stack|stackid] to this
event.

In the error case, -EPROTO is returned so that libbpf can identify this
error and print proper hint message.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200723180648.1429892-3-songliubraving@fb.com
2020-07-25 20:16:34 -07:00
Song Liu
7b04d6d60f bpf: Separate bpf_get_[stack|stackid] for perf events BPF
Calling get_perf_callchain() on perf_events from PEBS entries may cause
unwinder errors. To fix this issue, the callchain is fetched early. Such
perf_events are marked with __PERF_SAMPLE_CALLCHAIN_EARLY.

Similarly, calling bpf_get_[stack|stackid] on perf_events from PEBS may
also cause unwinder errors. To fix this, add separate version of these
two helpers, bpf_get_[stack|stackid]_pe. These two hepers use callchain in
bpf_perf_event_data_kern->data->callchain.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200723180648.1429892-2-songliubraving@fb.com
2020-07-25 20:16:34 -07:00
Yonghong Song
d3cc2ab546 bpf: Implement bpf iterator for array maps
The bpf iterators for array and percpu array
are implemented. Similar to hash maps, for percpu
array map, bpf program will receive values
from all cpus.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200723184115.590532-1-yhs@fb.com
2020-07-25 20:16:33 -07:00
Yonghong Song
d6c4503cc2 bpf: Implement bpf iterator for hash maps
The bpf iterators for hash, percpu hash, lru hash
and lru percpu hash are implemented. During link time,
bpf_iter_reg->check_target() will check map type
and ensure the program access key/value region is
within the map defined key/value size limit.

For percpu hash and lru hash maps, the bpf program
will receive values for all cpus. The map element
bpf iterator infrastructure will prepare value
properly before passing the value pointer to the
bpf program.

This patch set supports readonly map keys and
read/write map values. It does not support deleting
map elements, e.g., from hash tables. If there is
a user case for this, the following mechanism can
be used to support map deletion for hashtab, etc.
  - permit a new bpf program return value, e.g., 2,
    to let bpf iterator know the map element should
    be removed.
  - since bucket lock is taken, the map element will be
    queued.
  - once bucket lock is released after all elements under
    this bucket are traversed, all to-be-deleted map
    elements can be deleted.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200723184114.590470-1-yhs@fb.com
2020-07-25 20:16:33 -07:00
Yonghong Song
a5cbe05a66 bpf: Implement bpf iterator for map elements
The bpf iterator for map elements are implemented.
The bpf program will receive four parameters:
  bpf_iter_meta *meta: the meta data
  bpf_map *map:        the bpf_map whose elements are traversed
  void *key:           the key of one element
  void *value:         the value of the same element

Here, meta and map pointers are always valid, and
key has register type PTR_TO_RDONLY_BUF_OR_NULL and
value has register type PTR_TO_RDWR_BUF_OR_NULL.
The kernel will track the access range of key and value
during verification time. Later, these values will be compared
against the values in the actual map to ensure all accesses
are within range.

A new field iter_seq_info is added to bpf_map_ops which
is used to add map type specific information, i.e., seq_ops,
init/fini seq_file func and seq_file private data size.
Subsequent patches will have actual implementation
for bpf_map_ops->iter_seq_info.

In user space, BPF_ITER_LINK_MAP_FD needs to be
specified in prog attr->link_create.flags, which indicates
that attr->link_create.target_fd is a map_fd.
The reason for such an explicit flag is for possible
future cases where one bpf iterator may allow more than
one possible customization, e.g., pid and cgroup id for
task_file.

Current kernel internal implementation only allows
the target to register at most one required bpf_iter_link_info.
To support the above case, optional bpf_iter_link_info's
are needed, the target can be extended to register such link
infos, and user provided link_info needs to match one of
target supported ones.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200723184112.590360-1-yhs@fb.com
2020-07-25 20:16:32 -07:00
Yonghong Song
afbf21dce6 bpf: Support readonly/readwrite buffers in verifier
Readonly and readwrite buffer register states
are introduced. Totally four states,
PTR_TO_RDONLY_BUF[_OR_NULL] and PTR_TO_RDWR_BUF[_OR_NULL]
are supported. As suggested by their respective
names, PTR_TO_RDONLY_BUF[_OR_NULL] are for
readonly buffers and PTR_TO_RDWR_BUF[_OR_NULL]
for read/write buffers.

These new register states will be used
by later bpf map element iterator.

New register states share some similarity to
PTR_TO_TP_BUFFER as it will calculate accessed buffer
size during verification time. The accessed buffer
size will be later compared to other metrics during
later attach/link_create time.

Similar to reg_state PTR_TO_BTF_ID_OR_NULL in bpf
iterator programs, PTR_TO_RDONLY_BUF_OR_NULL or
PTR_TO_RDWR_BUF_OR_NULL reg_types can be set at
prog->aux->bpf_ctx_arg_aux, and bpf verifier will
retrieve the values during btf_ctx_access().
Later bpf map element iterator implementation
will show how such information will be assigned
during target registeration time.

The verifier is also enhanced such that PTR_TO_RDONLY_BUF
can be passed to ARG_PTR_TO_MEM[_OR_NULL] helper argument, and
PTR_TO_RDWR_BUF can be passed to ARG_PTR_TO_MEM[_OR_NULL] or
ARG_PTR_TO_UNINIT_MEM.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200723184111.590274-1-yhs@fb.com
2020-07-25 20:16:32 -07:00
Yonghong Song
f9c7927295 bpf: Refactor to provide aux info to bpf_iter_init_seq_priv_t
This patch refactored target bpf_iter_init_seq_priv_t callback
function to accept additional information. This will be needed
in later patches for map element targets since a particular
map should be passed to traverse elements for that particular
map. In the future, other information may be passed to target
as well, e.g., pid, cgroup id, etc. to customize the iterator.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200723184110.590156-1-yhs@fb.com
2020-07-25 20:16:32 -07:00
Yonghong Song
14fc6bd6b7 bpf: Refactor bpf_iter_reg to have separate seq_info member
There is no functionality change for this patch.
Struct bpf_iter_reg is used to register a bpf_iter target,
which includes information for both prog_load, link_create
and seq_file creation.

This patch puts fields related seq_file creation into
a different structure. This will be useful for map
elements iterator where one iterator covers different
map types and different map types may have different
seq_ops, init/fini private_data function and
private_data size.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200723184109.590030-1-yhs@fb.com
2020-07-25 20:16:32 -07:00
Alexei Starovoitov
a228a64fc1 bpf: Add bpf_prog iterator
It's mostly a copy paste of commit 6086d29def ("bpf: Add bpf_map iterator")
that is use to implement bpf_seq_file opreations to traverse all bpf programs.

v1->v2: Tweak to use build time btf_id

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
2020-07-25 20:16:32 -07:00
Yonghong Song
3f9969f2c0 bpf: Fix pos computation for bpf_iter seq_ops->start()
Currently, the pos pointer in bpf iterator map/task/task_file
seq_ops->start() is always incremented.
This is incorrect. It should be increased only if
*pos is 0 (for SEQ_START_TOKEN) since these start()
function actually returns the first real object.
If *pos is not 0, it merely found the object
based on the state in seq->private, and not really
advancing the *pos. This patch fixed this issue
by only incrementing *pos if it is 0.

Note that the old *pos calculation, although not
correct, does not affect correctness of bpf_iter
as bpf_iter seq_file->read() does not support llseek.

This patch also renamed "mid" in bpf_map iterator
seq_file private data to "map_id" for better clarity.

Fixes: 6086d29def ("bpf: Add bpf_map iterator")
Fixes: eaaacd2391 ("bpf: Add task and task/file iterator targets")
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200722195156.4029817-1-yhs@fb.com
2020-07-25 20:16:32 -07:00
David S. Miller
a57066b1a0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
The UDP reuseport conflict was a little bit tricky.

The net-next code, via bpf-next, extracted the reuseport handling
into a helper so that the BPF sk lookup code could invoke it.

At the same time, the logic for reuseport handling of unconnected
sockets changed via commit efc6b6f6c3
which changed the logic to carry on the reuseport result into the
rest of the lookup loop if we do not return immediately.

This requires moving the reuseport_has_conns() logic into the callers.

While we are here, get rid of inline directives as they do not belong
in foo.c files.

The other changes were cases of more straightforward overlapping
modifications.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-25 17:49:04 -07:00
Linus Torvalds
78b1afe22d Fix a interaction/regression between uprobes based shared library tracing & GDB.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl8cHqwRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1iyTxAAghbrqeeT6V4lj8xfBif5BOvexEPDaur5
 2OPXRMLg4YWYCCC8Wwdj43Nnz3IuG5A75acyXkxmqLgvIu4QE7EAm8CYuyy0LWZa
 NK7A8v5fbDOKJJL/WdSZ/xoFP27KLgE4u0gaeBBOp9Chib/bWo6h63jxsOpbQybE
 u5u6x4uAm90RKRocYnfHaMXxI8kDVnRRdfwmVjAcy3dbu3uth02vdfqlmXrmPh1A
 kpqH/EnIlw9DrJCGOdDJPCmCTjcj5xYyy3qA7C6DkY3X+oe4AYtMWnfCPOgV0k7v
 xGq0j9jIa0kynHQsZxjX28CObuumMrh6ZbI7/UfE8W8yccpUQKkzTWKhqzb/9Wc7
 E5SivCiykAhUPCbhcwZIihW6nwjWSF54Z1UCxi/PA1RMRTe89e96881Xfgh/huW7
 TCHnK9lRyWAP+ceJcj72BEobRElfsHyeTbd0xwS/a+LBCftxp/r8Fm1DHrrBaGtL
 WE/lHXxgRR2b/ieAVhlEUNwd7tn4YBjcad8WstZTYIZHMxtmPvUKjaDhMuWNclIH
 QtjRYy1vA0gAf2Lbs3EP52sTZXyxlksftkwlca0z8VF5SezDER0R4V4EgQqIa8zC
 mLr7jmnKqEeOp6vkdhTSnDHYNHuVw4OpGCY+28vh17b/z6MFNokAq5Bd/fV07oKg
 nassUOSteTw=
 =eCSv
 -----END PGP SIGNATURE-----

Merge tag 'perf-urgent-2020-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into master

Pull uprobe fix from Ingo Molnar:
 "Fix an interaction/regression between uprobes based shared library
  tracing & GDB"

* tag 'perf-urgent-2020-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  uprobes: Change handle_swbp() to send SIGTRAP with si_code=SI_KERNEL, to fix GDB regression
2020-07-25 13:55:38 -07:00
Ingo Molnar
c84d53051f Linux 5.8-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl8UzA4eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGQ7cH/3v+Gv+SmHJCvaT2
 CSu0+7okVnYbY3UTb3hykk7/aOqb6284KjxR03r0CWFzsEsZVhC5pvvruASSiMQg
 Pi04sLqv6CsGLHd1n+pl4AUYEaxq6k4KS3uU3HHSWxrahDDApQoRUx2F8lpOxyj8
 RiwnoO60IMPA7IFJqzcZuFqsgdxqiiYvnzT461KX8Mrw6fyMXeR2KAj2NwMX8dZN
 At21Sf8+LSoh6q2HnugfiUd/jR10XbfxIIx2lXgIinb15GXgWydEQVrDJ7cUV7ix
 Jd0S+dtOtp+lWtFHDoyjjqqsMV7+G8i/rFNZoxSkyZqsUTaKzaR6JD3moSyoYZgG
 0+eXO4A=
 =9EpR
 -----END PGP SIGNATURE-----

Merge tag 'v5.8-rc6' into locking/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-07-25 21:49:36 +02:00
Chris Wilson
a7ef9b28aa locking/lockdep: Fix overflow in presentation of average lock-time
Though the number of lock-acquisitions is tracked as unsigned long, this
is passed as the divisor to div_s64() which interprets it as a s32,
giving nonsense values with more than 2 billion acquisitons. E.g.

  acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
  -------------------------------------------------------------------------
    2350439395           0.07         353.38   649647067.36          0.-32

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20200725185110.11588-1-chris@chris-wilson.co.uk
2020-07-25 21:47:42 +02:00
Qinglang Miao
13efa61612 sched/uclamp: Remove unnecessary mutex_init()
The uclamp_mutex lock is initialized statically via DEFINE_MUTEX(),
it is unnecessary to initialize it runtime via mutex_init().

Signed-off-by: Qinglang Miao <miaoqinglang@huawei.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20200725085629.98292-1-miaoqinglang@huawei.com
2020-07-25 12:10:36 +02:00
Jim Cromie
e5ebffe18e dyndbg: rename __verbose section to __dyndbg
dyndbg populates its callsite info into __verbose section, change that
to a more specific and descriptive name, __dyndbg.

Also, per checkpatch:
  simplify __attribute(..) to __section(__dyndbg) declaration.

and 1 spelling fix, decriptor

Acked-by: <jbaron@akamai.com>
Signed-off-by: Jim Cromie <jim.cromie@gmail.com>
Link: https://lore.kernel.org/r/20200719231058.1586423-6-jim.cromie@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-07-24 17:00:08 +02:00
Oleg Nesterov
fe5ed7ab99 uprobes: Change handle_swbp() to send SIGTRAP with si_code=SI_KERNEL, to fix GDB regression
If a tracee is uprobed and it hits int3 inserted by debugger, handle_swbp()
does send_sig(SIGTRAP, current, 0) which means si_code == SI_USER. This used
to work when this code was written, but then GDB started to validate si_code
and now it simply can't use breakpoints if the tracee has an active uprobe:

	# cat test.c
	void unused_func(void)
	{
	}
	int main(void)
	{
		return 0;
	}

	# gcc -g test.c -o test
	# perf probe -x ./test -a unused_func
	# perf record -e probe_test:unused_func gdb ./test -ex run
	GNU gdb (GDB) 10.0.50.20200714-git
	...
	Program received signal SIGTRAP, Trace/breakpoint trap.
	0x00007ffff7ddf909 in dl_main () from /lib64/ld-linux-x86-64.so.2
	(gdb)

The tracee hits the internal breakpoint inserted by GDB to monitor shared
library events but GDB misinterprets this SIGTRAP and reports a signal.

Change handle_swbp() to use force_sig(SIGTRAP), this matches do_int3_user()
and fixes the problem.

This is the minimal fix for -stable, arch/x86/kernel/uprobes.c is equally
wrong; it should use send_sigtrap(TRAP_TRACE) instead of send_sig(SIGTRAP),
but this doesn't confuse GDB and needs another x86-specific patch.

Reported-by: Aaron Merey <amerey@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20200723154420.GA32043@redhat.com
2020-07-24 15:38:37 +02:00
Thomas Gleixner
935ace2fb5 entry: Provide infrastructure for work before transitioning to guest mode
Entering a guest is similar to exiting to user space. Pending work like
handling signals, rescheduling, task work etc. needs to be handled before
that.

Provide generic infrastructure to avoid duplication of the same handling
code all over the place.

The transfer to guest mode handling is different from the exit to usermode
handling, e.g. vs. rseq and live patching, so a separate function is used.

The initial list of work items handled is:

    TIF_SIGPENDING, TIF_NEED_RESCHED, TIF_NOTIFY_RESUME

Architecture specific TIF flags can be added via defines in the
architecture specific include files.

The calling convention is also different from the syscall/interrupt entry
functions as KVM invokes this from the outer vcpu_run() loop with
interrupts and preemption enabled. To prevent missing a pending work item
it invokes a check for pending TIF work from interrupt disabled code right
before transitioning to guest mode. The lockdep, RCU and tracing state
handling is also done directly around the switch to and from guest mode.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200722220519.833296398@linutronix.de
2020-07-24 15:03:42 +02:00
Thomas Gleixner
a5497bab5f entry: Provide generic interrupt entry/exit code
Like the syscall entry/exit code interrupt/exception entry after the real
low level ASM bits should not be different accross architectures.

Provide a generic version based on the x86 code.

irqentry_enter() is called after the low level entry code and
irqentry_exit() must be invoked right before returning to the low level
code which just contains the actual return logic. The code before
irqentry_enter() and irqentry_exit() must not be instrumented. Code after
irqentry_enter() and before irqentry_exit() can be instrumented.

irqentry_enter() invokes irqentry_enter_from_user_mode() if the
interrupt/exception came from user mode. If if entered from kernel mode it
handles the kernel mode variant of establishing state for lockdep, RCU and
tracing depending on the kernel context it interrupted (idle, non-idle).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200722220519.723703209@linutronix.de
2020-07-24 14:59:04 +02:00
Thomas Gleixner
a9f3a74a29 entry: Provide generic syscall exit function
Like syscall entry all architectures have similar and pointlessly different
code to handle pending work before returning from a syscall to user space.

  1) One-time syscall exit work:
      - rseq syscall exit
      - audit
      - syscall tracing
      - tracehook (single stepping)

  2) Preparatory work
      - Exit to user mode loop (common TIF handling).
      - Architecture specific one time work arch_exit_to_user_mode_prepare()
      - Address limit and lockdep checks
     
  3) Final transition (lockdep, tracing, context tracking, RCU). Invokes
     arch_exit_to_user_mode() to handle e.g. speculation mitigations

Provide a generic version based on the x86 code which has all the RCU and
instrumentation protections right.

Provide a variant for interrupt return to user mode as well which shares
the above #2 and #3 work items.

After syscall_exit_to_user_mode() and irqentry_exit_to_user_mode() the
architecture code just has to return to user space. The code after
returning from these functions must not be instrumented.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20200722220519.613977173@linutronix.de
2020-07-24 14:59:03 +02:00
Thomas Gleixner
142781e108 entry: Provide generic syscall entry functionality
On syscall entry certain work needs to be done:

   - Establish state (lockdep, context tracking, tracing)
   - Conditional work (ptrace, seccomp, audit...)

This code is needlessly duplicated and  different in all
architectures.

Provide a generic version based on the x86 implementation which has all the
RCU and instrumentation bits right.

As interrupt/exception entry from user space needs parts of the same
functionality, provide a function for this as well.

syscall_enter_from_user_mode() and irqentry_enter_from_user_mode() must be
called right after the low level ASM entry. The calling code must be
non-instrumentable. After the functions returns state is correct and the
subsequent functions can be instrumented.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20200722220519.513463269@linutronix.de
2020-07-24 14:59:03 +02:00
Chris Wilson
062d3f95b6 sched: Warn if garbage is passed to default_wake_function()
Since the default_wake_function() passes its flags onto
try_to_wake_up(), warn if those flags collide with internal values.

Given that the supplied flags are garbage, no repair can be done but at
least alert the user to the damage they are causing.

In the belief that these errors should be picked up during testing, the
warning is only compiled in under CONFIG_SCHED_DEBUG.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: https://lore.kernel.org/r/20200723201042.18861-1-chris@chris-wilson.co.uk
2020-07-24 14:40:25 +02:00
Frederic Weisbecker
31cd0e119d timers: Recalculate next timer interrupt only when necessary
The nohz tick code recalculates the timer wheel's next expiry on each idle
loop iteration.

On the other hand, the base next expiry is now always cached and updated
upon timer enqueue and execution. Only timer dequeue may leave
base->next_expiry out of date (but then its stale value won't ever go past
the actual next expiry to be recalculated).

Since recalculating the next_expiry isn't a free operation, especially when
the last wheel level is reached to find out that no timer has been enqueued
at all, reuse the next expiry cache when it is known to be reliable, which
it is most of the time.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200723151641.12236-1-frederic@kernel.org
2020-07-24 12:49:40 +02:00
Peter Enderborg
072e133d55 tracefs: Remove unnecessary debug_fs checks.
This is a preparation for debugfs restricted mode.
We don't need debugfs to trace, the removed check stop tracefs to work
if debugfs is not initialised. We instead tries to automount within
debugfs and relay on it's handling. The code path is to create a
backward compatibility from when tracefs was part of debugfs, it is now
standalone and does not need debugfs. When debugfs is in restricted
it is compiled in but not active and return EPERM to clients and
tracefs wont work if it assumes it is active it is compiled in
kernel.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Peter Enderborg <peter.enderborg@sony.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20200716071511.26864-2-peter.enderborg@sony.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-07-23 17:10:25 +02:00
Daniel Jordan
f601c725a6 padata: remove padata_parallel_queue
Only its reorder field is actually used now, so remove the struct and
embed @reorder directly in parallel_data.

No functional change, just a cleanup.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-07-23 17:34:18 +10:00
Daniel Jordan
3f257191d3 padata: fold padata_alloc_possible() into padata_alloc()
There's no reason to have two interfaces when there's only one caller.
Removing _possible saves text and simplifies future changes.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-07-23 17:34:18 +10:00
Daniel Jordan
d69e037bcc padata: remove effective cpumasks from the instance
A padata instance has effective cpumasks that store the user-supplied
masks ANDed with the online mask, but this middleman is unnecessary.
parallel_data keeps the same information around.  Removing this saves
text and code churn in future changes.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-07-23 17:34:18 +10:00
Daniel Jordan
cec00e6e12 padata: inline single call of pd_setup_cpumasks()
pd_setup_cpumasks() has only one caller.  Move its contents inline to
prepare for the next cleanup.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-07-23 17:34:18 +10:00
Daniel Jordan
350ef051d4 padata: remove stop function
padata_stop() has two callers and is unnecessary in both cases.  When
pcrypt calls it before padata_free(), it's being unloaded so there are
no outstanding padata jobs[0].  When __padata_free() calls it, it's
either along the same path or else pcrypt initialization failed, which
of course means there are also no outstanding jobs.

Removing it simplifies padata and saves text.

[0] https://lore.kernel.org/linux-crypto/20191119225017.mjrak2fwa5vccazl@gondor.apana.org.au/

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-07-23 17:34:18 +10:00
Daniel Jordan
bd25b4886d padata: remove start function
padata_start() is only used right after pcrypt allocates an instance
with all possible CPUs, when PADATA_INVALID can't happen, so there's no
need for a separate "start" step.  It can be done during allocation to
save text, make using padata easier, and avoid unneeded calls in the
future.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-07-23 17:34:17 +10:00
Valentin Schneider
25980c7a79 arch_topology, sched/core: Cleanup thermal pressure definition
The following commit:

  14533a16c4 ("thermal/cpu-cooling, sched/core: Move the arch_set_thermal_pressure() API to generic scheduler code")

moved the definition of arch_set_thermal_pressure() to sched/core.c, but
kept its declaration in linux/arch_topology.h. When building e.g. an x86
kernel with CONFIG_SCHED_THERMAL_PRESSURE=y, cpufreq_cooling.c ends up
getting the declaration of arch_set_thermal_pressure() from
include/linux/arch_topology.h, which is somewhat awkward.

On top of this, sched/core.c unconditionally defines
o The thermal_pressure percpu variable
o arch_set_thermal_pressure()

while arch_scale_thermal_pressure() does nothing unless redefined by the
architecture.

arch_*() functions are meant to be defined by architectures, so revert the
aforementioned commit and re-implement it in a way that keeps
arch_set_thermal_pressure() architecture-definable, and doesn't define the
thermal pressure percpu variable for kernels that don't need
it (CONFIG_SCHED_THERMAL_PRESSURE=n).

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200712165917.9168-2-valentin.schneider@arm.com
2020-07-22 10:22:05 +02:00
Muchun Song
589343569d smp: Fix a potential usage of stale nr_cpus
The get_option() maybe return 0, it means that the nr_cpus is
not initialized. Then we will use the stale nr_cpus to initialize
the nr_cpu_ids. So fix it.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200716070457.53255-1-songmuchun@bytedance.com
2020-07-22 10:22:04 +02:00
Peter Puhov
3edecfef02 sched/fair: update_pick_idlest() Select group with lowest group_util when idle_cpus are equal
In slow path, when selecting idlest group, if both groups have type
group_has_spare, only idle_cpus count gets compared.
As a result, if multiple tasks are created in a tight loop,
and go back to sleep immediately
(while waiting for all tasks to be created),
they may be scheduled on the same core, because CPU is back to idle
when the new fork happen.

For example:
sudo perf record -e sched:sched_wakeup_new -- \
                                  sysbench threads --threads=4 run
...
    total number of events:              61582
...
sudo perf script
sysbench 129378 [006] 74586.633466: sched:sched_wakeup_new:
                            sysbench:129380 [120] success=1 CPU:007
sysbench 129378 [006] 74586.634718: sched:sched_wakeup_new:
                            sysbench:129381 [120] success=1 CPU:007
sysbench 129378 [006] 74586.635957: sched:sched_wakeup_new:
                            sysbench:129382 [120] success=1 CPU:007
sysbench 129378 [006] 74586.637183: sched:sched_wakeup_new:
                            sysbench:129383 [120] success=1 CPU:007

This may have negative impact on performance for workloads with frequent
creation of multiple threads.

In this patch we are using group_util to select idlest group if both groups
have equal number of idle_cpus. Comparing the number of idle cpu is
not enough in this case, because the newly forked thread sleeps
immediately and before we select the cpu for the next one.
This is shown in the trace where the same CPU7 is selected for
all wakeup_new events.
That's why, looking at utilization when there is the same number of
CPU is a good way to see where the previous task was placed. Using
nr_running doesn't solve the problem because the newly forked task is not
running and the cpu would not have been idle in this case and an idle
CPU would have been selected instead.

With this patch newly created tasks would be better distributed.

With this patch:
sudo perf record -e sched:sched_wakeup_new -- \
                                    sysbench threads --threads=4 run
...
    total number of events:              74401
...
sudo perf script
sysbench 129455 [006] 75232.853257: sched:sched_wakeup_new:
                            sysbench:129457 [120] success=1 CPU:008
sysbench 129455 [006] 75232.854489: sched:sched_wakeup_new:
                            sysbench:129458 [120] success=1 CPU:009
sysbench 129455 [006] 75232.855732: sched:sched_wakeup_new:
                            sysbench:129459 [120] success=1 CPU:010
sysbench 129455 [006] 75232.856980: sched:sched_wakeup_new:
                            sysbench:129460 [120] success=1 CPU:011

We tested this patch with following benchmarks:
master: 'commit b3a9e3b962 ("Linux 5.8-rc1")'

100 iterations of: perf bench -f simple futex wake -s -t 128 -w 1
Lower result is better
|         |   BASELINE |   +PATCH |   DELTA (%) |
|---------|------------|----------|-------------|
| mean    |      0.33  |    0.313 |      +5.152 |
| std (%) |     10.433 |    7.563 |             |

100 iterations of: sysbench threads --threads=8 run
Higher result is better
|         |   BASELINE |   +PATCH |   DELTA (%) |
|---------|------------|----------|-------------|
| mean    |   5235.02  | 5863.73  |      +12.01 |
| std (%) |      8.166 |   10.265 |             |

100 iterations of: sysbench mutex --mutex-num=1 --threads=8 run
Lower result is better
|         |   BASELINE |   +PATCH |   DELTA (%) |
|---------|------------|----------|-------------|
| mean    |      0.413 |    0.404 |      +2.179 |
| std (%) |      3.791 |    1.816 |             |

Signed-off-by: Peter Puhov <peter.puhov@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200714125941.4174-1-peter.puhov@linaro.org
2020-07-22 10:22:04 +02:00
Paul Gortmaker
46132e3ac5 sched: nohz: stop passing around unused "ticks" parameter.
The "ticks" parameter was added in commit 0f004f5a69 ("sched: Cure more
NO_HZ load average woes") since calc_global_nohz() was called and needed
the "ticks" argument.

But in commit c308b56b53 ("sched: Fix nohz load accounting -- again!")
it became unused as the function calc_global_nohz() dropped using "ticks".

Fixes: c308b56b53 ("sched: Fix nohz load accounting -- again!")
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1593628458-32290-1-git-send-email-paul.gortmaker@windriver.com
2020-07-22 10:22:04 +02:00
Peter Zijlstra
58877d347b sched: Better document ttwu()
Dave hit the problem fixed by commit:

  b6e13e8582 ("sched/core: Fix ttwu() race")

and failed to understand much of the code involved. Per his request a
few comments to (hopefully) clarify things.

Requested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200702125211.GQ4800@hirez.programming.kicks-ass.net
2020-07-22 10:22:03 +02:00
Peter Zijlstra
015dc08918 Merge branch 'sched/urgent' 2020-07-22 10:22:02 +02:00
Peter Zijlstra
d136122f58 sched: Fix race against ptrace_freeze_trace()
There is apparently one site that violates the rule that only current
and ttwu() will modify task->state, namely ptrace_{,un}freeze_traced()
will change task->state for a remote task.

Oleg explains:

  "TASK_TRACED/TASK_STOPPED was always protected by siglock. In
particular, ttwu(__TASK_TRACED) must be always called with siglock
held. That is why ptrace_freeze_traced() assumes it can safely do
s/TASK_TRACED/__TASK_TRACED/ under spin_lock(siglock)."

This breaks the ordering scheme introduced by commit:

  dbfb089d36 ("sched: Fix loadavg accounting race")

Specifically, the reload not matching no longer implies we don't have
to block.

Simply things by noting that what we need is a LOAD->STORE ordering
and this can be provided by a control dependency.

So replace:

	prev_state = prev->state;
	raw_spin_lock(&rq->lock);
	smp_mb__after_spinlock(); /* SMP-MB */
	if (... && prev_state && prev_state == prev->state)
		deactivate_task();

with:

	prev_state = prev->state;
	if (... && prev_state) /* CTRL-DEP */
		deactivate_task();

Since that already implies the 'prev->state' load must be complete
before allowing the 'prev->on_rq = 0' store to become visible.

Fixes: dbfb089d36 ("sched: Fix loadavg accounting race")
Reported-by: Jiri Slaby <jirislaby@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Tested-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-07-22 10:22:00 +02:00
Yonghong Song
951cf368bc bpf: net: Use precomputed btf_id for bpf iterators
One additional field btf_id is added to struct
bpf_ctx_arg_aux to store the precomputed btf_ids.
The btf_id is computed at build time with
BTF_ID_LIST or BTF_ID_LIST_GLOBAL macro definitions.
All existing bpf iterators are changed to used
pre-compute btf_ids.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200720163403.1393551-1-yhs@fb.com
2020-07-21 13:26:26 -07:00
Yonghong Song
bc4f0548f6 bpf: Compute bpf_skc_to_*() helper socket btf ids at build time
Currently, socket types (struct tcp_sock, udp_sock, etc.)
used by bpf_skc_to_*() helpers are computed when vmlinux_btf
is first built in the kernel.

Commit 5a2798ab32
("bpf: Add BTF_ID_LIST/BTF_ID/BTF_ID_UNUSED macros")
implemented a mechanism to compute btf_ids at kernel build
time which can simplify kernel implementation and reduce
runtime overhead by removing in-kernel btf_id calculation.
This patch did exactly this, removing in-kernel btf_id
computation and utilizing build-time btf_id computation.

If CONFIG_DEBUG_INFO_BTF is not defined, BTF_ID_LIST will
define an array with size of 5, which is not enough for
btf_sock_ids. So define its own static array if
CONFIG_DEBUG_INFO_BTF is not defined.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200720163358.1393023-1-yhs@fb.com
2020-07-21 13:26:26 -07:00
Lorenzo Bianconi
c576b9c77b bpf: cpumap: Fix possible rcpu kthread hung
Fix the following cpumap kthread hung. The issue is currently occurring
when __cpu_map_load_bpf_program fails (e.g if the bpf prog has not
BPF_XDP_CPUMAP as expected_attach_type)

$./test_progs -n 101
101/1 cpumap_with_progs:OK
101 xdp_cpumap_attach:OK
Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED
[  369.996478] INFO: task cpumap/0/map:7:205 blocked for more than 122 seconds.
[  369.998463]       Not tainted 5.8.0-rc4-01472-ge57892f50a07 #212
[  370.000102] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  370.001918] cpumap/0/map:7  D    0   205      2 0x00004000
[  370.003228] Call Trace:
[  370.003930]  __schedule+0x5c7/0xf50
[  370.004901]  ? io_schedule_timeout+0xb0/0xb0
[  370.005934]  ? static_obj+0x31/0x80
[  370.006788]  ? mark_held_locks+0x24/0x90
[  370.007752]  ? cpu_map_bpf_prog_run_xdp+0x6c0/0x6c0
[  370.008930]  schedule+0x6f/0x160
[  370.009728]  schedule_preempt_disabled+0x14/0x20
[  370.010829]  kthread+0x17b/0x240
[  370.011433]  ? kthread_create_worker_on_cpu+0xd0/0xd0
[  370.011944]  ret_from_fork+0x1f/0x30
[  370.012348]
               Showing all locks held in the system:
[  370.013025] 1 lock held by khungtaskd/33:
[  370.013432]  #0: ffffffff82b24720 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x28/0x1c3

[  370.014461] =============================================

Fixes: 9216477449 ("bpf: cpumap: Add the possibility to attach an eBPF program to cpumap")
Reported-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/e54f2aabf959f298939e5507b09c48f8c2e380be.1595170625.git.lorenzo@kernel.org
2020-07-21 09:15:28 -07:00
Jakub Sitnicki
343ead287d bpf, netns: Fix build without CONFIG_INET
When CONFIG_NET is set but CONFIG_INET isn't, build fails with:

  ld: kernel/bpf/net_namespace.o: in function `netns_bpf_attach_type_unneed':
  kernel/bpf/net_namespace.c:32: undefined reference to `bpf_sk_lookup_enabled'
  ld: kernel/bpf/net_namespace.o: in function `netns_bpf_attach_type_need':
  kernel/bpf/net_namespace.c:43: undefined reference to `bpf_sk_lookup_enabled'

This is because without CONFIG_INET bpf_sk_lookup_enabled symbol is not
available. Wrap references to bpf_sk_lookup_enabled with preprocessor
conditionals.

Fixes: 1559b4aa1d ("inet: Run SK_LOOKUP BPF program on socket lookup")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Link: https://lore.kernel.org/bpf/20200721100716.720477-1-jakub@cloudflare.com
2020-07-21 09:12:34 -07:00
Max Englander
b43870c74f audit: report audit wait metric in audit status reply
In environments where the preservation of audit events and predictable
usage of system memory are prioritized, admins may use a combination of
--backlog_wait_time and -b options at the risk of degraded performance
resulting from backlog waiting. In some cases, this risk may be
preferred to lost events or unbounded memory usage. Ideally, this risk
can be mitigated by making adjustments when backlog waiting is detected.

However, detection can be difficult using the currently available
metrics. For example, an admin attempting to debug degraded performance
may falsely believe a full backlog indicates backlog waiting. It may
turn out the backlog frequently fills up but drains quickly.

To make it easier to reliably track degraded performance to backlog
waiting, this patch makes the following changes:

Add a new field backlog_wait_time_total to the audit status reply.
Initialize this field to zero. Add to this field the total time spent
by the current task on scheduled timeouts while the backlog limit is
exceeded. Reset field to zero upon request via AUDIT_SET.

Tested on Ubuntu 18.04 using complementary changes to the
audit-userspace and audit-testsuite:
- https://github.com/linux-audit/audit-userspace/pull/134
- https://github.com/linux-audit/audit-testsuite/pull/97

Signed-off-by: Max Englander <max.englander@gmail.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2020-07-21 11:21:44 -04:00
Richard Guy Briggs
f1d9b23cab audit: purge audit_log_string from the intra-kernel audit API
audit_log_string() was inteded to be an internal audit function and
since there are only two internal uses, remove them.  Purge all external
uses of it by restructuring code to use an existing audit_log_format()
or using audit_log_format().

Please see the upstream issue
https://github.com/linux-audit/audit-kernel/issues/84

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2020-07-21 11:12:31 -04:00
Eric W. Biederman
be619f7f06 exec: Implement kernel_execve
To allow the kernel not to play games with set_fs to call exec
implement kernel_execve.  The function kernel_execve takes pointers
into kernel memory and copies the values pointed to onto the new
userspace stack.

The calls with arguments from kernel space of do_execve are replaced
with calls to kernel_execve.

The calls do_execve and do_execveat are made static as there are now
no callers outside of exec.

The comments that mention do_execve are updated to refer to
kernel_execve or execve depending on the circumstances.  In addition
to correcting the comments, this makes it easy to grep for do_execve
and verify it is not used.

Inspired-by: https://lkml.kernel.org/r/20200627072704.2447163-1-hch@lst.de
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/87wo365ikj.fsf@x220.int.ebiederm.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-07-21 08:24:52 -05:00
Tyler Hicks
4834177e63 ima: Support additional conditionals in the KEXEC_CMDLINE hook function
Take the properties of the kexec kernel's inode and the current task
ownership into consideration when matching a KEXEC_CMDLINE operation to
the rules in the IMA policy. This allows for some uniformity when
writing IMA policy rules for KEXEC_KERNEL_CHECK, KEXEC_INITRAMFS_CHECK,
and KEXEC_CMDLINE operations.

Prior to this patch, it was not possible to write a set of rules like
this:

 dont_measure func=KEXEC_KERNEL_CHECK obj_type=foo_t
 dont_measure func=KEXEC_INITRAMFS_CHECK obj_type=foo_t
 dont_measure func=KEXEC_CMDLINE obj_type=foo_t
 measure func=KEXEC_KERNEL_CHECK
 measure func=KEXEC_INITRAMFS_CHECK
 measure func=KEXEC_CMDLINE

The inode information associated with the kernel being loaded by a
kexec_kernel_load(2) syscall can now be included in the decision to
measure or not

Additonally, the uid, euid, and subj_* conditionals can also now be
used in KEXEC_CMDLINE rules. There was no technical reason as to why
those conditionals weren't being considered previously other than
ima_match_rules() didn't have a valid inode to use so it immediately
bailed out for KEXEC_CMDLINE operations rather than going through the
full list of conditional comparisons.

Signed-off-by: Tyler Hicks <tyhicks@linux.microsoft.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: kexec@lists.infradead.org
Reviewed-by: Lakshmi Ramasubramanian <nramas@linux.microsoft.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
2020-07-20 13:28:16 -04:00
Ahmed S. Darwish
aadd6e5caa time/sched_clock: Use raw_read_seqcount_latch()
sched_clock uses seqcount_t latching to switch between two storage
places protected by the sequence counter. This allows it to have
interruptible, NMI-safe, seqcount_t write side critical sections.

Since 7fc26327b7 ("seqlock: Introduce raw_read_seqcount_latch()"),
raw_read_seqcount_latch() became the standardized way for seqcount_t
latch read paths. Due to the dependent load, it also has one read
memory barrier less than the currently used raw_read_seqcount() API.

Use raw_read_seqcount_latch() for the seqcount_t latch read path.

Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
Signed-off-by: Leo Yan <leo.yan@linaro.org>
Link: https://lkml.kernel.org/r/20200625085745.GD117543@hirez.programming.kicks-ass.net
Link: https://lkml.kernel.org/r/20200715092345.GA231464@debian-buster-darwi.lab.linutronix.de
Link: https://lore.kernel.org/r/20200716051130.4359-3-leo.yan@linaro.org
References: 1809bfa44e ("timers, sched/clock: Avoid deadlock during read from NMI")
Signed-off-by: Will Deacon <will@kernel.org>
2020-07-20 11:50:47 +01:00
Peter Zijlstra
1b86abc1c6 sched_clock: Expose struct clock_read_data
In order to support perf_event_mmap_page::cap_time features, an
architecture needs, aside from a userspace readable counter register,
to expose the exact clock data so that userspace can convert the
counter register into a correct timestamp.

Provide struct clock_read_data and two (seqcount) helpers so that
architectures (arm64 in specific) can expose the numbers to userspace.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Leo Yan <leo.yan@linaro.org>
Link: https://lore.kernel.org/r/20200716051130.4359-2-leo.yan@linaro.org
Signed-off-by: Will Deacon <will@kernel.org>
2020-07-20 11:50:47 +01:00
Linus Torvalds
66e4b63624 Two fixes for the timer wheel:
- A timer which is already expired at enqueue time can set the
    base->next_expiry value backwards. As a consequence base->clk can be set
    back as well. This can lead to timers expiring early. Add a sanity check
    to prevent this.
 
  - When a timer is queued with an expiry time beyond the wheel capacity
    then it should be queued in the bucket of the last wheel level which is
    expiring last. The code adjusts expiry time to the maximum wheel
    capacity, which is only correct when the wheel clock is 0. Aside of that
    the check whether the delta is larger than wheel capacity does not check
    the delta, it checks the expiry value itself. As a result timers can
    expire at random.
 
    Fix this by checking the right variable and adjust expiry time so it
    becomes base->clock plus capacity which places it into the outmost
    bucket in the last wheel level.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl8URRQTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoQVMD/0VMkT36A8SKbPudMLZ5REp63E629wQ
 yuGJz9IJPE1NYB25PXL0TmVAQpseXKDKh3eSP2ac6Ao1FTUk/He/CwF2tsGvu+tm
 kxpuPQgeUF8BeF7WzE21k4NeAmTv8eaIxirQPRQBRJldHuNG9u0l1u8dr0rT2mQG
 N0djinQvM4bRUVa10l4dz6gE2F0Egjv5sIZohv3E6ORwisJxJoZUUFMlqfuS+2Xt
 lOebR8juJahIDRM3ihhZfXJI2tCPD/FnrcMWbk1z3NbsE6C2MiG4ncrjxR2MY81Q
 zRr3CrN6TgjTUkvSMOP1SuFePEKLc/2rl5dg9EcGEFNOyggPEezSB/sL1HavRsV9
 2s/hmLB6VR5GQwhMnhbLTq3jAI9M9P1S4VEoKHlDs8LoCxtQ+g+2IKmSVqKWXFuO
 6AscBbNQkEbrkTx+OkbHWYc7+RLQE87ryCNODeETzSwE0H3NLk/GRQirq6LO9ESq
 AjVg5085YZXEIzistsSON0aTdY0eIIVsmaYmFOI0qNPnSUCOPlHIXwD+ju1WEW4h
 QtM6BW6xggydgSLgOWQQzKpgBfLW3j7F4r7cFsNCjaQ7UtDQMPMMm+ATBpoT8vdA
 EHR/FC4U8ABiXpnleh87B1WCpQr6p6qo95eIbe5UxY3yPdPb32s1/+ycFngW9XPj
 B4353TQp7aNRUw==
 =aCiv
 -----END PGP SIGNATURE-----

Merge tag 'timers-urgent-2020-07-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into master

Pull timer fixes from Thomas Gleixner:
 "Two fixes for the timer wheel:

   - A timer which is already expired at enqueue time can set the
     base->next_expiry value backwards. As a consequence base->clk can
     be set back as well. This can lead to timers expiring early. Add a
     sanity check to prevent this.

   - When a timer is queued with an expiry time beyond the wheel
     capacity then it should be queued in the bucket of the last wheel
     level which is expiring last.

     The code adjusted the expiry time to the maximum wheel capacity,
     which is only correct when the wheel clock is 0. Aside of that the
     check whether the delta is larger than wheel capacity does not
     check the delta, it checks the expiry value itself. As a result
     timers can expire at random.

     Fix this by checking the right variable and adjust expiry time so
     it becomes base->clock plus capacity which places it into the
     outmost bucket in the last wheel level"

* tag 'timers-urgent-2020-07-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timer: Fix wheel index calculation on last level
  timer: Prevent base->clk from moving backward
2020-07-19 12:06:08 -07:00
Linus Torvalds
43768f7ce0 A set of scheduler fixes:
- Plug a load average accounting race which was introduced with a recent
    optimization casing load average to show bogus numbers.
 
  - Fix the rseq CPU id initialization for new tasks. sched_fork() does not
    update the rseq CPU id so the id is the stale id of the parent task,
    which can cause user space data corruption.
 
  - Handle a 0 return value of task_h_load() correctly in the load balancer,
    which does not decrease imbalance and therefore pulls until the maximum
    number of loops is reached, which might be all tasks just created by a
    fork bomb.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl8UQrITHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoTNgD/4+uP0wmIuYAJd1WmpifX+G9h+3NIiU
 zfLTxGeo+D/I+rdeS7ClyjeSTcZHl1fQfZBIopMevsEMymMu2BbQd+OeAlkESbS6
 dp6G3dv0ZGbm9Sn4G3CEEPltoCJi7pOgRrixGi/o4APkNfy3U2r+w/kM1N6AwHE0
 PYztzvq5Q++m+MEHOALsB1J8mc7vygU26EO4s/rRrV6/RnNZXL269PeZRFxxEvYn
 rtmyUw53Lc72Y+23FuityE/jb2xkr80yuXQWxTOxbhzBtHO1omWQQVhBTMam5RDg
 NUYzeZvK/nZW3i6WOuHyaaLj7+2ML7RmNpaYRueymJinda409GDXRcDOYXNFtxcI
 lcVmsxzNF5rb7b9mXqdgdSJKuZotKLnTjXAIGhHzkSkl2uYfYW6PUGxq6BmSCKvR
 GpewHQ8Ynf4JcsjioOTQjRNjJYmlrTsHcUUKXsyTIfYaEEw+i/7s/7G5G7bXxJ6G
 Sma52oTyrsFQEG+AjT2CxhOzxQumtT5vQ9/l8EvnQXQdG7fZzIimgWnTBc6IE83J
 OPYI8WomKhj+EkJSltxUQm+ZwqhDv4rBHQ+SqPr+jhvPobUN6jS0HkOoW+SIGuo4
 oMRvMiNhCyUWLFYMVL2pflJANyiFczfKWyAqyjwgiSjfNaTqSmYCcPOc1NWz/Ic4
 fGMLMqFQ2fW/rg==
 =bCPw
 -----END PGP SIGNATURE-----

Merge tag 'sched-urgent-2020-07-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into master

Pull scheduler fixes from Thomas Gleixner:
 "A set of scheduler fixes:

   - Plug a load average accounting race which was introduced with a
     recent optimization casing load average to show bogus numbers.

   - Fix the rseq CPU id initialization for new tasks. sched_fork() does
     not update the rseq CPU id so the id is the stale id of the parent
     task, which can cause user space data corruption.

   - Handle a 0 return value of task_h_load() correctly in the load
     balancer, which does not decrease imbalance and therefore pulls
     until the maximum number of loops is reached, which might be all
     tasks just created by a fork bomb"

* tag 'sched-urgent-2020-07-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: handle case of task_h_load() returning 0
  sched: Fix unreliable rseq cpu_id for new tasks
  sched: Fix loadavg accounting race
2020-07-19 11:55:24 -07:00
Linus Torvalds
9413cd7792 Two fixes for the interrupt subsystem:
- Make the handling of the firmware node consistent and do not free the
    node after the domain has been created successfully. The core code
    stores a pointer to it which can lead to a use after free or double
    free.
 
    This used to "work" because the pointer was not stored when the initial
    code was written, but at some point later it was required to store
    it. Of course nobody noticed that the existing users break that way.
 
  - Handle affinity setting on inactive interrupts correctly when
    hierarchical irq domains are enabled. When interrupts are inactive with
    the modern hierarchical irqdomain design, the interrupt chips are not
    necessarily in a state where affinity changes can be handled. The legacy
    irq chip design allowed this because interrupts are immediately fully
    initialized at allocation time. X86 has a hacky workaround for this, but
    other implementations do not. This cased malfunction on GIC-V3. Instead
    of playing whack a mole to find all affected drivers, change the core
    code to store the requested affinity setting and then establish it when
    the interrupt is allocated, which makes the X86 hack go away.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl8UP+4THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoSuZD/9tNPR4fIDt4mC9ciSvwSqGTV+q1y1D
 zhXSDro4cJNjzy/9D475IJqOlvchaF9Nfun55b60Q6vnA4VN8G+kABEaG8uwr8mV
 ijTB4f0qKfW/9kUDTJRScq3nNmC3miqg8ZFgFEn6Ecxj3NHmwidATIi5sF6f/XVG
 DdhL0Jys7ycNeGBf7yIKbT5/NOULMHYy9rK1NDAeBo9u3klvmrwrHgdNsiDDhEaU
 KlHtwuQLCdjFY3Lf67YpSah+Hx/gXPI1VHUxDDFRoFmC4RlB0VjyXGydjsisOrSQ
 Cl2gnkQ6VOlLaLbN38nmia9nyb6npzE5iK1h9EDcaRhBACG9O23Bdo+YZYxl6BOP
 mXuyIVKJYczJEp7j1fGHW/aNCoEqC8dGVyN7toxMVfGZmF12JzMSt4SYItPeSjFC
 bPNPRCscpiMOQdgwgO0woK1764V46g1BlmxXtJRdWB4iwWgXcryaz65xzSfNeZF4
 0+TvdYs2FYjxwwIyWj8xJ3Npe1lKhH+06DA6gziwJt1u4it8rl82UcqMFyf/ty1w
 o5LHyMBWYm7SJXSeaZZj+nv7moJKJnmRYKnpry21cUzsK/vQEPX0vqhwh4dSFN3O
 BaBocDsOk+9wkmUwi6haP+6+vpadAFQrsqVhURtwc6OVSWn2/vsf2ZH5P36xwFWD
 tlFanb8hX9y2NQ==
 =elM3
 -----END PGP SIGNATURE-----

Merge tag 'irq-urgent-2020-07-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into master

Pull irq fixes from Thomas Gleixner:
 "Two fixes for the interrupt subsystem:

   - Make the handling of the firmware node consistent and do not free
     the node after the domain has been created successfully. The core
     code stores a pointer to it which can lead to a use after free or
     double free.

     This used to "work" because the pointer was not stored when the
     initial code was written, but at some point later it was required
     to store it. Of course nobody noticed that the existing users break
     that way.

   - Handle affinity setting on inactive interrupts correctly when
     hierarchical irq domains are enabled.

     When interrupts are inactive with the modern hierarchical irqdomain
     design, the interrupt chips are not necessarily in a state where
     affinity changes can be handled. The legacy irq chip design allowed
     this because interrupts are immediately fully initialized at
     allocation time. X86 has a hacky workaround for this, but other
     implementations do not.

     This cased malfunction on GIC-V3. Instead of playing whack a mole
     to find all affected drivers, change the core code to store the
     requested affinity setting and then establish it when the interrupt
     is allocated, which makes the X86 hack go away"

* tag 'irq-urgent-2020-07-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq/affinity: Handle affinity setting on inactive interrupts correctly
  irqdomain/treewide: Keep firmware node unconditionally allocated
2020-07-19 11:53:08 -07:00
Nicolas Viennot
227175b2c9 prctl: exe link permission error changed from -EINVAL to -EPERM
This brings consistency with the rest of the prctl() syscall where
-EPERM is returned when failing a capability check.

Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
Signed-off-by: Adrian Reber <areber@redhat.com>
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Link: https://lore.kernel.org/r/20200719100418.2112740-7-areber@redhat.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-07-19 20:14:42 +02:00
Nicolas Viennot
ebd6de6812 prctl: Allow local CAP_CHECKPOINT_RESTORE to change /proc/self/exe
Originally, only a local CAP_SYS_ADMIN could change the exe link,
making it difficult for doing checkpoint/restore without CAP_SYS_ADMIN.
This commit adds CAP_CHECKPOINT_RESTORE in addition to CAP_SYS_ADMIN
for permitting changing the exe link.

The following describes the history of the /proc/self/exe permission
checks as it may be difficult to understand what decisions lead to this
point.

* [1] May 2012: This commit introduces the ability of changing
  /proc/self/exe if the user is CAP_SYS_RESOURCE capable.
  In the related discussion [2], no clear thread model is presented for
  what could happen if the /proc/self/exe changes multiple times, or why
  would the admin be at the mercy of userspace.

* [3] Oct 2014: This commit introduces a new API to change
  /proc/self/exe. The permission no longer checks for CAP_SYS_RESOURCE,
  but instead checks if the current user is root (uid=0) in its local
  namespace. In the related discussion [4] it is said that "Controlling
  exe_fd without privileges may turn out to be dangerous. At least
  things like tomoyo examine it for making policy decisions (see
  tomoyo_manager())."

* [5] Dec 2016: This commit removes the restriction to change
  /proc/self/exe at most once. The related discussion [6] informs that
  the audit subsystem relies on the exe symlink, presumably
  audit_log_d_path_exe() in kernel/audit.c.

* [7] May 2017: This commit changed the check from uid==0 to local
  CAP_SYS_ADMIN. No discussion.

* [8] July 2020: A PoC to spoof any program's /proc/self/exe via ptrace
  is demonstrated

Overall, the concrete points that were made to retain capability checks
around changing the exe symlink is that tomoyo_manager() and
audit_log_d_path_exe() uses the exe_file path.

Christian Brauner said that relying on /proc/<pid>/exe being immutable (or
guarded by caps) in a sake of security is a bit misleading. It can only
be used as a hint without any guarantees of what code is being executed
once execve() returns to userspace. Christian suggested that in the
future, we could call audit_log() or similar to inform the admin of all
exe link changes, instead of attempting to provide security guarantees
via permission checks. However, this proposed change requires the
understanding of the security implications in the tomoyo/audit subsystems.

[1] b32dfe3771 ("c/r: prctl: add ability to set new mm_struct::exe_file")
[2] https://lore.kernel.org/patchwork/patch/292515/
[3] f606b77f1a ("prctl: PR_SET_MM -- introduce PR_SET_MM_MAP operation")
[4] https://lore.kernel.org/patchwork/patch/479359/
[5] 3fb4afd9a5 ("prctl: remove one-shot limitation for changing exe link")
[6] https://lore.kernel.org/patchwork/patch/697304/
[7] 4d28df6152 ("prctl: Allow local CAP_SYS_ADMIN changing exe_file")
[8] https://github.com/nviennot/run_as_exe

Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
Signed-off-by: Adrian Reber <areber@redhat.com>
Link: https://lore.kernel.org/r/20200719100418.2112740-6-areber@redhat.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-07-19 20:14:42 +02:00
Adrian Reber
b9a3db92e1 pid_namespace: use checkpoint_restore_ns_capable() for ns_last_pid
Use the newly introduced capability CAP_CHECKPOINT_RESTORE to allow
writing to ns_last_pid.

Signed-off-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20200719100418.2112740-4-areber@redhat.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-07-19 20:14:42 +02:00
Adrian Reber
1caef81da0 pid: use checkpoint_restore_ns_capable() for set_tid
Use the newly introduced capability CAP_CHECKPOINT_RESTORE to allow
using clone3() with set_tid set.

Signed-off-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20200719100418.2112740-3-areber@redhat.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-07-19 20:14:42 +02:00
Christoph Hellwig
23efed6fa7 dma-debug: use named initializers for dir2name
Make dir2name a little more readable and maintainable by using
named initializers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
2020-07-19 09:31:13 +02:00
Christoph Hellwig
d35834c648 dma-mapping: add a dma_ops_bypass flag to struct device
Several IOMMU drivers have a bypass mode where they can use a direct
mapping if the devices DMA mask is large enough.  Add generic support
to the core dma-mapping code to do that to switch those drivers to
a common solution.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
2020-07-19 09:29:29 +02:00
Christoph Hellwig
2f9237d4f6 dma-mapping: make support for dma ops optional
Avoid the overhead of the dma ops support for tiny builds that only
use the direct mapping.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
2020-07-19 09:29:23 +02:00
Jakub Sitnicki
1559b4aa1d inet: Run SK_LOOKUP BPF program on socket lookup
Run a BPF program before looking up a listening socket on the receive path.
Program selects a listening socket to yield as result of socket lookup by
calling bpf_sk_assign() helper and returning SK_PASS code. Program can
revert its decision by assigning a NULL socket with bpf_sk_assign().

Alternatively, BPF program can also fail the lookup by returning with
SK_DROP, or let the lookup continue as usual with SK_PASS on return, when
no socket has been selected with bpf_sk_assign().

This lets the user match packets with listening sockets freely at the last
possible point on the receive path, where we know that packets are destined
for local delivery after undergoing policing, filtering, and routing.

With BPF code selecting the socket, directing packets destined to an IP
range or to a port range to a single socket becomes possible.

In case multiple programs are attached, they are run in series in the order
in which they were attached. The end result is determined from return codes
of all the programs according to following rules:

 1. If any program returned SK_PASS and selected a valid socket, the socket
    is used as result of socket lookup.
 2. If more than one program returned SK_PASS and selected a socket,
    last selection takes effect.
 3. If any program returned SK_DROP, and no program returned SK_PASS and
    selected a socket, socket lookup fails with -ECONNREFUSED.
 4. If all programs returned SK_PASS and none of them selected a socket,
    socket lookup continues to htable-based lookup.

Suggested-by: Marek Majkowski <marek@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200717103536.397595-5-jakub@cloudflare.com
2020-07-17 20:18:16 -07:00
Jakub Sitnicki
e9ddbb7707 bpf: Introduce SK_LOOKUP program type with a dedicated attach point
Add a new program type BPF_PROG_TYPE_SK_LOOKUP with a dedicated attach type
BPF_SK_LOOKUP. The new program kind is to be invoked by the transport layer
when looking up a listening socket for a new connection request for
connection oriented protocols, or when looking up an unconnected socket for
a packet for connection-less protocols.

When called, SK_LOOKUP BPF program can select a socket that will receive
the packet. This serves as a mechanism to overcome the limits of what
bind() API allows to express. Two use-cases driving this work are:

 (1) steer packets destined to an IP range, on fixed port to a socket

     192.0.2.0/24, port 80 -> NGINX socket

 (2) steer packets destined to an IP address, on any port to a socket

     198.51.100.1, any port -> L7 proxy socket

In its run-time context program receives information about the packet that
triggered the socket lookup. Namely IP version, L4 protocol identifier, and
address 4-tuple. Context can be further extended to include ingress
interface identifier.

To select a socket BPF program fetches it from a map holding socket
references, like SOCKMAP or SOCKHASH, and calls bpf_sk_assign(ctx, sk, ...)
helper to record the selection. Transport layer then uses the selected
socket as a result of socket lookup.

In its basic form, SK_LOOKUP acts as a filter and hence must return either
SK_PASS or SK_DROP. If the program returns with SK_PASS, transport should
look for a socket to receive the packet, or use the one selected by the
program if available, while SK_DROP informs the transport layer that the
lookup should fail.

This patch only enables the user to attach an SK_LOOKUP program to a
network namespace. Subsequent patches hook it up to run on local delivery
path in ipv4 and ipv6 stacks.

Suggested-by: Marek Majkowski <marek@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200717103536.397595-3-jakub@cloudflare.com
2020-07-17 20:18:16 -07:00
Jakub Sitnicki
ce3aa9cc51 bpf, netns: Handle multiple link attachments
Extend the BPF netns link callbacks to rebuild (grow/shrink) or update the
prog_array at given position when link gets attached/updated/released.

This let's us lift the limit of having just one link attached for the new
attach type introduced by subsequent patch.

No functional changes intended.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200717103536.397595-2-jakub@cloudflare.com
2020-07-17 20:18:16 -07:00
André Almeida
9a71df495c futex: Remove unused or redundant includes
Since 82af7aca ("Removal of FUTEX_FD"), some includes related to file
operations aren't needed anymore. More investigation around the includes
showed that a lot of includes aren't required for compilation, possible
due to redundant includes. Simplify the code by removing unused
includes.

Signed-off-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200702202843.520764-4-andrealmeid@collabora.com
2020-07-18 01:56:09 +02:00
André Almeida
9261308598 futex: Consistently use fshared as boolean
Since fshared is only conveying true/false values, declare it as bool.

In get_futex_key() the usage of fshared can be restricted to the first part
of the function. If fshared is false the function is terminated early and
the subsequent code can use a constant 'true' instead of the variable.

Signed-off-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200702202843.520764-5-andrealmeid@collabora.com
2020-07-18 01:56:08 +02:00
André Almeida
d7c5ed73b1 futex: Remove needless goto's
As stated in the coding style documentation, "if there is no cleanup
needed then just return directly", instead of jumping to a label and
then returning.

Remove such goto's and replace with a return statement.  When there's a
ternary operator on the return value, replace it with the result of the
operation when it is logically possible to determine it by the control
flow.

Signed-off-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200702202843.520764-3-andrealmeid@collabora.com
2020-07-17 23:58:49 +02:00
André Almeida
9180bd467f futex: Remove put_futex_key()
Since 4b39f99c ("futex: Remove {get,drop}_futex_key_refs()"),
put_futex_key() is empty.

Remove all references for this function and the then redundant labels.

Signed-off-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200702202843.520764-2-andrealmeid@collabora.com
2020-07-17 23:58:49 +02:00
Thomas Gleixner
baedb87d1b genirq/affinity: Handle affinity setting on inactive interrupts correctly
Setting interrupt affinity on inactive interrupts is inconsistent when
hierarchical irq domains are enabled. The core code should just store the
affinity and not call into the irq chip driver for inactive interrupts
because the chip drivers may not be in a state to handle such requests.

X86 has a hacky workaround for that but all other irq chips have not which
causes problems e.g. on GIC V3 ITS.

Instead of adding more ugly hacks all over the place, solve the problem in
the core code. If the affinity is set on an inactive interrupt then:

    - Store it in the irq descriptors affinity mask
    - Update the effective affinity to reflect that so user space has
      a consistent view
    - Don't call into the irq chip driver

This is the core equivalent of the X86 workaround and works correctly
because the affinity setting is established in the irq chip when the
interrupt is activated later on.

Note, that this is only effective when hierarchical irq domains are enabled
by the architecture. Doing it unconditionally would break legacy irq chip
implementations.

For hierarchial irq domains this works correctly as none of the drivers can
have a dependency on affinity setting in inactive state by design.

Remove the X86 workaround as it is not longer required.

Fixes: 02edee152d ("x86/apic/vector: Ignore set_affinity call for inactive interrupts")
Reported-by: Ali Saidi <alisaidi@amazon.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Ali Saidi <alisaidi@amazon.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20200529015501.15771-1-alisaidi@amazon.com
Link: https://lkml.kernel.org/r/877dv2rv25.fsf@nanos.tec.linutronix.de
2020-07-17 23:30:43 +02:00
Frederic Weisbecker
36cd28a4cd timers: Lower base clock forwarding threshold
There is nothing that prevents from forwarding the base clock if it's one
jiffy off. The reason for this arbitrary limit of two jiffies is historical
and does not longer exist.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lkml.kernel.org/r/20200717140551.29076-13-frederic@kernel.org
2020-07-17 21:55:25 +02:00
Frederic Weisbecker
0975fb565b timers: Remove must_forward_clk
There is no reason to keep this guard around. The code makes sure that
base->clk remains sane and won't be forwarded beyond jiffies nor set
backward.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lkml.kernel.org/r/20200717140551.29076-12-frederic@kernel.org
2020-07-17 21:55:25 +02:00
Frederic Weisbecker
d4f7dae870 timers: Spare timer softirq until next expiry
Now that the core timer infrastructure doesn't depend anymore on
periodic base->clk increments, even when the CPU is not in NO_HZ mode,
timer softirqs can be skipped until there are timers to expire.

Some spurious softirqs can still remain since base->next_expiry doesn't
keep track of canceled timers but this still reduces the number of softirqs
significantly: ~15 times less for HZ=1000 and ~5 times less for HZ=100.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lkml.kernel.org/r/20200717140551.29076-11-frederic@kernel.org
2020-07-17 21:55:24 +02:00
Frederic Weisbecker
1f8a4212dc timers: Expand clk forward logic beyond nohz
As for next_expiry, the base->clk catch up logic will be expanded beyond
NOHZ in order to avoid triggering useless softirqs.

If softirqs should only fire to expire pending timers, periodic base->clk
increments must be skippable for random amounts of time.  Therefore prepare
to catch-up with missing updates whenever an up-to-date base clock is
needed.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lkml.kernel.org/r/20200717140551.29076-10-frederic@kernel.org
2020-07-17 21:55:24 +02:00
Frederic Weisbecker
90d52f65f3 timers: Reuse next expiry cache after nohz exit
Now that the next expiry it tracked unconditionally when a timer is added,
this information can be reused on a tick firing after exiting nohz.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lkml.kernel.org/r/20200717140551.29076-9-frederic@kernel.org
2020-07-17 21:55:23 +02:00
Frederic Weisbecker
dc2a0f1fb2 timers: Always keep track of next expiry
So far next expiry was only tracked while the CPU was in nohz_idle mode
in order to cope with missing ticks that can't increment the base->clk
periodically anymore.

This logic is going to be expanded beyond nohz in order to spare timer
softirqs so do it unconditionally.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lkml.kernel.org/r/20200717140551.29076-8-frederic@kernel.org
2020-07-17 21:55:23 +02:00
Frederic Weisbecker
001ec1b392 timers: Optimize _next_timer_interrupt() level iteration
If a level has a timer that expires before reaching the next level, there
is no need to iterate further.

The next level is reached when the 3 lower bits of the current level are
cleared. If the next event happens before/during that, the next levels
won't provide an earlier expiration.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lkml.kernel.org/r/20200717140551.29076-7-frederic@kernel.org
2020-07-17 21:55:22 +02:00
Frederic Weisbecker
4468897211 timers: Add comments about calc_index() ceiling work
calc_index() adds 1 unit of the level granularity to the expiry passed
in parameter to ensure that the timer doesn't expire too early. Add a
comment to explain that and the resulting layout in the wheel.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lkml.kernel.org/r/20200717140551.29076-6-frederic@kernel.org
2020-07-17 21:55:22 +02:00
Frederic Weisbecker
9a2b764b06 timers: Move trigger_dyntick_cpu() to enqueue_timer()
Consolidate the code by calling trigger_dyntick_cpu() from
enqueue_timer() instead of calling it from all its callers.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lkml.kernel.org/r/20200717140551.29076-5-frederic@kernel.org
2020-07-17 21:55:22 +02:00
Anna-Maria Behnsen
1f32cab0db timers: Use only bucket expiry for base->next_expiry value
The bucket expiry time is the effective expriy time of timers and is
greater than or equal to the requested timer expiry time. This is due
to the guarantee that timers never expire early and the reduced expiry
granularity in the secondary wheel levels.

When a timer is enqueued, trigger_dyntick_cpu() checks whether the
timer is the new first timer. This check compares next_expiry with
the requested timer expiry value and not with the effective expiry
value of the bucket into which the timer was queued.

Storing the requested timer expiry value in base->next_expiry can lead
to base->clk going backwards if the requested timer expiry value is
smaller than base->clk. Commit 30c66fc30e ("timer: Prevent base->clk
from moving backward") worked around this by preventing the store when
timer->expiry is before base->clk, but did not fix the underlying
problem.

Use the expiry value of the bucket into which the timer is queued to
do the new first timer check. This fixes the base->clk going backward
problem.

The workaround of commit 30c66fc30e ("timer: Prevent base->clk from
moving backward") in trigger_dyntick_cpu() is not longer necessary as the
timers bucket expiry is guaranteed to be greater than or equal base->clk.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200717140551.29076-4-frederic@kernel.org
2020-07-17 21:55:21 +02:00
Frederic Weisbecker
3d2e83a2a6 timers: Preserve higher bits of expiration on index calculation
The higher bits of the timer expiration are cropped while calling
calc_index() due to the implicit cast from unsigned long to unsigned int.

This loss shouldn't have consequences on the current code since all the
computation to calculate the index is done on the lower 32 bits.

However to prepare for returning the actual bucket expiration from
calc_index() in order to properly fix base->next_expiry updates, the higher
bits need to be preserved.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200717140551.29076-3-frederic@kernel.org
2020-07-17 21:55:21 +02:00
Frederic Weisbecker
e2a71bdea8 timer: Fix wheel index calculation on last level
When an expiration delta falls into the last level of the wheel, that delta
has be compared against the maximum possible delay and reduced to fit in if
necessary.

However instead of comparing the delta against the maximum, the code
compares the actual expiry against the maximum. Then instead of fixing the
delta to fit in, it sets the maximum delta as the expiry value.

This can result in various undesired outcomes, the worst possible one
being a timer expiring 15 days ahead to fire immediately.

Fixes: 500462a9de ("timers: Switch to a non-cascading wheel")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20200717140551.29076-2-frederic@kernel.org
2020-07-17 21:44:05 +02:00
Vincent Guittot
01cfcde9c2 sched/fair: handle case of task_h_load() returning 0
task_h_load() can return 0 in some situations like running stress-ng
mmapfork, which forks thousands of threads, in a sched group on a 224 cores
system. The load balance doesn't handle this correctly because
env->imbalance never decreases and it will stop pulling tasks only after
reaching loop_max, which can be equal to the number of running tasks of
the cfs. Make sure that imbalance will be decreased by at least 1.

misfit task is the other feature that doesn't handle correctly such
situation although it's probably more difficult to face the problem
because of the smaller number of CPUs and running tasks on heterogenous
system.

We can't simply ensure that task_h_load() returns at least one because it
would imply to handle underflow in other places.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: <stable@vger.kernel.org> # v4.4+
Link: https://lkml.kernel.org/r/20200710152426.16981-1-vincent.guittot@linaro.org
2020-07-16 23:19:48 +02:00
Kees Cook
3f649ab728 treewide: Remove uninitialized_var() usage
Using uninitialized_var() is dangerous as it papers over real bugs[1]
(or can in the future), and suppresses unrelated compiler warnings
(e.g. "unused variable"). If the compiler thinks it is uninitialized,
either simply initialize the variable or make compiler changes.

In preparation for removing[2] the[3] macro[4], remove all remaining
needless uses with the following script:

git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
	xargs perl -pi -e \
		's/\buninitialized_var\(([^\)]+)\)/\1/g;
		 s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'

drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
pathological white-space.

No outstanding warnings were found building allmodconfig with GCC 9.3.0
for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
alpha, and m68k.

[1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
[2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
[3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
[4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/

Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5
Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB
Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers
Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-07-16 12:35:15 -07:00
Lorenzo Bianconi
28b1520ebf bpf: cpumap: Implement XDP_REDIRECT for eBPF programs attached to map entries
Introduce XDP_REDIRECT support for eBPF programs attached to cpumap
entries.
This patch has been tested on Marvell ESPRESSObin using a modified
version of xdp_redirect_cpu sample in order to attach a XDP program
to CPUMAP entries to perform a redirect on the mvneta interface.
In particular the following scenario has been tested:

rq (cpu0) --> mvneta - XDP_REDIRECT (cpu0) --> CPUMAP - XDP_REDIRECT (cpu1) --> mvneta

$./xdp_redirect_cpu -p xdp_cpu_map0 -d eth0 -c 1 -e xdp_redirect \
	-f xdp_redirect_kern.o -m tx_port -r eth0

tx: 285.2 Kpps rx: 285.2 Kpps

Attaching a simple XDP program on eth0 to perform XDP_TX gives
comparable results:

tx: 288.4 Kpps rx: 288.4 Kpps

Co-developed-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/bpf/2cf8373a731867af302b00c4ff16c122630c4980.1594734381.git.lorenzo@kernel.org
2020-07-16 17:00:32 +02:00
Lorenzo Bianconi
9216477449 bpf: cpumap: Add the possibility to attach an eBPF program to cpumap
Introduce the capability to attach an eBPF program to cpumap entries.
The idea behind this feature is to add the possibility to define on
which CPU run the eBPF program if the underlying hw does not support
RSS. Current supported verdicts are XDP_DROP and XDP_PASS.

This patch has been tested on Marvell ESPRESSObin using xdp_redirect_cpu
sample available in the kernel tree to identify possible performance
regressions. Results show there are no observable differences in
packet-per-second:

$./xdp_redirect_cpu --progname xdp_cpu_map0 --dev eth0 --cpu 1
rx: 354.8 Kpps
rx: 356.0 Kpps
rx: 356.8 Kpps
rx: 356.3 Kpps
rx: 356.6 Kpps
rx: 356.6 Kpps
rx: 356.7 Kpps
rx: 355.8 Kpps
rx: 356.8 Kpps
rx: 356.8 Kpps

Co-developed-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/bpf/5c9febdf903d810b3415732e5cd98491d7d9067a.1594734381.git.lorenzo@kernel.org
2020-07-16 17:00:32 +02:00
Lorenzo Bianconi
644bfe51fa cpumap: Formalize map value as a named struct
As it has been already done for devmap, introduce 'struct bpf_cpumap_val'
to formalize the expected values that can be passed in for a CPUMAP.
Update cpumap code to use the struct.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/bpf/754f950674665dae6139c061d28c1d982aaf4170.1594734381.git.lorenzo@kernel.org
2020-07-16 17:00:32 +02:00
Jesper Dangaard Brouer
9b74ebb2b0 cpumap: Use non-locked version __ptr_ring_consume_batched
Commit 77361825bb ("bpf: cpumap use ptr_ring_consume_batched") changed
away from using single frame ptr_ring dequeue (__ptr_ring_consume) to
consume a batched, but it uses a locked version, which as the comment
explain isn't needed.

Change to use the non-locked version __ptr_ring_consume_batched.

Fixes: 77361825bb ("bpf: cpumap use ptr_ring_consume_batched")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/a9c7d06f9a009e282209f0c8c7b2c5d9b9ad60b9.1594734381.git.lorenzo@kernel.org
2020-07-16 17:00:31 +02:00
Christoph Hellwig
b417417300 dma-mapping: inline the fast path dma-direct calls
Inline the single page map/unmap/sync dma-direct calls into the now
out of line generic wrappers.  This restores the behavior of a single
function call that we had before moving the generic calls out of line.
Besides the dma-mapping callers there are just a few callers in IOMMU
drivers that have a bypass mode, and more of those are going to be
switched to the generic bypass soon.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
2020-07-16 16:58:37 +02:00
Christoph Hellwig
d3fa60d7bf dma-mapping: move the remaining DMA API calls out of line
For a long time the DMA API has been implemented inline in dma-mapping.h,
but the function bodies can be quite large.  Move them all out of line.

This also removes all the dma_direct_* exports as those are just
implementation details and should never be used by drivers directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
2020-07-16 16:58:33 +02:00
Peilin Ye
5b801dfb7f bpf: Fix NULL pointer dereference in __btf_resolve_helper_id()
Prevent __btf_resolve_helper_id() from dereferencing `btf_vmlinux`
as NULL. This patch fixes the following syzbot bug:

    https://syzkaller.appspot.com/bug?id=f823224ada908fa5c207902a5a62065e53ca0fcc

Reported-by: syzbot+ee09bda7017345f1fbe6@syzkaller.appspotmail.com
Signed-off-by: Peilin Ye <yepeilin.cs@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200714180904.277512-1-yepeilin.cs@gmail.com
2020-07-15 22:53:39 +02:00
Sargun Dhillon
7cf97b1254 seccomp: Introduce addfd ioctl to seccomp user notifier
The current SECCOMP_RET_USER_NOTIF API allows for syscall supervision over
an fd. It is often used in settings where a supervising task emulates
syscalls on behalf of a supervised task in userspace, either to further
restrict the supervisee's syscall abilities or to circumvent kernel
enforced restrictions the supervisor deems safe to lift (e.g. actually
performing a mount(2) for an unprivileged container).

While SECCOMP_RET_USER_NOTIF allows for the interception of any syscall,
only a certain subset of syscalls could be correctly emulated. Over the
last few development cycles, the set of syscalls which can't be emulated
has been reduced due to the addition of pidfd_getfd(2). With this we are
now able to, for example, intercept syscalls that require the supervisor
to operate on file descriptors of the supervisee such as connect(2).

However, syscalls that cause new file descriptors to be installed can not
currently be correctly emulated since there is no way for the supervisor
to inject file descriptors into the supervisee. This patch adds a
new addfd ioctl to remove this restriction by allowing the supervisor to
install file descriptors into the intercepted task. By implementing this
feature via seccomp the supervisor effectively instructs the supervisee
to install a set of file descriptors into its own file descriptor table
during the intercepted syscall. This way it is possible to intercept
syscalls such as open() or accept(), and install (or replace, like
dup2(2)) the supervisor's resulting fd into the supervisee. One
replacement use-case would be to redirect the stdout and stderr of a
supervisee into log file descriptors opened by the supervisor.

The ioctl handling is based on the discussions[1] of how Extensible
Arguments should interact with ioctls. Instead of building size into
the addfd structure, make it a function of the ioctl command (which
is how sizes are normally passed to ioctls). To support forward and
backward compatibility, just mask out the direction and size, and match
everything. The size (and any future direction) checks are done along
with copy_struct_from_user() logic.

As a note, the seccomp_notif_addfd structure is laid out based on 8-byte
alignment without requiring packing as there have been packing issues
with uapi highlighted before[2][3]. Although we could overload the
newfd field and use -1 to indicate that it is not to be used, doing
so requires changing the size of the fd field, and introduces struct
packing complexity.

[1]: https://lore.kernel.org/lkml/87o8w9bcaf.fsf@mid.deneb.enyo.de/
[2]: https://lore.kernel.org/lkml/a328b91d-fd8f-4f27-b3c2-91a9c45f18c0@rasmusvillemoes.dk/
[3]: https://lore.kernel.org/lkml/20200612104629.GA15814@ircssh-2.c.rugged-nimbus-611.internal

Cc: Christoph Hellwig <hch@lst.de>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Jann Horn <jannh@google.com>
Cc: Robert Sesek <rsesek@google.com>
Cc: Chris Palmer <palmer@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-api@vger.kernel.org
Suggested-by: Matt Denton <mpdenton@google.com>
Link: https://lore.kernel.org/r/20200603011044.7972-4-sargun@sargun.me
Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Reviewed-by: Will Drewry <wad@chromium.org>
Co-developed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-07-14 16:29:42 -07:00
Alexei Starovoitov
ec2ffdf65f Merge branch 'usermode-driver-cleanup' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace into bpf-next 2020-07-14 12:18:01 -07:00
Alexey Dobriyan
02d7f4005e PM: sleep: spread "const char *" correctness
Fixed string literals can be referred to as "const char *".

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
[ rjw: Minor subject edit ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-07-14 19:25:41 +02:00
Xiang Chen
6ada7ba2fa PM: hibernate: fix white space in a few places
In hibernate.c, some places lack of spaces while some places have
redundant spaces. So fix them.

Signed-off-by: Xiang Chen <chenxiang66@hisilicon.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-07-14 19:25:41 +02:00
Nicolas Saenz Julienne
d9765e41d8 dma-pool: do not allocate pool memory from CMA
There is no guarantee to CMA's placement, so allocating a zone specific
atomic pool from CMA might return memory from a completely different
memory zone. So stop using it.

Fixes: c84dc6e68a ("dma-pool: add additional coherent pools to map to gfp mask")
Reported-by: Jeremy Linton <jeremy.linton@arm.com>
Signed-off-by: Nicolas Saenz Julienne <nsaenzjulienne@suse.de>
Tested-by: Jeremy Linton <jeremy.linton@arm.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-14 15:46:32 +02:00
Nicolas Saenz Julienne
81e9d894e0 dma-pool: make sure atomic pool suits device
When allocating DMA memory from a pool, the core can only guess which
atomic pool will fit a device's constraints. If it doesn't, get a safer
atomic pool and try again.

Fixes: c84dc6e68a ("dma-pool: add additional coherent pools to map to gfp mask")
Reported-by: Jeremy Linton <jeremy.linton@arm.com>
Suggested-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Nicolas Saenz Julienne <nsaenzjulienne@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-14 15:42:33 +02:00
Nicolas Saenz Julienne
48b6703858 dma-pool: introduce dma_guess_pool()
dma-pool's dev_to_pool() creates the false impression that there is a
way to grantee a mapping between a device's DMA constraints and an
atomic pool. It tuns out it's just a guess, and the device might need to
use an atomic pool containing memory from a 'safer' (or lower) memory
zone.

To help mitigate this, introduce dma_guess_pool() which can be fed a
device's DMA constraints and atomic pools already known to be faulty, in
order for it to provide an better guess on which pool to use.

Signed-off-by: Nicolas Saenz Julienne <nsaenzjulienne@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-14 15:42:30 +02:00
Nicolas Saenz Julienne
23e469be62 dma-pool: get rid of dma_in_atomic_pool()
The function is only used once and can be simplified to a one-liner.

Signed-off-by: Nicolas Saenz Julienne <nsaenzjulienne@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-14 15:42:27 +02:00
Nicolas Saenz Julienne
567f6a6eba dma-direct: provide function to check physical memory area validity
dma_coherent_ok() checks if a physical memory area fits a device's DMA
constraints.

Signed-off-by: Nicolas Saenz Julienne <nsaenzjulienne@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-14 15:42:24 +02:00
David S. Miller
07dd1b7e68 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2020-07-13

The following pull-request contains BPF updates for your *net-next* tree.

We've added 36 non-merge commits during the last 7 day(s) which contain
a total of 62 files changed, 2242 insertions(+), 468 deletions(-).

The main changes are:

1) Avoid trace_printk warning banner by switching bpf_trace_printk to use
   its own tracing event, from Alan.

2) Better libbpf support on older kernels, from Andrii.

3) Additional AF_XDP stats, from Ciara.

4) build time resolution of BTF IDs, from Jiri.

5) BPF_CGROUP_INET_SOCK_RELEASE hook, from Stanislav.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-13 18:04:05 -07:00
Alan Maguire
ac5a72ea5c bpf: Use dedicated bpf_trace_printk event instead of trace_printk()
The bpf helper bpf_trace_printk() uses trace_printk() under the hood.
This leads to an alarming warning message originating from trace
buffer allocation which occurs the first time a program using
bpf_trace_printk() is loaded.

We can instead create a trace event for bpf_trace_printk() and enable
it in-kernel when/if we encounter a program using the
bpf_trace_printk() helper.  With this approach, trace_printk()
is not used directly and no warning message appears.

This work was started by Steven (see Link) and finished by Alan; added
Steven's Signed-off-by with his permission.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/r/20200628194334.6238b933@oasis.local.home
Link: https://lore.kernel.org/bpf/1594641154-18897-2-git-send-email-alan.maguire@oracle.com
2020-07-13 16:55:49 -07:00
Kees Cook
910d2f16ac pidfd: Replace open-coded receive_fd()
Replace the open-coded version of receive_fd() with a call to the
new helper.

Thanks to Vamshi K Sthambamkadi <vamshi.k.sthambamkadi@gmail.com> for
catching a missed fput() in an earlier version of this patch.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Sargun Dhillon <sargun@sargun.me>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-07-13 11:03:45 -07:00
Kees Cook
4969f8a073 pidfd: Add missing sock updates for pidfd_getfd()
The sock counting (sock_update_netprioidx() and sock_update_classid())
was missing from pidfd's implementation of received fd installation. Add
a call to the new __receive_sock() helper.

Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sargun Dhillon <sargun@sargun.me>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
Fixes: 8649c322f7 ("pid: Implement pidfd_getfd syscall")
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-07-13 11:03:44 -07:00
Jiri Olsa
49f4e67207 bpf: Use BTF_ID to resolve bpf_ctx_convert struct
This way the ID is resolved during compile time,
and we can remove the runtime name search.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200711215329.41165-7-jolsa@kernel.org
2020-07-13 10:42:03 -07:00
Jiri Olsa
138b9a0511 bpf: Remove btf_id helpers resolving
Now when we moved the helpers btf_id arrays into .BTF_ids section,
we can remove the code that resolve those IDs in runtime.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200711215329.41165-6-jolsa@kernel.org
2020-07-13 10:42:02 -07:00
Jiri Olsa
c9a0f3b85e bpf: Resolve BTF IDs in vmlinux image
Using BTF_ID_LIST macro to define lists for several helpers
using BTF arguments.

And running resolve_btfids on vmlinux elf object during linking,
so the .BTF_ids section gets the IDs resolved.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200711215329.41165-5-jolsa@kernel.org
2020-07-13 10:42:02 -07:00
Bruno Meneguele
bc885f1ab6 doc:kmsg: explicitly state the return value in case of SEEK_CUR
The commit 625d344978 ("Revert "kernel/printk: add kmsg SEEK_CUR
handling"") reverted a change done to the return value in case a SEEK_CUR
operation was performed for kmsg buffer based on the fact that different
userspace apps were handling the new return value (-ESPIPE) in different
ways, breaking them.

At the same time -ESPIPE was the wrong decision because kmsg /does support/
seek() but doesn't follow the "normal" behavior userspace is used to.
Because of that and also considering the time -EINVAL has been used, it was
decided to keep this way to avoid more userspace breakage.

This patch adds an official statement to the kmsg documentation pointing to
the current return value for SEEK_CUR, -EINVAL, thus userspace libraries
and apps can refer to it for a definitive guide on what to expect.

Signed-off-by: Bruno Meneguele <bmeneg@redhat.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20200710174423.10480-1-bmeneg@redhat.com
2020-07-13 15:07:45 +02:00
Linus Torvalds
9901a6bd15 RISC-V Fixes for 5.8-rc5 (ideally)
I have a few KGDB-related fixes that I'd like to target for 5.8-rc5.  They're
 mostly fixes for build warnings, but there's also:
 
 * Support for the qSupported and qXfer packets, which are necessary to pass
   around GDB XML information which we need for the RISC-V GDB port to fully
   function.
 * Users can now select STRICT_KERNEL_RWX instead of forcing it on.
 
 I know it's a bit late for rc5, as these are not critical it's not a big deal
 if they don't make it in.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEKzw3R0RoQ7JKlDp6LhMZ81+7GIkFAl8KH4UTHHBhbG1lckBk
 YWJiZWx0LmNvbQAKCRAuExnzX7sYifPsEACcpQJRzLaYxjTP6INLtUK2J1jvx3Md
 D0QfzGQsWLOtqtk37vXUt+0KPS8vErvDHzfD1ZkHKDVFIVt4ZEVfDyPPx74nuvns
 qpyFkHuv2f+icTf+YnZyH+MZW8iFesOwqbfXC5YnhI/vcqeieafd8U3t3oDik5SI
 NuT0uiWAiTqUPan2vu1xrBBynxpCyCM/U/ZONf3J38wL6Mck0GTc2NjAsAsmpnZJ
 pxhkGFiDIuOUuJDCDbQBoC5bWamDYYZOuhrjMizILdqiDlxdBSTSmLWpCfXtp7ls
 xZL+/QV0BSR8ymSnMMAowXCrK+TTFY62bxOLhpvk5uDGEtW6F9jOh7VsW8vAtz+x
 WmqcgTtPrtyvNn4hM/1Md0IV58pKU+VaeLeKQQu3V5jH6h3s+YSSyWtuheLsnhI8
 KWdd88xU0Tp7ym7BcaQqXM6UbmT61YAyr1R2VcwsiSz/uRwpKYdfo12FDmTr6FxN
 Br5HL0okfmDnE9KgEhEY9kbRt3FM2aoLvYlVTdRX5yAnoF1/Dnh0Jry5kOkD6OuO
 lIbzvwzziTqA/STJ5UuoXRrUfwHQ+XLEMo9zGhEAv6mfXYoIkX9txVeIKFrIDkKU
 dBGKL3mSruntDp/FfCgksDlZUy111VcwwdxpeplHCcyI8YGPsaavO9B8qkI5iJbG
 WEukopxoA5Yj0g==
 =kyQD
 -----END PGP SIGNATURE-----

Merge tag 'riscv-for-linus-5.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux

Pull RISC-V fixes from Palmer Dabbelt:
 "I have a few KGDB-related fixes. They're mostly fixes for build
  warnings, but there's also:

   - Support for the qSupported and qXfer packets, which are necessary
     to pass around GDB XML information which we need for the RISC-V GDB
     port to fully function.

   - Users can now select STRICT_KERNEL_RWX instead of forcing it on"

* tag 'riscv-for-linus-5.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
  riscv: Avoid kgdb.h including gdb_xml.h to solve unused-const-variable warning
  kgdb: Move the extern declaration kgdb_has_hit_break() to generic kgdb.h
  riscv: Fix "no previous prototype" compile warning in kgdb.c file
  riscv: enable the Kconfig prompt of STRICT_KERNEL_RWX
  kgdb: enable arch to support XML packet.
2020-07-11 19:22:46 -07:00
David S. Miller
71930d6102 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
All conflicts seemed rather trivial, with some guidance from
Saeed Mameed on the tc_ct.c one.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-11 00:46:00 -07:00
Linus Torvalds
5a764898af Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Restore previous behavior of CAP_SYS_ADMIN wrt loading networking
    BPF programs, from Maciej Żenczykowski.

 2) Fix dropped broadcasts in mac80211 code, from Seevalamuthu
    Mariappan.

 3) Slay memory leak in nl80211 bss color attribute parsing code, from
    Luca Coelho.

 4) Get route from skb properly in ip_route_use_hint(), from Miaohe Lin.

 5) Don't allow anything other than ARPHRD_ETHER in llc code, from Eric
    Dumazet.

 6) xsk code dips too deeply into DMA mapping implementation internals.
    Add dma_need_sync and use it. From Christoph Hellwig

 7) Enforce power-of-2 for BPF ringbuf sizes. From Andrii Nakryiko.

 8) Check for disallowed attributes when loading flow dissector BPF
    programs. From Lorenz Bauer.

 9) Correct packet injection to L3 tunnel devices via AF_PACKET, from
    Jason A. Donenfeld.

10) Don't advertise checksum offload on ipa devices that don't support
    it. From Alex Elder.

11) Resolve several issues in TCP MD5 signature support. Missing memory
    barriers, bogus options emitted when using syncookies, and failure
    to allow md5 key changes in established states. All from Eric
    Dumazet.

12) Fix interface leak in hsr code, from Taehee Yoo.

13) VF reset fixes in hns3 driver, from Huazhong Tan.

14) Make loopback work again with ipv6 anycast, from David Ahern.

15) Fix TX starvation under high load in fec driver, from Tobias
    Waldekranz.

16) MLD2 payload lengths not checked properly in bridge multicast code,
    from Linus Lüssing.

17) Packet scheduler code that wants to find the inner protocol
    currently only works for one level of VLAN encapsulation. Allow
    Q-in-Q situations to work properly here, from Toke
    Høiland-Jørgensen.

18) Fix route leak in l2tp, from Xin Long.

19) Resolve conflict between the sk->sk_user_data usage of bpf reuseport
    support and various protocols. From Martin KaFai Lau.

20) Fix socket cgroup v2 reference counting in some situations, from
    Cong Wang.

21) Cure memory leak in mlx5 connection tracking offload support, from
    Eli Britstein.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (146 commits)
  mlxsw: pci: Fix use-after-free in case of failed devlink reload
  mlxsw: spectrum_router: Remove inappropriate usage of WARN_ON()
  net: macb: fix call to pm_runtime in the suspend/resume functions
  net: macb: fix macb_suspend() by removing call to netif_carrier_off()
  net: macb: fix macb_get/set_wol() when moving to phylink
  net: macb: mark device wake capable when "magic-packet" property present
  net: macb: fix wakeup test in runtime suspend/resume routines
  bnxt_en: fix NULL dereference in case SR-IOV configuration fails
  libbpf: Fix libbpf hashmap on (I)LP32 architectures
  net/mlx5e: CT: Fix memory leak in cleanup
  net/mlx5e: Fix port buffers cell size value
  net/mlx5e: Fix 50G per lane indication
  net/mlx5e: Fix CPU mapping after function reload to avoid aRFS RX crash
  net/mlx5e: Fix VXLAN configuration restore after function reload
  net/mlx5e: Fix usage of rcu-protected pointer
  net/mxl5e: Verify that rpriv is not NULL
  net/mlx5: E-Switch, Fix vlan or qos setting in legacy mode
  net/mlx5: Fix eeprom support for SFP module
  cgroup: Fix sock_cgroup_data on big-endian.
  selftests: bpf: Fix detach from sockmap tests
  ...
2020-07-10 18:16:22 -07:00
Kees Cook
fe4bfff86e seccomp: Use -1 marker for end of mode 1 syscall list
The terminator for the mode 1 syscalls list was a 0, but that could be
a valid syscall number (e.g. x86_64 __NR_read). By luck, __NR_read was
listed first and the loop construct would not test it, so there was no
bug. However, this is fragile. Replace the terminator with -1 instead,
and make the variable name for mode 1 syscall lists more descriptive.

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Drewry <wad@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-07-10 16:01:52 -07:00
Kees Cook
47e33c05f9 seccomp: Fix ioctl number for SECCOMP_IOCTL_NOTIF_ID_VALID
When SECCOMP_IOCTL_NOTIF_ID_VALID was first introduced it had the wrong
direction flag set. While this isn't a big deal as nothing currently
enforces these bits in the kernel, it should be defined correctly. Fix
the define and provide support for the old command until it is no longer
needed for backward compatibility.

Fixes: 6a21cc50f0 ("seccomp: add a return code to trap to userspace")
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-07-10 16:01:52 -07:00
Kees Cook
e68f9d49dd seccomp: Use pr_fmt
Avoid open-coding "seccomp: " prefixes for pr_*() calls.

Signed-off-by: Kees Cook <keescook@chromium.org>
2020-07-10 16:01:52 -07:00
Christian Brauner
99cdb8b9a5 seccomp: notify about unused filter
We've been making heavy use of the seccomp notifier to intercept and
handle certain syscalls for containers. This patch allows a syscall
supervisor listening on a given notifier to be notified when a seccomp
filter has become unused.

A container is often managed by a singleton supervisor process the
so-called "monitor". This monitor process has an event loop which has
various event handlers registered. If the user specified a seccomp
profile that included a notifier for various syscalls then we also
register a seccomp notify even handler. For any container using a
separate pid namespace the lifecycle of the seccomp notifier is bound to
the init process of the pid namespace, i.e. when the init process exits
the filter must be unused.

If a new process attaches to a container we force it to assume a seccomp
profile. This can either be the same seccomp profile as the container
was started with or a modified one. If the attaching process makes use
of the seccomp notifier we will register a new seccomp notifier handler
in the monitor's event loop. However, when the attaching process exits
we can't simply delete the handler since other child processes could've
been created (daemons spawned etc.) that have inherited the seccomp
filter and so we need to keep the seccomp notifier fd alive in the event
loop. But this is problematic since we don't get a notification when the
seccomp filter has become unused and so we currently never remove the
seccomp notifier fd from the event loop and just keep accumulating fds
in the event loop. We've had this issue for a while but it has recently
become more pressing as more and larger users make use of this.

To fix this, we introduce a new "users" reference counter that tracks any
tasks and dependent filters making use of a filter. When a notifier is
registered waiting tasks will be notified that the filter is now empty
by receiving a (E)POLLHUP event.

The concept in this patch introduces is the same as for signal_struct,
i.e. reference counting for life-cycle management is decoupled from
reference counting taks using the object. There's probably some trickery
possible but the second counter is just the correct way of doing this
IMHO and has precedence.

Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Kees Cook <keescook@chromium.org>
Cc: Matt Denton <mpdenton@google.com>
Cc: Sargun Dhillon <sargun@sargun.me>
Cc: Jann Horn <jannh@google.com>
Cc: Chris Palmer <palmer@google.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Robert Sesek <rsesek@google.com>
Cc: Jeffrey Vander Stoep <jeffv@google.com>
Cc: Linux Containers <containers@lists.linux-foundation.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20200531115031.391515-3-christian.brauner@ubuntu.com
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-07-10 16:01:51 -07:00
Christian Brauner
76194c4e83 seccomp: Lift wait_queue into struct seccomp_filter
Lift the wait_queue from struct notification into struct seccomp_filter.
This is cleaner overall and lets us avoid having to take the notifier
mutex in the future for EPOLLHUP notifications since we need to neither
read nor modify the notifier specific aspects of the seccomp filter. In
the exit path I'd very much like to avoid having to take the notifier mutex
for each filter in the task's filter hierarchy.

Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Kees Cook <keescook@chromium.org>
Cc: Matt Denton <mpdenton@google.com>
Cc: Sargun Dhillon <sargun@sargun.me>
Cc: Jann Horn <jannh@google.com>
Cc: Chris Palmer <palmer@google.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Robert Sesek <rsesek@google.com>
Cc: Jeffrey Vander Stoep <jeffv@google.com>
Cc: Linux Containers <containers@lists.linux-foundation.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-07-10 16:01:51 -07:00
Christian Brauner
3a15fb6ed9 seccomp: release filter after task is fully dead
The seccomp filter used to be released in free_task() which is called
asynchronously via call_rcu() and assorted mechanisms. Since we need
to inform tasks waiting on the seccomp notifier when a filter goes empty
we will notify them as soon as a task has been marked fully dead in
release_task(). To not split seccomp cleanup into two parts, move
filter release out of free_task() and into release_task() after we've
unhashed struct task from struct pid, exited signals, and unlinked it
from the threadgroups' thread list. We'll put the empty filter
notification infrastructure into it in a follow up patch.

This also renames put_seccomp_filter() to seccomp_filter_release() which
is a more descriptive name of what we're doing here especially once
we've added the empty filter notification mechanism in there.

We're also NULL-ing the task's filter tree entrypoint which seems
cleaner than leaving a dangling pointer in there. Note that this shouldn't
need any memory barriers since we're calling this when the task is in
release_task() which means it's EXIT_DEAD. So it can't modify its seccomp
filters anymore. You can also see this from the point where we're calling
seccomp_filter_release(). It's after __exit_signal() and at this point,
tsk->sighand will already have been NULLed which is required for
thread-sync and filter installation alike.

Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Kees Cook <keescook@chromium.org>
Cc: Matt Denton <mpdenton@google.com>
Cc: Sargun Dhillon <sargun@sargun.me>
Cc: Jann Horn <jannh@google.com>
Cc: Chris Palmer <palmer@google.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Robert Sesek <rsesek@google.com>
Cc: Jeffrey Vander Stoep <jeffv@google.com>
Cc: Linux Containers <containers@lists.linux-foundation.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20200531115031.391515-2-christian.brauner@ubuntu.com
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-07-10 16:01:51 -07:00
Christian Brauner
b707ddee11 seccomp: rename "usage" to "refs" and document
Naming the lifetime counter of a seccomp filter "usage" suggests a
little too strongly that its about tasks that are using this filter
while it also tracks other references such as the user notifier or
ptrace. This also updates the documentation to note this fact.

We'll be introducing an actual usage counter in a follow-up patch.

Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Kees Cook <keescook@chromium.org>
Cc: Matt Denton <mpdenton@google.com>
Cc: Sargun Dhillon <sargun@sargun.me>
Cc: Jann Horn <jannh@google.com>
Cc: Chris Palmer <palmer@google.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Robert Sesek <rsesek@google.com>
Cc: Jeffrey Vander Stoep <jeffv@google.com>
Cc: Linux Containers <containers@lists.linux-foundation.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20200531115031.391515-1-christian.brauner@ubuntu.com
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-07-10 16:01:51 -07:00
Sargun Dhillon
9f87dcf14b seccomp: Add find_notification helper
This adds a helper which can iterate through a seccomp_filter to
find a notification matching an ID. It removes several replicated
chunks of code.

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Tycho Andersen <tycho@tycho.ws>
Cc: Matt Denton <mpdenton@google.com>
Cc: Kees Cook <keescook@google.com>,
Cc: Jann Horn <jannh@google.com>,
Cc: Robert Sesek <rsesek@google.com>,
Cc: Chris Palmer <palmer@google.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Tycho Andersen <tycho@tycho.ws>
Link: https://lore.kernel.org/r/20200601112532.150158-1-sargun@sargun.me
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-07-10 16:01:51 -07:00
Kees Cook
c818c03b66 seccomp: Report number of loaded filters in /proc/$pid/status
A common question asked when debugging seccomp filters is "how many
filters are attached to your process?" Provide a way to easily answer
this question through /proc/$pid/status with a "Seccomp_filters" line.

Signed-off-by: Kees Cook <keescook@chromium.org>
2020-07-10 16:01:51 -07:00
Linus Torvalds
1bfde03742 dma-mapping fixes for 5.8
- add a warning when the atomic pool is depleted (David Rientjes)
  - protect the parameters of the new scatterlist helper macros
    (Marek Szyprowski )
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAl8IjBILHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYN10RAAjCGeb2ImNmGHgqZEbJ5KM99g/gVeGJO2aUOLQWCx
 qr3Jx0PX6TaGi/tg4OMJFwA8oErHh6bZO1OWVp7PShmeEHRdRp+FPmcb0PzRM1pO
 gNxgouJIj+B47enkFwRjLpiST5YVoP90Sn61I8Vr9hiC88TaLho0Kj2hkvTcKRln
 NCahkT9NTQpoC1iFR+lMje1yodEzWum3+aAEmjIaebeMJor1v8RRGkYXJASdD1V2
 whchfZCWM6Jhr9PUAL3NnTbQXccI7qOkCCsxssW652SysIN6dV8XmBmoH/VUC5QE
 soScl93T0EZvBdUreEvKSjVO3BOCRuemuzQ9myFk4c/olKGqQO675G1sCs9RIawz
 UEAtWEWYC/CluKvzjJuJl2pGmfNRuazsylLA6WDQGqQoe8uJ/9qKKpCr9jRn3shl
 dUccyFQWrmXrh76qXPvB05D0/qb4JNVhyXYLiD8DhzR3DlH1d5z52TWDT9g/J84Q
 usq69gwZq65MZYMHWRlRRXYdEuvQxgEZvl2ecYA/ZaW1wh6XYGBCQI5CtG5E2sOP
 8THs5E+u1PQaJWqdIR57xCuNxpWS+r6nv0N7z4vIQtwVkXPO3lS7aVNClIOY1u5/
 m7SEeJ4ZBtVZsA4nQbG3sxAiA1GT8nm5JugwfOIgmMyxrpRbNWj8IrIe49Ckbhqa
 YZQ=
 =KI5i
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-5.8-5' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping fixes from Christoph Hellwig:

 - add a warning when the atomic pool is depleted (David Rientjes)

 - protect the parameters of the new scatterlist helper macros (Marek
   Szyprowski )

* tag 'dma-mapping-5.8-5' of git://git.infradead.org/users/hch/dma-mapping:
  scatterlist: protect parameters of the sg_table related macros
  dma-mapping: warn when coherent pool is depleted
2020-07-10 09:36:03 -07:00
Peter Zijlstra
f9ad4a5f3f lockdep: Remove lockdep_hardirq{s_enabled,_context}() argument
Now that the macros use per-cpu data, we no longer need the argument.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200623083721.571835311@infradead.org
2020-07-10 12:00:02 +02:00
Peter Zijlstra
a21ee6055c lockdep: Change hardirq{s_enabled,_context} to per-cpu variables
Currently all IRQ-tracking state is in task_struct, this means that
task_struct needs to be defined before we use it.

Especially for lockdep_assert_irq*() this can lead to header-hell.

Move the hardirq state into per-cpu variables to avoid the task_struct
dependency.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200623083721.512673481@infradead.org
2020-07-10 12:00:02 +02:00
Peter Zijlstra
859d069ee1 lockdep: Prepare for NMI IRQ state tracking
There is no reason not to always, accurately, track IRQ state.

This change also makes IRQ state tracking ignore lockdep_off().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200623083721.155449112@infradead.org
2020-07-10 12:00:01 +02:00
Marco Elver
248591f5d2 kcsan: Make KCSAN compatible with new IRQ state tracking
The new IRQ state tracking code does not honor lockdep_off(), and as
such we should again permit tracing by using non-raw functions in
core.c. Update the lockdep_off() comment in report.c, to reflect the
fact there is still a potential risk of deadlock due to using printk()
from scheduler code.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200624113246.GA170324@elver.google.com
2020-07-10 12:00:00 +02:00
John Ogness
896fbe20b4 printk: use the lockless ringbuffer
Replace the existing ringbuffer usage and implementation with
lockless ringbuffer usage. Even though the new ringbuffer does not
require locking, all existing locking is left in place. Therefore,
this change is purely replacing the underlining ringbuffer.

Changes that exist due to the ringbuffer replacement:

- The VMCOREINFO has been updated for the new structures.

- Dictionary data is now stored in a separate data buffer from the
  human-readable messages. The dictionary data buffer is set to the
  same size as the message buffer. Therefore, the total required
  memory for both dictionary and message data is
  2 * (2 ^ CONFIG_LOG_BUF_SHIFT) for the initial static buffers and
  2 * log_buf_len (the kernel parameter) for the dynamic buffers.

- Record meta-data is now stored in a separate array of descriptors.
  This is an additional 72 * (2 ^ (CONFIG_LOG_BUF_SHIFT - 5)) bytes
  for the static array and 72 * (log_buf_len >> 5) bytes for the
  dynamic array.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20200709132344.760-5-john.ogness@linutronix.de
2020-07-10 08:48:55 +02:00
John Ogness
8749efc0c0 Revert "printk: lock/unlock console only for new logbuf entries"
This reverts commit 3ac37a93fa.

This optimization will not apply once the transition to a lockless
printk is complete. Rather than porting this optimization through
the transition only to remove it anyway, just revert it now to
simplify the transition.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20200709132344.760-4-john.ogness@linutronix.de
2020-07-10 08:48:35 +02:00
John Ogness
b6cf8b3f33 printk: add lockless ringbuffer
Introduce a multi-reader multi-writer lockless ringbuffer for storing
the kernel log messages. Readers and writers may use their API from
any context (including scheduler and NMI). This ringbuffer will make
it possible to decouple printk() callers from any context, locking,
or console constraints. It also makes it possible for readers to have
full access to the ringbuffer contents at any time and context (for
example from any panic situation).

The printk_ringbuffer is made up of 3 internal ringbuffers:

desc_ring:
A ring of descriptors. A descriptor contains all record meta data
(sequence number, timestamp, loglevel, etc.) as well as internal state
information about the record and logical positions specifying where in
the other ringbuffers the text and dictionary strings are located.

text_data_ring:
A ring of data blocks. A data block consists of an unsigned long
integer (ID) that maps to a desc_ring index followed by the text
string of the record.

dict_data_ring:
A ring of data blocks. A data block consists of an unsigned long
integer (ID) that maps to a desc_ring index followed by the dictionary
string of the record.

The internal state information of a descriptor is the key element to
allow readers and writers to locklessly synchronize access to the data.

Co-developed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20200709132344.760-3-john.ogness@linutronix.de
2020-07-10 08:48:19 +02:00
Vincent Chen
8c080d3a97
kgdb: enable arch to support XML packet.
The XML packet could be supported by required architecture if the
architecture defines CONFIG_HAVE_ARCH_KGDB_QXFER_PKT and implement its own
kgdb_arch_handle_qxfer_pkt(). Except for the kgdb_arch_handle_qxfer_pkt(),
the architecture also needs to record the feature supported by gdb stub
into the kgdb_arch_gdb_stub_feature, and these features will be reported
to host gdb when gdb stub receives the qSupported packet.

Signed-off-by: Vincent Chen <vincent.chen@sifive.com>
Acked-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
2020-07-09 20:09:28 -07:00
Wei Yang
36b8aacf2a tracing: Save one trace_event->type by using __TRACE_LAST_TYPE
Static defined trace_event->type stops at (__TRACE_LAST_TYPE - 1) and
dynamic trace_event->type starts from (__TRACE_LAST_TYPE + 1).

To save one trace_event->type index, let's use __TRACE_LAST_TYPE.

Link: https://lkml.kernel.org/r/20200703020612.12930-3-richard.weiyang@linux.alibaba.com

Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-07-09 18:14:58 -04:00
Wei Yang
746cf3459f tracing: Simplify defining of the next event id
The value to be used and compared in trace_search_list() is "last + 1".
Let's just define next to be "last + 1" instead of doing the addition
each time.

Link: https://lkml.kernel.org/r/20200703020612.12930-2-richard.weiyang@linux.alibaba.com

Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-07-09 18:00:47 -04:00
Linus Torvalds
ce69fb3b39 Refactor kallsyms_show_value() users for correct cred
Several users of kallsyms_show_value() were performing checks not
 during "open". Refactor everything needed to gain proper checks against
 file->f_cred for modules, kprobes, and bpf.
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAl8GUbMWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJohnD/9VsPsAMV+8lhsPvkcuW/DkTcAY
 qUEzsXU3v06gJ0Z/1lBKtisJ6XmD93wWcZCTFvJ0S8vR3yLZvOVfToVjCMO32Trc
 4ZkWTPwpvfeLug6T6CcI2ukQdZ/opI1cSabqGl79arSBgE/tsghwrHuJ8Exkz4uq
 0b7i8nZa+RiTezwx4EVeGcg6Dv1tG5UTG2VQvD/+QGGKneBlrlaKlI885N/6jsHa
 KxvB7+8ES1pnfGYZenx+RxMdljNrtyptbQEU8gyvoV5YR7635gjZsVsPwWANJo+4
 EGcFFpwWOAcVQaC3dareLTM8nVngU6Wl3Rd7JjZtjvtZba8DdCn669R34zDGXbiP
 +1n1dYYMSMBeqVUbAQfQyLD0pqMIHdwQj2TN8thSGccr2o3gNk6AXgYq0aYm8IBf
 xDCvAansJw9WqmxErIIsD4BFkMqF7MjH3eYZxwCPWSrKGDvKxQSPV5FarnpDC9U7
 dYCWVxNPmtn+unC/53yXjEcBepKaYgNR7j5G7uOfkHvU43Bd5demzLiVJ10D8abJ
 ezyErxxEqX2Gr7JR2fWv7iBbULJViqcAnYjdl0y0NgK/hftt98iuge6cZmt1z6ai
 24vI3X4VhvvVN5/f64cFDAdYtMRUtOo2dmxdXMid1NI07Mj2qFU1MUwb8RHHlxbK
 8UegV2zcrBghnVuMkw==
 =ib5Q
 -----END PGP SIGNATURE-----

Merge tag 'kallsyms_show_value-v5.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull kallsyms fix from Kees Cook:
 "Refactor kallsyms_show_value() users for correct cred.

  I'm not delighted by the timing of getting these changes to you, but
  it does fix a handful of kernel address exposures, and no one has
  screamed yet at the patches.

  Several users of kallsyms_show_value() were performing checks not
  during "open". Refactor everything needed to gain proper checks
  against file->f_cred for modules, kprobes, and bpf"

* tag 'kallsyms_show_value-v5.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  selftests: kmod: Add module address visibility test
  bpf: Check correct cred for CAP_SYSLOG in bpf_dump_raw_ok()
  kprobes: Do not expose probe addresses to non-CAP_SYSLOG
  module: Do not expose section addresses to non-CAP_SYSLOG
  module: Refactor section attr into bin attribute
  kallsyms: Refactor kallsyms_show_value() to take cred
2020-07-09 13:09:30 -07:00
Martin KaFai Lau
c9a368f1c0 bpf: net: Avoid incorrect bpf_sk_reuseport_detach call
bpf_sk_reuseport_detach is currently called when sk->sk_user_data
is not NULL.  It is incorrect because sk->sk_user_data may not be
managed by the bpf's reuseport_array.  It has been reported in [1] that,
the bpf_sk_reuseport_detach() which is called from udp_lib_unhash() has
corrupted the sk_user_data managed by l2tp.

This patch solves it by using another bit (defined as SK_USER_DATA_BPF)
of the sk_user_data pointer value.  It marks that a sk_user_data is
managed/owned by BPF.

The patch depends on a PTRMASK introduced in
commit f1ff5ce2cd ("net, sk_msg: Clear sk_user_data pointer on clone if tagged").

[ Note: sk->sk_user_data is used by bpf's reuseport_array only when a sk is
  added to the bpf's reuseport_array.
  i.e. doing setsockopt(SO_REUSEPORT) and having "sk->sk_reuseport == 1"
  alone will not stop sk->sk_user_data being used by other means. ]

[1]: https://lore.kernel.org/netdev/20200706121259.GA20199@katalix.com/

Fixes: 5dc4c4b7d4 ("bpf: Introduce BPF_MAP_TYPE_REUSEPORT_SOCKARRAY")
Reported-by: James Chapman <jchapman@katalix.com>
Reported-by: syzbot+9f092552ba9a5efca5df@syzkaller.appspotmail.com
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: James Chapman <jchapman@katalix.com>
Acked-by: James Chapman <jchapman@katalix.com>
Link: https://lore.kernel.org/bpf/20200709061110.4019316-1-kafai@fb.com
2020-07-09 22:03:31 +02:00
Martin KaFai Lau
f3dda7a679 bpf: net: Avoid copying sk_user_data of reuseport_array during sk_clone
It makes little sense for copying sk_user_data of reuseport_array during
sk_clone_lock().  This patch reuses the SK_USER_DATA_NOCOPY bit introduced in
commit f1ff5ce2cd ("net, sk_msg: Clear sk_user_data pointer on clone if tagged").
It is used to mark the sk_user_data is not supposed to be copied to its clone.

Although the cloned sk's sk_user_data will not be used/freed in
bpf_sk_reuseport_detach(), this change can still allow the cloned
sk's sk_user_data to be used by some other means.

Freeing the reuseport_array's sk_user_data does not require a rcu grace
period.  Thus, the existing rcu_assign_sk_user_data_nocopy() is not
used.

Fixes: 5dc4c4b7d4 ("bpf: Introduce BPF_MAP_TYPE_REUSEPORT_SOCKARRAY")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20200709061104.4018798-1-kafai@fb.com
2020-07-09 22:03:31 +02:00
Frederic Weisbecker
30c66fc30e timer: Prevent base->clk from moving backward
When a timer is enqueued with a negative delta (ie: expiry is below
base->clk), it gets added to the wheel as expiring now (base->clk).

Yet the value that gets stored in base->next_expiry, while calling
trigger_dyntick_cpu(), is the initial timer->expires value. The
resulting state becomes:

	base->next_expiry < base->clk

On the next timer enqueue, forward_timer_base() may accidentally
rewind base->clk. As a possible outcome, timers may expire way too
early, the worst case being that the highest wheel levels get spuriously
processed again.

To prevent from that, make sure that base->next_expiry doesn't get below
base->clk.

Fixes: a683f390b9 ("timers: Forward the wheel clock whenever possible")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20200703010657.2302-1-frederic@kernel.org
2020-07-09 11:56:57 +02:00
Richard Guy Briggs
d7481b24b8 audit: issue CWD record to accompany LSM_AUDIT_DATA_* records
The LSM_AUDIT_DATA_* records for PATH, FILE, IOCTL_OP, DENTRY and INODE
are incomplete without the task context of the AUDIT Current Working
Directory record.  Add it.

This record addition can't use audit_dummy_context to determine whether
or not to store the record information since the LSM_AUDIT_DATA_*
records are initiated by various LSMs independent of any audit rules.
context->in_syscall is used to determine if it was called in user
context like audit_getname.

Please see the upstream issue
https://github.com/linux-audit/audit-kernel/issues/96

Adapted from Vladis Dronov's v2 patch.

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2020-07-08 19:02:11 -04:00
Kees Cook
6396026045 bpf: Check correct cred for CAP_SYSLOG in bpf_dump_raw_ok()
When evaluating access control over kallsyms visibility, credentials at
open() time need to be used, not the "current" creds (though in BPF's
case, this has likely always been the same). Plumb access to associated
file->f_cred down through bpf_dump_raw_ok() and its callers now that
kallsysm_show_value() has been refactored to take struct cred.

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: bpf@vger.kernel.org
Cc: stable@vger.kernel.org
Fixes: 7105e828c0 ("bpf: allow for correlation of maps and helpers in dump")
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-07-08 16:01:21 -07:00
Kees Cook
60f7bb66b8 kprobes: Do not expose probe addresses to non-CAP_SYSLOG
The kprobe show() functions were using "current"'s creds instead
of the file opener's creds for kallsyms visibility. Fix to use
seq_file->file->f_cred.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: stable@vger.kernel.org
Fixes: 81365a947d ("kprobes: Show address of kprobes if kallsyms does")
Fixes: ffb9bd68eb ("kprobes: Show blacklist addresses as same as kallsyms does")
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-07-08 16:00:22 -07:00
Kees Cook
b25a7c5af9 module: Do not expose section addresses to non-CAP_SYSLOG
The printing of section addresses in /sys/module/*/sections/* was not
using the correct credentials to evaluate visibility.

Before:

 # cat /sys/module/*/sections/.*text
 0xffffffffc0458000
 ...
 # capsh --drop=CAP_SYSLOG -- -c "cat /sys/module/*/sections/.*text"
 0xffffffffc0458000
 ...

After:

 # cat /sys/module/*/sections/*.text
 0xffffffffc0458000
 ...
 # capsh --drop=CAP_SYSLOG -- -c "cat /sys/module/*/sections/.*text"
 0x0000000000000000
 ...

Additionally replaces the existing (safe) /proc/modules check with
file->f_cred for consistency.

Reported-by: Dominik Czarnota <dominik.czarnota@trailofbits.com>
Fixes: be71eda538 ("module: Fix display of wrong module .text address")
Cc: stable@vger.kernel.org
Tested-by: Jessica Yu <jeyu@kernel.org>
Acked-by: Jessica Yu <jeyu@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-07-08 16:00:19 -07:00
Kees Cook
ed66f991bb module: Refactor section attr into bin attribute
In order to gain access to the open file's f_cred for kallsym visibility
permission checks, refactor the module section attributes to use the
bin_attribute instead of attribute interface. Additionally removes the
redundant "name" struct member.

Cc: stable@vger.kernel.org
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Tested-by: Jessica Yu <jeyu@kernel.org>
Acked-by: Jessica Yu <jeyu@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-07-08 16:00:17 -07:00
Kees Cook
160251842c kallsyms: Refactor kallsyms_show_value() to take cred
In order to perform future tests against the cred saved during open(),
switch kallsyms_show_value() to operate on a cred, and have all current
callers pass current_cred(). This makes it very obvious where callers
are checking the wrong credential in their "read" contexts. These will
be fixed in the coming patches.

Additionally switch return value to bool, since it is always used as a
direct permission check, not a 0-on-success, negative-on-error style
function return.

Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-07-08 15:59:57 -07:00
Zhenzhong Duan
05eee619ed x86/kvm: Add "nopvspin" parameter to disable PV spinlocks
There are cases where a guest tries to switch spinlocks to bare metal
behavior (e.g. by setting "xen_nopvspin" on XEN platform and
"hv_nopvspin" on HYPER_V).

That feature is missed on KVM, add a new parameter "nopvspin" to disable
PV spinlocks for KVM guest.

The new 'nopvspin' parameter will also replace Xen and Hyper-V specific
parameters in future patches.

Define variable nopvsin as global because it will be used in future
patches as above.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-07-08 16:21:57 -04:00
Phil Auld
9d246053a6 sched: Add a tracepoint to track rq->nr_running
Add a bare tracepoint trace_sched_update_nr_running_tp which tracks
->nr_running CPU's rq. This is used to accurately trace this data and
provide a visualization of scheduler imbalances in, for example, the
form of a heat map.  The tracepoint is accessed by loading an external
kernel module. An example module (forked from Qais' module and including
the pelt related tracepoints) can be found at:

  https://github.com/auldp/tracepoints-helpers.git

A script to turn the trace-cmd report output into a heatmap plot can be
found at:

  https://github.com/jirvoz/plot-nr-running

The tracepoints are added to add_nr_running() and sub_nr_running() which
are in kernel/sched/sched.h. In order to avoid CREATE_TRACE_POINTS in
the header a wrapper call is used and the trace/events/sched.h include
is moved before sched.h in kernel/sched/core.

Signed-off-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200629192303.GC120228@lorien.usersys.redhat.com
2020-07-08 11:39:02 +02:00
Qais Yousef
46609ce227 sched/uclamp: Protect uclamp fast path code with static key
There is a report that when uclamp is enabled, a netperf UDP test
regresses compared to a kernel compiled without uclamp.

https://lore.kernel.org/lkml/20200529100806.GA3070@suse.de/

While investigating the root cause, there were no sign that the uclamp
code is doing anything particularly expensive but could suffer from bad
cache behavior under certain circumstances that are yet to be
understood.

https://lore.kernel.org/lkml/20200616110824.dgkkbyapn3io6wik@e107158-lin/

To reduce the pressure on the fast path anyway, add a static key that is
by default will skip executing uclamp logic in the
enqueue/dequeue_task() fast path until it's needed.

As soon as the user start using util clamp by:

	1. Changing uclamp value of a task with sched_setattr()
	2. Modifying the default sysctl_sched_util_clamp_{min, max}
	3. Modifying the default cpu.uclamp.{min, max} value in cgroup

We flip the static key now that the user has opted to use util clamp.
Effectively re-introducing uclamp logic in the enqueue/dequeue_task()
fast path. It stays on from that point forward until the next reboot.

This should help minimize the effect of util clamp on workloads that
don't need it but still allow distros to ship their kernels with uclamp
compiled in by default.

SCHED_WARN_ON() in uclamp_rq_dec_id() was removed since now we can end
up with unbalanced call to uclamp_rq_dec_id() if we flip the key while
a task is running in the rq. Since we know it is harmless we just
quietly return if we attempt a uclamp_rq_dec_id() when
rq->uclamp[].bucket[].tasks is 0.

In schedutil, we introduce a new uclamp_is_enabled() helper which takes
the static key into account to ensure RT boosting behavior is retained.

The following results demonstrates how this helps on 2 Sockets Xeon E5
2x10-Cores system.

                                   nouclamp                 uclamp      uclamp-static-key
Hmean     send-64         162.43 (   0.00%)      157.84 *  -2.82%*      163.39 *   0.59%*
Hmean     send-128        324.71 (   0.00%)      314.78 *  -3.06%*      326.18 *   0.45%*
Hmean     send-256        641.55 (   0.00%)      628.67 *  -2.01%*      648.12 *   1.02%*
Hmean     send-1024      2525.28 (   0.00%)     2448.26 *  -3.05%*     2543.73 *   0.73%*
Hmean     send-2048      4836.14 (   0.00%)     4712.08 *  -2.57%*     4867.69 *   0.65%*
Hmean     send-3312      7540.83 (   0.00%)     7425.45 *  -1.53%*     7621.06 *   1.06%*
Hmean     send-4096      9124.53 (   0.00%)     8948.82 *  -1.93%*     9276.25 *   1.66%*
Hmean     send-8192     15589.67 (   0.00%)    15486.35 *  -0.66%*    15819.98 *   1.48%*
Hmean     send-16384    26386.47 (   0.00%)    25752.25 *  -2.40%*    26773.74 *   1.47%*

The perf diff between nouclamp and uclamp-static-key when uclamp is
disabled in the fast path:

     8.73%     -1.55%  [kernel.kallsyms]        [k] try_to_wake_up
     0.07%     +0.04%  [kernel.kallsyms]        [k] deactivate_task
     0.13%     -0.02%  [kernel.kallsyms]        [k] activate_task

The diff between nouclamp and uclamp-static-key when uclamp is enabled
in the fast path:

     8.73%     -0.72%  [kernel.kallsyms]        [k] try_to_wake_up
     0.13%     +0.39%  [kernel.kallsyms]        [k] activate_task
     0.07%     +0.38%  [kernel.kallsyms]        [k] deactivate_task

Fixes: 69842cba9a ("sched/uclamp: Add CPU's clamp buckets refcounting")
Reported-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://lkml.kernel.org/r/20200630112123.12076-3-qais.yousef@arm.com
2020-07-08 11:39:01 +02:00
Qais Yousef
d81ae8aac8 sched/uclamp: Fix initialization of struct uclamp_rq
struct uclamp_rq was zeroed out entirely in assumption that in the first
call to uclamp_rq_inc() they'd be initialized correctly in accordance to
default settings.

But when next patch introduces a static key to skip
uclamp_rq_{inc,dec}() until userspace opts in to use uclamp, schedutil
will fail to perform any frequency changes because the
rq->uclamp[UCLAMP_MAX].value is zeroed at init and stays as such. Which
means all rqs are capped to 0 by default.

Fix it by making sure we do proper initialization at init without
relying on uclamp_rq_inc() doing it later.

Fixes: 69842cba9a ("sched/uclamp: Add CPU's clamp buckets refcounting")
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://lkml.kernel.org/r/20200630112123.12076-2-qais.yousef@arm.com
2020-07-08 11:39:01 +02:00
Peter Zijlstra
85c2ce9104 sched, vmlinux.lds: Increase STRUCT_ALIGNMENT to 64 bytes for GCC-4.9
For some mysterious reason GCC-4.9 has a 64 byte section alignment for
structures, all other GCC versions (and Clang) tested (including 4.8
and 5.0) are fine with the 32 bytes alignment.

Getting this right is important for the new SCHED_DATA macro that
creates an explicitly ordered array of 'struct sched_class' in the
linker script and expect pointer arithmetic to work.

Fixes: c3a340f7e7 ("sched: Have sched_class_highest define by vmlinux.lds.h")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200630144905.GX4817@hirez.programming.kicks-ass.net
2020-07-08 11:39:00 +02:00
Peter Zijlstra
faa2fd7cba Merge branch 'sched/urgent' 2020-07-08 11:38:59 +02:00
Kan Liang
5a09928d33 perf/x86: Remove task_ctx_size
A new kmem_cache method has replaced the kzalloc() to allocate the PMU
specific data. The task_ctx_size is not required anymore.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1593780569-62993-19-git-send-email-kan.liang@linux.intel.com
2020-07-08 11:38:55 +02:00
Kan Liang
217c2a633e perf/core: Use kmem_cache to allocate the PMU specific data
Currently, the PMU specific data task_ctx_data is allocated by the
function kzalloc() in the perf generic code. When there is no specific
alignment requirement for the task_ctx_data, the method works well for
now. However, there will be a problem once a specific alignment
requirement is introduced in future features, e.g., the Architecture LBR
XSAVE feature requires 64-byte alignment. If the specific alignment
requirement is not fulfilled, the XSAVE family of instructions will fail
to save/restore the xstate to/from the task_ctx_data.

The function kzalloc() itself only guarantees a natural alignment. A
new method to allocate the task_ctx_data has to be introduced, which
has to meet the requirements as below:
- must be a generic method can be used by different architectures,
  because the allocation of the task_ctx_data is implemented in the
  perf generic code;
- must be an alignment-guarantee method (The alignment requirement is
  not changed after the boot);
- must be able to allocate/free a buffer (smaller than a page size)
  dynamically;
- should not cause extra CPU overhead or space overhead.

Several options were considered as below:
- One option is to allocate a larger buffer for task_ctx_data. E.g.,
    ptr = kmalloc(size + alignment, GFP_KERNEL);
    ptr &= ~(alignment - 1);
  This option causes space overhead.
- Another option is to allocate the task_ctx_data in the PMU specific
  code. To do so, several function pointers have to be added. As a
  result, both the generic structure and the PMU specific structure
  will become bigger. Besides, extra function calls are added when
  allocating/freeing the buffer. This option will increase both the
  space overhead and CPU overhead.
- The third option is to use a kmem_cache to allocate a buffer for the
  task_ctx_data. The kmem_cache can be created with a specific alignment
  requirement by the PMU at boot time. A new pointer for kmem_cache has
  to be added in the generic struct pmu, which would be used to
  dynamically allocate a buffer for the task_ctx_data at run time.
  Although the new pointer is added to the struct pmu, the existing
  variable task_ctx_size is not required anymore. The size of the
  generic structure is kept the same.

The third option which meets all the aforementioned requirements is used
to replace kzalloc() for the PMU specific data allocation. A later patch
will remove the kzalloc() method and the related variables.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1593780569-62993-17-git-send-email-kan.liang@linux.intel.com
2020-07-08 11:38:55 +02:00
Kan Liang
ff9ff92688 perf/core: Factor out functions to allocate/free the task_ctx_data
The method to allocate/free the task_ctx_data is going to be changed in
the following patch. Currently, the task_ctx_data is allocated/freed in
several different places. To avoid repeatedly modifying the same codes
in several different places, alloc_task_ctx_data() and
free_task_ctx_data() are factored out to allocate/free the
task_ctx_data. The modification only needs to be applied once.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1593780569-62993-16-git-send-email-kan.liang@linux.intel.com
2020-07-08 11:38:54 +02:00
Mathieu Desnoyers
ce3614daab sched: Fix unreliable rseq cpu_id for new tasks
While integrating rseq into glibc and replacing glibc's sched_getcpu
implementation with rseq, glibc's tests discovered an issue with
incorrect __rseq_abi.cpu_id field value right after the first time
a newly created process issues sched_setaffinity.

For the records, it triggers after building glibc and running tests, and
then issuing:

  for x in {1..2000} ; do posix/tst-affinity-static  & done

and shows up as:

error: Unexpected CPU 2, expected 0
error: Unexpected CPU 2, expected 0
error: Unexpected CPU 2, expected 0
error: Unexpected CPU 2, expected 0
error: Unexpected CPU 138, expected 0
error: Unexpected CPU 138, expected 0
error: Unexpected CPU 138, expected 0
error: Unexpected CPU 138, expected 0

This is caused by the scheduler invoking __set_task_cpu() directly from
sched_fork() and wake_up_new_task(), thus bypassing rseq_migrate() which
is done by set_task_cpu().

Add the missing rseq_migrate() to both functions. The only other direct
use of __set_task_cpu() is done by init_idle(), which does not involve a
user-space task.

Based on my testing with the glibc test-case, just adding rseq_migrate()
to wake_up_new_task() is sufficient to fix the observed issue. Also add
it to sched_fork() to keep things consistent.

The reason why this never triggered so far with the rseq/basic_test
selftest is unclear.

The current use of sched_getcpu(3) does not typically require it to be
always accurate. However, use of the __rseq_abi.cpu_id field within rseq
critical sections requires it to be accurate. If it is not accurate, it
can cause corruption in the per-cpu data targeted by rseq critical
sections in user-space.

Reported-By: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-By: Florian Weimer <fweimer@redhat.com>
Cc: stable@vger.kernel.org # v4.18+
Link: https://lkml.kernel.org/r/20200707201505.2632-1-mathieu.desnoyers@efficios.com
2020-07-08 11:38:50 +02:00
Peter Zijlstra
dbfb089d36 sched: Fix loadavg accounting race
The recent commit:

  c6e7bd7afa ("sched/core: Optimize ttwu() spinning on p->on_cpu")

moved these lines in ttwu():

	p->sched_contributes_to_load = !!task_contributes_to_load(p);
	p->state = TASK_WAKING;

up before:

	smp_cond_load_acquire(&p->on_cpu, !VAL);

into the 'p->on_rq == 0' block, with the thinking that once we hit
schedule() the current task cannot change it's ->state anymore. And
while this is true, it is both incorrect and flawed.

It is incorrect in that we need at least an ACQUIRE on 'p->on_rq == 0'
to avoid weak hardware from re-ordering things for us. This can fairly
easily be achieved by relying on the control-dependency already in
place.

The second problem, which makes the flaw in the original argument, is
that while schedule() will not change prev->state, it will read it a
number of times (arguably too many times since it's marked volatile).
The previous condition 'p->on_cpu == 0' was sufficient because that
indicates schedule() has completed, and will no longer read
prev->state. So now the trick is to make this same true for the (much)
earlier 'prev->on_rq == 0' case.

Furthermore, in order to make the ordering stick, the 'prev->on_rq = 0'
assignment needs to he a RELEASE, but adding additional ordering to
schedule() is an unwelcome proposition at the best of times, doubly so
for mere accounting.

Luckily we can push the prev->state load up before rq->lock, with the
only caveat that we then have to re-read the state after. However, we
know that if it changed, we no longer have to worry about the blocking
path. This gives us the required ordering, if we block, we did the
prev->state load before an (effective) smp_mb() and the p->on_rq store
needs not change.

With this we end up with the effective ordering:

	LOAD p->state           LOAD-ACQUIRE p->on_rq == 0
	MB
	STORE p->on_rq, 0       STORE p->state, TASK_WAKING

which ensures the TASK_WAKING store happens after the prev->state
load, and all is well again.

Fixes: c6e7bd7afa ("sched/core: Optimize ttwu() spinning on p->on_cpu")
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Reported-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Dave Jones <davej@codemonkey.org.uk>
Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Link: https://lkml.kernel.org/r/20200707102957.GN117543@hirez.programming.kicks-ass.net
2020-07-08 11:38:49 +02:00
Christian Brauner
76c12881a3
nsproxy: support CLONE_NEWTIME with setns()
So far setns() was missing time namespace support. This was partially due
to it simply not being implemented but also because vdso_join_timens()
could still fail which made switching to multiple namespaces atomically
problematic. This is now fixed so support CLONE_NEWTIME with setns()

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Andrei Vagin <avagin@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Dmitry Safonov <dima@arista.com>
Link: https://lore.kernel.org/r/20200706154912.3248030-4-christian.brauner@ubuntu.com
2020-07-08 11:14:22 +02:00
Christian Brauner
5cfea9a106
timens: add timens_commit() helper
Wrap the calls to timens_set_vvar_page() and vdso_join_timens() in
timens_on_fork() and timens_install() in a new timens_commit() helper.
We'll use this helper in a follow-up patch in nsproxy too.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Andrei Vagin <avagin@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Dmitry Safonov <dima@arista.com>
Cc: linux-arm-kernel@lists.infradead.org
Link: https://lore.kernel.org/r/20200706154912.3248030-3-christian.brauner@ubuntu.com
2020-07-08 11:14:21 +02:00
Christian Brauner
42815808f1
timens: make vdso_join_timens() always succeed
As discussed on-list (cf. [1]), in order to make setns() support time
namespaces when attaching to multiple namespaces at once properly we
need to tweak vdso_join_timens() to always succeed. So switch
vdso_join_timens() to using a read lock and replacing
mmap_write_lock_killable() to mmap_read_lock() as we discussed.

Last cycle setns() was changed to support attaching to multiple namespaces
atomically. This requires all namespaces to have a point of no return where
they can't fail anymore. Specifically, <namespace-type>_install() is
allowed to perform permission checks and install the namespace into the new
struct nsset that it has been given but it is not allowed to make visible
changes to the affected task. Once <namespace-type>_install() returns
anything that the given namespace type requires to be setup in addition
needs to ideally be done in a function that can't fail or if it fails the
failure is not fatal. For time namespaces the relevant functions that fall
into this category are timens_set_vvar_page() and vdso_join_timens().
Currently the latter can fail but doesn't need to. With this we can go on
to implement a timens_commit() helper in a follow up patch to be used by
setns().

[1]: https://lore.kernel.org/lkml/20200611110221.pgd3r5qkjrjmfqa2@wittgenstein

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Andrei Vagin <avagin@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Dmitry Safonov <dima@arista.com>
Cc: linux-arm-kernel@lists.infradead.org
Link: https://lore.kernel.org/r/20200706154912.3248030-2-christian.brauner@ubuntu.com
2020-07-08 11:14:21 +02:00
Stanislav Fomichev
f5836749c9 bpf: Add BPF_CGROUP_INET_SOCK_RELEASE hook
Sometimes it's handy to know when the socket gets freed. In
particular, we'd like to try to use a smarter allocation of
ports for bpf_bind and explore the possibility of limiting
the number of SOCK_DGRAM sockets the process can have.

Implement BPF_CGROUP_INET_SOCK_RELEASE hook that triggers on
inet socket release. It triggers only for userspace sockets
(not in-kernel ones) and therefore has the same semantics as
the existing BPF_CGROUP_INET_SOCK_CREATE.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200706230128.4073544-2-sdf@google.com
2020-07-08 01:03:31 +02:00
Cong Wang
ad0f75e5f5 cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
copied, so the cgroup refcnt must be taken too. And, unlike the
sk_alloc() path, sock_update_netprioidx() is not called here.
Therefore, it is safe and necessary to grab the cgroup refcnt
even when cgroup_sk_alloc is disabled.

sk_clone_lock() is in BH context anyway, the in_interrupt()
would terminate this function if called there. And for sk_alloc()
skcd->val is always zero. So it's safe to factor out the code
to make it more readable.

The global variable 'cgroup_sk_alloc_disabled' is used to determine
whether to take these reference counts. It is impossible to make
the reference counting correct unless we save this bit of information
in skcd->val. So, add a new bit there to record whether the socket
has already taken the reference counts. This obviously relies on
kmalloc() to align cgroup pointers to at least 4 bytes,
ARCH_KMALLOC_MINALIGN is certainly larger than that.

This bug seems to be introduced since the beginning, commit
d979a39d72 ("cgroup: duplicate cgroup reference when cloning sockets")
tried to fix it but not compeletely. It seems not easy to trigger until
the recent commit 090e28b229
("netprio_cgroup: Fix unlimited memory leak of v2 cgroups") was merged.

Fixes: bd1060a1d6 ("sock, cgroup: add sock->sk_cgroup")
Reported-by: Cameron Berkenpas <cam@neo-zeon.de>
Reported-by: Peter Geis <pgwipeout@gmail.com>
Reported-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reported-by: Daniël Sonck <dsonck92@gmail.com>
Reported-by: Zhang Qiang <qiang.zhang@windriver.com>
Tested-by: Cameron Berkenpas <cam@neo-zeon.de>
Tested-by: Peter Geis <pgwipeout@gmail.com>
Tested-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Zefan Li <lizefan@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-07 13:34:11 -07:00
Eric W. Biederman
33c326014f umd: Stop using split_argv
There is exactly one argument so there is nothing to split.  All
split_argv does now is cause confusion and avoid the need for a cast
when passing a "const char *" string to call_usermodehelper_setup.

So avoid confusion and the possibility of an odd driver name causing
problems by just using a fixed argv array with a cast in the call to
call_usermodehelper_setup.

v1: https://lkml.kernel.org/r/87sged3a9n.fsf_-_@x220.int.ebiederm.org
Link: https://lkml.kernel.org/r/20200702164140.4468-16-ebiederm@xmission.com
Acked-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-07-07 11:58:59 -05:00
Eric W. Biederman
8c2f526639 umd: Remove exit_umh
The bpfilter code no longer uses the umd_info.cleanup callback.  This
callback is what exit_umh exists to call.  So remove exit_umh and all
of it's associated booking.

v1: https://lkml.kernel.org/r/87bll6dlte.fsf_-_@x220.int.ebiederm.org
v2: https://lkml.kernel.org/r/87y2o53abg.fsf_-_@x220.int.ebiederm.org
Link: https://lkml.kernel.org/r/20200702164140.4468-15-ebiederm@xmission.com
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-07-07 11:58:59 -05:00
Eric W. Biederman
38fd525a4c exit: Factor thread_group_exited out of pidfd_poll
Create an independent helper thread_group_exited which returns true
when all threads have passed exit_notify in do_exit.  AKA all of the
threads are at least zombies and might be dead or completely gone.

Create this helper by taking the logic out of pidfd_poll where it is
already tested, and adding a READ_ONCE on the read of
task->exit_state.

I will be changing the user mode driver code to use this same logic
to know when a user mode driver needs to be restarted.

Place the new helper thread_group_exited in kernel/exit.c and
EXPORT it so it can be used by modules.

Link: https://lkml.kernel.org/r/20200702164140.4468-13-ebiederm@xmission.com
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-07-07 11:58:17 -05:00
Linus Torvalds
5465a324af A single fix for a printk format warning in RCU.
-----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl8B8gcTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoXDoEADCdEVs/qLktFXrN17i6Oeju7h6oQ2m
 8iI1GDStY5zj4Jk6FdQB864XXHbkNO7C096IqJOZud66k+lj3sEk9lE24tpgZqi/
 gHVCueUoKF5ZYNyEtPkSDzHcr/IJg3iueQyShTbGotvGbF/gBAWJtuIq3sVpaD+Q
 qvZYASQMkBNrRcEgxzaTr286MJ4lIQ61ujwRWQJV4woQgAqjeTrOKQ+qOoKCZVfB
 c1glieDNLwZvs/534zsBLRj7ApvuJ2SyHXhfC9byIitUb1RdZ/1gAkiteX/K6ici
 PXoPamBsd+gSEdfWN69HB+cWqPqJ8Gq8M2zcmp3KSrg4IrXTVrnYHmymH3tN5Vbe
 p3my9/rH/yDv1kgcRgOlgL7ykz6W2oCr7LrTrQ7fupOXrR7UrW0dSsEcFRbWwoBn
 7dlfdEI4Q7ay9GPN2f7QOiaGGE+Bi76iCXTjRTFzcEQHiwO6W1bLoSu8qtncYvke
 2PaDrE4V/2CWjOuE37mw3IPsjUEOJNKC2+H3y+J9ma94CX4kAuZzH2LS4oHO8ww1
 eyF1HSHOKh1tuY9RhNnsyh+1V2Iao6T0BjUMnG/c01xFeEz+lE+e1JRdTSxa5BOr
 zKSSuEei7z6m9t7Yn2DSU8YaKn8ZbF8JpfC5nGFZTMBh64EOEaWGhQUr7ttuFQxi
 7JF6WVarhn5FHw==
 =p6ry
 -----END PGP SIGNATURE-----

Merge tag 'core-urgent-2020-07-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull rcu fixlet from Thomas Gleixner:
 "A single fix for a printk format warning in RCU"

* tag 'core-urgent-2020-07-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  rcuperf: Fix printk format warning
2020-07-05 12:21:28 -07:00
David S. Miller
f91c031e65 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2020-07-04

The following pull-request contains BPF updates for your *net-next* tree.

We've added 73 non-merge commits during the last 17 day(s) which contain
a total of 106 files changed, 5233 insertions(+), 1283 deletions(-).

The main changes are:

1) bpftool ability to show PIDs of processes having open file descriptors
   for BPF map/program/link/BTF objects, relying on BPF iterator progs
   to extract this info efficiently, from Andrii Nakryiko.

2) Addition of BPF iterator progs for dumping TCP and UDP sockets to
   seq_files, from Yonghong Song.

3) Support access to BPF map fields in struct bpf_map from programs
   through BTF struct access, from Andrey Ignatov.

4) Add a bpf_get_task_stack() helper to be able to dump /proc/*/stack
   via seq_file from BPF iterator progs, from Song Liu.

5) Make SO_KEEPALIVE and related options available to bpf_setsockopt()
   helper, from Dmitry Yakunin.

6) Optimize BPF sk_storage selection of its caching index, from Martin
   KaFai Lau.

7) Removal of redundant synchronize_rcu()s from BPF map destruction which
   has been a historic leftover, from Alexei Starovoitov.

8) Several improvements to test_progs to make it easier to create a shell
   loop that invokes each test individually which is useful for some CIs,
   from Jesper Dangaard Brouer.

9) Fix bpftool prog dump segfault when compiled without skeleton code on
   older clang versions, from John Fastabend.

10) Bunch of cleanups and minor improvements, from various others.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-04 17:48:34 -07:00
Christian Brauner
714acdbd1c
arch: rename copy_thread_tls() back to copy_thread()
Now that HAVE_COPY_THREAD_TLS has been removed, rename copy_thread_tls()
back simply copy_thread(). It's a simpler name, and doesn't imply that only
tls is copied here. This finishes an outstanding chunk of internal process
creation work since we've added clone3().

Cc: linux-arch@vger.kernel.org
Acked-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>A
Acked-by: Stafford Horne <shorne@gmail.com>
Acked-by: Greentime Hu <green.hu@gmail.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>A
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-07-04 23:41:37 +02:00
Christian Brauner
140c8180eb
arch: remove HAVE_COPY_THREAD_TLS
All architectures support copy_thread_tls() now, so remove the legacy
copy_thread() function and the HAVE_COPY_THREAD_TLS config option. Everyone
uses the same process creation calling convention based on
copy_thread_tls() and struct kernel_clone_args. This will make it easier to
maintain the core process creation code under kernel/, simplifies the
callpaths and makes the identical for all architectures.

Cc: linux-arch@vger.kernel.org
Acked-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Acked-by: Greentime Hu <green.hu@gmail.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-07-04 23:41:37 +02:00
Christian Brauner
ff2a91127b
fork: remove do_fork()
Now that all architectures have been switched to use _do_fork() and the new
struct kernel_clone_args calling convention we can remove the legacy
do_fork() helper completely. The calling convention used to be brittle and
do_fork() didn't buy us anything. The only calling convention accepted
should be based on struct kernel_clone_args going forward. It's cleaner and
uniform.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-07-04 23:41:36 +02:00
Eric W. Biederman
1c340ead18 umd: Track user space drivers with struct pid
Use struct pid instead of user space pid values that are prone to wrap
araound.

In addition track the entire thread group instead of just the first
thread that is started by exec.  There are no multi-threaded user mode
drivers today but there is nothing preclucing user drivers from being
multi-threaded, so it is just a good idea to track the entire process.

Take a reference count on the tgid's in question to make it possible
to remove exit_umh in a future change.

As a struct pid is available directly use kill_pid_info.

The prior process signalling code was iffy in using a userspace pid
known to be in the initial pid namespace and then looking up it's task
in whatever the current pid namespace is.  It worked only because
kernel threads always run in the initial pid namespace.

As the tgid is now refcounted verify the tgid is NULL at the start of
fork_usermode_driver to avoid the possibility of silent pid leaks.

v1: https://lkml.kernel.org/r/87mu4qdlv2.fsf_-_@x220.int.ebiederm.org
v2: https://lkml.kernel.org/r/a70l4oy8.fsf_-_@x220.int.ebiederm.org
Link: https://lkml.kernel.org/r/20200702164140.4468-12-ebiederm@xmission.com
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-07-04 09:35:56 -05:00
Eric W. Biederman
55e6074e3f umh: Stop calling do_execve_file
With the user mode driver code changed to not set subprocess_info.file
there are no more users of subproces_info.file.  Remove this field
from struct subprocess_info and remove the only user in
call_usermodehelper_exec_async that would call do_execve_file instead
of do_execve if file was set.

v1: https://lkml.kernel.org/r/877dvuf0i7.fsf_-_@x220.int.ebiederm.org
v2: https://lkml.kernel.org/r/87r1tx4p2a.fsf_-_@x220.int.ebiederm.org
Link: https://lkml.kernel.org/r/20200702164140.4468-9-ebiederm@xmission.com
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-07-04 09:35:36 -05:00
Eric W. Biederman
e2dc9bf3f5 umd: Transform fork_usermode_blob into fork_usermode_driver
Instead of loading a binary blob into a temporary file with
shmem_kernel_file_setup load a binary blob into a temporary tmpfs
filesystem.  This means that the blob can be stored in an init section
and discared, and it means the binary blob will have a filename so can
be executed normally.

The only tricky thing about this code is that in the helper function
blob_to_mnt __fput_sync is used.  That is because a file can not be
executed if it is still open for write, and the ordinary delayed close
for kernel threads does not happen soon enough, which causes the
following exec to fail.  The function umd_load_blob is not called with
any locks so this should be safe.

Executing the blob normally winds up correcting several problems with
the user mode driver code discovered by Tetsuo Handa[1].  By passing
an ordinary filename into the exec, it is no longer necessary to
figure out how to turn a O_RDWR file descriptor into a properly
referende counted O_EXEC file descriptor that forbids all writes.  For
path based LSMs there are no new special cases.

[1] https://lore.kernel.org/linux-fsdevel/2a8775b4-1dd5-9d5c-aa42-9872445e0942@i-love.sakura.ne.jp/
v1: https://lkml.kernel.org/r/87d05mf0j9.fsf_-_@x220.int.ebiederm.org
v2: https://lkml.kernel.org/r/87wo3p4p35.fsf_-_@x220.int.ebiederm.org
Link: https://lkml.kernel.org/r/20200702164140.4468-8-ebiederm@xmission.com
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-07-04 09:35:29 -05:00
Eric W. Biederman
1199c6c3da umd: Rename umd_info.cmdline umd_info.driver_name
The only thing supplied in the cmdline today is the driver name so
rename the field to clarify the code.

As this value is always supplied stop trying to handle the case of
a NULL cmdline.

Additionally since we now have a name we can count on use the
driver_name any place where the code is looking for a name
of the binary.

v1: https://lkml.kernel.org/r/87imfef0k3.fsf_-_@x220.int.ebiederm.org
v2: https://lkml.kernel.org/r/87366d63os.fsf_-_@x220.int.ebiederm.org
Link: https://lkml.kernel.org/r/20200702164140.4468-7-ebiederm@xmission.com
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-07-04 09:35:13 -05:00
Eric W. Biederman
74be2d3b80 umd: For clarity rename umh_info umd_info
This structure is only used for user mode drivers so change
the prefix from umh to umd to make that clear.

v1: https://lkml.kernel.org/r/87o8p6f0kw.fsf_-_@x220.int.ebiederm.org
v2: https://lkml.kernel.org/r/878sg563po.fsf_-_@x220.int.ebiederm.org
Link: https://lkml.kernel.org/r/20200702164140.4468-6-ebiederm@xmission.com
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-07-04 09:34:39 -05:00
Eric W. Biederman
884c5e683b umh: Separate the user mode driver and the user mode helper support
This makes it clear which code is part of the core user mode
helper support and which code is needed to implement user mode
drivers.

This makes the kernel smaller for everyone who does not use a usermode
driver.

v1: https://lkml.kernel.org/r/87tuyyf0ln.fsf_-_@x220.int.ebiederm.org
v2: https://lkml.kernel.org/r/87imf963s6.fsf_-_@x220.int.ebiederm.org
Link: https://lkml.kernel.org/r/20200702164140.4468-5-ebiederm@xmission.com
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-07-04 09:34:32 -05:00
Eric W. Biederman
21d5982806 umh: Remove call_usermodehelper_setup_file.
The only caller of call_usermodehelper_setup_file is fork_usermode_blob.
In fork_usermode_blob replace call_usermodehelper_setup_file with
call_usermodehelper_setup and delete fork_usermodehelper_setup_file.

For this to work the argv_free is moved from umh_clean_and_save_pid
to fork_usermode_blob.

v1: https://lkml.kernel.org/r/87zh8qf0mp.fsf_-_@x220.int.ebiederm.org
v2: https://lkml.kernel.org/r/87o8p163u1.fsf_-_@x220.int.ebiederm.org
Link: https://lkml.kernel.org/r/20200702164140.4468-4-ebiederm@xmission.com
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-07-04 09:34:26 -05:00
Eric W. Biederman
3a171042ae umh: Rename the user mode driver helpers for clarity
Now that the functionality of umh_setup_pipe and
umh_clean_and_save_pid has changed their names are too specific and
don't make much sense.  Instead name them  umd_setup and umd_cleanup
for the functional role in setting up user mode drivers.

v1: https://lkml.kernel.org/r/875zbegf82.fsf_-_@x220.int.ebiederm.org
v2: https://lkml.kernel.org/r/87tuyt63x3.fsf_-_@x220.int.ebiederm.org
Link: https://lkml.kernel.org/r/20200702164140.4468-3-ebiederm@xmission.com
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-07-04 09:34:18 -05:00
Eric W. Biederman
b044fa2ae5 umh: Move setting PF_UMH into umh_pipe_setup
I am separating the code specific to user mode drivers from the code
for ordinary user space helpers.  Move setting of PF_UMH from
call_usermodehelper_exec_async which is core user mode helper code
into umh_pipe_setup which is user mode driver code.

The code is equally as easy to write in one location as the other and
the movement minimizes the impact of the user mode driver code on the
core of the user mode helper code.

Setting PF_UMH unconditionally is harmless as an action will only
happen if it is paired with an entry on umh_list.

v1: https://lkml.kernel.org/r/87bll6gf8t.fsf_-_@x220.int.ebiederm.org
v2: https://lkml.kernel.org/r/87zh8l63xs.fsf_-_@x220.int.ebiederm.org
Link: https://lkml.kernel.org/r/20200702164140.4468-2-ebiederm@xmission.com
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-07-04 09:34:06 -05:00
Eric W. Biederman
5fec25f2cb umh: Capture the pid in umh_pipe_setup
The pid in struct subprocess_info is only used by umh_clean_and_save_pid to
write the pid into umh_info.

Instead always capture the pid on struct umh_info in umh_pipe_setup, removing
code that is specific to user mode drivers from the common user path of
user mode helpers.

v1: https://lkml.kernel.org/r/87h7uygf9i.fsf_-_@x220.int.ebiederm.org
v2: https://lkml.kernel.org/r/875zb97iix.fsf_-_@x220.int.ebiederm.org
Link: https://lkml.kernel.org/r/20200702164140.4468-1-ebiederm@xmission.com
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-07-04 09:33:40 -05:00
Valentin Schneider
8fa88a88d5 genirq: Remove preflow handler support
That was put in place for sparc64, and blackfin also used it for some time;
sparc64 no longer uses those, and blackfin is dead.

As there are no more users, remove preflow handlers.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200703155645.29703-3-valentin.schneider@arm.com
2020-07-04 10:02:06 +02:00
Christoph Hellwig
a3a66c3822 vmalloc: fix the owner argument for the new __vmalloc_node_range callers
Fix the recently added new __vmalloc_node_range callers to pass the
correct values as the owner for display in /proc/vmallocinfo.

Fixes: 800e26b813 ("x86/hyperv: allocate the hypercall page with only read and execute bits")
Fixes: 10d5e97c1b ("arm64: use PAGE_KERNEL_ROX directly in alloc_insn_page")
Fixes: 7a0e27b2a0 ("mm: remove vmalloc_exec")
Reported-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200627075649.2455097-1-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-07-03 16:15:25 -07:00
Linus Torvalds
45564bcd57 for-linus-2020-07-02
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXv20ggAKCRCRxhvAZXjc
 osYAAQD4Lp5ORPJv5jtYScxRM1RiUh8hp2MBle+4iYCAQX/UowD/QtMPtiLWaMIw
 6bMMBAj1gVBmlIOmu7DhBs8hVtKugQY=
 =kNpi
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-2020-07-02' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull data race annotation from Christian Brauner:
 "This contains an annotation patch for a data race in copy_process()
  reported by KCSAN when reading and writing nr_threads.

  The data race is intentional and benign. This is obvious from the
  comment above the relevant code and based on general consensus when
  discussing this issue. So simply using data_race() to annotate this as
  an intentional race seems the best option"

* tag 'for-linus-2020-07-02' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  fork: annotate data race in copy_process()
2020-07-02 22:40:06 -07:00
Song Liu
046cc3dd9a bpf: Fix build without CONFIG_STACKTRACE
Without CONFIG_STACKTRACE stack_trace_save_tsk() is not defined. Let
get_callchain_entry_for_task() to always return NULL in such cases.

Fixes: fa28dcb82a ("bpf: Introduce helper bpf_get_task_stack()")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200703024537.79971-1-songliubraving@fb.com
2020-07-02 20:21:56 -07:00
Linus Torvalds
c93493b7cd io_uring-5.8-2020-07-01
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl79YU0QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgplHKD/9rgv0c1I7dCh6MgQKxT+2z/eZcaPO3PekW
 sbn8yC8RiSIL85Av1zEfC1wAp+Mp21QlFKXFiZ6BJj5bdDbbshLk0WdbnxvuM+9I
 gyngTI/+em5D/WCcetAkPjnMTDq0m4l0UXd91fyNAeErmYZbvhL5dXihZsBJ3T9c
 Bprn4RzWwrUsUwGn8qIEZhx2UovMrzXJHGFxWXh/81YHkh7Y4mjvATKxtECIliW/
 +QQJDU7Tf3gZw+ETPIDOEB9Hl9c9W+9fcWWzmrXzViUyy54IMbF4qyJpWcGaRh6c
 sO3apymwu7wwAUbQcE8IWr3ZLZDtw68AgUdZ5b/T0c2fEwqsI/UDMhBbELiuqcT0
 MAoQdUSNNqZTti0PX5vg5CQlCFzjnl2uIwHF6LVSbrqgyqxiC3Qrus/FYSaf3x9h
 bAmNgWC9DeKp/wtEKMuBXaOm7RjrEutD5hjJYfVK/AkvKTZyZDx3vZ9FRH8WtrII
 7KhUI3DPSZCeWlcpDtK+0fEqtqTw6OtCQ8U5vKSnJjoRSXLUtuk6IYbp/tqNxwe/
 0d+U6R+w513jVlXARUP48mV7tzpESp2MLP6Nd2Is/OD5tePWzQEZinpKzsFP4djH
 d2PT5FFGPCw9yBk03sI1Je/CFqVYwCGqav6h8dKKVBanMjoEdL4U1PMhI48Zua+9
 M8pqRHoeDA==
 =4lvI
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.8-2020-07-01' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
 "One fix in here, for a regression in 5.7 where a task is waiting in
  the kernel for a condition, but that condition won't become true until
  task_work is run. And the task_work can't be run exactly because the
  task is waiting in the kernel, so we'll never make any progress.

  One example of that is registering an eventfd and queueing io_uring
  work, and then the task goes and waits in eventfd read with the
  expectation that it'll get woken (and read an event) when the io_uring
  request completes. The io_uring request is finished through task_work,
  which won't get run while the task is looping in eventfd read"

* tag 'io_uring-5.8-2020-07-01' of git://git.kernel.dk/linux-block:
  io_uring: use signal based task_work running
  task_work: teach task_work_add() to do signal_wake_up()
2020-07-02 14:56:22 -07:00
Bhupesh Sharma
1d50e5d0c5 crash_core, vmcoreinfo: Append 'MAX_PHYSMEM_BITS' to vmcoreinfo
Right now user-space tools like 'makedumpfile' and 'crash' need to rely
on a best-guess method of determining value of 'MAX_PHYSMEM_BITS'
supported by underlying kernel.

This value is used in user-space code to calculate the bit-space
required to store a section for SPARESMEM (similar to the existing
calculation method used in the kernel implementation):

  #define SECTIONS_SHIFT    (MAX_PHYSMEM_BITS - SECTION_SIZE_BITS)

Now, regressions have been reported in user-space utilities
like 'makedumpfile' and 'crash' on arm64, with the recently added
kernel support for 52-bit physical address space, as there is
no clear method of determining this value in user-space
(other than reading kernel CONFIG flags).

As per suggestion from makedumpfile maintainer (Kazu), it makes more
sense to append 'MAX_PHYSMEM_BITS' to vmcoreinfo in the core code itself
rather than in arch-specific code, so that the user-space code for other
archs can also benefit from this addition to the vmcoreinfo and use it
as a standard way of determining 'SECTIONS_SHIFT' value in user-land.

A reference 'makedumpfile' implementation which reads the
'MAX_PHYSMEM_BITS' value from vmcoreinfo in a arch-independent fashion
is available here:

While at it also update vmcoreinfo documentation for 'MAX_PHYSMEM_BITS'
variable being added to vmcoreinfo.

'MAX_PHYSMEM_BITS' defines the maximum supported physical address
space memory.

Signed-off-by: Bhupesh Sharma <bhsharma@redhat.com>
Tested-by: John Donnelly <john.p.donnelly@oracle.com>
Acked-by: Dave Young <dyoung@redhat.com>
Cc: Boris Petkov <bp@alien8.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: James Morse <james.morse@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dave Anderson <anderson@redhat.com>
Cc: Kazuhito Hagio <k-hagio@ab.jp.nec.com>
Cc: x86@kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: kexec@lists.infradead.org
Link: https://lore.kernel.org/r/1589395957-24628-2-git-send-email-bhsharma@redhat.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2020-07-02 17:56:11 +01:00
Peter Zijlstra
78c2141b65 Merge branch 'perf/vlbr' 2020-07-02 15:51:48 +02:00
Quentin Perret
10dd8573b0 cpufreq: Register governors at core_initcall
Currently, most CPUFreq governors are registered at the core_initcall
time when the given governor is the default one, and the module_init
time otherwise.

In preparation for letting users specify the default governor on the
kernel command line, change all of them to be registered at the
core_initcall unconditionally, as it is already the case for the
schedutil and performance governors. This will allow us to assume
that builtin governors have been registered before the built-in
CPUFreq drivers probe.

And since all governors have similar init/exit patterns now, introduce
two new macros, cpufreq_governor_{init,exit}(), to factorize the code.

Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
[ rjw: Changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-07-02 13:03:30 +02:00
Steven Rostedt (VMware)
29ce24519c ring-buffer: Do not trigger a WARN if clock going backwards is detected
After tweaking the ring buffer to be a bit faster, a warning is triggering
on one of my machines, and causing my tests to fail. This warning is caused
when the delta (current time stamp minus previous time stamp), is larger
than the max time held by the ring buffer (59 bits).

If the clock were to go backwards slightly, this would then easily trigger
this warning. The machine that it triggered on, the clock did go backwards
by around 450 nanoseconds, and this happened after a recalibration of the
TSC clock. Now that the ring buffer is faster, it detects this, and the
delta that is used larger than the max, the warning is triggered and my test
fails.

To handle the clock going backwards, look at the saved before and after time
stamps. If they are the same, it means that the current event did not
interrupt another event, and that those timestamp are of a previous event
that was recorded. If the max delta is triggered, look at those time stamps,
make sure they are the same, then use them to compare with the current
timestamp. If the current timestamp is less than the before/after time
stamps, then that means the clock being used went backward.

Print out a message that this has happened, but do not warn about it (and
only print the message once).

Still do the warning if the delta is indeed larger than what can be used.

Also remove the unneeded KERN_WARNING from the WARN_ONCE() print.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-07-01 22:12:07 -04:00
Steven Rostedt (VMware)
bbeba3e58f ring-buffer: Call trace_clock_local() directly for RETPOLINE kernels
After doing some benchmarks and examining the code, I found that the ring
buffer clock calls were quite expensive, and noticed that it uses
retpolines. This is because the ring buffer clock is programmable, and can
be set. But in most cases it simply uses the fastest ns unit clock which is
the trace_clock_local(). For RETPOLINE builds, checking if the ring buffer
clock is set to trace_clock_local() and then calling it directly has brought
the time of an event on my i7 box from an average of 93 nanoseconds an event
down to 83 nanoseconds an event, and the minimum time from 81 nanoseconds to
68 nanoseconds!

Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-07-01 22:12:07 -04:00
Steven Rostedt (VMware)
74e879373b ring-buffer: Move the add_timestamp into its own function
Make a helper function rb_add_timestamp() that moves the adding of the
extended time stamps into its own function. Also, remove the noinline and
inline for the functions it calls, as recent benchmarks appear they do not
make a difference (just let gcc decide).

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-07-01 22:12:06 -04:00
Steven Rostedt (VMware)
58fbc3c632 ring-buffer: Consolidate add_timestamp to remove some branches
Reorganize a little the logic to handle adding the absolute time stamp,
extended and forced time stamps, in such a way to remove a branch or two.
This is just a micro optimization.

Also add before and after time stamps to the rb_event_info structure to
display those values in the rb_check_timestamps() code, if something were to
go wrong.

Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-07-01 22:11:22 -04:00
Song Liu
2df6bb5493 bpf: Allow %pB in bpf_seq_printf() and bpf_trace_printk()
This makes it easy to dump stack trace in text.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200630062846.664389-4-songliubraving@fb.com
2020-07-01 08:23:59 -07:00
Song Liu
fa28dcb82a bpf: Introduce helper bpf_get_task_stack()
Introduce helper bpf_get_task_stack(), which dumps stack trace of given
task. This is different to bpf_get_stack(), which gets stack track of
current task. One potential use case of bpf_get_task_stack() is to call
it from bpf_iter__task and dump all /proc/<pid>/stack to a seq_file.

bpf_get_task_stack() uses stack_trace_save_tsk() instead of
get_perf_callchain() for kernel stack. The benefit of this choice is that
stack_trace_save_tsk() doesn't require changes in arch/. The downside of
using stack_trace_save_tsk() is that stack_trace_save_tsk() dumps the
stack trace to unsigned long array. For 32-bit systems, we need to
translate it to u64 array.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200630062846.664389-3-songliubraving@fb.com
2020-07-01 08:23:19 -07:00
Song Liu
d141b8bc57 perf: Expose get/put_callchain_entry()
Sanitize and expose get/put_callchain_entry(). This would be used by bpf
stack map.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200630062846.664389-2-songliubraving@fb.com
2020-07-01 08:22:08 -07:00
Alexei Starovoitov
bba1dc0b55 bpf: Remove redundant synchronize_rcu.
bpf_free_used_maps() or close(map_fd) will trigger map_free callback.
bpf_free_used_maps() is called after bpf prog is no longer executing:
bpf_prog_put->call_rcu->bpf_prog_free->bpf_free_used_maps.
Hence there is no need to call synchronize_rcu() to protect map elements.

Note that hash_of_maps and array_of_maps update/delete inner maps via
sys_bpf() that calls maybe_wait_bpf_programs() and synchronize_rcu().

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/bpf/20200630043343.53195-2-alexei.starovoitov@gmail.com
2020-07-01 08:07:13 -07:00
Mike Rapoport
fb37409a01 arch: remove unicore32 port
The unicore32 port do not seem maintained for a long time now, there is no
upstream toolchain that can create unicore32 binaries and all the links to
prebuilt toolchains for unicore32 are dead. Even compilers that were
available are not supported by the kernel anymore.

Guenter Roeck says:

  I have stopped building unicore32 images since v4.19 since there is no
  available compiler that is still supported by the kernel. I am surprised
  that support for it has not been removed from the kernel.

Remove unicore32 port.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Guenter Roeck <linux@roeck-us.net>
2020-07-01 12:09:13 +03:00
David S. Miller
e708e2bd55 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2020-06-30

The following pull-request contains BPF updates for your *net* tree.

We've added 28 non-merge commits during the last 9 day(s) which contain
a total of 35 files changed, 486 insertions(+), 232 deletions(-).

The main changes are:

1) Fix an incorrect verifier branch elimination for PTR_TO_BTF_ID pointer
   types, from Yonghong Song.

2) Fix UAPI for sockmap and flow_dissector progs that were ignoring various
   arguments passed to BPF_PROG_{ATTACH,DETACH}, from Lorenz Bauer & Jakub Sitnicki.

3) Fix broken AF_XDP DMA hacks that are poking into dma-direct and swiotlb
   internals and integrate it properly into DMA core, from Christoph Hellwig.

4) Fix RCU splat from recent changes to avoid skipping ingress policy when
   kTLS is enabled, from John Fastabend.

5) Fix BPF ringbuf map to enforce size to be the power of 2 in order for its
   position masking to work, from Andrii Nakryiko.

6) Fix regression from CAP_BPF work to re-allow CAP_SYS_ADMIN for loading
   of network programs, from Maciej Żenczykowski.

7) Fix libbpf section name prefix for devmap progs, from Jesper Dangaard Brouer.

8) Fix formatting in UAPI documentation for BPF helpers, from Quentin Monnet.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-30 14:20:45 -07:00
Steven Rostedt (VMware)
75b21c6dfa ring-buffer: Mark the !tail (crossing a page) as unlikely
It is the uncommon case where an event crosses a sub buffer boundary (page)
mark that check at the end of reserving an event as unlikely.

Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-30 17:18:56 -04:00
Nicholas Piggin
b23d7a5f4a ring-buffer: speed up buffer resets by avoiding synchronize_rcu for each CPU
On a 144 thread system, `perf ftrace` takes about 20 seconds to start
up, due to calling synchronize_rcu() for each CPU.

  cat /proc/108560/stack
    0xc0003e7eb336f470
    __switch_to+0x2e0/0x480
    __wait_rcu_gp+0x20c/0x220
    synchronize_rcu+0x9c/0xc0
    ring_buffer_reset_cpu+0x88/0x2e0
    tracing_reset_online_cpus+0x84/0xe0
    tracing_open+0x1d4/0x1f0

On a system with 10x more threads, it starts to become an annoyance.

Batch these up so we disable all the per-cpu buffers first, then
synchronize_rcu() once, then reset each of the buffers. This brings
the time down to about 0.5s.

Link: https://lkml.kernel.org/r/20200625053403.2386972-1-npiggin@gmail.com

Tested-by: Anton Blanchard <anton@ozlabs.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-30 17:18:56 -04:00
Steven Rostedt (VMware)
10464b4aa6 ring-buffer: Add rb_time_t 64 bit operations for speeding up 32 bit
After a discussion with the new time algorithm to have nested events still
have proper time keeping but required using local64_t atomic operations.
Mathieu was concerned about the performance this would have on 32 bit
machines, as in most cases, atomic 64 bit operations on them can be
expensive.

As the ring buffer's timing needs do not require full features of local64_t,
a wrapper is made to implement a new rb_time_t operation that uses two longs
on 32 bit machines but still uses the local64_t operations on 64 bit
machines. There's a switch that can be made in the file to force 64 bit to
use the 32 bit version just for testing purposes.

All reads do not need to succeed if a read happened while the stamp being
read is in the process of being updated. The requirement is that all reads
must succed that were done by an interrupting event (where this event was
interrupted by another event that did the write). Or if the event itself did
the write first. That is: rb_time_set(t, x) followed by rb_time_read(t) will
always succeed (even if it gets interrupted by another event that writes to
t. The result of the read will be either the previous set, or a set
performed by an interrupting event.

If the read is done by an event that interrupted another event that was in
the process of setting the time stamp, and no other event came along to
write to that time stamp, it will fail and the rb_time_read() will return
that it failed (the value to read will be undefined).

A set will always write to the time stamp and return with a valid time
stamp, such that any read after it will be valid.

A cmpxchg may fail if it interrupted an event that was in the process of
updating the time stamp just like the reads do. Other than that, it will act
like a normal cmpxchg.

The way this works is that the rb_time_t is made of of three fields. A cnt,
that gets updated atomically everyting a modification is made. A top that
represents the most significant 30 bits of the time, and a bottom to
represent the least significant 30 bits of the time. Notice, that the time
values is only 60 bits long (where the ring buffer only uses 59 bits, which
gives us 18 years of nanoseconds!).

The top two bits of both the top and bottom is a 2 bit counter that gets set
by the value of the least two significant bits of the cnt. A read of the top
and the bottom where both the top and bottom have the same most significant
top 2 bits, are considered a match and a valid 60 bit number can be created
from it. If they do not match, then the number is considered invalid, and
this must only happen if an event interrupted another event in the midst of
updating the time stamp.

This is only used for 32 bits machines as 64 bit machines can get better
performance out of the local64_t. This has been tested heavily by forcing 64
bit to use this logic.

Link: https://lore.kernel.org/r/20200625225345.18cf5881@oasis.local.home
Link: http://lkml.kernel.org/r/20200629025259.309232719@goodmis.org

Inspired-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-30 17:18:51 -04:00
Yonghong Song
01c66c48d4 bpf: Fix an incorrect branch elimination by verifier
Wenbo reported an issue in [1] where a checking of null
pointer is evaluated as always false. In this particular
case, the program type is tp_btf and the pointer to
compare is a PTR_TO_BTF_ID.

The current verifier considers PTR_TO_BTF_ID always
reprents a non-null pointer, hence all PTR_TO_BTF_ID compares
to 0 will be evaluated as always not-equal, which resulted
in the branch elimination.

For example,
 struct bpf_fentry_test_t {
     struct bpf_fentry_test_t *a;
 };
 int BPF_PROG(test7, struct bpf_fentry_test_t *arg)
 {
     if (arg == 0)
         test7_result = 1;
     return 0;
 }
 int BPF_PROG(test8, struct bpf_fentry_test_t *arg)
 {
     if (arg->a == 0)
         test8_result = 1;
     return 0;
 }

In above bpf programs, both branch arg == 0 and arg->a == 0
are removed. This may not be what developer expected.

The bug is introduced by Commit cac616db39 ("bpf: Verifier
track null pointer branch_taken with JNE and JEQ"),
where PTR_TO_BTF_ID is considered to be non-null when evaluting
pointer vs. scalar comparison. This may be added
considering we have PTR_TO_BTF_ID_OR_NULL in the verifier
as well.

PTR_TO_BTF_ID_OR_NULL is added to explicitly requires
a non-NULL testing in selective cases. The current generic
pointer tracing framework in verifier always
assigns PTR_TO_BTF_ID so users does not need to
check NULL pointer at every pointer level like a->b->c->d.

We may not want to assign every PTR_TO_BTF_ID as
PTR_TO_BTF_ID_OR_NULL as this will require a null test
before pointer dereference which may cause inconvenience
for developers. But we could avoid branch elimination
to preserve original code intention.

This patch simply removed PTR_TO_BTD_ID from reg_type_not_null()
in verifier, which prevented the above branches from being eliminated.

 [1]: https://lore.kernel.org/bpf/79dbb7c0-449d-83eb-5f4f-7af0cc269168@fb.com/T/

Fixes: cac616db39 ("bpf: Verifier track null pointer branch_taken with JNE and JEQ")
Reported-by: Wenbo Zhang <ethercflow@gmail.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200630171240.2523722-1-yhs@fb.com
2020-06-30 22:21:05 +02:00
Steven Rostedt (VMware)
7c4b4a5164 ring-buffer: Incorporate absolute timestamp into add_timestamp logic
Instead of calling out the absolute test for each time to check if the
ring buffer wants absolute time stamps for all its recording, incorporate it
with the add_timestamp field and turn it into flags for faster processing
between wanting a absolute tag and needing to force one.

Link: http://lkml.kernel.org/r/20200629025259.154892368@goodmis.org

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-30 16:16:14 -04:00
Steven Rostedt (VMware)
a389d86f7f ring-buffer: Have nested events still record running time stamp
Up until now, if an event is interrupted while it is recorded by an
interrupt, and that interrupt records events, the time of those events will
all be the same. This is because events only record the delta of the time
since the previous event (or beginning of a page), and to handle updating
the time keeping for that of nested events is extremely racy. After years of
thinking about this and several failed attempts, I finally have a solution
to solve this puzzle.

The problem is that you need to atomically calculate the delta and then
update the time stamp you made the delta from, as well as then record it
into the buffer, all this while at any time an interrupt can come in and
do the same thing. This is easy to solve with heavy weight atomics, but that
would be detrimental to the performance of the ring buffer. The current
state of affairs sacrificed the time deltas for nested events for
performance.

The reason for previous failed attempts at solving this puzzle was because I
was trying to completely avoid slow atomic operations like cmpxchg. I final
came to the conclusion to always avoid cmpxchg is not possible, which is why
those previous attempts always failed. But it is possible to pick one path
(the most common case) and avoid cmpxchg in that path, which is the "fast
path". The most common case is that an event will not be interrupted and
have other events added into it. An event can detect if it has
interrupted another event, and for these cases we can make it the slow
path and use the heavy operations like cmpxchg.

One more player was added to the game that made this possible, and that is
the "absolute timestamp" (by Tom Zanussi) that allows us to inject a full 59
bit time stamp. (Of course this breaks if a machine is running for more than
18 years without a reboot!).

There's barrier() placements around for being paranoid, even when they
are not needed because of other atomic functions near by. But those
should not hurt, as if they are not needed, they basically become a nop.

Note, this also makes the race window much smaller, which means there
are less slow paths to slow down the performance.

The basic idea is that there's two main paths taken.

 1) Not being interrupted between time stamps and reserving buffer space.
    In this case, the time stamps taken are true to the location in the
    buffer.

 2) Was interrupted by another path between taking time stamps and reserving
    buffer space.

The objective is to know what the delta is from the last reserved location
in the buffer.

As it is possible to detect if an event is interrupting another event before
reserving data, space is added to the length to be reserved to inject a full
time stamp along with the event being reserved.

When an event is not interrupted, the write stamp is always the time of the
last event written to the buffer.

In path 1, there's two sub paths we care about:

 a) The event did not interrupt another event.
 b) The event interrupted another event.

In case a, as the write stamp was read and known to be correct, the delta
between the current time stamp and the write stamp is the delta between the
current event and the previously recorded event.

In case b, extra space was reserved to just put the full time stamp into the
buffer. Which is done, as stated, in this path the time stamp taken is known
to match the location in the buffer.

In path 2, there's also two sub paths we care about:

 a) The event was not interrupted by another event since it reserved space
    on the buffer and re-reading the write stamp.
 b) The event was interrupted by another event.

In case a, the write stamp is that of the last event that interrupted this
event between taking the time stamps and reserving. As no event came in
after re-reading the write stamp, that event is known to be the time of the
event directly before this event and the delta can be the new time stamp and
the write stamp.

In case b, one or more events came in between reserving the event and
re-reading he write stamp. Since this event's buffer reservation is between
other events at this path, there's no way to know what the delta is. But
because an event interrupted this event after it started, its fine to just
give a zero delta, and take the same time stamp as the events that happened
within the event being recorded.

Here's the implementation of the design of this solution:

 All this is per cpu, and only needs to worry about nested events (not
 parallel events).

The players:

 write_tail: The index in the buffer where new events can be written to.
     It is incremented via local_add() to reserve space for a new event.

 before_stamp: A time stamp set by all events before reserving space.

 write_stamp: A time stamp updated by events after it has successfully
     reserved space.

	/* Save the current position of write */
 [A]	w = local_read(write_tail);
	barrier();
	/* Read both before and write stamps before touching anything */
	before = local_read(before_stamp);
	after = local_read(write_stamp);
	barrier();

	/*
	 * If before and after are the same, then this event is not
	 * interrupting a time update. If it is, then reserve space for adding
	 * a full time stamp (this can turn into a time extend which is
	 * just an extended time delta but fill up the extra space).
	 */
	if (after != before)
		abs = true;

	ts = clock();

	/* Now update the before_stamp (everyone does this!) */
 [B]	local_set(before_stamp, ts);

	/* Now reserve space on the buffer */
 [C]	write = local_add_return(len, write_tail);

	/* Set tail to be were this event's data is */
	tail = write - len;

 	if (w == tail) {

		/* Nothing interrupted this between A and C */
 [D]		local_set(write_stamp, ts);
		barrier();
 [E]		save_before = local_read(before_stamp);

 		if (!abs) {
			/* This did not interrupt a time update */
			delta = ts - after;
		} else {
			delta = ts; /* The full time stamp will be in use */
		}
		if (ts != save_before) {
			/* slow path - Was interrupted between C and E */
			/* The update to write_stamp could have overwritten the update to
			 * it by the interrupting event, but before and after should be
			 * the same for all completed top events */
			after = local_read(write_stamp);
			if (save_before > after)
				local_cmpxchg(write_stamp, after, save_before);
		}
	} else {
		/* slow path - Interrupted between A and C */

		after = local_read(write_stamp);
		temp_ts = clock();
		barrier();
 [F]		if (write == local_read(write_tail) && after < temp_ts) {
			/* This was not interrupted since C and F
			 * The last write_stamp is still valid for the previous event
			 * in the buffer. */
			delta = temp_ts - after;
			/* OK to keep this new time stamp */
			ts = temp_ts;
		} else {
			/* Interrupted between C and F
			 * Well, there's no use to try to know what the time stamp
			 * is for the previous event. Just set delta to zero and
			 * be the same time as that event that interrupted us before
			 * the reservation of the buffer. */

			delta = 0;
		}
		/* No need to use full timestamps here */
		abs = 0;
	}

Link: https://lkml.kernel.org/r/20200625094454.732790f7@oasis.local.home
Link: https://lore.kernel.org/r/20200627010041.517736087@goodmis.org
Link: http://lkml.kernel.org/r/20200629025258.957440797@goodmis.org

Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-30 14:29:33 -04:00
Steven Rostedt (VMware)
7ef282e051 tracing: Move pipe reference to trace array instead of current_tracer
If a process has the trace_pipe open on a trace_array, the current tracer
for that trace array should not be changed. This was original enforced by a
global lock, but when instances were introduced, it was moved to the
current_trace. But this structure is shared by all instances, and a
trace_pipe is for a single instance. There's no reason that a process that
has trace_pipe open on one instance should prevent another instance from
changing its current tracer. Move the reference counter to the trace_array
instead.

This is marked as "Fixes" but is more of a clean up than a true fix.
Backport if you want, but its not critical.

Fixes: cf6ab6d914 ("tracing: Add ref count to tracer for when they are being read by pipe")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-30 14:29:33 -04:00
Oleg Nesterov
e91b481623 task_work: teach task_work_add() to do signal_wake_up()
So that the target task will exit the wait_event_interruptible-like
loop and call task_work_run() asap.

The patch turns "bool notify" into 0,TWA_RESUME,TWA_SIGNAL enum, the
new TWA_SIGNAL flag implies signal_wake_up().  However, it needs to
avoid the race with recalc_sigpending(), so the patch also adds the
new JOBCTL_TASK_WORK bit included in JOBCTL_PENDING_MASK.

TODO: once this patch is merged we need to change all current users
of task_work_add(notify = true) to use TWA_RESUME.

Cc: stable@vger.kernel.org # v5.7
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-30 12:18:08 -06:00
Jakub Sitnicki
2576f87066 bpf, netns: Fix use-after-free in pernet pre_exit callback
Iterating over BPF links attached to network namespace in pre_exit hook is
not safe, even if there is just one. Once link gets auto-detached, that is
its back-pointer to net object is set to NULL, the link can be released and
freed without waiting on netns_bpf_mutex, effectively causing the list
element we are operating on to be freed.

This leads to use-after-free when trying to access the next element on the
list, as reported by KASAN. Bug can be triggered by destroying a network
namespace, while also releasing a link attached to this network namespace.

| ==================================================================
| BUG: KASAN: use-after-free in netns_bpf_pernet_pre_exit+0xd9/0x130
| Read of size 8 at addr ffff888119e0d778 by task kworker/u8:2/177
|
| CPU: 3 PID: 177 Comm: kworker/u8:2 Not tainted 5.8.0-rc1-00197-ga0c04c9d1008-dirty #776
| Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
| Workqueue: netns cleanup_net
| Call Trace:
|  dump_stack+0x9e/0xe0
|  print_address_description.constprop.0+0x3a/0x60
|  ? netns_bpf_pernet_pre_exit+0xd9/0x130
|  kasan_report.cold+0x1f/0x40
|  ? netns_bpf_pernet_pre_exit+0xd9/0x130
|  netns_bpf_pernet_pre_exit+0xd9/0x130
|  cleanup_net+0x30b/0x5b0
|  ? unregister_pernet_device+0x50/0x50
|  ? rcu_read_lock_bh_held+0xb0/0xb0
|  ? _raw_spin_unlock_irq+0x24/0x50
|  process_one_work+0x4d1/0xa10
|  ? lock_release+0x3e0/0x3e0
|  ? pwq_dec_nr_in_flight+0x110/0x110
|  ? rwlock_bug.part.0+0x60/0x60
|  worker_thread+0x7a/0x5c0
|  ? process_one_work+0xa10/0xa10
|  kthread+0x1e3/0x240
|  ? kthread_create_on_node+0xd0/0xd0
|  ret_from_fork+0x1f/0x30
|
| Allocated by task 280:
|  save_stack+0x1b/0x40
|  __kasan_kmalloc.constprop.0+0xc2/0xd0
|  netns_bpf_link_create+0xfe/0x650
|  __do_sys_bpf+0x153a/0x2a50
|  do_syscall_64+0x59/0x300
|  entry_SYSCALL_64_after_hwframe+0x44/0xa9
|
| Freed by task 198:
|  save_stack+0x1b/0x40
|  __kasan_slab_free+0x12f/0x180
|  kfree+0xed/0x350
|  process_one_work+0x4d1/0xa10
|  worker_thread+0x7a/0x5c0
|  kthread+0x1e3/0x240
|  ret_from_fork+0x1f/0x30
|
| The buggy address belongs to the object at ffff888119e0d700
|  which belongs to the cache kmalloc-192 of size 192
| The buggy address is located 120 bytes inside of
|  192-byte region [ffff888119e0d700, ffff888119e0d7c0)
| The buggy address belongs to the page:
| page:ffffea0004678340 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0
| flags: 0x2fffe0000000200(slab)
| raw: 02fffe0000000200 ffffea00045ba8c0 0000000600000006 ffff88811a80ea80
| raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
| page dumped because: kasan: bad access detected
|
| Memory state around the buggy address:
|  ffff888119e0d600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
|  ffff888119e0d680: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
| >ffff888119e0d700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
|                                                                 ^
|  ffff888119e0d780: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
|  ffff888119e0d800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
| ==================================================================

Remove the "fast-path" for releasing a link that got auto-detached by a
dying network namespace to fix it. This way as long as link is on the list
and netns_bpf mutex is held, we have a guarantee that link memory can be
accessed.

An alternative way to fix this issue would be to safely iterate over the
list of links and ensure there is no access to link object after detaching
it. But, at the moment, optimizing synchronization overhead on link release
without a workload in mind seems like an overkill.

Fixes: ab53cad90e ("bpf, netns: Keep a list of attached bpf_link's")
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200630164541.1329993-1-jakub@cloudflare.com
2020-06-30 10:53:42 -07:00
Lorenz Bauer
bb0de3131f bpf: sockmap: Require attach_bpf_fd when detaching a program
The sockmap code currently ignores the value of attach_bpf_fd when
detaching a program. This is contrary to the usual behaviour of
checking that attach_bpf_fd represents the currently attached
program.

Ensure that attach_bpf_fd is indeed the currently attached
program. It turns out that all sockmap selftests already do this,
which indicates that this is unlikely to cause breakage.

Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200629095630.7933-5-lmb@cloudflare.com
2020-06-30 10:46:39 -07:00
Lorenz Bauer
4ac2add659 bpf: flow_dissector: Check value of unused flags to BPF_PROG_DETACH
Using BPF_PROG_DETACH on a flow dissector program supports neither
attach_flags nor attach_bpf_fd. Yet no value is enforced for them.

Enforce that attach_flags are zero, and require the current program
to be passed via attach_bpf_fd. This allows us to remove the check
for CAP_SYS_ADMIN, since userspace can now no longer remove
arbitrary flow dissector programs.

Fixes: b27f7bb590 ("flow_dissector: Move out netns_bpf prog callbacks")
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200629095630.7933-3-lmb@cloudflare.com
2020-06-30 10:46:38 -07:00
Lorenz Bauer
1b514239e8 bpf: flow_dissector: Check value of unused flags to BPF_PROG_ATTACH
Using BPF_PROG_ATTACH on a flow dissector program supports neither
target_fd, attach_flags or replace_bpf_fd but accepts any value.

Enforce that all of them are zero. This is fine for replace_bpf_fd
since its presence is indicated by BPF_F_REPLACE. It's more
problematic for target_fd, since zero is a valid fd. Should we
want to use the flag later on we'd have to add an exception for
fd 0. The alternative is to force a value like -1. This requires
more changes to tests. There is also precedent for using 0,
since bpf_iter uses this for target_fd as well.

Fixes: b27f7bb590 ("flow_dissector: Move out netns_bpf prog callbacks")
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200629095630.7933-2-lmb@cloudflare.com
2020-06-30 10:46:38 -07:00
Jakub Sitnicki
ab53cad90e bpf, netns: Keep a list of attached bpf_link's
To support multi-prog link-based attachments for new netns attach types, we
need to keep track of more than one bpf_link per attach type. Hence,
convert net->bpf.links into a list, that currently can be either empty or
have just one item.

Instead of reusing bpf_prog_list from bpf-cgroup, we link together
bpf_netns_link's themselves. This makes list management simpler as we don't
have to allocate, initialize, and later release list elements. We can do
this because multi-prog attachment will be available only for bpf_link, and
we don't need to build a list of programs attached directly and indirectly
via links.

No functional changes intended.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200625141357.910330-4-jakub@cloudflare.com
2020-06-30 10:45:08 -07:00
Jakub Sitnicki
695c12147a bpf, netns: Keep attached programs in bpf_prog_array
Prepare for having multi-prog attachments for new netns attach types by
storing programs to run in a bpf_prog_array, which is well suited for
iterating over programs and running them in sequence.

After this change bpf(PROG_QUERY) may block to allocate memory in
bpf_prog_array_copy_to_user() for collected program IDs. This forces a
change in how we protect access to the attached program in the query
callback. Because bpf_prog_array_copy_to_user() can sleep, we switch from
an RCU read lock to holding a mutex that serializes updaters.

Because we allow only one BPF flow_dissector program to be attached to
netns at all times, the bpf_prog_array pointed by net->bpf.run_array is
always either detached (null) or one element long.

No functional changes intended.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200625141357.910330-3-jakub@cloudflare.com
2020-06-30 10:45:08 -07:00
Jakub Sitnicki
3b7016996c flow_dissector: Pull BPF program assignment up to bpf-netns
Prepare for using bpf_prog_array to store attached programs by moving out
code that updates the attached program out of flow dissector.

Managing bpf_prog_array is more involved than updating a single bpf_prog
pointer. This will let us do it all from one place, bpf/net_namespace.c, in
the subsequent patch.

No functional change intended.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200625141357.910330-2-jakub@cloudflare.com
2020-06-30 10:45:07 -07:00
Andrii Nakryiko
517bbe1994 bpf: Enforce BPF ringbuf size to be the power of 2
BPF ringbuf assumes the size to be a multiple of page size and the power of
2 value. The latter is important to avoid division while calculating position
inside the ring buffer and using (N-1) mask instead. This patch fixes omission
to enforce power-of-2 size rule.

Fixes: 457f44363a ("bpf: Implement BPF ring buffer and verifier support for it")
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200630061500.1804799-1-andriin@fb.com
2020-06-30 16:31:55 +02:00
Christoph Hellwig
3aa9162500 dma-mapping: Add a new dma_need_sync API
Add a new API to check if calls to dma_sync_single_for_{device,cpu} are
required for a given DMA streaming mapping.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200629130359.2690853-2-hch@lst.de
2020-06-30 15:44:03 +02:00
Richard Guy Briggs
142240398e audit: add gfp parameter to audit_log_nfcfg
Fixed an inconsistent use of GFP flags in nft_obj_notify() that used
GFP_KERNEL when a GFP flag was passed in to that function.  Given this
allocated memory was then used in audit_log_nfcfg() it led to an audit
of all other GFP allocations in net/netfilter/nf_tables_api.c and a
modification of audit_log_nfcfg() to accept a GFP parameter.

Reported-by: Dan Carptenter <dan.carpenter@oracle.com>
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2020-06-29 19:14:47 -04:00
Marco Elver
61d56d7aa5 kcsan: Disable branch tracing in core runtime
Disable branch tracing in core KCSAN runtime if branches are being
traced (TRACE_BRANCH_PROFILING). This it to avoid its performance
impact, but also avoid recursion in case KCSAN is enabled for the branch
tracing runtime.

The latter had already been a problem for KASAN:
https://lore.kernel.org/lkml/CANpmjNOeXmD5E3O50Z3MjkiuCYaYOPyi+1rq=GZvEKwBvLR0Ug@mail.gmail.com/

Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 12:04:48 -07:00
Marco Elver
2839a23207 kcsan: Simplify compiler flags
Simplify the set of compiler flags for the runtime by removing cc-option
from -fno-stack-protector, because all supported compilers support it.
This saves us one compiler invocation during build.

Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 12:04:48 -07:00
Marco Elver
56b031f0ab kcsan: Add jiffies test to test suite
Add a test that KCSAN nor the compiler gets confused about accesses to
jiffies on different architectures.

Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 12:04:48 -07:00
Marco Elver
7e766560e6 kcsan: Remove existing special atomic rules
Remove existing special atomic rules from kcsan_is_atomic_special()
because they are no longer needed. Since we rely on the compiler
emitting instrumentation distinguishing volatile accesses, the rules
have become redundant.

Let's keep kcsan_is_atomic_special() around, so that we have an obvious
place to add special rules should the need arise in future.

Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 12:04:48 -07:00
Marco Elver
acfa087ccf kcsan: Rename test.c to selftest.c
Rename 'test.c' to 'selftest.c' to better reflect its purpose (Kconfig
variable and code inside already match this). This is to avoid confusion
with the test suite module in 'kcsan-test.c'.

No functional change.

Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 12:04:48 -07:00
Marco Elver
9dd979bae4 kcsan: Silence -Wmissing-prototypes warning with W=1
The functions here should not be forward declared for explicit use
elsewhere in the kernel, as they should only be emitted by the compiler
due to sanitizer instrumentation.  Add forward declarations a line above
their definition to shut up warnings in W=1 builds.

Link: https://lkml.kernel.org/r/202006060103.jSCpnV1g%lkp@intel.com
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 12:04:48 -07:00
Marco Elver
2888557f68 kcsan: Prefer '__no_kcsan inline' in test
Instead of __no_kcsan_or_inline, prefer '__no_kcsan inline' in test --
this is in case we decide to remove __no_kcsan_or_inline.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 12:04:48 -07:00
Qian Cai
33190b675c locking/osq_lock: Annotate a data race in osq_lock
The prev->next pointer can be accessed concurrently as noticed by KCSAN:

 write (marked) to 0xffff9d3370dbbe40 of 8 bytes by task 3294 on cpu 107:
  osq_lock+0x25f/0x350
  osq_wait_next at kernel/locking/osq_lock.c:79
  (inlined by) osq_lock at kernel/locking/osq_lock.c:185
  rwsem_optimistic_spin
  <snip>

 read to 0xffff9d3370dbbe40 of 8 bytes by task 3398 on cpu 100:
  osq_lock+0x196/0x350
  osq_lock at kernel/locking/osq_lock.c:157
  rwsem_optimistic_spin
  <snip>

Since the write only stores NULL to prev->next and the read tests if
prev->next equals to this_cpu_ptr(&osq_node). Even if the value is
shattered, the code is still working correctly. Thus, mark it as an
intentional data race using the data_race() macro.

Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 12:04:48 -07:00
Marco Elver
1fe84fd4a4 kcsan: Add test suite
This adds KCSAN test focusing on behaviour of the integrated runtime.
Tests various race scenarios, and verifies the reports generated to
console. Makes use of KUnit for test organization, and the Torture
framework for test thread control.

Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 12:04:48 -07:00
Qian Cai
cda099b37d fork: Annotate a data race in vm_area_dup()
struct vm_area_struct could be accessed concurrently as noticed by
KCSAN,

 write to 0xffff9cf8bba08ad8 of 8 bytes by task 14263 on cpu 35:
  vma_interval_tree_insert+0x101/0x150:
  rb_insert_augmented_cached at include/linux/rbtree_augmented.h:58
  (inlined by) vma_interval_tree_insert at mm/interval_tree.c:23
  __vma_link_file+0x6e/0xe0
  __vma_link_file at mm/mmap.c:629
  vma_link+0xa2/0x120
  mmap_region+0x753/0xb90
  do_mmap+0x45c/0x710
  vm_mmap_pgoff+0xc0/0x130
  ksys_mmap_pgoff+0x1d1/0x300
  __x64_sys_mmap+0x33/0x40
  do_syscall_64+0x91/0xc44
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

 read to 0xffff9cf8bba08a80 of 200 bytes by task 14262 on cpu 122:
  vm_area_dup+0x6a/0xe0
  vm_area_dup at kernel/fork.c:362
  __split_vma+0x72/0x2a0
  __split_vma at mm/mmap.c:2661
  split_vma+0x5a/0x80
  mprotect_fixup+0x368/0x3f0
  do_mprotect_pkey+0x263/0x420
  __x64_sys_mprotect+0x51/0x70
  do_syscall_64+0x91/0xc44
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

vm_area_dup() blindly copies all fields of original VMA to the new one.
This includes coping vm_area_struct::shared.rb which is normally
protected by i_mmap_lock. But this is fine because the read value will
be overwritten on the following __vma_link_file() under proper
protection. Thus, mark it as an intentional data race and insert a few
assertions for the fields that should not be modified concurrently.

Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 12:04:47 -07:00
Paul E. McKenney
13625c0a40 Merge branches 'doc.2020.06.29a', 'fixes.2020.06.29a', 'kfree_rcu.2020.06.29a', 'rcu-tasks.2020.06.29a', 'scale.2020.06.29a', 'srcu.2020.06.29a' and 'torture.2020.06.29a' into HEAD
doc.2020.06.29a:  Documentation updates.
fixes.2020.06.29a:  Miscellaneous fixes.
kfree_rcu.2020.06.29a:  kfree_rcu() updates.
rcu-tasks.2020.06.29a:  RCU Tasks updates.
scale.2020.06.29a:  Read-side scalability tests.
srcu.2020.06.29a:  SRCU updates.
torture.2020.06.29a:  Torture-test updates.
2020-06-29 12:03:15 -07:00
Paul E. McKenney
2102ad290a torture: Dump ftrace at shutdown only if requested
If there is a large number of torture tests running concurrently,
all of which are dumping large ftrace buffers at shutdown time, the
resulting dumping can take a very long time, particularly on systems
with rotating-rust storage.  This commit therefore adds a default-off
torture.ftrace_dump_at_shutdown module parameter that enables
shutdown-time ftrace-buffer dumping.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 12:01:45 -07:00
Paul E. McKenney
7752275118 rcutorture: Check for unwatched readers
RCU is supposed to be watching all non-idle kernel code and also all
softirq handlers.  This commit adds some teeth to this statement by
adding a WARN_ON_ONCE().

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 12:01:44 -07:00
Jules Irenge
8f43d5911b rcu/rcutorture: Replace 0 with false
Coccinelle reports a warning

WARNING: Assignment of 0/1 to bool variable

The root cause is that the variable lastphase is a bool, but is
initialised with integer 0.  This commit therefore replaces the 0 with
a false.

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 12:01:44 -07:00
Paul E. McKenney
cae7cc6ba5 rcutorture: NULL rcu_torture_current earlier in cleanup code
Currently, the rcu_torture_current variable remains non-NULL until after
all readers have stopped.  During this time, rcu_torture_stats_print()
will think that the test is still ongoing, which can result in confusing
dmesg output.  This commit therefore NULLs rcu_torture_current immediately
after the rcu_torture_writer() kthread has decided to stop, thus informing
rcu_torture_stats_print() much sooner.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 12:01:44 -07:00
Paul E. McKenney
4a5f133c15 rcutorture: Add races with task-exit processing
Several variants of Linux-kernel RCU interact with task-exit processing,
including preemptible RCU, Tasks RCU, and Tasks Trace RCU.  This commit
therefore adds testing of this interaction to rcutorture by adding
rcutorture.read_exit_burst and rcutorture.read_exit_delay kernel-boot
parameters.  These kernel parameters control the frequency and spacing
of special read-then-exit kthreads that are spawned.

[ paulmck: Apply feedback from Dan Carpenter's static checker. ]
[ paulmck: Reduce latency to avoid false-positive shutdown hangs. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 12:01:44 -07:00
Zou Wei
d02c6b52d1 locktorture: Use true and false to assign to bool variables
This commit fixes the following coccicheck warnings:

kernel/locking/locktorture.c:689:6-10: WARNING: Assignment of 0/1 to bool variable
kernel/locking/locktorture.c:907:2-20: WARNING: Assignment of 0/1 to bool variable
kernel/locking/locktorture.c:938:3-20: WARNING: Assignment of 0/1 to bool variable
kernel/locking/locktorture.c:668:2-19: WARNING: Assignment of 0/1 to bool variable
kernel/locking/locktorture.c:674:2-19: WARNING: Assignment of 0/1 to bool variable
kernel/locking/locktorture.c:634:2-20: WARNING: Assignment of 0/1 to bool variable
kernel/locking/locktorture.c:640:2-20: WARNING: Assignment of 0/1 to bool variable

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zou Wei <zou_wei@huawei.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 12:01:44 -07:00
Sebastian Andrzej Siewior
bde50d8ff8 srcu: Avoid local_irq_save() before acquiring spinlock_t
SRCU disables interrupts to get a stable per-CPU pointer and then
acquires the spinlock which is in the per-CPU data structure. The
release uses spin_unlock_irqrestore(). While this is correct on a non-RT
kernel, this conflicts with the RT semantics because the spinlock is
converted to a 'sleeping' spinlock. Sleeping locks can obviously not be
acquired with interrupts disabled.

Acquire the per-CPU pointer `ssp->sda' without disabling preemption and
then acquire the spinlock_t of the per-CPU data structure. The lock will
ensure that the data is consistent.

The added call to check_init_srcu_struct() is now needed because a
statically defined srcu_struct may remain uninitialized until this
point and the newly introduced locking operation requires an initialized
spinlock_t.

This change was tested for four hours with 8*SRCU-N and 8*SRCU-P without
causing any warnings.

Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: rcu@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 12:01:22 -07:00
Ethon Paul
7fef6cff8f srcu: Fix a typo in comment "amoritized"->"amortized"
This commit fixes a typo in a comment.

Signed-off-by: Ethon Paul <ethp@qq.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 12:01:22 -07:00
Paul E. McKenney
1fbeb3a8c4 refperf: Rename refperf.c to refscale.c and change internal names
This commit further avoids conflation of refperf with the kernel's perf
feature by renaming kernel/rcu/refperf.c to kernel/rcu/refscale.c,
and also by similarly renaming the functions and variables inside
this file.  This has the side effect of changing the names of the
kernel boot parameters, so kernel-parameters.txt and ver_functions.sh
are also updated.

The rcutorture --torture type remains refperf, and this will be
addressed in a separate commit.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 12:00:46 -07:00
Paul E. McKenney
8e4ec3d02b refperf: Rename RCU_REF_PERF_TEST to RCU_REF_SCALE_TEST
The old Kconfig option name is all too easy to conflate with the
unrelated "perf" feature, so this commit renames RCU_REF_PERF_TEST to
RCU_REF_SCALE_TEST.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 12:00:46 -07:00
Paul E. McKenney
c7dcf8106f rcu-tasks: Fix synchronize_rcu_tasks_trace() header comment
The synchronize_rcu_tasks_trace() header comment incorrectly claims that
any number of things delimit RCU Tasks Trace read-side critical sections,
when in fact only rcu_read_lock_trace() and rcu_read_unlock_trace() do so.
This commit therefore fixes this comment, and, while in the area, fixes
a typo in the rcu_read_lock_trace() header comment.

Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 12:00:46 -07:00
Paul E. McKenney
e13ef442fe refperf: Add test for RCU Tasks readers
This commit adds testing for RCU Tasks readers to the refperf module.
This also applies to RCU Rude readers, as both flavors have empty
(as in non-existent) read-side markers.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 12:00:46 -07:00
Paul E. McKenney
72bb749e70 refperf: Add test for RCU Tasks Trace readers.
This commit adds testing for RCU Tasks Trace readers to the refperf module.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 12:00:45 -07:00
Paul E. McKenney
918b351d96 refperf: Change readdelay module parameter to nanoseconds
The current units of microseconds are too coarse, so this commit
changes the units to nanoseconds.  However, ndelay is used only for the
nanoseconds with udelay being used for whole microseconds.  For example,
setting refperf.readdelay=1500 results in a udelay(1) followed by an
ndelay(500).

Suggested-by: Akira Yokosawa <akiyks@gmail.com>
[ paulmck: Abstracted delay per Akira feedback and move from 80 to 100 lines. ]
[ paulmck: Fix names as suggested by kbuild test robot. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 12:00:45 -07:00
Arnd Bergmann
7c944d7c67 refperf: Work around 64-bit division
A 64-bit division was introduced in refperf, breaking compilation
on all 32-bit architectures:

kernel/rcu/refperf.o: in function `main_func':
refperf.c:(.text+0x57c): undefined reference to `__aeabi_uldivmod'

Fix this by using div_u64 to mark the expensive operation.

[ paulmck: Update primitive and format per Nathan Chancellor. ]
Fixes: bd5b16d6c88d ("refperf: Allow decimal nanoseconds")
Reported-by: kbuild test robot <lkp@intel.com>
Reported-by: Valdis Klētnieks <valdis.kletnieks@vt.edu>
Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 12:00:45 -07:00
Paul E. McKenney
4dd72a338a refperf: Adjust refperf.loop default value
With the various measurement optimizations, 10,000 loops normally
suffices.  This commit therefore reduces the refperf.loops default value
from 10,000,000 to 10,000.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 12:00:45 -07:00
Paul E. McKenney
b4d1e34f65 refperf: Add read-side delay module parameter
This commit adds a refperf.readdelay module parameter that controls the
duration of each critical section.  This parameter allows gathering data
showing how the performance differences between the various primitives
vary with critical-section length.

Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 12:00:45 -07:00
Paul E. McKenney
96af866959 refperf: Simplify initialization-time wakeup protocol
This commit moves the reader-launch wait loop from ref_perf_init()
to main_func(), removing one layer of wakeup and allowing slightly
faster system boot.

Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 12:00:45 -07:00
Paul E. McKenney
6efb063408 refperf: Label experiment-number column "Runs"
The experiment-number column is currently labeled "Threads", which is
misleading at best.  This commit therefore relabels it as "Runs", and
adjusts the scripts accordingly.

Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 12:00:45 -07:00
Paul E. McKenney
2db0bda384 refperf: Add warmup and cooldown processing phases
This commit causes all the readers to start running unmeasured load
until all readers have done at least one such run (thus having warmed
up), then run the measured load, and then run unmeasured load until all
readers have completed their measured load.  This approach avoids any
thread running measured load while other readers are idle.

Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 12:00:45 -07:00
Paul E. McKenney
86e0da2bb8 refperf: More closely synchronize reader start times
Currently, readers are awakened individually.  On most systems, this
results in significant wakeup delay from one reader to the next, which
can result in the first and last reader having sole access to the
synchronization primitive in question.  If that synchronization primitive
involves shared memory, those readers will rack up a huge number of
operations in a very short time, causing large perturbations in the
results.

This commit therefore has the readers busy-wait after being awakened,
and uses a new n_started variable to synchronize their start times.

Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 12:00:45 -07:00
Paul E. McKenney
af2789db13 refperf: Convert reader_task structure's "start" field to int
This commit converts the reader_task structure's "start" field to int
in order to demote a full barrier to an smp_load_acquire() and also to
simplify the code a bit.  While in the area, and to enlist the compiler's
help in ensuring that nothing was missed, the field's name was changed
to start_reader.

Also while in the area, change the main_func() store to use
smp_store_release() to further fortify against wait/wake races.

Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 12:00:45 -07:00
Paul E. McKenney
b864f89ff6 refperf: Tune reader measurement interval
This commit moves a printk() out of the measurement interval, converts
a atomic_dec()/atomic_read() pair to atomic_dec_and_test(), and adds
a smp_mb__before_atomic() to avoid potential wake/wait hangs.  These
changes have the added benefit of reducing the number of loops required
for amortizing loop overhead for CONFIG_PREEMPT=n RCU measurements from
1,000,000 to 10,000.  This reduction in turn shortens the test, reducing
the probability of interference.

Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 12:00:45 -07:00
Paul E. McKenney
2990750bce refperf: Make functions static
Because the reset_readers() and process_durations() functions are used
only within kernel/rcu/refperf.c, this commit makes them static.

Reported-by: kbuild test robot <lkp@intel.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 12:00:45 -07:00
Paul E. McKenney
2e90de76f2 refperf: Dynamically allocate thread-summary output buffer
Currently, the buffer used to accumulate the thread-summary output is
fixed size, which will cause problems if someone decides to run on a large
number of PCUs.  This commit therefore dynamically allocates this buffer.

[ paulmck: Fix memory allocation as suggested by KASAN. ]
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 12:00:44 -07:00
Paul E. McKenney
f518f154ec refperf: Dynamically allocate experiment-summary output buffer
Currently, the buffer used to accumulate the experiment-summary output
is fixed size, which will cause problems if someone decides to run
one hundred experiments.  This commit therefore dynamically allocates
this buffer.

Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 12:00:44 -07:00
Paul E. McKenney
dbf28efdae refperf: Provide module parameter to specify number of experiments
The current code uses the number of threads both to limit the number
of threads and to specify the number of experiments, but also varies
the number of threads as the experiments progress.  This commit takes
a different approach by adding an refperf.nruns module parameter that
specifies the number of experiments, and furthermore uses the same
number of threads for each experiment.

Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 12:00:44 -07:00
Paul E. McKenney
8fc28783a0 refperf: Convert nreaders to a module parameter
This commit converts nreaders to a module parameter, with the default
of -1 specifying the old behavior of using 75% of the readers.

Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 12:00:44 -07:00
Paul E. McKenney
83b88c86da refperf: Allow decimal nanoseconds
The CONFIG_PREEMPT=n rcu_read_lock()/rcu_read_unlock() pair's overhead,
even including loop overhead, is far less than one nanosecond.
Since logscale plots are not all that happy with zero values, provide
picoseconds as decimals.

Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 12:00:44 -07:00
Paul E. McKenney
75dd8efef5 refperf: Hoist function-pointer calls out of the loop
Current runs show PREEMPT=n rcu_read_lock()/rcu_read_unlock() pairs
consuming between 20 and 30 nanoseconds, when in fact the actual value is
zero, give or take the barrier() asm's effect on compiler optimizations.
The additional overhead is caused by function calls through pointers
(especially in these days of Spectre mitigations) and perhaps also
needless argument passing, a non-const loop limit, and an upcounting loop.

This commit therefore combines the ->readlock() and ->readunlock()
function pointers into a single ->readsection() function pointer that
takes the loop count as a const parameter and keeps any data passed
from the read-lock to the read-unlock internal to this new function.

These changes reduce the measured overhead of the aforementioned
PREEMPT=n rcu_read_lock()/rcu_read_unlock() pairs from between 20 and
30 nanoseconds to somewhere south of 500 picoseconds.

Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 12:00:44 -07:00
Paul E. McKenney
777a54c908 refperf: Add holdoff parameter to allow CPUs to come online
This commit adds an rcuperf module parameter named "holdoff" that
defaults to 10 seconds if refperf is built in and to zero otherwise.
The assumption is that all the CPUs are online by the time that the
modprobe and insmod commands are going to do anything, and that normal
systems will have all the CPUs online within ten seconds.

Larger systems may take many tens of seconds or even minutes to get
to this point, hence this being a module parameter instead of being a
hard-coded constant.

Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 12:00:44 -07:00
Paul E. McKenney
708cda3165 rcuperf: Add comments explaining the high reader overhead
This commit adds comments explaining why the readers have otherwise insane
levels of measurement overhead, namely that they are intended as a test
load for update-side performance measurements, not as a straight-up
read-side performance test.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 12:00:44 -07:00
Joel Fernandes (Google)
653ed64b01 refperf: Add a test to measure performance of read-side synchronization
Add a test for comparing the performance of RCU with various read-side
synchronization mechanisms. The test has proved useful for collecting
data and performing these comparisons.

Currently RCU, SRCU, reader-writer lock, reader-writer semaphore and
reference counting can be measured using refperf.perf_type parameter.
Each invocation of the test runs measures performance of a specific
mechanism.

The maximum number of CPUs to concurrently run readers on is chosen by
the test itself and is 75% of the total number of CPUs. So if you had 24
CPUs, the test runs with a maximum of 18 parallel readers.

A number of experiments are conducted, and in each experiment, the
number of readers is increased by 1, upto the 75% of CPUs mark. During
each experiment, all readers execute an empty loop with refperf.loops
iterations and time the total loop duration. This is then averaged.

Example output:
Parameters "refperf.perf_type=srcu refperf.loops=2000000" looks like:

[    3.347133] srcu-ref-perf:
[    3.347133] Threads  Time(ns)
[    3.347133] 1        36
[    3.347133] 2        34
[    3.347133] 3        34
[    3.347133] 4        34
[    3.347133] 5        33
[    3.347133] 6        33
[    3.347133] 7        33
[    3.347133] 8        33
[    3.347133] 9        33
[    3.347133] 10       33
[    3.347133] 11       33
[    3.347133] 12       33
[    3.347133] 13       33
[    3.347133] 14       33
[    3.347133] 15       32
[    3.347133] 16       33
[    3.347133] 17       33
[    3.347133] 18       34

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 12:00:44 -07:00
Joel Fernandes (Google)
7e866460cc rcuperf: Remove useless while loops around wait_event
wait_event() already retries if the condition for the wake up is not
satisifed after wake up. Remove them from the rcuperf test.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 12:00:44 -07:00
Paul E. McKenney
30d8aa5128 rcu-tasks: Fix code-style issues
This commit declares trc_n_readers_need_end and trc_wait static and
replaced a "&" with "&&".  The "&" happened to work because the values
are bool, but accidents waiting to happen and all that...

Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 12:00:22 -07:00
Paul E. McKenney
8344496e8b rcu-tasks: Conditionally compile show_rcu_tasks_gp_kthreads()
The show_rcu_tasks_gp_kthreads() function is not invoked by Tiny RCU,
but is nevertheless defined in Tiny RCU builds that enable Tasks Trace
RCU.  This commit therefore conditionally compiles this function so
that it is defined only in builds that actually use it.

Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 12:00:22 -07:00
Paul E. McKenney
5b3cc99bed rcu-tasks: Add #include of rcupdate_trace.h to update.c
Although this is in some strict sense unnecessary, it is good to allow
the compiler to compare the function declaration with its definition.
This commit therefore adds a #include of linux/rcupdate_trace.h to
kernel/rcu/update.c.

Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 12:00:22 -07:00
Paul E. McKenney
04a3c5aa7a rcu-tasks: Make rcu_tasks_postscan() be static
The rcu_tasks_postscan() function is not used outside of RCU's tasks.h
file, so this commit makes it be static.

Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 12:00:22 -07:00
Paul E. McKenney
ea6eed9f7d rcu-tasks: Convert sleeps to idle priority
This commit converts the long-standing schedule_timeout_interruptible()
and schedule_timeout_uninterruptible() calls used by the various Tasks
RCU's grace-period kthreads to schedule_timeout_idle().  This conversion
avoids polluting the load-average with Tasks-RCU-related sleeping.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 12:00:22 -07:00
Uladzislau Rezki (Sony)
3042f83f19 rcu: Support reclaim for head-less object
Update the kvfree_call_rcu() function with head-less support.
This allows RCU to reclaim objects without an embedded rcu_head.

tree-RCU:
We introduce two chains of arrays to store SLAB-backed and vmalloc
pointers, each.  Storage in either of these arrays does not require
embedding an rcu_head within the object.

Maintaining the arrays may become impossible due to high memory
pressure. For such cases there is an emergency path. Objects with
rcu_head inside are just queued on a backup rcu_head list. Later on
that list is drained. As for the head-less variant, as the current
context can sleep, the following emergency measures are applied:
   a) Synchronously wait until a grace period has elapsed.
   b) Call kvfree().

tiny-RCU:
For double argument calls, there are no new changes in behavior. For
single argument call, kvfree() is directly inlined on the current
stack after a synchronize_rcu() call. Note that for tiny-RCU, any
call to synchronize_rcu() is actually a quiescent state, therefore
it does nothing.

Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Co-developed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 11:59:26 -07:00
Uladzislau Rezki (Sony)
c408b215f5 rcu: Rename *_kfree_callback/*_kfree_rcu_offset/kfree_call_*
The following changes are introduced:

1. Rename rcu_invoke_kfree_callback() to rcu_invoke_kvfree_callback(),
as well as the associated trace events, so the rcu_kfree_callback(),
becomes rcu_kvfree_callback(). The reason is to be aligned with kvfree()
notation.

2. Rename __is_kfree_rcu_offset to __is_kvfree_rcu_offset. All RCU
paths use kvfree() now instead of kfree(), thus rename it.

3. Rename kfree_call_rcu() to the kvfree_call_rcu(). The reason is,
it is capable of freeing vmalloc() memory now. Do the same with
__kfree_rcu() macro, it becomes __kvfree_rcu(), the goal is the
same.

Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Co-developed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 11:59:25 -07:00
Uladzislau Rezki (Sony)
64d1d06ccb rcu/tiny: support vmalloc in tiny-RCU
Replace kfree() with kvfree() in rcu_reclaim_tiny().
This makes it possible to release either SLAB or vmalloc
objects after a GP.

Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 11:59:25 -07:00
Uladzislau Rezki (Sony)
5f3c8d6204 rcu/tree: Maintain separate array for vmalloc ptrs
To do so, we use an array of kvfree_rcu_bulk_data structures.
It consists of two elements:
 - index number 0 corresponds to slab pointers.
 - index number 1 corresponds to vmalloc pointers.

Keeping vmalloc pointers separated from slab pointers makes
it possible to invoke the right freeing API for the right
kind of pointer.

It also prepares us for future headless support for vmalloc
and SLAB objects. Such objects cannot be queued on a linked
list and are instead directly into an array.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Co-developed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 11:59:25 -07:00
Uladzislau Rezki (Sony)
53c72b590b rcu/tree: cache specified number of objects
In order to reduce the dynamic need for pages in kfree_rcu(),
pre-allocate a configurable number of pages per CPU and link
them in a list. When kfree_rcu() reclaims objects, the object's
container page is cached into a list instead of being released
to the low-level page allocator.

Such an approach provides O(1) access to free pages while also
reducing the number of requests to the page allocator. It also
makes the kfree_rcu() code to have free pages available during
a low memory condition.

A read-only sysfs parameter (rcu_min_cached_objs) reflects the
minimum number of allowed cached pages per CPU.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 11:59:25 -07:00
Sebastian Andrzej Siewior
69f08d3999 rcu/tree: Use static initializer for krc.lock
The per-CPU variable is initialized at runtime in
kfree_rcu_batch_init(). This function is invoked before
'rcu_scheduler_active' is set to 'RCU_SCHEDULER_RUNNING'.
After the initialisation, '->initialized' is to true.

The raw_spin_lock is only acquired if '->initialized' is
set to true. The worqueue item is only used if 'rcu_scheduler_active'
set to RCU_SCHEDULER_RUNNING which happens after initialisation.

Use a static initializer for krc.lock and remove the runtime
initialisation of the lock. Since the lock can now be always
acquired, remove the '->initialized' check.

Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 11:59:25 -07:00
Uladzislau Rezki (Sony)
952371d6fc rcu/tree: Move kfree_rcu_cpu locking/unlocking to separate functions
Introduce helpers to lock and unlock per-cpu "kfree_rcu_cpu"
structures. That will make kfree_call_rcu() more readable
and prevent programming errors.

Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 11:59:25 -07:00
Uladzislau Rezki (Sony)
3af8486281 rcu/tree: Simplify KFREE_BULK_MAX_ENTR macro
We can simplify KFREE_BULK_MAX_ENTR macro and get rid of
magic numbers which were used to make the structure to be
exactly one page.

Suggested-by: Boqun Feng <boqun.feng@gmail.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 11:59:25 -07:00
Joel Fernandes (Google)
446044eb9c rcu/tree: Make debug_objects logic independent of rcu_head
kfree_rcu()'s debug_objects logic uses the address of the object's
embedded rcu_head to queue/unqueue. Instead of this, make use of the
object's address itself as preparation for future headless kfree_rcu()
support.

Reviewed-by: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 11:59:25 -07:00
Uladzislau Rezki (Sony)
594aa5975b rcu/tree: Repeat the monitor if any free channel is busy
It is possible that one of the channels cannot be detached
because its free channel is busy and previously queued data
has not been processed yet. On the other hand, another
channel can be successfully detached causing the monitor
work to stop.

Prevent that by rescheduling the monitor work if there are
any channels in the pending state after a detach attempt.

Fixes: 34c8817455 ("rcu: Support kfree_bulk() interface in kfree_rcu()")
Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 11:59:25 -07:00
Joel Fernandes (Google)
4d29194118 rcu/tree: Skip entry into the page allocator for PREEMPT_RT
To keep the kfree_rcu() code working in purely atomic sections on RT,
such as non-threaded IRQ handlers and raw spinlock sections, avoid
calling into the page allocator which uses sleeping locks on RT.

In fact, even if the  caller is preemptible, the kfree_rcu() code is
not, as the krcp->lock is a raw spinlock.

Calling into the page allocator is optional and avoiding it should be
Ok, especially with the page pre-allocation support in future patches.
Such pre-allocation would further avoid the a need for a dynamically
allocated page in the first place.

Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Uladzislau Rezki <urezki@gmail.com>
Co-developed-by: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 11:59:25 -07:00
Joel Fernandes (Google)
8ac88f7177 rcu/tree: Keep kfree_rcu() awake during lock contention
On PREEMPT_RT kernels, the krcp spinlock gets converted to an rt-mutex
and causes kfree_rcu() callers to sleep. This makes it unusable for
callers in purely atomic sections such as non-threaded IRQ handlers and
raw spinlock sections. Fix it by converting the spinlock to a raw
spinlock.

Vetting all code paths, there is no reason to believe that the raw
spinlock will hurt RT latencies as it is not held for a long time.

Cc: bigeasy@linutronix.de
Cc: Uladzislau Rezki <urezki@gmail.com>
Reviewed-by: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 11:59:25 -07:00
Mauro Carvalho Chehab
8e11690d2f rcu: Fix a kernel-doc warnings for "count"
There are some kernel-doc warnings:

	./kernel/rcu/tree.c:2915: warning: Function parameter or member 'count' not described in 'kfree_rcu_cpu'

This commit therefore moves the comment for "count" to the kernel-doc
markup.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 11:59:24 -07:00
Randy Dunlap
c3cb47a6cc kernel/rcu/tree.c: Fix kernel-doc warnings
Fix kernel-doc warning:

../kernel/rcu/tree.c:959: warning: Excess function parameter 'irq' description in 'rcu_nmi_enter'

Fixes: cf7614e13c ("rcu: Refactor rcu_{nmi,irq}_{enter,exit}()")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 11:58:51 -07:00
Wei Yang
7a0c2b0940 rcu: grpnum just records group number
The ->grpnum field in the rcu_node structure contains the bit position
in this structure's parent's bitmasks, which is not the CPU number.
This commit therefore adjusts this field's comment accordingly.

Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 11:58:51 -07:00
Wei Yang
a2dae43088 rcu: grplo/grphi just records CPU number
The ->grplo and ->grphi fields store the lowest and highest CPU number
covered by to a rcu_node structure, which is not the group number.
This commit therefore adjusts these fields' comments to match reality.

Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 11:58:51 -07:00
Wei Yang
00943a609d rcu: gp_max is protected by root rcu_node's lock
Because gp_max is protected by root rcu_node's lock, this commit moves
the gp_max definition to the region of the rcu_node structure containing
fields protected by this lock.

Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 11:58:51 -07:00
Peter Enderborg
c6dfd72b7a rcu: Stop shrinker loop
The count and scan can be separated in time, and there is a fair chance
that all work is already done when the scan starts, which might in turn
result in a needless retry.  This commit therefore avoids this retry by
returning SHRINK_STOP.

Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Peter Enderborg <peter.enderborg@sony.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 11:58:51 -07:00
Jules Irenge
e40bb92111 rcu: Replace 1 with true
Coccinelle reports a warning

WARNING: Assignment of 0/1 to bool variable

The root cause is that the variable lastphase is a bool, but is
initialised with integer 1.  This commit therefore replaces the 1 with
a true.

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 11:58:51 -07:00
Paul E. McKenney
d29e0b26b0 lockdep: Complain only once about RCU in extended quiescent state
Currently, lockdep_rcu_suspicious() complains twice about RCU read-side
critical sections being invoked from within extended quiescent states,
for example:

	RCU used illegally from idle CPU!
	rcu_scheduler_active = 2, debug_locks = 1
	RCU used illegally from extended quiescent state!

This commit therefore saves a couple lines of code and one line of
console-log output by eliminating the first of these two complaints.

Link: https://lore.kernel.org/lkml/87wo4wnpzb.fsf@nanos.tec.linutronix.de
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 11:58:51 -07:00
Paul E. McKenney
04b25a495b rcu: Mark rcu_nmi_enter() call to rcu_cleanup_after_idle() noinstr
The objtool complains about the call to rcu_cleanup_after_idle() from
rcu_nmi_enter(), so this commit adds instrumentation_begin() before that
call and instrumentation_end() after it.

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 11:58:50 -07:00
Paul E. McKenney
55fbe86ef3 rcu: Remove initialized but unused rnp from check_slow_task()
This commit removes the variable rnp from check_slow_task(), which
is defined, assigned to, but not otherwise used.

Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 11:58:50 -07:00
Frederic Weisbecker
3c8920e2db tick/nohz: Narrow down noise while setting current task's tick dependency
Setting a tick dependency on any task, including the case where a task
sets that dependency on itself, triggers an IPI to all CPUs.  That is
of course suboptimal but it had previously not been an issue because it
was only used by POSIX CPU timers on nohz_full, which apparently never
occurs in latency-sensitive workloads in production.  (Or users of such
systems are suffering in silence on the one hand or venting their ire
on the wrong people on the other.)

But RCU now sets a task tick dependency on the current task in order
to fix stall issues that can occur during RCU callback processing.
Thus, RCU callback processing triggers frequent system-wide IPIs from
nohz_full CPUs.  This is quite counter-productive, after all, avoiding
IPIs is what nohz_full is supposed to be all about.

This commit therefore optimizes tasks' self-setting of a task tick
dependency by using tick_nohz_full_kick() to avoid the system-wide IPI.
Instead, only the execution of the one task is disturbed, which is
acceptable given that this disturbance is well down into the noise
compared to the degree to which the RCU callback processing itself
disturbs execution.

Fixes: 6a949b7af8 (rcu: Force on tick when invoking lots of callbacks)
Reported-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: stable@kernel.org
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 11:58:50 -07:00
Lihao Liang
360fbbb489 rcu: Update comment from rsp->rcu_gp_seq to rsp->gp_seq
Signed-off-by: Lihao Liang <lihaoliang@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 11:58:50 -07:00
Paul E. McKenney
68c2f27e01 rcu: Expedited grace-period sleeps to idle priority
This commit converts the schedule_timeout_uninterruptible() call used
by RCU's expedited grace-period processing to schedule_timeout_idle().
This conversion avoids polluting the load-average with RCU-related
sleeping.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 11:58:50 -07:00
Paul E. McKenney
f5ca34643b rcu: No-CBs-related sleeps to idle priority
This commit converts the schedule_timeout_interruptible() call used by
RCU's no-CBs grace-period kthreads to schedule_timeout_idle().  This
conversion avoids polluting the load-average with RCU-related sleeping.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 11:58:50 -07:00
Paul E. McKenney
a9352f72d6 rcu: Priority-boost-related sleeps to idle priority
This commit converts the long-standing schedule_timeout_interruptible()
call used by RCU's priority-boosting kthreads to schedule_timeout_idle().
This conversion avoids polluting the load-average with RCU-related
sleeping.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 11:58:50 -07:00
Paul E. McKenney
77865dea25 rcu: Grace-period-kthread related sleeps to idle priority
This commit converts the long-standing schedule_timeout_interruptible()
and schedule_timeout_uninterruptible() calls used by RCU's grace-period
kthread to schedule_timeout_idle().  This conversion avoids polluting
the load-average with RCU-related sleeping.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 11:58:49 -07:00
Paul E. McKenney
f8466f9468 rcu: Add comment documenting rcu_callback_map's purpose
The rcu_callback_map lockdep_map structure was added back in 2013, but
its purpose has become obscure.  This commit therefore documments that the
purpose of rcu_callback map is, in the words of commit 24ef659a85 ("rcu:
Provide better diagnostics for blocking in RCU callback functions"),
to help lockdep to tie an "inappropriate voluntary context switch back
to the fact that the function is being invoked from within a callback."

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 11:58:49 -07:00
Paul E. McKenney
e816d56fad rcu: Add callbacks-invoked counters
This commit adds a count of the callbacks invoked to the per-CPU rcu_data
structure.  This count is printed by the show_rcu_gp_kthreads() that
is invoked by rcutorture and the RCU CPU stall-warning code.  It is also
intended for use by drgn.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 11:58:49 -07:00
Wei Yang
abfce04148 rcu: Simplify the calculation of rcu_state.ncpus
There is only 1 bit set in mask, which means that the only difference
between oldmask and the new one will be at the position where the bit is
set in mask.  This commit therefore updates rcu_state.ncpus by checking
whether the bit in mask is already set in rnp->expmaskinitnext.

Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 11:58:49 -07:00
Wei Yang
7ee880b7bf rcu: Initialize and destroy rcu_synchronize only when necessary
The __wait_rcu_gp() function unconditionally initializes and cleans up
each element of rs_array[], whether used or not.  This is slightly
wasteful and rather confusing, so this commit skips both initialization
and cleanup for duplicate callback functions.

Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 11:58:49 -07:00
Mauro Carvalho Chehab
f2286ab995 docs: RCU: Convert stallwarn.txt to ReST
- Add a SPDX header;
- Adjust document and section titles;
- Fix list markups;
- Some whitespace fixes and new line breaks;
- Mark literal blocks as such;
- Add it to RCU/index.rst.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 11:58:11 -07:00
Mauro Carvalho Chehab
43cb5451df docs: RCU: Convert torture.txt to ReST
- Add a SPDX header;
- Adjust document and section titles;
- Some whitespace fixes and new line breaks;
- Mark literal blocks as such;
- Add it to RCU/index.rst.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-29 11:58:11 -07:00
Linus Torvalds
2cfa46dadd Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto fixes from Herbert Xu:
 "This fixes two race conditions, one in padata and one in af_alg"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  padata: upgrade smp_mb__after_atomic to smp_mb in padata_do_serial
  crypto: af_alg - fix use-after-free in af_alg_accept() due to bh_lock_sock()
2020-06-29 10:06:26 -07:00
Steven Rostedt (VMware)
5da7cd11d0 x86/ftrace: Only have the builtin ftrace_regs_caller call direct hooks
If a direct hook is attached to a function that ftrace also has a function
attached to it, then it is required that the ftrace_ops_list_func() is used
to iterate over the registered ftrace callbacks. This will also include the
direct ftrace_ops helper, that tells ftrace_regs_caller where to return to
(the direct callback and not the function that called it).

As this direct helper is only to handle the case of ftrace callbacks
attached to the same function as the direct callback, the ftrace callback
allocated trampolines (used to only call them), should never be used to
return back to a direct callback.

Only copy the portion of the ftrace_regs_caller that will return back to
what called it, and not the portion that returns back to the direct caller.

The direct ftrace_ops must then pick the ftrace_regs_caller builtin function
as its own trampoline to ensure that it will never have one allocated for
it (which would not include the handling of direct callbacks).

Link: http://lkml.kernel.org/r/20200422162750.495903799@goodmis.org

Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-29 11:42:47 -04:00
Christoph Hellwig
7582f30cc9 cgroup: unexport cgroup_rstat_updated
cgroup_rstat_updated is only used by core block code, no need to
export it.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-29 09:09:08 -06:00
Steven Rostedt (VMware)
c791cc4b1f tracing: Only allow trace_array_printk() to be used by instances
To prevent default "trace_printks()" from spamming the top level tracing
ring buffer, only allow trace instances to use trace_array_printk() (which
can be used without the trace_printk() start up warning).

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-29 09:01:02 -04:00
David Rientjes
71cdec4fab dma-mapping: warn when coherent pool is depleted
When a DMA coherent pool is depleted, allocation failures may or may not
get reported in the kernel log depending on the allocator.

The admin does have a workaround, however, by using coherent_pool= on the
kernel command line.

Provide some guidance on the failure and a recommended minimum size for
the pools (double the size).

Signed-off-by: David Rientjes <rientjes@google.com>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-06-29 10:05:21 +02:00
Linus Torvalds
91a9a90d04 Peter Zijlstra says:
The most anticipated fix in this pull request is probably the horrible build
 fix for the RANDSTRUCT fail that didn't make -rc2. Also included is the cleanup
 that removes those BUILD_BUG_ON()s and replaces it with ugly unions.
 
 Also included is the try_to_wake_up() race fix that was first triggered by
 Paul's RCU-torture runs, but was independently hit by Dave Chinner's fstest
 runs as well.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl74tMMACgkQEsHwGGHe
 VUqpAxAAnAiwPetkmCUn53wmv10oGC/vbnxprvNzoIANo9IFJYwKLYuRviT4r4KW
 0tEmpWtsy0CkVdCTpx4yXYUqtGswbjAvxSuwk8vR3bdtottMNJ77PPBKrywL3ymZ
 uQ0tpB/W9CFTOjKx4U/OyaK2Gf4mYzvuJSqhhTbopGf4H9SWflhepLZf0C4rhYa5
 tywch3etazAcNpq+dm31jKIVUkwULyJ4mXH2VDXo+jjl1A5g6h2UliS03e1/BChD
 hX78NRv7ezySdVVpLFhLVKCRdFFj6wIbLsx0yIQjw83dYhmDHK9iqN7m9+p4pZOr
 4qz/+eRYv+zZwWZP8IqOIAE4la1S/LToKEyxAehwl2sfIjhUXx68PvM/feWr8yfd
 z2CHEsI3Dn5XfM8FdPSA+JHE9IHwUyHrDRxcVGU7Nj/9s4L2DfxdrPl6qKGA3Tzm
 F7rK4vR5MNB8Sr7bzcCWV9FOsMNcXh2WThpZcsjfCUgwJza45N3HfocsXO5m4ShC
 FQ8RjE46Msd1WgIoslAkgQT7rFohe/sUKs5xVj4SwT/5i6lz55IGYmiV+hErrxU4
 ArSzUeOys/0EwzJX8PvxiElMq3btFW2XYV65XX5dIABt9IxgRvxHcUGPJDNvQKP7
 WdKVxRIzVXcfRiKUI05vLZU6yzfJuoAjvI1kyTYo64QIbeM7H6g=
 =EGOe
 -----END PGP SIGNATURE-----

Merge tag 'sched_urgent_for_5.8_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Borislav Petkov:
 "The most anticipated fix in this pull request is probably the horrible
  build fix for the RANDSTRUCT fail that didn't make -rc2. Also included
  is the cleanup that removes those BUILD_BUG_ON()s and replaces it with
  ugly unions.

  Also included is the try_to_wake_up() race fix that was first
  triggered by Paul's RCU-torture runs, but was independently hit by
  Dave Chinner's fstest runs as well"

* tag 'sched_urgent_for_5.8_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/cfs: change initial value of runnable_avg
  smp, irq_work: Continue smp_call_function*() and irq_work*() integration
  sched/core: s/WF_ON_RQ/WQ_ON_CPU/
  sched/core: Fix ttwu() race
  sched/core: Fix PI boosting between RT and DEADLINE tasks
  sched/deadline: Initialize ->dl_boosted
  sched/core: Check cpus_mask, not cpus_ptr in __set_cpus_allowed_ptr(), to fix mask corruption
  sched/core: Fix CONFIG_GCC_PLUGIN_RANDSTRUCT build fail
2020-06-28 10:37:39 -07:00
Linus Torvalds
c141b30e99 Paul E. McKenney says:
A single commit that uses "arch_" atomic operations to avoid the
 instrumentation that comes with the non-"arch_" versions. In preparation
 for that commit, it also has another commit that makes these "arch_"
 atomic operations available to generic code.
 
 Without these commits, KCSAN uses can see pointless errors.
 
 Both from Peter Zijlstra.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl74qe4ACgkQEsHwGGHe
 VUruxhAApxHnsIX4IFm4cBaSMMsXCGpifM3EOd3S1PPqxWQxLfrDpc/SgW4hvJja
 y144m/HQVvHkO8DAqWaC5lNmILjZhZeR1ToRrtqsFVzedlORaXgFJQzojjOBBCWi
 kwtrqVDb4dw+RBQdj6hrknnsivdAlDVFHYCxQuBpNQ/NN4M9l0nwxPRVpTdcFtw0
 Yv6ttpDeo8/XJ12OwiFINWnQT7F1n6CoyvdH+zQayvP+2qK8sq3sYVN4DiTC2Jyk
 9YpnR9ubl4jGz78+l2IrhhHw0zcHutGy2OVMXMYYvqZVzcp7QCpXFCP7MY00R6Br
 1eyxzMJX3j9rxDcreNTFZQFqQsCSfla3SMJIHFT1PHiw2O1ZVXp4EUaHb6eCy/nb
 IMgRd37mRCQovE267+LmDMNovSbRXGFu/qhu7QPaKQizqfYTbAzGULbttHJr6P7i
 ciQRG6ZfpbqflsezlijmhDTXI/oK/prn5apo8g6IVAxVBINzpu01+xszpuOKdCg0
 CGliJRShIXwPCAPacq0aFtauRt3RVpbEWOXj3GZU4yof/8wnHOAPZ0/HmFeKmO+4
 BIaa7QASvYUfczVv/Fi0FKdU6c0jQGDCUxVi1XJpxNG0XSiayGEPyN4Y0wDNHuWg
 H+9MPAUhGoyDoMPRBjSKIVzNF7bLJ8VMe3GUBrFcJY+BVLhfXUE=
 =amVp
 -----END PGP SIGNATURE-----

Merge tag 'rcu_urgent_for_5.8_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull RCU-vs-KCSAN fixes from Borislav Petkov:
 "A single commit that uses "arch_" atomic operations to avoid the
  instrumentation that comes with the non-"arch_" versions.

  In preparation for that commit, it also has another commit that makes
  these "arch_" atomic operations available to generic code.

  Without these commits, KCSAN uses can see pointless errors"

* tag 'rcu_urgent_for_5.8_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  rcu: Fixup noinstr warnings
  locking/atomics: Provide the arch_atomic_ interface to generic code
2020-06-28 10:29:38 -07:00
Vincent Guittot
e21cf43406 sched/cfs: change initial value of runnable_avg
Some performance regression on reaim benchmark have been raised with
  commit 070f5e860e ("sched/fair: Take into account runnable_avg to classify group")

The problem comes from the init value of runnable_avg which is initialized
with max value. This can be a problem if the newly forked task is finally
a short task because the group of CPUs is wrongly set to overloaded and
tasks are pulled less agressively.

Set initial value of runnable_avg equals to util_avg to reflect that there
is no waiting time so far.

Fixes: 070f5e860e ("sched/fair: Take into account runnable_avg to classify group")
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200624154422.29166-1-vincent.guittot@linaro.org
2020-06-28 17:01:20 +02:00
Peter Zijlstra
8c4890d1c3 smp, irq_work: Continue smp_call_function*() and irq_work*() integration
Instead of relying on BUG_ON() to ensure the various data structures
line up, use a bunch of horrible unions to make it all automatic.

Much of the union magic is to ensure irq_work and smp_call_function do
not (yet) see the members of their respective data structures change
name.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20200622100825.844455025@infradead.org
2020-06-28 17:01:20 +02:00
Peter Zijlstra
739f70b476 sched/core: s/WF_ON_RQ/WQ_ON_CPU/
Use a better name for this poorly named flag, to avoid confusion...

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lkml.kernel.org/r/20200622100825.785115830@infradead.org
2020-06-28 17:01:20 +02:00
Peter Zijlstra
b6e13e8582 sched/core: Fix ttwu() race
Paul reported rcutorture occasionally hitting a NULL deref:

  sched_ttwu_pending()
    ttwu_do_wakeup()
      check_preempt_curr() := check_preempt_wakeup()
        find_matching_se()
          is_same_group()
            if (se->cfs_rq == pse->cfs_rq) <-- *BOOM*

Debugging showed that this only appears to happen when we take the new
code-path from commit:

  2ebb177175 ("sched/core: Offload wakee task activation if it the wakee is descheduling")

and only when @cpu == smp_processor_id(). Something which should not
be possible, because p->on_cpu can only be true for remote tasks.
Similarly, without the new code-path from commit:

  c6e7bd7afa ("sched/core: Optimize ttwu() spinning on p->on_cpu")

this would've unconditionally hit:

  smp_cond_load_acquire(&p->on_cpu, !VAL);

and if: 'cpu == smp_processor_id() && p->on_cpu' is possible, this
would result in an instant live-lock (with IRQs disabled), something
that hasn't been reported.

The NULL deref can be explained however if the task_cpu(p) load at the
beginning of try_to_wake_up() returns an old value, and this old value
happens to be smp_processor_id(). Further assume that the p->on_cpu
load accurately returns 1, it really is still running, just not here.

Then, when we enqueue the task locally, we can crash in exactly the
observed manner because p->se.cfs_rq != rq->cfs_rq, because p's cfs_rq
is from the wrong CPU, therefore we'll iterate into the non-existant
parents and NULL deref.

The closest semi-plausible scenario I've managed to contrive is
somewhat elaborate (then again, actual reproduction takes many CPU
hours of rcutorture, so it can't be anything obvious):

					X->cpu = 1
					rq(1)->curr = X

	CPU0				CPU1				CPU2

					// switch away from X
					LOCK rq(1)->lock
					smp_mb__after_spinlock
					dequeue_task(X)
					  X->on_rq = 9
					switch_to(Z)
					  X->on_cpu = 0
					UNLOCK rq(1)->lock

									// migrate X to cpu 0
									LOCK rq(1)->lock
									dequeue_task(X)
									set_task_cpu(X, 0)
									  X->cpu = 0
									UNLOCK rq(1)->lock

									LOCK rq(0)->lock
									enqueue_task(X)
									  X->on_rq = 1
									UNLOCK rq(0)->lock

	// switch to X
	LOCK rq(0)->lock
	smp_mb__after_spinlock
	switch_to(X)
	  X->on_cpu = 1
	UNLOCK rq(0)->lock

	// X goes sleep
	X->state = TASK_UNINTERRUPTIBLE
	smp_mb();			// wake X
					ttwu()
					  LOCK X->pi_lock
					  smp_mb__after_spinlock

					  if (p->state)

					  cpu = X->cpu; // =? 1

					  smp_rmb()

	// X calls schedule()
	LOCK rq(0)->lock
	smp_mb__after_spinlock
	dequeue_task(X)
	  X->on_rq = 0

					  if (p->on_rq)

					  smp_rmb();

					  if (p->on_cpu && ttwu_queue_wakelist(..)) [*]

					  smp_cond_load_acquire(&p->on_cpu, !VAL)

					  cpu = select_task_rq(X, X->wake_cpu, ...)
					  if (X->cpu != cpu)
	switch_to(Y)
	  X->on_cpu = 0
	UNLOCK rq(0)->lock

However I'm having trouble convincing myself that's actually possible
on x86_64 -- after all, every LOCK implies an smp_mb() there, so if ttwu
observes ->state != RUNNING, it must also observe ->cpu != 1.

(Most of the previous ttwu() races were found on very large PowerPC)

Nevertheless, this fully explains the observed failure case.

Fix it by ordering the task_cpu(p) load after the p->on_cpu load,
which is easy since nothing actually uses @cpu before this.

Fixes: c6e7bd7afa ("sched/core: Optimize ttwu() spinning on p->on_cpu")
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200622125649.GC576871@hirez.programming.kicks-ass.net
2020-06-28 17:01:20 +02:00
Juri Lelli
740797ce3a sched/core: Fix PI boosting between RT and DEADLINE tasks
syzbot reported the following warning:

 WARNING: CPU: 1 PID: 6351 at kernel/sched/deadline.c:628
 enqueue_task_dl+0x22da/0x38a0 kernel/sched/deadline.c:1504

At deadline.c:628 we have:

 623 static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
 624 {
 625 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
 626 	struct rq *rq = rq_of_dl_rq(dl_rq);
 627
 628 	WARN_ON(dl_se->dl_boosted);
 629 	WARN_ON(dl_time_before(rq_clock(rq), dl_se->deadline));
        [...]
     }

Which means that setup_new_dl_entity() has been called on a task
currently boosted. This shouldn't happen though, as setup_new_dl_entity()
is only called when the 'dynamic' deadline of the new entity
is in the past w.r.t. rq_clock and boosted tasks shouldn't verify this
condition.

Digging through the PI code I noticed that what above might in fact happen
if an RT tasks blocks on an rt_mutex hold by a DEADLINE task. In the
first branch of boosting conditions we check only if a pi_task 'dynamic'
deadline is earlier than mutex holder's and in this case we set mutex
holder to be dl_boosted. However, since RT 'dynamic' deadlines are only
initialized if such tasks get boosted at some point (or if they become
DEADLINE of course), in general RT 'dynamic' deadlines are usually equal
to 0 and this verifies the aforementioned condition.

Fix it by checking that the potential donor task is actually (even if
temporary because in turn boosted) running at DEADLINE priority before
using its 'dynamic' deadline value.

Fixes: 2d3d891d33 ("sched/deadline: Add SCHED_DEADLINE inheritance logic")
Reported-by: syzbot+119ba87189432ead09b4@syzkaller.appspotmail.com
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Tested-by: Daniel Wagner <dwagner@suse.de>
Link: https://lkml.kernel.org/r/20181119153201.GB2119@localhost.localdomain
2020-06-28 17:01:20 +02:00
Juri Lelli
ce9bc3b27f sched/deadline: Initialize ->dl_boosted
syzbot reported the following warning triggered via SYSC_sched_setattr():

  WARNING: CPU: 0 PID: 6973 at kernel/sched/deadline.c:593 setup_new_dl_entity /kernel/sched/deadline.c:594 [inline]
  WARNING: CPU: 0 PID: 6973 at kernel/sched/deadline.c:593 enqueue_dl_entity /kernel/sched/deadline.c:1370 [inline]
  WARNING: CPU: 0 PID: 6973 at kernel/sched/deadline.c:593 enqueue_task_dl+0x1c17/0x2ba0 /kernel/sched/deadline.c:1441

This happens because the ->dl_boosted flag is currently not initialized by
__dl_clear_params() (unlike the other flags) and setup_new_dl_entity()
rightfully complains about it.

Initialize dl_boosted to 0.

Fixes: 2d3d891d33 ("sched/deadline: Add SCHED_DEADLINE inheritance logic")
Reported-by: syzbot+5ac8bac25f95e8b221e7@syzkaller.appspotmail.com
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Daniel Wagner <dwagner@suse.de>
Link: https://lkml.kernel.org/r/20200617072919.818409-1-juri.lelli@redhat.com
2020-06-28 17:01:20 +02:00
Scott Wood
fd844ba9ae sched/core: Check cpus_mask, not cpus_ptr in __set_cpus_allowed_ptr(), to fix mask corruption
This function is concerned with the long-term CPU mask, not the
transitory mask the task might have while migrate disabled.  Before
this patch, if a task was migrate-disabled at the time
__set_cpus_allowed_ptr() was called, and the new mask happened to be
equal to the CPU that the task was running on, then the mask update
would be lost.

Signed-off-by: Scott Wood <swood@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200617121742.cpxppyi7twxmpin7@linutronix.de
2020-06-28 17:01:20 +02:00
Linus Torvalds
f05baa066d dma-mapping fixes for 5.8:
- fix dma coherent mmap in nommu (me)
  - more AMD SEV fallout (David Rientjes, me)
  - fix alignment in dma_common_*_remap (Eric Auger)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAl72+VsLHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYMVaw//VgQbKUfTsuCZt+ZZqIY5nd6YajexoC+X051yC7/8
 YtdGqAa2RuutoHwUhTcqzvrSsCqthNCeeZ3yBUS/SQwyoQy3szrEwNXnRboNdwgq
 xebuTOra3MIRSWJzFHL+PNQjkaGSoQroSJHEeVZOUdYchE+sNh/pZxQoPU8ImcOe
 iVB+6nDJga+CpbKVi6oaGs8EISHtYkt1yHOeAhTxlqPkmP1tvsOZFgvMQBPCq4Rz
 QlqcVilDb0fPl2pnLy1LTbgAC8yPs7phrf9KBVUqCptfTLAv1nkwI9WpX8zFmkDo
 KapepEr9bkAHcq+gNcUOSiKr3K1bMF41numZ5zi6PnEJ/bHsPEotzwf05GrKY0Ci
 vMNpWL5QIcaMECe8Q8jrelgoDK0614vp8k7U+1CXmgpyF3lf5+zXwJyYLSgcf2PI
 2ryJnnib3jYORe80VVHc76CpX5Z5Ez6IaaDP/3rNsexLW/Ip3mhwqUDEYNCvMN+P
 qYJ8GrmqGAbMrhifvxVRL0ur73kIKE2s4l7xznd7p0Nj6ToAdMYnmrKUZEhMTPD9
 UcpzK9omgT51qAsByEggT97eDYzQSqYfh0OxAUJwML/8AXa7nJVdFo9ipHCVal6x
 tEuWpAMBe9YRBDaPUgu3vf8VNagv7YCzJmLnPFS7KvYJ0siw5r6ZxdXfkE2cG9o2
 DyI=
 =qAJQ
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-5.8-4' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping fixes from Christoph Hellwig:

 - fix dma coherent mmap in nommu (me)

 - more AMD SEV fallout (David Rientjes, me)

 - fix alignment in dma_common_*_remap (Eric Auger)

* tag 'dma-mapping-5.8-4' of git://git.infradead.org/users/hch/dma-mapping:
  dma-remap: align the size in dma_common_*_remap()
  dma-mapping: DMA_COHERENT_POOL should select GENERIC_ALLOCATOR
  dma-direct: add missing set_memory_decrypted() for coherent mapping
  dma-direct: check return value when encrypting or decrypting memory
  dma-direct: re-encrypt memory if dma_direct_alloc_pages() fails
  dma-direct: always align allocation size in dma_direct_alloc_pages()
  dma-direct: mark __dma_direct_alloc_pages static
  dma-direct: re-enable mmap for !CONFIG_MMU
2020-06-27 13:06:22 -07:00
Linus Torvalds
6116dea80d kgdb patches for 5.8-rc3
The main change here is a fix for a number of unsafe interactions
 between kdb and the console system. The fixes are specific to kdb (pure
 kgdb debugging does not use the console system at all). On systems with
 an NMI then kdb, if it is enabled, must get messages to the user despite
 potentially running from some "difficult" calling contexts. These fixes
 avoid using the console system where we have been provided an
 alternative (safer) way to interact with the user and, if using the
 console system in unavoidable, use oops_in_progress for deadlock
 avoidance. These fixes also ensure kdb honours the console enable flag.
 
 Also included is a fix that wraps kgdb trap handling in an RCU read lock
 to avoids triggering diagnostic warnings. This is a wide lock scope but
 this is OK because kgdb is a stop-the-world debugger. When we stop the
 world we put all the CPUs into holding pens and this inhibits RCU update
 anyway.
 
 Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEELzVBU1D3lWq6cKzwfOMlXTn3iKEFAl72C5AACgkQfOMlXTn3
 iKFfmA//SCJU7zJsrKTsr6+HJY+gIuwHm70aGCNIr3EjBgTZQHQYflG6msmMHTAX
 d4qnGSkfKzC8jYJrHPpX4eU3bnqYci6GnaT/N5p9YkTGHun+kYYTz3wLzZiWxKRg
 iE4QLEwjU/dGAYyRz0CKCTRNTLTG+R79HWLL2Wi5OQiNhYiPuFAgS/NSUjpnJIuf
 fmj8jSPP/7T/m0cEUWXbLwTfolEZLIa1heqtaJq4fAftPsAk5a5TZ0NugaxUPoo4
 YS06eASIZoVcDQiehVy+gH05FyEjJGXnkFtTkAoRL/yOERKLy0WMzFZAAh6NT4St
 16Hx3Nnw+7ds7Iq8jEIpM/XJo1d3haYvAQdzy6HakAOwp7vrD/CjF45wwju78woY
 Jq54Vjvaxjaw1vlJCVrAAjdj3bAHdufBeWrBGmYO8F1HSn9eNeLS7wWbq6lEhxNd
 ObXRUFwebzYpOT6DI2TdnDg/2+xAn2oXpzk4UK9I/Vbxew8R4lOPQm4vC0V3CTME
 cHXFGV3ncjXlVRKdMAmnYcN7pMY4NCdX5vGqC/djQRwKRV1Ve8jwUCFVKRAd4zio
 wHpCFziwSaz9giZJ5I831EKsvSj9DVoPPJFgoEXIzIWF3OS0qzP6UqO2HwJNbA+e
 W4laVRzdBcMuVVa+7XWYzdAhof0hNX0Ov78dyDMcX1MkOS02O7o=
 =ovnT
 -----END PGP SIGNATURE-----

Merge tag 'kgdb-5.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux

Pull kgdb fixes from Daniel Thompson:
 "The main change here is a fix for a number of unsafe interactions
  between kdb and the console system. The fixes are specific to kdb
  (pure kgdb debugging does not use the console system at all). On
  systems with an NMI then kdb, if it is enabled, must get messages to
  the user despite potentially running from some "difficult" calling
  contexts. These fixes avoid using the console system where we have
  been provided an alternative (safer) way to interact with the user
  and, if using the console system in unavoidable, use oops_in_progress
  for deadlock avoidance. These fixes also ensure kdb honours the
  console enable flag.

  Also included is a fix that wraps kgdb trap handling in an RCU read
  lock to avoids triggering diagnostic warnings. This is a wide lock
  scope but this is OK because kgdb is a stop-the-world debugger. When
  we stop the world we put all the CPUs into holding pens and this
  inhibits RCU update anyway"

* tag 'kgdb-5.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux:
  kgdb: Avoid suspicious RCU usage warning
  kdb: Switch to use safer dbg_io_ops over console APIs
  kdb: Make kdb_printf() console handling more robust
  kdb: Check status of console prior to invoking handlers
  kdb: Re-factor kdb_printf() message write code
2020-06-27 08:53:49 -07:00
Linus Torvalds
ed3e00e7d6 Power management fixes for 5.8-rc3
- Make sure that the _TIF_POLLING_NRFLAG is clear before entering
    the last phase of suspend-to-idle to avoid wakeup issues on some
    x86 systems (Chen Yu, Rafael Wysocki).
 
  - Cover one more case in which the intel_pstate driver should let
    the platform firmware control the CPU frequency and refuse to
    load (Srinivas Pandruvada).
 
  - Add __init annotations to 2 functions in the power management
    core (Christophe JAILLET).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl72FGESHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxt44QAKk+ojYFCLVHz+mwniaNBRROegZrDPFS
 u9QiH4Qth5QdG6TFXJF2nhcZmkvyh/W3AXFj8px1XWanBfG8nhPb0c+wRHGIbACV
 /R7ykcQuj/h7ZlCjqevFAaUbCSw4Lt5oldIk+YyVUGrZH+uulyNOABm+cxFT6XaD
 KKnJ4hbbgnriQaMxmee+jYNphDsaTBhWWCkGj/V5x4DEyvWihkjf9skDEJ4/aP4O
 09Ug8FB1P0AxMTdaoKNfrIae27oPAb74HQOe92UbOMA/Q/k1snRr7vxGVak2/VyH
 /KHxCElM+F5F3kM6LMp4C9jtVMrcJq+1ABUmh0z2jBAw5MxGcdvjKihw6+W8Z9gC
 j7r4gDvIkXDGHlLSaWQC8otARs8gMmGdZJu+Tl8yzo7ZK7InFMAsNMBH/iOPzycQ
 9UMOpIkbDrOP2zeukULAFIuh1ow/dQopnY0Hyi0Gfezb1lTXe3rFzjPAnTRantyo
 z+c+o01pkbiiP+fBG3aUYjzkUv3wpm3cyBAlzr17wIUs2cfIo7WGMQuGQ8ZqoL0T
 QMpY5OxtzJJlvTvn9UF3oxlYbNFGq68fNnrdQ4u1zHm+K1P2mh2Dko6viHrceOyA
 DOMHuUP1Yrc7sneqGYkNbfhAfjdGeS4FTDJOpx9fgasTG1IYmk+hLOuKTi8Vttjj
 Fx6nxbo1sECm
 =Reud
 -----END PGP SIGNATURE-----

Merge tag 'pm-5.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
 "These fix a recent regression that broke suspend-to-idle on some x86
  systems, fix the intel_pstate driver to correctly let the platform
  firmware control CPU performance in some cases and add __init
  annotations to a couple of functions.

  Specifics:

   - Make sure that the _TIF_POLLING_NRFLAG is clear before entering the
     last phase of suspend-to-idle to avoid wakeup issues on some x86
     systems (Chen Yu, Rafael Wysocki).

   - Cover one more case in which the intel_pstate driver should let the
     platform firmware control the CPU frequency and refuse to load
     (Srinivas Pandruvada).

   - Add __init annotations to 2 functions in the power management core
     (Christophe JAILLET)"

* tag 'pm-5.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpuidle: Rearrange s2idle-specific idle state entry code
  PM: sleep: core: mark 2 functions as __init to save some memory
  cpufreq: intel_pstate: Add one more OOB control bit
  PM: s2idle: Clear _TIF_POLLING_NRFLAG before suspend to idle
2020-06-26 12:32:11 -07:00
Linus Torvalds
7c902e2730 Merge branch 'akpm' (patches from Andrew)
Merge misx fixes from Andrew Morton:
 "31 patches.

  Subsystems affected by this patch series: hotfixes, mm/pagealloc,
  kexec, ocfs2, lib, mm/slab, mm/slab, mm/slub, mm/swap, mm/pagemap,
  mm/vmalloc, mm/memcg, mm/gup, mm/thp, mm/vmscan, x86,
  mm/memory-hotplug, MAINTAINERS"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (31 commits)
  MAINTAINERS: update info for sparse
  mm/memory_hotplug.c: fix false softlockup during pfn range removal
  mm: remove vmalloc_exec
  arm64: use PAGE_KERNEL_ROX directly in alloc_insn_page
  x86/hyperv: allocate the hypercall page with only read and execute bits
  mm/memory: fix IO cost for anonymous page
  mm/swap: fix for "mm: workingset: age nonresident information alongside anonymous pages"
  mm: workingset: age nonresident information alongside anonymous pages
  doc: THP CoW fault no longer allocate THP
  docs: mm/gup: minor documentation update
  mm/memcontrol.c: prevent missed memory.low load tears
  mm/memcontrol.c: add missed css_put()
  mm: memcontrol: handle div0 crash race condition in memory.low
  mm/vmalloc.c: fix a warning while make xmldocs
  media: omap3isp: remove cacheflush.h
  make asm-generic/cacheflush.h more standalone
  mm/debug_vm_pgtable: fix build failure with powerpc 8xx
  mm/memory.c: properly pte_offset_map_lock/unlock in vm_insert_pages()
  mm: fix swap cache node allocation mask
  slub: cure list_slab_objects() from double fix
  ...
2020-06-26 12:19:36 -07:00
Mauro Carvalho Chehab
985098a05e docs: fix references for DMA*.txt files
As we moved those files to core-api, fix references to point
to their newer locations.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Link: https://lore.kernel.org/r/37b2fd159fbc7655dbf33b3eb1215396a25f6344.1592895969.git.mchehab+huawei@kernel.org
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2020-06-26 10:01:32 -06:00
Douglas Anderson
440ab9e10e kgdb: Avoid suspicious RCU usage warning
At times when I'm using kgdb I see a splat on my console about
suspicious RCU usage.  I managed to come up with a case that could
reproduce this that looked like this:

  WARNING: suspicious RCU usage
  5.7.0-rc4+ #609 Not tainted
  -----------------------------
  kernel/pid.c:395 find_task_by_pid_ns() needs rcu_read_lock() protection!

  other info that might help us debug this:

    rcu_scheduler_active = 2, debug_locks = 1
  3 locks held by swapper/0/1:
   #0: ffffff81b6b8e988 (&dev->mutex){....}-{3:3}, at: __device_attach+0x40/0x13c
   #1: ffffffd01109e9e8 (dbg_master_lock){....}-{2:2}, at: kgdb_cpu_enter+0x20c/0x7ac
   #2: ffffffd01109ea90 (dbg_slave_lock){....}-{2:2}, at: kgdb_cpu_enter+0x3ec/0x7ac

  stack backtrace:
  CPU: 7 PID: 1 Comm: swapper/0 Not tainted 5.7.0-rc4+ #609
  Hardware name: Google Cheza (rev3+) (DT)
  Call trace:
   dump_backtrace+0x0/0x1b8
   show_stack+0x1c/0x24
   dump_stack+0xd4/0x134
   lockdep_rcu_suspicious+0xf0/0x100
   find_task_by_pid_ns+0x5c/0x80
   getthread+0x8c/0xb0
   gdb_serial_stub+0x9d4/0xd04
   kgdb_cpu_enter+0x284/0x7ac
   kgdb_handle_exception+0x174/0x20c
   kgdb_brk_fn+0x24/0x30
   call_break_hook+0x6c/0x7c
   brk_handler+0x20/0x5c
   do_debug_exception+0x1c8/0x22c
   el1_sync_handler+0x3c/0xe4
   el1_sync+0x7c/0x100
   rpmh_rsc_probe+0x38/0x420
   platform_drv_probe+0x94/0xb4
   really_probe+0x134/0x300
   driver_probe_device+0x68/0x100
   __device_attach_driver+0x90/0xa8
   bus_for_each_drv+0x84/0xcc
   __device_attach+0xb4/0x13c
   device_initial_probe+0x18/0x20
   bus_probe_device+0x38/0x98
   device_add+0x38c/0x420

If I understand properly we should just be able to blanket kgdb under
one big RCU read lock and the problem should go away.  We'll add it to
the beast-of-a-function known as kgdb_cpu_enter().

With this I no longer get any splats and things seem to work fine.

Signed-off-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20200602154729.v2.1.I70e0d4fd46d5ed2aaf0c98a355e8e1b7a5bb7e4e@changeid
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2020-06-26 15:41:40 +01:00
Sumit Garg
5946d1f5b3 kdb: Switch to use safer dbg_io_ops over console APIs
In kgdb context, calling console handlers aren't safe due to locks used
in those handlers which could in turn lead to a deadlock. Although, using
oops_in_progress increases the chance to bypass locks in most console
handlers but it might not be sufficient enough in case a console uses
more locks (VT/TTY is good example).

Currently when a driver provides both polling I/O and a console then kdb
will output using the console. We can increase robustness by using the
currently active polling I/O driver (which should be lockless) instead
of the corresponding console. For several common cases (e.g. an
embedded system with a single serial port that is used both for console
output and debugger I/O) this will result in no console handler being
used.

In order to achieve this we need to reverse the order of preference to
use dbg_io_ops (uses polling I/O mode) over console APIs. So we just
store "struct console" that represents debugger I/O in dbg_io_ops and
while emitting kdb messages, skip console that matches dbg_io_ops
console in order to avoid duplicate messages. After this change,
"is_console" param becomes redundant and hence removed.

Suggested-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Sumit Garg <sumit.garg@linaro.org>
Link: https://lore.kernel.org/r/1591264879-25920-5-git-send-email-sumit.garg@linaro.org
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2020-06-26 15:40:16 +01:00
Christoph Hellwig
7a0e27b2a0 mm: remove vmalloc_exec
Merge vmalloc_exec into its only caller.  Note that for !CONFIG_MMU
__vmalloc_node_range maps to __vmalloc, which directly clears the
__GFP_HIGHMEM added by the vmalloc_exec stub anyway.

Link: http://lkml.kernel.org/r/20200618064307.32739-4-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-26 00:27:38 -07:00
Lianbo Jiang
fd7af71be5 kexec: do not verify the signature without the lockdown or mandatory signature
Signature verification is an important security feature, to protect
system from being attacked with a kernel of unknown origin.  Kexec
rebooting is a way to replace the running kernel, hence need be secured
carefully.

In the current code of handling signature verification of kexec kernel,
the logic is very twisted.  It mixes signature verification, IMA
signature appraising and kexec lockdown.

If there is no KEXEC_SIG_FORCE, kexec kernel image doesn't have one of
signature, the supported crypto, and key, we don't think this is wrong,
Unless kexec lockdown is executed.  IMA is considered as another kind of
signature appraising method.

If kexec kernel image has signature/crypto/key, it has to go through the
signature verification and pass.  Otherwise it's seen as verification
failure, and won't be loaded.

Seems kexec kernel image with an unqualified signature is even worse
than those w/o signature at all, this sounds very unreasonable.  E.g.
If people get a unsigned kernel to load, or a kernel signed with expired
key, which one is more dangerous?

So, here, let's simplify the logic to improve code readability.  If the
KEXEC_SIG_FORCE enabled or kexec lockdown enabled, signature
verification is mandated.  Otherwise, we lift the bar for any kernel
image.

Link: http://lkml.kernel.org/r/20200602045952.27487-1-lijiang@redhat.com
Signed-off-by: Lianbo Jiang <lijiang@redhat.com>
Reviewed-by: Jiri Bohac <jbohac@suse.cz>
Acked-by: Dave Young <dyoung@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: Matthew Garrett <mjg59@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-26 00:27:36 -07:00
Jan Kara
f3bdc62fd8 blktrace: Provide event for request merging
Currently blk-mq does not report any event when two requests get merged
in the elevator. This then results in difficult to understand sequence
of events like:

...
  8,0   34     1579     0.608765271  2718  I  WS 215023504 + 40 [dbench]
  8,0   34     1584     0.609184613  2719  A  WS 215023544 + 56 <- (8,4) 2160568
  8,0   34     1585     0.609184850  2719  Q  WS 215023544 + 56 [dbench]
  8,0   34     1586     0.609188524  2719  G  WS 215023544 + 56 [dbench]
  8,0    3      602     0.609684162   773  D  WS 215023504 + 96 [kworker/3:1H]
  8,0   34     1591     0.609843593     0  C  WS 215023504 + 96 [0]

and you can only guess (after quite some headscratching since the above
excerpt is intermixed with a lot of other IO) that request 215023544+56
got merged to request 215023504+40. Provide proper event for request
merging like we used to do in the legacy block layer.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-25 21:06:11 -06:00
Linus Torvalds
4a21185cda Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Don't insert ESP trailer twice in IPSEC code, from Huy Nguyen.

 2) The default crypto algorithm selection in Kconfig for IPSEC is out
    of touch with modern reality, fix this up. From Eric Biggers.

 3) bpftool is missing an entry for BPF_MAP_TYPE_RINGBUF, from Andrii
    Nakryiko.

 4) Missing init of ->frame_sz in xdp_convert_zc_to_xdp_frame(), from
    Hangbin Liu.

 5) Adjust packet alignment handling in ax88179_178a driver to match
    what the hardware actually does. From Jeremy Kerr.

 6) register_netdevice can leak in the case one of the notifiers fail,
    from Yang Yingliang.

 7) Use after free in ip_tunnel_lookup(), from Taehee Yoo.

 8) VLAN checks in sja1105 DSA driver need adjustments, from Vladimir
    Oltean.

 9) tg3 driver can sleep forever when we get enough EEH errors, fix from
    David Christensen.

10) Missing {READ,WRITE}_ONCE() annotations in various Intel ethernet
    drivers, from Ciara Loftus.

11) Fix scanning loop break condition in of_mdiobus_register(), from
    Florian Fainelli.

12) MTU limit is incorrect in ibmveth driver, from Thomas Falcon.

13) Endianness fix in mlxsw, from Ido Schimmel.

14) Use after free in smsc95xx usbnet driver, from Tuomas Tynkkynen.

15) Missing bridge mrp configuration validation, from Horatiu Vultur.

16) Fix circular netns references in wireguard, from Jason A. Donenfeld.

17) PTP initialization on recovery is not done properly in qed driver,
    from Alexander Lobakin.

18) Endian conversion of L4 ports in filters of cxgb4 driver is wrong,
    from Rahul Lakkireddy.

19) Don't clear bound device TX queue of socket prematurely otherwise we
    get problems with ktls hw offloading, from Tariq Toukan.

20) ipset can do atomics on unaligned memory, fix from Russell King.

21) Align ethernet addresses properly in bridging code, from Thomas
    Martitz.

22) Don't advertise ipv4 addresses on SCTP sockets having ipv6only set,
    from Marcelo Ricardo Leitner.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (149 commits)
  rds: transport module should be auto loaded when transport is set
  sch_cake: fix a few style nits
  sch_cake: don't call diffserv parsing code when it is not needed
  sch_cake: don't try to reallocate or unshare skb unconditionally
  ethtool: fix error handling in linkstate_prepare_data()
  wil6210: account for napi_gro_receive never returning GRO_DROP
  hns: do not cast return value of napi_gro_receive to null
  socionext: account for napi_gro_receive never returning GRO_DROP
  wireguard: receive: account for napi_gro_receive never returning GRO_DROP
  vxlan: fix last fdb index during dump of fdb with nhid
  sctp: Don't advertise IPv4 addresses if ipv6only is set on the socket
  tc-testing: avoid action cookies with odd length.
  bpf: tcp: bpf_cubic: fix spurious HYSTART_DELAY exit upon drop in min RTT
  tcp_cubic: fix spurious HYSTART_DELAY exit upon drop in min RTT
  net: dsa: sja1105: fix tc-gate schedule with single element
  net: dsa: sja1105: recalculate gating subschedule after deleting tc-gate rules
  net: dsa: sja1105: unconditionally free old gating config
  net: dsa: sja1105: move sja1105_compose_gating_subschedule at the top
  net: macb: free resources on failure path of at91ether_open()
  net: macb: call pm_runtime_put_sync on failure path
  ...
2020-06-25 18:27:40 -07:00
Linus Torvalds
42e9c85f5c tracing: Four small fixes
- Fixed a ringbuffer bug for nested events having time go backwards
  - Fix a config dependency for boot time tracing to depend on synthetic
    events instead of histograms.
  - Fix trigger format parsing to handle multiple spaces
  - Fix bootconfig to handle failures in multiple events
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXvUjBBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qmLiAQD47/1T01ilYeXqJ+EG235aeQssvRa7
 RSmIAoMP+V6kHQD9G2RjnWkb3BcrdNk9zoi0LpnuMl95m5OuaMzE4PPO+ws=
 =Zbx8
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
 "Four small fixes:

   - Fix a ringbuffer bug for nested events having time go backwards

   - Fix a config dependency for boot time tracing to depend on
     synthetic events instead of histograms.

   - Fix trigger format parsing to handle multiple spaces

   - Fix bootconfig to handle failures in multiple events"

* tag 'trace-v5.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing/boottime: Fix kprobe multiple events
  tracing: Fix event trigger to accept redundant spaces
  tracing/boot: Fix config dependency for synthedic event
  ring-buffer: Zero out time extend if it is nested and not absolute
2020-06-25 16:16:49 -07:00
Weilong Chen
c17d1a3a8e fork: annotate data race in copy_process()
KCSAN reported data race reading and writing nr_threads and max_threads.
The data race is intentional and benign. This is obvious from the comment
above it and based on general consensus when discussing this issue. So
there's no need for any heavy atomic or *_ONCE() machinery here.

In accordance with the newly introduced data_race() annotation consensus,
mark the offending line with data_race(). Here it's actually useful not
just to silence KCSAN but to also clearly communicate that the race is
intentional. This is especially helpful since nr_threads is otherwise
protected by tasklist_lock.

BUG: KCSAN: data-race in copy_process / copy_process

write to 0xffffffff86205cf8 of 4 bytes by task 14779 on cpu 1:
  copy_process+0x2eba/0x3c40 kernel/fork.c:2273
  _do_fork+0xfe/0x7a0 kernel/fork.c:2421
  __do_sys_clone kernel/fork.c:2576 [inline]
  __se_sys_clone kernel/fork.c:2557 [inline]
  __x64_sys_clone+0x130/0x170 kernel/fork.c:2557
  do_syscall_64+0xcc/0x3a0 arch/x86/entry/common.c:294
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

read to 0xffffffff86205cf8 of 4 bytes by task 6944 on cpu 0:
  copy_process+0x94d/0x3c40 kernel/fork.c:1954
  _do_fork+0xfe/0x7a0 kernel/fork.c:2421
  __do_sys_clone kernel/fork.c:2576 [inline]
  __se_sys_clone kernel/fork.c:2557 [inline]
  __x64_sys_clone+0x130/0x170 kernel/fork.c:2557
  do_syscall_64+0xcc/0x3a0 arch/x86/entry/common.c:294
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Link: https://groups.google.com/forum/#!msg/syzkaller-upstream-mo
deration/thvp7AHs5Ew/aPdYLXfYBQAJ

Reported-by: syzbot+52fced2d288f8ecd2b20@syzkaller.appspotmail.com
Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Weilong Chen <chenweilong@huawei.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Marco Elver <elver@google.com>
[christian.brauner@ubuntu.com: rewrite commit message]
Link: https://lore.kernel.org/r/20200623041240.154294-1-chenweilong@huawei.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-06-26 01:05:29 +02:00
Peter Zijlstra
b58e733fd7 rcu: Fixup noinstr warnings
A KCSAN build revealed we have explicit annoations through atomic_*()
usage, switch to arch_atomic_*() for the respective functions.

vmlinux.o: warning: objtool: rcu_nmi_exit()+0x4d: call to __kcsan_check_access() leaves .noinstr.text section
vmlinux.o: warning: objtool: rcu_dynticks_eqs_enter()+0x25: call to __kcsan_check_access() leaves .noinstr.text section
vmlinux.o: warning: objtool: rcu_nmi_enter()+0x4f: call to __kcsan_check_access() leaves .noinstr.text section
vmlinux.o: warning: objtool: rcu_dynticks_eqs_exit()+0x2a: call to __kcsan_check_access() leaves .noinstr.text section
vmlinux.o: warning: objtool: __rcu_is_watching()+0x25: call to __kcsan_check_access() leaves .noinstr.text section

Additionally, without the NOP in instrumentation_begin(), objtool would
not detect the lack of the 'else instrumentation_begin();' branch in
rcu_nmi_enter().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-25 08:24:32 -07:00
John Fastabend
a9b59159d3 bpf: Do not allow btf_ctx_access with __int128 types
To ensure btf_ctx_access() is safe the verifier checks that the BTF
arg type is an int, enum, or pointer. When the function does the
BTF arg lookup it uses the calculation 'arg = off / 8'  using the
fact that registers are 8B. This requires that the first arg is
in the first reg, the second in the second, and so on. However,
for __int128 the arg will consume two registers by default LLVM
implementation. So this will cause the arg layout assumed by the
'arg = off / 8' calculation to be incorrect.

Because __int128 is uncommon this patch applies the easiest fix and
will force int types to be sizeof(u64) or smaller so that they will
fit in a single register.

v2: remove unneeded parens per Andrii's feedback

Fixes: 9e15db6613 ("bpf: Implement accurate raw_tp context access via BTF")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/159303723962.11287.13309537171132420717.stgit@john-Precision-5820-Tower
2020-06-25 16:17:05 +02:00
Andy Shevchenko
504603767a console: Fix trivia typo 'change' -> 'chance'
I bet the word 'chance' has to be used in 'had a chance to be called',
but, alas, I'm not native speaker...

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20200618164751.56828-7-andriy.shevchenko@linux.intel.com
2020-06-25 14:24:54 +02:00
Andy Shevchenko
bba18a1af3 console: Propagate error code from console ->setup()
Since console ->setup() hook returns meaningful error codes,
propagate it to the caller of try_enable_new_console().

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20200618164751.56828-6-andriy.shevchenko@linux.intel.com
2020-06-25 14:24:07 +02:00
Rafael J. Wysocki
10e8b11eb3 cpuidle: Rearrange s2idle-specific idle state entry code
Implement call_cpuidle_s2idle() in analogy with call_cpuidle()
for the s2idle-specific idle state entry and invoke it from
cpuidle_idle_call() to make the s2idle-specific idle entry code
path look more similar to the "regular" idle entry one.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Chen Yu <yu.c.chen@intel.com>
2020-06-25 13:52:53 +02:00
Peng Wang
423d02e146 sched/fair: Optimize dequeue_task_fair()
While looking at enqueue_task_fair and dequeue_task_fair, it occurred
to me that dequeue_task_fair can also be optimized as Vincent described
in commit 7d148be69e ("sched/fair: Optimize enqueue_task_fair()").

When encountering throttled cfs_rq, dequeue_throttle label can ensure
se not to be NULL, and rq->nr_running remains unchanged, so we can also
skip the early balance check.

Signed-off-by: Peng Wang <rocking@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/701eef9a40de93dcf5fe7063fd607bca5db38e05.1592287263.git.rocking@linux.alibaba.com
2020-06-25 13:45:44 +02:00
Kirill Tkhai
aa93cd53bc sched: Micro optimization in pick_next_task() and in check_preempt_curr()
This introduces an optimization based on xxx_sched_class addresses
in two hot scheduler functions: pick_next_task() and check_preempt_curr().

It is possible to compare pointers to sched classes to check, which
of them has a higher priority, instead of current iterations using
for_each_class().

One more result of the patch is that size of object file becomes a little
less (excluding added BUG_ON(), which goes in __init section):

$size kernel/sched/core.o
         text     data      bss	    dec	    hex	filename
before:  66446    18957	    676	  86079	  1503f	kernel/sched/core.o
after:   66398    18957	    676	  86031	  1500f	kernel/sched/core.o

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/711a9c4b-ff32-1136-b848-17c622d548f3@yandex.ru
2020-06-25 13:45:44 +02:00
Steven Rostedt (VMware)
a87e749e8f sched: Remove struct sched_class::next field
Now that the sched_class descriptors are defined in order via the linker
script vmlinux.lds.h, there's no reason to have a "next" pointer to the
previous priroity structure. The order of the sturctures can be aligned as
an array, and used to index and find the next sched_class descriptor.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20191219214558.845353593@goodmis.org
2020-06-25 13:45:44 +02:00
Steven Rostedt (VMware)
c3a340f7e7 sched: Have sched_class_highest define by vmlinux.lds.h
Now that the sched_class descriptors are defined by the linker script, and
this needs to be aware of the existance of stop_sched_class when SMP is
enabled or not, as it is used as the "highest" priority when defined. Move
the declaration of sched_class_highest to the same location in the linker
script that inserts stop_sched_class, and this will also make it easier to
see what should be defined as the highest class, as this linker script
location defines the priorities as well.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20191219214558.682913590@goodmis.org
2020-06-25 13:45:44 +02:00
Steven Rostedt (VMware)
590d697963 sched: Force the address order of each sched class descriptor
In order to make a micro optimization in pick_next_task(), the order of the
sched class descriptor address must be in the same order as their priority
to each other. That is:

 &idle_sched_class < &fair_sched_class < &rt_sched_class <
 &dl_sched_class < &stop_sched_class

In order to guarantee this order of the sched class descriptors, add each
one into their own data section and force the order in the linker script.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/157675913272.349305.8936736338884044103.stgit@localhost.localdomain
2020-06-25 13:45:43 +02:00
Sumit Garg
2a78b85b70 kdb: Make kdb_printf() console handling more robust
While rounding up CPUs via NMIs, its possible that a rounded up CPU
maybe holding a console port lock leading to kgdb master CPU stuck in
a deadlock during invocation of console write operations. A similar
deadlock could also be possible while using synchronous breakpoints.

So in order to avoid such a deadlock, set oops_in_progress to encourage
the console drivers to disregard their internal spin locks: in the
current calling context the risk of deadlock is a bigger problem than
risks due to re-entering the console driver. We operate directly on
oops_in_progress rather than using bust_spinlocks() because the calls
bust_spinlocks() makes on exit are not appropriate for this calling
context.

Suggested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Sumit Garg <sumit.garg@linaro.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/1591264879-25920-4-git-send-email-sumit.garg@linaro.org
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2020-06-25 12:04:30 +01:00
Sumit Garg
e8857288bb kdb: Check status of console prior to invoking handlers
Check if a console is enabled prior to invoking corresponding write
handler.

Suggested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Sumit Garg <sumit.garg@linaro.org>
Reviewed-by: Daniel Thompson <daniel.thompson@linaro.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/1591264879-25920-3-git-send-email-sumit.garg@linaro.org
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2020-06-25 12:04:29 +01:00
Sumit Garg
9d71b344f8 kdb: Re-factor kdb_printf() message write code
Re-factor kdb_printf() message write code in order to avoid duplication
of code and thereby increase readability.

Signed-off-by: Sumit Garg <sumit.garg@linaro.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/1591264879-25920-2-git-send-email-sumit.garg@linaro.org
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2020-06-25 12:04:29 +01:00
Yonghong Song
0d4fad3e57 bpf: Add bpf_skc_to_udp6_sock() helper
The helper is used in tracing programs to cast a socket
pointer to a udp6_sock pointer.
The return value could be NULL if the casting is illegal.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/bpf/20200623230815.3988481-1-yhs@fb.com
2020-06-24 18:37:59 -07:00
Yonghong Song
478cfbdf5f bpf: Add bpf_skc_to_{tcp, tcp_timewait, tcp_request}_sock() helpers
Three more helpers are added to cast a sock_common pointer to
an tcp_sock, tcp_timewait_sock or a tcp_request_sock for
tracing programs.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200623230811.3988277-1-yhs@fb.com
2020-06-24 18:37:59 -07:00
Yonghong Song
af7ec13833 bpf: Add bpf_skc_to_tcp6_sock() helper
The helper is used in tracing programs to cast a socket
pointer to a tcp6_sock pointer.
The return value could be NULL if the casting is illegal.

A new helper return type RET_PTR_TO_BTF_ID_OR_NULL is added
so the verifier is able to deduce proper return types for the helper.

Different from the previous BTF_ID based helpers,
the bpf_skc_to_tcp6_sock() argument can be several possible
btf_ids. More specifically, all possible socket data structures
with sock_common appearing in the first in the memory layout.
This patch only added socket types related to tcp and udp.

All possible argument btf_id and return value btf_id
for helper bpf_skc_to_tcp6_sock() are pre-calculcated and
cached. In the future, it is even possible to precompute
these btf_id's at kernel build time.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200623230809.3988195-1-yhs@fb.com
2020-06-24 18:37:59 -07:00
Yonghong Song
72e2b2b66f bpf: Allow tracing programs to use bpf_jiffies64() helper
/proc/net/tcp{4,6} uses jiffies for various computations.
Let us add bpf_jiffies64() helper to tracing program
so bpf_iter and other programs can use it.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200623230808.3988073-1-yhs@fb.com
2020-06-24 18:37:58 -07:00
Yonghong Song
c06b022957 bpf: Support 'X' in bpf_seq_printf() helper
'X' tells kernel to print hex with upper case letters.
/proc/net/tcp{4,6} seq_file show() used this, and
supports it in bpf_seq_printf() helper too.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200623230807.3988014-1-yhs@fb.com
2020-06-24 18:37:58 -07:00
Linus Torvalds
fbb58011fd for-linus-2020-06-24
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXvNBJgAKCRCRxhvAZXjc
 oulGAPoCPfCguA8TPcy4tq4byGPoThyO4XnWR6XcUDOEzhbzzAEA+s5S7iRV8W92
 p2gzbI4Kncq4dQNEtUvfPHQZDAEwTA0=
 =eZDz
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-2020-06-24' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull thread fix from Christian Brauner:
 "This fixes a regression introduced with 303cc571d1 ("nsproxy: attach
  to namespaces via pidfds").

  The LTP testsuite reported a regression where users would now see
  EBADF returned instead of EINVAL when an fd was passed that referred
  to an open file but the file was not a namespace file.

  Fix this by continuing to report EINVAL and add a regression test"

* tag 'for-linus-2020-06-24' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  tests: test for setns() EINVAL regression
  nsproxy: restore EINVAL for non-namespace file descriptor
2020-06-24 14:19:45 -07:00
Lukasz Luba
f0b5694791 PM / EM: change name of em_pd_energy to em_cpu_energy
Energy Model framework now supports other devices than CPUs. Refactor some
of the functions in order to prevent wrong usage. The old function
em_pd_energy has to generic name. It must not be used without proper
cpumask pointer, which is possible only for CPU devices. Thus, rename it
and add proper description to warn of potential wrong usage for other
devices.

Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Quentin Perret <qperret@google.com>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-06-24 17:16:42 +02:00
Lukasz Luba
07891f15d9 PM / EM: remove em_register_perf_domain
Remove old function em_register_perf_domain which is no longer needed.
There is em_dev_register_perf_domain that covers old use cases and new as
well.

Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Quentin Perret <qperret@google.com>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-06-24 17:16:42 +02:00
Lukasz Luba
1bc138c622 PM / EM: add support for other devices than CPUs in Energy Model
Add support for other devices than CPUs. The registration function
does not require a valid cpumask pointer and is ready to handle new
devices. Some of the internal structures has been reorganized in order to
keep consistent view (like removing per_cpu pd pointers).

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-06-24 17:16:27 +02:00
Luis Chamberlain
85e0cbbb8a block: create the request_queue debugfs_dir on registration
We were only creating the request_queue debugfs_dir only
for make_request block drivers (multiqueue), but never for
request-based block drivers. We did this as we were only
creating non-blktrace additional debugfs files on that directory
for make_request drivers. However, since blktrace *always* creates
that directory anyway, we special-case the use of that directory
on blktrace. Other than this being an eye-sore, this exposes
request-based block drivers to the same debugfs fragile
race that used to exist with make_request block drivers
where if we start adding files onto that directory we can later
run a race with a double removal of dentries on the directory
if we don't deal with this carefully on blktrace.

Instead, just simplify things by always creating the request_queue
debugfs_dir on request_queue registration. Rename the mutex also to
reflect the fact that this is used outside of the blktrace context.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-24 09:15:58 -06:00
Luis Chamberlain
b431ef837e blktrace: ensure our debugfs dir exists
We make an assumption that a debugfs directory exists, but since
this can fail ensure it exists before allowing blktrace setup to
complete. Otherwise we end up stuffing blktrace files on the debugfs
root directory. In the worst case scenario this *in theory* can create
an eventual panic *iff* in the future a similarly named file is created
prior on the debugfs root directory. This theoretical crash can happen
due to a recursive removal followed by a specific dentry removal.

This doesn't fix any known crash, however I have seen the files
go into the main debugfs root directory in cases where the debugfs
directory was not created due to other internal bugs with blktrace
now fixed.

blktrace is also completely useless without this directory, so
this ensures to userspace we only setup blktrace if the kernel
can stuff files where they are supposed to go into.

debugfs directory creations typically aren't checked for, and we have
maintainers doing sweep removals of these checks, but since we need this
check to ensure proper userspace blktrace functionality we make sure
to annotate the justification for the check.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-24 09:15:58 -06:00
Luis Chamberlain
bad8e64fb1 blktrace: fix debugfs use after free
On commit 6ac93117ab ("blktrace: use existing disk debugfs directory")
merged on v4.12 Omar fixed the original blktrace code for request-based
drivers (multiqueue). This however left in place a possible crash, if you
happen to abuse blktrace while racing to remove / add a device.

We used to use asynchronous removal of the request_queue, and with that
the issue was easier to reproduce. Now that we have reverted to
synchronous removal of the request_queue, the issue is still possible to
reproduce, its however just a bit more difficult.

We essentially run two instances of break-blktrace which add/remove
a loop device, and setup a blktrace and just never tear the blktrace
down. We do this twice in parallel. This is easily reproduced with the
script run_0004.sh from break-blktrace [0].

We can end up with two types of panics each reflecting where we
race, one a failed blktrace setup:

[  252.426751] debugfs: Directory 'loop0' with parent 'block' already present!
[  252.432265] BUG: kernel NULL pointer dereference, address: 00000000000000a0
[  252.436592] #PF: supervisor write access in kernel mode
[  252.439822] #PF: error_code(0x0002) - not-present page
[  252.442967] PGD 0 P4D 0
[  252.444656] Oops: 0002 [#1] SMP NOPTI
[  252.446972] CPU: 10 PID: 1153 Comm: break-blktrace Tainted: G            E     5.7.0-rc2-next-20200420+ #164
[  252.452673] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
[  252.456343] RIP: 0010:down_write+0x15/0x40
[  252.458146] Code: eb ca e8 ae 22 8d ff cc cc cc cc cc cc cc cc cc cc cc cc
               cc cc 0f 1f 44 00 00 55 48 89 fd e8 52 db ff ff 31 c0 ba 01 00
               00 00 <f0> 48 0f b1 55 00 75 0f 48 8b 04 25 c0 8b 01 00 48 89
               45 08 5d
[  252.463638] RSP: 0018:ffffa626415abcc8 EFLAGS: 00010246
[  252.464950] RAX: 0000000000000000 RBX: ffff958c25f0f5c0 RCX: ffffff8100000000
[  252.466727] RDX: 0000000000000001 RSI: ffffff8100000000 RDI: 00000000000000a0
[  252.468482] RBP: 00000000000000a0 R08: 0000000000000000 R09: 0000000000000001
[  252.470014] R10: 0000000000000000 R11: ffff958d1f9227ff R12: 0000000000000000
[  252.471473] R13: ffff958c25ea5380 R14: ffffffff8cce15f1 R15: 00000000000000a0
[  252.473346] FS:  00007f2e69dee540(0000) GS:ffff958c2fc80000(0000) knlGS:0000000000000000
[  252.475225] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  252.476267] CR2: 00000000000000a0 CR3: 0000000427d10004 CR4: 0000000000360ee0
[  252.477526] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  252.478776] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  252.479866] Call Trace:
[  252.480322]  simple_recursive_removal+0x4e/0x2e0
[  252.481078]  ? debugfs_remove+0x60/0x60
[  252.481725]  ? relay_destroy_buf+0x77/0xb0
[  252.482662]  debugfs_remove+0x40/0x60
[  252.483518]  blk_remove_buf_file_callback+0x5/0x10
[  252.484328]  relay_close_buf+0x2e/0x60
[  252.484930]  relay_open+0x1ce/0x2c0
[  252.485520]  do_blk_trace_setup+0x14f/0x2b0
[  252.486187]  __blk_trace_setup+0x54/0xb0
[  252.486803]  blk_trace_ioctl+0x90/0x140
[  252.487423]  ? do_sys_openat2+0x1ab/0x2d0
[  252.488053]  blkdev_ioctl+0x4d/0x260
[  252.488636]  block_ioctl+0x39/0x40
[  252.489139]  ksys_ioctl+0x87/0xc0
[  252.489675]  __x64_sys_ioctl+0x16/0x20
[  252.490380]  do_syscall_64+0x52/0x180
[  252.491032]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

And the other on the device removal:

[  128.528940] debugfs: Directory 'loop0' with parent 'block' already present!
[  128.615325] BUG: kernel NULL pointer dereference, address: 00000000000000a0
[  128.619537] #PF: supervisor write access in kernel mode
[  128.622700] #PF: error_code(0x0002) - not-present page
[  128.625842] PGD 0 P4D 0
[  128.627585] Oops: 0002 [#1] SMP NOPTI
[  128.629871] CPU: 12 PID: 544 Comm: break-blktrace Tainted: G            E     5.7.0-rc2-next-20200420+ #164
[  128.635595] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
[  128.640471] RIP: 0010:down_write+0x15/0x40
[  128.643041] Code: eb ca e8 ae 22 8d ff cc cc cc cc cc cc cc cc cc cc cc cc
               cc cc 0f 1f 44 00 00 55 48 89 fd e8 52 db ff ff 31 c0 ba 01 00
               00 00 <f0> 48 0f b1 55 00 75 0f 65 48 8b 04 25 c0 8b 01 00 48 89
               45 08 5d
[  128.650180] RSP: 0018:ffffa9c3c05ebd78 EFLAGS: 00010246
[  128.651820] RAX: 0000000000000000 RBX: ffff8ae9a6370240 RCX: ffffff8100000000
[  128.653942] RDX: 0000000000000001 RSI: ffffff8100000000 RDI: 00000000000000a0
[  128.655720] RBP: 00000000000000a0 R08: 0000000000000002 R09: ffff8ae9afd2d3d0
[  128.657400] R10: 0000000000000056 R11: 0000000000000000 R12: 0000000000000000
[  128.659099] R13: 0000000000000000 R14: 0000000000000003 R15: 00000000000000a0
[  128.660500] FS:  00007febfd995540(0000) GS:ffff8ae9afd00000(0000) knlGS:0000000000000000
[  128.662204] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  128.663426] CR2: 00000000000000a0 CR3: 0000000420042003 CR4: 0000000000360ee0
[  128.664776] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  128.666022] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  128.667282] Call Trace:
[  128.667801]  simple_recursive_removal+0x4e/0x2e0
[  128.668663]  ? debugfs_remove+0x60/0x60
[  128.669368]  debugfs_remove+0x40/0x60
[  128.669985]  blk_trace_free+0xd/0x50
[  128.670593]  __blk_trace_remove+0x27/0x40
[  128.671274]  blk_trace_shutdown+0x30/0x40
[  128.671935]  blk_release_queue+0x95/0xf0
[  128.672589]  kobject_put+0xa5/0x1b0
[  128.673188]  disk_release+0xa2/0xc0
[  128.673786]  device_release+0x28/0x80
[  128.674376]  kobject_put+0xa5/0x1b0
[  128.674915]  loop_remove+0x39/0x50 [loop]
[  128.675511]  loop_control_ioctl+0x113/0x130 [loop]
[  128.676199]  ksys_ioctl+0x87/0xc0
[  128.676708]  __x64_sys_ioctl+0x16/0x20
[  128.677274]  do_syscall_64+0x52/0x180
[  128.677823]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

The common theme here is:

debugfs: Directory 'loop0' with parent 'block' already present

This crash happens because of how blktrace uses the debugfs directory
where it places its files. Upon init we always create the same directory
which would be needed by blktrace but we only do this for make_request
drivers (multiqueue) block drivers. When you race a removal of these
devices with a blktrace setup you end up in a situation where the
make_request recursive debugfs removal will sweep away the blktrace
files and then later blktrace will also try to remove individual
dentries which are already NULL. The inverse is also possible and hence
the two types of use after frees.

We don't create the block debugfs directory on init for these types of
block devices:

  * request-based block driver block devices
  * every possible partition
  * scsi-generic

And so, this race should in theory only be possible with make_request
drivers.

We can fix the UAF by simply re-using the debugfs directory for
make_request drivers (multiqueue) and only creating the ephemeral
directory for the other type of block devices. The new clarifications
on relying on the q->blk_trace_mutex *and* also checking for q->blk_trace
*prior* to processing a blktrace ensures the debugfs directories are
only created if no possible directory name clashes are possible.

This goes tested with:

  o nvme partitions
  o ISCSI with tgt, and blktracing against scsi-generic with:
    o block
    o tape
    o cdrom
    o media changer
  o blktests

This patch is part of the work which disputes the severity of
CVE-2019-19770 which shows this issue is not a core debugfs issue, but
a misuse of debugfs within blktace.

Fixes: 6ac93117ab ("blktrace: use existing disk debugfs directory")
Reported-by: syzbot+603294af2d01acfdd6da@syzkaller.appspotmail.com
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Nicolai Stange <nstange@suse.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: "James E.J. Bottomley" <jejb@linux.ibm.com>
Cc: yu kuai <yukuai3@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-24 09:15:58 -06:00
Luis Chamberlain
a67549c8e5 blktrace: annotate required lock on do_blk_trace_setup()
Ensure it is clear which lock is required on do_blk_trace_setup().

Suggested-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-24 09:15:58 -06:00
Lukasz Luba
d0351cc3b0 PM / EM: update callback structure and add device pointer
The Energy Model framework is going to support devices other that CPUs. In
order to make this happen change the callback function and add pointer to
a device as an argument.

Update the related users to use new function and new callback from the
Energy Model.

Acked-by: Quentin Perret <qperret@google.com>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-06-24 17:14:07 +02:00
Lukasz Luba
7d9895c7fb PM / EM: introduce em_dev_register_perf_domain function
Add now function in the Energy Model framework which is going to support
new devices. This function will help in transition and make it smoother.
For now it still checks if the cpumask is a valid pointer, which will be
removed later when the new structures and infrastructure will be ready.

Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Quentin Perret <qperret@google.com>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-06-24 17:14:07 +02:00
Lukasz Luba
521b512b15 PM / EM: change naming convention from 'capacity' to 'performance'
The Energy Model uses concept of performance domain and capacity states in
order to calculate power used by CPUs. Change naming convention from
capacity to performance state would enable wider usage in future, e.g.
upcoming support for other devices other than CPUs.

Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Quentin Perret <qperret@google.com>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-06-24 17:14:07 +02:00
Alexander Popov
feee1b8c49 gcc-plugins/stackleak: Use asm instrumentation to avoid useless register saving
The kernel code instrumentation in stackleak gcc plugin works in two stages.
At first, stack tracking is added to GIMPLE representation of every function
(except some special cases). And later, when stack frame size info is
available, stack tracking is removed from the RTL representation of the
functions with small stack frame. There is an unwanted side-effect for these
functions: some of them do useless work with caller-saved registers.

As an example of such case, proc_sys_write without() instrumentation:
    55                      push   %rbp
    41 b8 01 00 00 00       mov    $0x1,%r8d
    48 89 e5                mov    %rsp,%rbp
    e8 11 ff ff ff          callq  ffffffff81284610 <proc_sys_call_handler>
    5d                      pop    %rbp
    c3                      retq
    0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
    66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
    00 00 00

proc_sys_write() with instrumentation:
    55                      push   %rbp
    48 89 e5                mov    %rsp,%rbp
    41 56                   push   %r14
    41 55                   push   %r13
    41 54                   push   %r12
    53                      push   %rbx
    49 89 f4                mov    %rsi,%r12
    48 89 fb                mov    %rdi,%rbx
    49 89 d5                mov    %rdx,%r13
    49 89 ce                mov    %rcx,%r14
    4c 89 f1                mov    %r14,%rcx
    4c 89 ea                mov    %r13,%rdx
    4c 89 e6                mov    %r12,%rsi
    48 89 df                mov    %rbx,%rdi
    41 b8 01 00 00 00       mov    $0x1,%r8d
    e8 f2 fe ff ff          callq  ffffffff81298e80 <proc_sys_call_handler>
    5b                      pop    %rbx
    41 5c                   pop    %r12
    41 5d                   pop    %r13
    41 5e                   pop    %r14
    5d                      pop    %rbp
    c3                      retq
    66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
    00 00

Let's improve the instrumentation to avoid this:

1. Make stackleak_track_stack() save all register that it works with.
Use no_caller_saved_registers attribute for that function. This attribute
is available for x86_64 and i386 starting from gcc-7.

2. Insert calling stackleak_track_stack() in asm:
  asm volatile("call stackleak_track_stack" :: "r" (current_stack_pointer))
Here we use ASM_CALL_CONSTRAINT trick from arch/x86/include/asm/asm.h.
The input constraint is taken into account during gcc shrink-wrapping
optimization. It is needed to be sure that stackleak_track_stack() call is
inserted after the prologue of the containing function, when the stack
frame is prepared.

This work is a deep reengineering of the idea described on grsecurity blog
  https://grsecurity.net/resolving_an_unfortunate_stackleak_interaction

Signed-off-by: Alexander Popov <alex.popov@linux.com>
Acked-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Link: https://lore.kernel.org/r/20200624123330.83226-5-alex.popov@linux.com
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-06-24 07:48:28 -07:00
Alexander Popov
005e696df6 gcc-plugins/stackleak: Don't instrument itself
There is no need to try instrumenting functions in kernel/stackleak.c.
Otherwise that can cause issues if the cleanup pass of stackleak gcc plugin
is disabled.

Signed-off-by: Alexander Popov <alex.popov@linux.com>
Link: https://lore.kernel.org/r/20200624123330.83226-2-alex.popov@linux.com
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-06-24 07:48:13 -07:00
Sascha Ortmann
20dc3847cc tracing/boottime: Fix kprobe multiple events
Fix boottime kprobe events to report and abort after each failure when
adding probes.

As an example, when we try to set multiprobe kprobe events in
bootconfig like this:

ftrace.event.kprobes.vfsevents {
        probes = "vfs_read $arg1 $arg2,,
                 !error! not reported;?", // leads to error
                 "vfs_write $arg1 $arg2"
}

This will not work as expected. After
commit da0f1f4167 ("tracing/boottime: Fix kprobe event API usage"),
the function trace_boot_add_kprobe_event will not produce any error
message when adding a probe fails at kprobe_event_gen_cmd_start.
Furthermore, we continue to add probes when kprobe_event_gen_cmd_end fails
(and kprobe_event_gen_cmd_start did not fail). In this case the function
even returns successfully when the last call to kprobe_event_gen_cmd_end
is successful.

The behaviour of reporting and aborting after failures is not
consistent.

The function trace_boot_add_kprobe_event now reports each failure and
stops adding probes immediately.

Link: https://lkml.kernel.org/r/20200618163301.25854-1-sascha.ortmann@stud.uni-hannover.de

Cc: stable@vger.kernel.org
Cc: linux-kernel@i4.cs.fau.de
Co-developed-by: Maximilian Werner <maximilian.werner96@gmail.com>
Fixes: da0f1f4167 ("tracing/boottime: Fix kprobe event API usage")
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Maximilian Werner <maximilian.werner96@gmail.com>
Signed-off-by: Sascha Ortmann <sascha.ortmann@stud.uni-hannover.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-23 21:51:50 -04:00
Masami Hiramatsu
6784beada6 tracing: Fix event trigger to accept redundant spaces
Fix the event trigger to accept redundant spaces in
the trigger input.

For example, these return -EINVAL

echo " traceon" > events/ftrace/print/trigger
echo "traceon  if common_pid == 0" > events/ftrace/print/trigger
echo "disable_event:kmem:kmalloc " > events/ftrace/print/trigger

But these are hard to find what is wrong.

To fix this issue, use skip_spaces() to remove spaces
in front of actual tokens, and set NULL if there is no
token.

Link: http://lkml.kernel.org/r/159262476352.185015.5261566783045364186.stgit@devnote2

Cc: Tom Zanussi <zanussi@kernel.org>
Cc: stable@vger.kernel.org
Fixes: 85f2b08268 ("tracing: Add basic event trigger framework")
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-23 21:51:40 -04:00
Masami Hiramatsu
6c95503c29 tracing/boot: Fix config dependency for synthedic event
Since commit 726721a518 ("tracing: Move synthetic events to
a separate file") decoupled synthetic event from histogram,
boot-time tracing also has to check CONFIG_SYNTH_EVENT instead
of CONFIG_HIST_TRIGGERS.

Link: http://lkml.kernel.org/r/159262475441.185015.5300725180746017555.stgit@devnote2

Fixes: 726721a518 ("tracing: Move synthetic events to a separate file")
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-23 21:51:22 -04:00
Maciej Żenczykowski
b338cb921e bpf: Restore behaviour of CAP_SYS_ADMIN allowing the loading of networking bpf programs
This is a fix for a regression in commit 2c78ee898d ("bpf: Implement CAP_BPF").
Before the above commit it was possible to load network bpf programs
with just the CAP_SYS_ADMIN privilege.

The Android bpfloader happens to run in such a configuration (it has
SYS_ADMIN but not NET_ADMIN) and creates maps and loads bpf programs
for later use by Android's netd (which has NET_ADMIN but not SYS_ADMIN).

Fixes: 2c78ee898d ("bpf: Implement CAP_BPF")
Reported-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: John Stultz <john.stultz@linaro.org>
Link: https://lore.kernel.org/bpf/20200620212616.93894-1-zenczykowski@gmail.com
2020-06-23 17:45:42 -07:00
Yonghong Song
c4c0bdc0d2 bpf: Set the number of exception entries properly for subprograms
Currently, if a bpf program has more than one subprograms, each program will be
jitted separately. For programs with bpf-to-bpf calls the
prog->aux->num_exentries is not setup properly. For example, with
bpf_iter_netlink.c modified to force one function to be not inlined and with
CONFIG_BPF_JIT_ALWAYS_ON the following error is seen:
   $ ./test_progs -n 3/3
   ...
   libbpf: failed to load program 'iter/netlink'
   libbpf: failed to load object 'bpf_iter_netlink'
   libbpf: failed to load BPF skeleton 'bpf_iter_netlink': -4007
   test_netlink:FAIL:bpf_iter_netlink__open_and_load skeleton open_and_load failed
   #3/3 netlink:FAIL
The dmesg shows the following errors:
   ex gen bug
which is triggered by the following code in arch/x86/net/bpf_jit_comp.c:
   if (excnt >= bpf_prog->aux->num_exentries) {
     pr_err("ex gen bug\n");
     return -EFAULT;
   }

This patch fixes the issue by computing proper num_exentries for each
subprogram before calling JIT.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-06-23 17:27:37 -07:00
Richard Guy Briggs
8e6cf365e1 audit: log nftables configuration change events
iptables, ip6tables, arptables and ebtables table registration,
replacement and unregistration configuration events are logged for the
native (legacy) iptables setsockopt api, but not for the
nftables netlink api which is used by the nft-variant of iptables in
addition to nftables itself.

Add calls to log the configuration actions in the nftables netlink api.

This uses the same NETFILTER_CFG record format but overloads the table
field.

  type=NETFILTER_CFG msg=audit(2020-05-28 17:46:41.878:162) : table=?:0;?:0 family=unspecified entries=2 op=nft_register_gen pid=396 subj=system_u:system_r:firewalld_t:s0 comm=firewalld
  ...
  type=NETFILTER_CFG msg=audit(2020-05-28 17:46:41.878:162) : table=firewalld:1;?:0 family=inet entries=0 op=nft_register_table pid=396 subj=system_u:system_r:firewalld_t:s0 comm=firewalld
  ...
  type=NETFILTER_CFG msg=audit(2020-05-28 17:46:41.911:163) : table=firewalld:1;filter_FORWARD:85 family=inet entries=8 op=nft_register_chain pid=396 subj=system_u:system_r:firewalld_t:s0 comm=firewalld
  ...
  type=NETFILTER_CFG msg=audit(2020-05-28 17:46:41.911:163) : table=firewalld:1;filter_FORWARD:85 family=inet entries=101 op=nft_register_rule pid=396 subj=system_u:system_r:firewalld_t:s0 comm=firewalld
  ...
  type=NETFILTER_CFG msg=audit(2020-05-28 17:46:41.911:163) : table=firewalld:1;__set0:87 family=inet entries=87 op=nft_register_setelem pid=396 subj=system_u:system_r:firewalld_t:s0 comm=firewalld
  ...
  type=NETFILTER_CFG msg=audit(2020-05-28 17:46:41.911:163) : table=firewalld:1;__set0:87 family=inet entries=0 op=nft_register_set pid=396 subj=system_u:system_r:firewalld_t:s0 comm=firewalld

For further information please see issue
https://github.com/linux-audit/audit-kernel/issues/124

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2020-06-23 20:25:16 -04:00
Alexei Starovoitov
4e608675e7 Merge up to bpf_probe_read_kernel_str() fix into bpf-next 2020-06-23 15:33:41 -07:00
Steven Rostedt (VMware)
097350d1c6 ring-buffer: Zero out time extend if it is nested and not absolute
Currently the ring buffer makes events that happen in interrupts that preempt
another event have a delta of zero. (Hopefully we can change this soon). But
this is to deal with the races of updating a global counter with lockless
and nesting functions updating deltas.

With the addition of absolute time stamps, the time extend didn't follow
this rule. A time extend can happen if two events happen longer than 2^27
nanoseconds appart, as the delta time field in each event is only 27 bits.
If that happens, then a time extend is injected with 2^59 bits of
nanoseconds to use (18 years). But if the 2^27 nanoseconds happen between
two events, and as it is writing the event, an interrupt triggers, it will
see the 2^27 difference as well and inject a time extend of its own. But a
recent change made the time extend logic not take into account the nesting,
and this can cause two time extend deltas to happen moving the time stamp
much further ahead than the current time. This gets all reset when the ring
buffer moves to the next page, but that can cause time to appear to go
backwards.

This was observed in a trace-cmd recording, and since the data is saved in a
file, with trace-cmd report --debug, it was possible to see that this indeed
did happen!

  bash-52501   110d... 81778.908247: sched_switch:         bash:52501 [120] S ==> swapper/110:0 [120] [12770284:0x2e8:64]
  <idle>-0     110d... 81778.908757: sched_switch:         swapper/110:0 [120] R ==> bash:52501 [120] [509947:0x32c:64]
 TIME EXTEND: delta:306454770 length:0
  bash-52501   110.... 81779.215212: sched_swap_numa:      src_pid=52501 src_tgid=52388 src_ngid=52501 src_cpu=110 src_nid=2 dst_pid=52509 dst_tgid=52388 dst_ngid=52501 dst_cpu=49 dst_nid=1 [0:0x378:48]
 TIME EXTEND: delta:306458165 length:0
  bash-52501   110dNh. 81779.521670: sched_wakeup:         migration/110:565 [0] success=1 CPU:110 [0:0x3b4:40]

and at the next page, caused the time to go backwards:

  bash-52504   110d... 81779.685411: sched_switch:         bash:52504 [120] S ==> swapper/110:0 [120] [8347057:0xfb4:64]
CPU:110 [SUBBUFFER START] [81779379165886:0x1320000]
  <idle>-0     110dN.. 81779.379166: sched_wakeup:         bash:52504 [120] success=1 CPU:110 [0:0x10:40]
  <idle>-0     110d... 81779.379167: sched_switch:         swapper/110:0 [120] R ==> bash:52504 [120] [1168:0x3c:64]

Link: https://lkml.kernel.org/r/20200622151815.345d1bf5@oasis.local.home

Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: stable@vger.kernel.org
Fixes: dc4e2801d4 ("ring-buffer: Redefine the unimplemented RINGBUF_TYPE_TIME_STAMP")
Reported-by: Julia Lawall <julia.lawall@inria.fr>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-23 11:18:42 -04:00
Eric Auger
8e36baf97b dma-remap: align the size in dma_common_*_remap()
Running a guest with a virtio-iommu protecting virtio devices
is broken since commit 515e5b6d90 ("dma-mapping: use vmap insted
of reimplementing it"). Before the conversion, the size was
page aligned in __get_vm_area_node(). Doing so fixes the
regression.

Fixes: 515e5b6d90 ("dma-mapping: use vmap insted of reimplementing it")
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-06-23 14:14:41 +02:00
Christoph Hellwig
d07ae4c486 dma-mapping: DMA_COHERENT_POOL should select GENERIC_ALLOCATOR
The dma coherent pool code needs genalloc.  Move the select over
from DMA_REMAP, which doesn't actually need it.

Fixes: dbed452a07 ("dma-pool: decouple DMA_REMAP from DMA_COHERENT_POOL")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Rientjes <rientjes@google.com>
2020-06-23 14:13:58 +02:00
David Rientjes
1a2b3357e8 dma-direct: add missing set_memory_decrypted() for coherent mapping
When a coherent mapping is created in dma_direct_alloc_pages(), it needs
to be decrypted if the device requires unencrypted DMA before returning.

Fixes: 3acac06550 ("dma-mapping: merge the generic remapping helpers into dma-direct")
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-06-23 14:13:54 +02:00
Andrey Ignatov
2872e9ac33 bpf: Set map_btf_{name, id} for all map types
Set map_btf_name and map_btf_id for all map types so that map fields can
be accessed by bpf programs.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/a825f808f22af52b018dbe82f1c7d29dab5fc978.1592600985.git.rdna@fb.com
2020-06-22 22:22:58 +02:00
Andrey Ignatov
41c48f3a98 bpf: Support access to bpf map fields
There are multiple use-cases when it's convenient to have access to bpf
map fields, both `struct bpf_map` and map type specific struct-s such as
`struct bpf_array`, `struct bpf_htab`, etc.

For example while working with sock arrays it can be necessary to
calculate the key based on map->max_entries (some_hash % max_entries).
Currently this is solved by communicating max_entries via "out-of-band"
channel, e.g. via additional map with known key to get info about target
map. That works, but is not very convenient and error-prone while
working with many maps.

In other cases necessary data is dynamic (i.e. unknown at loading time)
and it's impossible to get it at all. For example while working with a
hash table it can be convenient to know how much capacity is already
used (bpf_htab.count.counter for BPF_F_NO_PREALLOC case).

At the same time kernel knows this info and can provide it to bpf
program.

Fill this gap by adding support to access bpf map fields from bpf
program for both `struct bpf_map` and map type specific fields.

Support is implemented via btf_struct_access() so that a user can define
their own `struct bpf_map` or map type specific struct in their program
with only necessary fields and preserve_access_index attribute, cast a
map to this struct and use a field.

For example:

	struct bpf_map {
		__u32 max_entries;
	} __attribute__((preserve_access_index));

	struct bpf_array {
		struct bpf_map map;
		__u32 elem_size;
	} __attribute__((preserve_access_index));

	struct {
		__uint(type, BPF_MAP_TYPE_ARRAY);
		__uint(max_entries, 4);
		__type(key, __u32);
		__type(value, __u32);
	} m_array SEC(".maps");

	SEC("cgroup_skb/egress")
	int cg_skb(void *ctx)
	{
		struct bpf_array *array = (struct bpf_array *)&m_array;
		struct bpf_map *map = (struct bpf_map *)&m_array;

		/* .. use map->max_entries or array->map.max_entries .. */
	}

Similarly to other btf_struct_access() use-cases (e.g. struct tcp_sock
in net/ipv4/bpf_tcp_ca.c) the patch allows access to any fields of
corresponding struct. Only reading from map fields is supported.

For btf_struct_access() to work there should be a way to know btf id of
a struct that corresponds to a map type. To get btf id there should be a
way to get a stringified name of map-specific struct, such as
"bpf_array", "bpf_htab", etc for a map type. Two new fields are added to
`struct bpf_map_ops` to handle it:
* .map_btf_name keeps a btf name of a struct returned by map_alloc();
* .map_btf_id is used to cache btf id of that struct.

To make btf ids calculation cheaper they're calculated once while
preparing btf_vmlinux and cached same way as it's done for btf_id field
of `struct bpf_func_proto`

While calculating btf ids, struct names are NOT checked for collision.
Collisions will be checked as a part of the work to prepare btf ids used
in verifier in compile time that should land soon. The only known
collision for `struct bpf_htab` (kernel/bpf/hashtab.c vs
net/core/sock_map.c) was fixed earlier.

Both new fields .map_btf_name and .map_btf_id must be set for a map type
for the feature to work. If neither is set for a map type, verifier will
return ENOTSUPP on a try to access map_ptr of corresponding type. If
just one of them set, it's verifier misconfiguration.

Only `struct bpf_array` for BPF_MAP_TYPE_ARRAY and `struct bpf_htab` for
BPF_MAP_TYPE_HASH are supported by this patch. Other map types will be
supported separately.

The feature is available only for CONFIG_DEBUG_INFO_BTF=y and gated by
perfmon_capable() so that unpriv programs won't have access to bpf map
fields.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/6479686a0cd1e9067993df57b4c3eef0e276fec9.1592600985.git.rdna@fb.com
2020-06-22 22:22:58 +02:00
Andrey Ignatov
a2d0d62f4d bpf: Switch btf_parse_vmlinux to btf_find_by_name_kind
btf_parse_vmlinux() implements manual search for struct bpf_ctx_convert
since at the time of implementing btf_find_by_name_kind() was not
available.

Later btf_find_by_name_kind() was introduced in 27ae7997a6 ("bpf:
Introduce BPF_PROG_TYPE_STRUCT_OPS"). It provides similar search
functionality and can be leveraged in btf_parse_vmlinux(). Do it.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/6e12d5c3e8a3d552925913ef73a695dd1bb27800.1592600985.git.rdna@fb.com
2020-06-22 22:22:58 +02:00
Christian Brauner
3af8588c77
fork: fold legacy_clone_args_valid() into _do_fork()
This separate helper only existed to guarantee the mutual exclusivity of
CLONE_PIDFD and CLONE_PARENT_SETTID for legacy clone since CLONE_PIDFD
abuses the parent_tid field to return the pidfd. But we can actually handle
this uniformely thus removing the helper. For legacy clone we can detect
that CLONE_PIDFD is specified in conjunction with CLONE_PARENT_SETTID
because they will share the same memory which is invalid and for clone3()
setting the separate pidfd and parent_tid fields to the same memory is
bogus as well. So fold that helper directly into _do_fork() by detecting
this case.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: linux-m68k@lists.linux-m68k.org
Cc: x86@kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-06-22 14:38:38 +02:00
Jason A. Donenfeld
625d344978 Revert "kernel/printk: add kmsg SEEK_CUR handling"
This reverts commit 8ece3b3eb5.

This commit broke userspace. Bash uses ESPIPE to determine whether or
not the file should be read using "unbuffered I/O", which means reading
1 byte at a time instead of 128 bytes at a time. I used to use bash to
read through kmsg in a really quite nasty way:

    while read -t 0.1 -r line 2>/dev/null || [[ $? -ne 142 ]]; do
       echo "SARU $line"
    done < /dev/kmsg

This will show all lines that can fit into the 128 byte buffer, and skip
lines that don't. That's pretty awful, but at least it worked.

With this change, bash now tries to do 1-byte reads, which means it
skips all the lines, which is worse than before.

Now, I don't really care very much about this, and I'm already look for
a workaround. But I did just spend an hour trying to figure out why my
scripts were broken. Either way, it makes no difference to me personally
whether this is reverted, but it might be something to consider. If you
declare that "trying to read /dev/kmsg with bash is terminally stupid
anyway," I might be inclined to agree with you. But do note that bash
uses lseek(fd, 0, SEEK_CUR)==>ESPIPE to determine whether or not it's
reading from a pipe.

Cc: Bruno Meneguele <bmeneg@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Petr Mladek <pmladek@suse.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-21 20:47:20 -07:00
Linus Torvalds
8b6ddd10d6 A few fixes and small cleanups for tracing:
- Have recordmcount work with > 64K sections (to support LTO)
  - kprobe RCU fixes
  - Correct a kprobe critical section with missing mutex
  - Remove redundant arch_disarm_kprobe() call
  - Fix lockup when kretprobe triggers within kprobe_flush_task()
  - Fix memory leak in fetch_op_data operations
  - Fix sleep in atomic in ftrace trace array sample code
  - Free up memory on failure in sample trace array code
  - Fix incorrect reporting of function_graph fields in format file
  - Fix quote within quote parsing in bootconfig
  - Fix return value of bootconfig tool
  - Add testcases for bootconfig tool
  - Fix maybe uninitialized warning in ftrace pid file code
  - Remove unused variable in tracing_iter_reset()
  - Fix some typos
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXu1jrRQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qoCMAP91nOccE3X+Nvc3zET3isDWnl1tWJxk
 icsBgN/JwBRuTAD/dnWTHIWM2/5lTiagvyVsmINdJHP6JLr8T7dpN9tlxAQ=
 =Cuo7
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Have recordmcount work with > 64K sections (to support LTO)

 - kprobe RCU fixes

 - Correct a kprobe critical section with missing mutex

 - Remove redundant arch_disarm_kprobe() call

 - Fix lockup when kretprobe triggers within kprobe_flush_task()

 - Fix memory leak in fetch_op_data operations

 - Fix sleep in atomic in ftrace trace array sample code

 - Free up memory on failure in sample trace array code

 - Fix incorrect reporting of function_graph fields in format file

 - Fix quote within quote parsing in bootconfig

 - Fix return value of bootconfig tool

 - Add testcases for bootconfig tool

 - Fix maybe uninitialized warning in ftrace pid file code

 - Remove unused variable in tracing_iter_reset()

 - Fix some typos

* tag 'trace-v5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  ftrace: Fix maybe-uninitialized compiler warning
  tools/bootconfig: Add testcase for show-command and quotes test
  tools/bootconfig: Fix to return 0 if succeeded to show the bootconfig
  tools/bootconfig: Fix to use correct quotes for value
  proc/bootconfig: Fix to use correct quotes for value
  tracing: Remove unused event variable in tracing_iter_reset
  tracing/probe: Fix memleak in fetch_op_data operations
  trace: Fix typo in allocate_ftrace_ops()'s comment
  tracing: Make ftrace packed events have align of 1
  sample-trace-array: Remove trace_array 'sample-instance'
  sample-trace-array: Fix sleeping function called from invalid context
  kretprobe: Prevent triggering kretprobe from within kprobe_flush_task
  kprobes: Remove redundant arch_disarm_kprobe() call
  kprobes: Fix to protect kick_kprobe_optimizer() by kprobe_mutex
  kprobes: Use non RCU traversal APIs on kprobe_tables if possible
  kprobes: Suppress the suspicious RCU warning on kprobes
  recordmcount: support >64k sections
2020-06-20 13:17:47 -07:00
Yonghong Song
6c6935419e bpf: Avoid verifier failure for 32bit pointer arithmetic
When do experiments with llvm (disabling instcombine and
simplifyCFG), I hit the following error with test_seg6_loop.o.

  ; R1=pkt(id=0,off=0,r=48,imm=0), R7=pkt(id=0,off=40,r=48,imm=0)
  w2 = w7
  ; R2_w=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff))
  w2 -= w1
  R2 32-bit pointer arithmetic prohibited

The corresponding source code is:
  uint32_t srh_off
  // srh and skb->data are all packet pointers
  srh_off = (char *)srh - (char *)(long)skb->data;

The verifier does not support 32-bit pointer/scalar arithmetic.

Without my llvm change, the code looks like

  ; R3=pkt(id=0,off=40,r=48,imm=0), R8=pkt(id=0,off=0,r=48,imm=0)
  w3 -= w8
  ; R3_w=inv(id=0)

This is explicitly allowed in verifier if both registers are
pointers and the opcode is BPF_SUB.

To fix this problem, I changed the verifier to allow
32-bit pointer/scaler BPF_SUB operations.

At the source level, the issue could be workarounded with
inline asm or changing "uint32_t srh_off" to "uint64_t srh_off".
But I feel that verifier change might be the right thing to do.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200618234631.3321118-1-yhs@fb.com
2020-06-19 23:34:42 +02:00
Linus Torvalds
d2b1c81f5f block-5.8-2020-06-19
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl7s0SAQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpp+YEACVqFvsfzxKCqa61IzyuOaPfnj9awyP+MY2
 7V6y9sDDHL8sp6aPDbHvqFnqz0O7E+7nHVZD2rf2qc6tKKMvJYNO/BFZSXPvWTZV
 KQ4cBChf/LDwqAKOnI4ZhmF5UcSyyob1yMy4uJ+U0gQiXXrRMbwJ3N1K24a9dr4c
 epkzGavR0Q+PJ9BbUgjACjbRdT+vrP4bOu0cuyCGkIpD9eCerKJ6mFaUAj0FDthD
 bg4BJj+c8Ij6LO0V++Wga6OxccmL43KeP0ky8B3x07PfAl+tDWqsbHSlU2YPtdcq
 5nKgMMTW16mVnZeO2/W0JB7tn89VubsmyvIFcm2KNeeRqSnEZyW9HI8n4kq994Ju
 xMH24lgbsU4trNeYkgOmzPoJJZ+LShkn+rnldyI1U/fhpEYub7DqfVySuT7ti9in
 uFpQdeRUmPsdw92F3+o6h8OYAflpcQQ7CblkzxPEeV4OyzOZasb+S9tMNPe59KBh
 0MtHv9IfzgtDihR6HuXifitXaP+GtH4x3D2z0dzEdooHKHC/+P3WycS5daG+3WKQ
 xV5lJruvpTuxhXKLFAH0wRrxnVlB0VUvhQ21T3WgHrwF0btbdmQMHFc83XOxBIB4
 jHWJMHGc4xp1ZdpWFBC8Cj79OmJh1w/ao8+/cf8SUoTB0LzFce1B8LvwnxgpcpUk
 VjIOrl7zhQ==
 =LeLd
 -----END PGP SIGNATURE-----

Merge tag 'block-5.8-2020-06-19' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - Use import_uuid() where appropriate (Andy)

 - bcache fixes (Coly, Mauricio, Zhiqiang)

 - blktrace sparse warnings fix (Jan)

 - blktrace concurrent setup fix (Luis)

 - blkdev_get use-after-free fix (Jason)

 - Ensure all blk-mq maps are updated (Weiping)

 - Loop invalidate bdev fix (Zheng)

* tag 'block-5.8-2020-06-19' of git://git.kernel.dk/linux-block:
  block: make function 'kill_bdev' static
  loop: replace kill_bdev with invalidate_bdev
  partitions/ldm: Replace uuid_copy() with import_uuid() where it makes sense
  block: update hctx map when use multiple maps
  blktrace: Avoid sparse warnings when assigning q->blk_trace
  blktrace: break out of blktrace setup on concurrent calls
  block: Fix use-after-free in blkdev_get()
  trace/events/block.h: drop kernel-doc for dropped function parameter
  blk-mq: Remove redundant 'return' statement
  bcache: pr_info() format clean up in bcache_device_init()
  bcache: use delayed kworker fo asynchronous devices registration
  bcache: check and adjust logical block size for backing devices
  bcache: fix potential deadlock problem in btree_gc_coalesce
2020-06-19 13:11:26 -07:00
Linus Torvalds
5e857ce6ea Merge branch 'hch' (maccess patches from Christoph Hellwig)
Merge non-faulting memory access cleanups from Christoph Hellwig:
 "Andrew and I decided to drop the patches implementing your suggested
  rename of the probe_kernel_* and probe_user_* helpers from -mm as
  there were way to many conflicts.

  After -rc1 might be a good time for this as all the conflicts are
  resolved now"

This also adds a type safety checking patch on top of the renaming
series to make the subtle behavioral difference between 'get_user()' and
'get_kernel_nofault()' less potentially dangerous and surprising.

* emailed patches from Christoph Hellwig <hch@lst.de>:
  maccess: make get_kernel_nofault() check for minimal type compatibility
  maccess: rename probe_kernel_address to get_kernel_nofault
  maccess: rename probe_user_{read,write} to copy_{from,to}_user_nofault
  maccess: rename probe_kernel_{read,write} to copy_{from,to}_kernel_nofault
2020-06-18 12:35:51 -07:00
Daniel Jordan
e04ec0de61 padata: upgrade smp_mb__after_atomic to smp_mb in padata_do_serial
A 5.7 kernel hangs during a tcrypt test of padata that waits for an AEAD
request to finish.  This is only seen on large machines running many
concurrent requests.

The issue is that padata never serializes the request.  The removal of
the reorder_objects atomic missed that the memory barrier in
padata_do_serial() depends on it.

Upgrade the barrier from smp_mb__after_atomic to smp_mb to get correct
ordering again.

Fixes: 3facced7ae ("padata: remove reorder_objects")
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: linux-kernel@vger.kernel.org
Cc: <stable@vger.kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-06-18 17:09:54 +10:00
Kaitao Cheng
026bb845b0 ftrace: Fix maybe-uninitialized compiler warning
During build compiler reports some 'false positive' warnings about
variables {'seq_ops', 'filtered_pids', 'other_pids'} may be used
uninitialized. This patch silences these warnings.
Also delete some useless spaces

Link: https://lkml.kernel.org/r/20200529141214.37648-1-pilgrimtao@gmail.com

Signed-off-by: Kaitao Cheng <pilgrimtao@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-17 17:13:18 -04:00
Gustavo A. R. Silva
bbccc11bc8 audit: Use struct_size() helper in alloc_chunk
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:

struct audit_chunk {
	...
        struct node {
                struct list_head list;
                struct audit_tree *owner;
                unsigned index;         /* index; upper bit indicates 'will prune' */
        } owners[];
};

Make use of the struct_size() helper instead of an open-coded version
in order to avoid any potential type mistakes.

So, replace the following form:

offsetof(struct audit_chunk, owners) + count * sizeof(struct node);

with:

struct_size(chunk, owners, count)

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2020-06-17 16:43:11 -04:00
David S. Miller
b9d37bbb55 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Alexei Starovoitov says:

====================
pull-request: bpf 2020-06-17

The following pull-request contains BPF updates for your *net* tree.

We've added 10 non-merge commits during the last 2 day(s) which contain
a total of 14 files changed, 158 insertions(+), 59 deletions(-).

The main changes are:

1) Important fix for bpf_probe_read_kernel_str() return value, from Andrii.

2) [gs]etsockopt fix for large optlen, from Stanislav.

3) devmap allocation fix, from Toke.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-17 13:26:55 -07:00
Linus Torvalds
1b50440210 dma-mapping fixes for 5.8
- fixes for the SEV atomic pool (Geert Uytterhoeven and David Rientjes)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAl7pxQwLHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYN7kg/9F/S+fE587iJhn+6LbJHyNiAWtASi+cogBoO+Qdx/
 GDby8YP+s2+6hxIskavfdCSzY6W3UbCslgwxeZNKD60semb8BacAtf55N2PEK0/H
 UE8J9d/EInbe1ehrJvd7xCcuWxkesHGa/zcFbMBPvXxphX6SNfgPV0fdgg3UuWM9
 ZxRPENA14hXOF2ehKnbQNHcgodOm6SGM22PEk3g8GqASc2zCL+E+CVcRVFgoYmzw
 BgRgv5CaaHCHGvmWWtN4wWNmOm5YCqYmMuEAElbuWhY6VhcPVirMVtLju+3RfSmY
 1QEQQ0jQRtlQe9SuqhUPiegtReFMIvwC0Aoip7FaCSVMMean6uSzk3ubapkigCza
 r5dwG6RiLzVpRJyoYbDhCHh7gOUsXTMXqUzy33Jr5bTbGSJcelbycehL7gP9Qzag
 fFQ9Yep+BLDYESf7H5KzhDv9siZqGX2kXj3Z/gJGGMjkCUeAKNviRsi9t/IhQuzt
 cVAJCcU9vLJk2MJRuQ0P/7lCDvUIR4yGalN9Jl9J1ZTsVR7go330RVvonPhlTcXX
 9HrroqzSkqnLfaUFB3ml3LHj4SqygfUGtjbJ6qxkXxrChMfySe/VWbBrmOVUlta3
 SfKfXRcEYeHlgS7TtkxHkstaXfVA+fKN7V//PoycP9+rvoX74e2h++ujUY9kf8xJ
 QPQ=
 =uKpw
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-5.8-3' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping fixes from Christoph Hellwig:
 "Fixes for the SEV atomic pool (Geert Uytterhoeven and David Rientjes)"

* tag 'dma-mapping-5.8-3' of git://git.infradead.org/users/hch/dma-mapping:
  dma-pool: decouple DMA_REMAP from DMA_COHERENT_POOL
  dma-pool: fix too large DMA pools on medium memory size systems
2020-06-17 11:29:37 -07:00
Christoph Hellwig
c0ee37e85e maccess: rename probe_user_{read,write} to copy_{from,to}_user_nofault
Better describe what these functions do.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-17 10:57:41 -07:00
Christoph Hellwig
fe557319aa maccess: rename probe_kernel_{read,write} to copy_{from,to}_kernel_nofault
Better describe what these functions do.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-17 10:57:41 -07:00
Stanislav Fomichev
d8fe449a9c bpf: Don't return EINVAL from {get,set}sockopt when optlen > PAGE_SIZE
Attaching to these hooks can break iptables because its optval is
usually quite big, or at least bigger than the current PAGE_SIZE limit.
David also mentioned some SCTP options can be big (around 256k).

For such optvals we expose only the first PAGE_SIZE bytes to
the BPF program. BPF program has two options:
1. Set ctx->optlen to 0 to indicate that the BPF's optval
   should be ignored and the kernel should use original userspace
   value.
2. Set ctx->optlen to something that's smaller than the PAGE_SIZE.

v5:
* use ctx->optlen == 0 with trimmed buffer (Alexei Starovoitov)
* update the docs accordingly

v4:
* use temporary buffer to avoid optval == optval_end == NULL;
  this removes the corner case in the verifier that might assume
  non-zero PTR_TO_PACKET/PTR_TO_PACKET_END.

v3:
* don't increase the limit, bypass the argument

v2:
* proper comments formatting (Jakub Kicinski)

Fixes: 0d01da6afc ("bpf: implement getsockopt and setsockopt hooks")
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: David Laight <David.Laight@ACULAB.COM>
Link: https://lore.kernel.org/bpf/20200617010416.93086-1-sdf@google.com
2020-06-17 10:54:05 -07:00
Toke Høiland-Jørgensen
99c51064fb devmap: Use bpf_map_area_alloc() for allocating hash buckets
Syzkaller discovered that creating a hash of type devmap_hash with a large
number of entries can hit the memory allocator limit for allocating
contiguous memory regions. There's really no reason to use kmalloc_array()
directly in the devmap code, so just switch it to the existing
bpf_map_area_alloc() function that is used elsewhere.

Fixes: 6f9d451ab1 ("xdp: Add devmap_hash map type for looking up devices by hashed index")
Reported-by: Xiumei Mu <xmu@redhat.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200616142829.114173-1-toke@redhat.com
2020-06-17 10:01:19 -07:00
Andrii Nakryiko
02553b91da bpf: bpf_probe_read_kernel_str() has to return amount of data read on success
During recent refactorings, bpf_probe_read_kernel_str() started returning 0 on
success, instead of amount of data successfully read. This majorly breaks
applications relying on bpf_probe_read_kernel_str() and bpf_probe_read_str()
and their results. Fix this by returning actual number of bytes read.

Fixes: 8d92db5c04 ("bpf: rework the compat kernel probe handling")
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200616050432.1902042-1-andriin@fb.com
2020-06-17 17:50:02 +02:00
Jan Kara
c3dbe541ef blktrace: Avoid sparse warnings when assigning q->blk_trace
Mostly for historical reasons, q->blk_trace is assigned through xchg()
and cmpxchg() atomic operations. Although this is correct, sparse
complains about this because it violates rcu annotations since commit
c780e86dd4 ("blktrace: Protect q->blk_trace with RCU") which started
to use rcu for accessing q->blk_trace. Furthermore there's no real need
for atomic operations anymore since all changes to q->blk_trace happen
under q->blk_trace_mutex and since it also makes more sense to check if
q->blk_trace is set with the mutex held earlier.

So let's just replace xchg() with rcu_replace_pointer() and cmpxchg()
with explicit check and rcu_assign_pointer(). This makes the code more
efficient and sparse happy.

Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-17 09:07:11 -06:00
Luis Chamberlain
1b0b283648 blktrace: break out of blktrace setup on concurrent calls
We use one blktrace per request_queue, that means one per the entire
disk.  So we cannot run one blktrace on say /dev/vda and then /dev/vda1,
or just two calls on /dev/vda.

We check for concurrent setup only at the very end of the blktrace setup though.

If we try to run two concurrent blktraces on the same block device the
second one will fail, and the first one seems to go on. However when
one tries to kill the first one one will see things like this:

The kernel will show these:

```
debugfs: File 'dropped' in directory 'nvme1n1' already present!
debugfs: File 'msg' in directory 'nvme1n1' already present!
debugfs: File 'trace0' in directory 'nvme1n1' already present!
``

And userspace just sees this error message for the second call:

```
blktrace /dev/nvme1n1
BLKTRACESETUP(2) /dev/nvme1n1 failed: 5/Input/output error
```

The first userspace process #1 will also claim that the files
were taken underneath their nose as well. The files are taken
away form the first process given that when the second blktrace
fails, it will follow up with a BLKTRACESTOP and BLKTRACETEARDOWN.
This means that even if go-happy process #1 is waiting for blktrace
data, we *have* been asked to take teardown the blktrace.

This can easily be reproduced with break-blktrace [0] run_0005.sh test.

Just break out early if we know we're already going to fail, this will
prevent trying to create the files all over again, which we know still
exist.

[0] https://github.com/mcgrof/break-blktrace

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-17 09:07:11 -06:00
David Rientjes
56fccf21d1 dma-direct: check return value when encrypting or decrypting memory
__change_page_attr() can fail which will cause set_memory_encrypted() and
set_memory_decrypted() to return non-zero.

If the device requires unencrypted DMA memory and decryption fails, simply
free the memory and fail.

If attempting to re-encrypt in the failure path and that encryption fails,
there is no alternative other than to leak the memory.

Fixes: c10f07aa27 ("dma/direct: Handle force decryption for DMA coherent buffers in common code")
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-06-17 09:29:38 +02:00
David Rientjes
96a539fa3b dma-direct: re-encrypt memory if dma_direct_alloc_pages() fails
If arch_dma_set_uncached() fails after memory has been decrypted, it needs
to be re-encrypted before freeing.

Fixes: fa7e2247c5 ("dma-direct: make uncached_kernel_address more general")
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-06-17 09:29:38 +02:00
David Rientjes
633d5fce78 dma-direct: always align allocation size in dma_direct_alloc_pages()
dma_alloc_contiguous() does size >> PAGE_SHIFT and set_memory_decrypted()
works at page granularity.  It's necessary to page align the allocation
size in dma_direct_alloc_pages() for consistent behavior.

This also fixes an issue when arch_dma_prep_coherent() is called on an
unaligned allocation size for dma_alloc_need_uncached() when
CONFIG_DMA_DIRECT_REMAP is disabled but CONFIG_ARCH_HAS_DMA_SET_UNCACHED
is enabled.

Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-06-17 09:29:38 +02:00
Christoph Hellwig
26749b3201 dma-direct: mark __dma_direct_alloc_pages static
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-06-17 09:29:37 +02:00
Christoph Hellwig
1fbf57d053 dma-direct: re-enable mmap for !CONFIG_MMU
nommu configfs can trivially map the coherent allocations to user space,
as no actual page table setup is required and the kernel and the user
space programs share the same address space.

Fixes: 62fcee9a3b ("dma-mapping: remove CONFIG_ARCH_NO_COHERENT_DMA_MMAP")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: dillon min <dillon.minfei@gmail.com>
Reviewed-by: Vladimir Murzin <vladimir.murzin@arm.com>
Tested-by: dillon min <dillon.minfei@gmail.com>
2020-06-17 09:29:31 +02:00
YangHui
69243720c0 tracing: Remove unused event variable in tracing_iter_reset
We do not use the event variable, just remove it.

Signed-off-by: YangHui <yanghui.def@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-16 21:21:03 -04:00
Vamshi K Sthambamkadi
3aa8fdc37d tracing/probe: Fix memleak in fetch_op_data operations
kmemleak report:
    [<57dcc2ca>] __kmalloc_track_caller+0x139/0x2b0
    [<f1c45d0f>] kstrndup+0x37/0x80
    [<f9761eb0>] parse_probe_arg.isra.7+0x3cc/0x630
    [<055bf2ba>] traceprobe_parse_probe_arg+0x2f5/0x810
    [<655a7766>] trace_kprobe_create+0x2ca/0x950
    [<4fc6a02a>] create_or_delete_trace_kprobe+0xf/0x30
    [<6d1c8a52>] trace_run_command+0x67/0x80
    [<be812cc0>] trace_parse_run_command+0xa7/0x140
    [<aecfe401>] probes_write+0x10/0x20
    [<2027641c>] __vfs_write+0x30/0x1e0
    [<6a4aeee1>] vfs_write+0x96/0x1b0
    [<3517fb7d>] ksys_write+0x53/0xc0
    [<dad91db7>] __ia32_sys_write+0x15/0x20
    [<da347f64>] do_syscall_32_irqs_on+0x3d/0x260
    [<fd0b7e7d>] do_fast_syscall_32+0x39/0xb0
    [<ea5ae810>] entry_SYSENTER_32+0xaf/0x102

Post parse_probe_arg(), the FETCH_OP_DATA operation type is overwritten
to FETCH_OP_ST_STRING, as a result memory is never freed since
traceprobe_free_probe_arg() iterates only over SYMBOL and DATA op types

Setup fetch string operation correctly after fetch_op_data operation.

Link: https://lkml.kernel.org/r/20200615143034.GA1734@cosmos

Cc: stable@vger.kernel.org
Fixes: a42e3c4de9 ("tracing/probe: Add immediate string parameter support")
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Vamshi K Sthambamkadi <vamshi.k.sthambamkadi@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-16 21:21:02 -04:00
Wei Yang
48a42f5d13 trace: Fix typo in allocate_ftrace_ops()'s comment
No functional change, just correct the word.

Link: https://lkml.kernel.org/r/20200610033251.31713-1-richard.weiyang@linux.alibaba.com

Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-16 21:21:02 -04:00
Steven Rostedt (VMware)
4649079b9d tracing: Make ftrace packed events have align of 1
When using trace-cmd on 5.6-rt for the function graph tracer, the output was
corrupted. It gave output like this:

 funcgraph_entry:       func=0xffffffff depth=38982
 funcgraph_entry:       func=0x1ffffffff depth=16044
 funcgraph_exit:        func=0xffffffff overrun=0x92539aaf00000000 calltime=0x92539c9900000072 rettime=0x100000072 depth=11084
 funcgraph_exit:        func=0xffffffff overrun=0x9253946e00000000 calltime=0x92539e2100000072 rettime=0x72 depth=26033702
 funcgraph_entry:       func=0xffffffff depth=85798
 funcgraph_entry:       func=0x1ffffffff depth=12044

The reason was because the tracefs/events/ftrace/funcgraph_entry/exit format
file was incorrect. The -rt kernel adds more common fields to the trace
events. Namely, common_migrate_disable and common_preempt_lazy_count. Each
is one byte in size. This changes the alignment of the normal payload. Most
events are aligned normally, but the function and function graph events are
defined with a "PACKED" macro, that packs their payload. As the offsets
displayed in the format files are now calculated by an aligned field, the
aligned field for function and function graph events should be 1, not their
normal alignment.

With aligning of the funcgraph_entry event, the format file has:

        field:unsigned short common_type;       offset:0;       size:2; signed:0;
        field:unsigned char common_flags;       offset:2;       size:1; signed:0;
        field:unsigned char common_preempt_count;       offset:3;       size:1; signed:0;
        field:int common_pid;   offset:4;       size:4; signed:1;
        field:unsigned char common_migrate_disable;     offset:8;       size:1; signed:0;
        field:unsigned char common_preempt_lazy_count;  offset:9;       size:1; signed:0;

        field:unsigned long func;       offset:16;      size:8; signed:0;
        field:int depth;        offset:24;      size:4; signed:1;

But the actual alignment is:

	field:unsigned short common_type;	offset:0;	size:2;	signed:0;
	field:unsigned char common_flags;	offset:2;	size:1;	signed:0;
	field:unsigned char common_preempt_count;	offset:3;	size:1;	signed:0;
	field:int common_pid;	offset:4;	size:4;	signed:1;
	field:unsigned char common_migrate_disable;	offset:8;	size:1;	signed:0;
	field:unsigned char common_preempt_lazy_count;	offset:9;	size:1;	signed:0;

	field:unsigned long func;	offset:12;	size:8;	signed:0;
	field:int depth;	offset:20;	size:4;	signed:1;

Link: https://lkml.kernel.org/r/20200609220041.2a3b527f@oasis.local.home

Cc: stable@vger.kernel.org
Fixes: 04ae87a520 ("ftrace: Rework event_create_dir()")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-16 21:21:02 -04:00
Jiri Olsa
9b38cc704e kretprobe: Prevent triggering kretprobe from within kprobe_flush_task
Ziqian reported lockup when adding retprobe on _raw_spin_lock_irqsave.
My test was also able to trigger lockdep output:

 ============================================
 WARNING: possible recursive locking detected
 5.6.0-rc6+ #6 Not tainted
 --------------------------------------------
 sched-messaging/2767 is trying to acquire lock:
 ffffffff9a492798 (&(kretprobe_table_locks[i].lock)){-.-.}, at: kretprobe_hash_lock+0x52/0xa0

 but task is already holding lock:
 ffffffff9a491a18 (&(kretprobe_table_locks[i].lock)){-.-.}, at: kretprobe_trampoline+0x0/0x50

 other info that might help us debug this:
  Possible unsafe locking scenario:

        CPU0
        ----
   lock(&(kretprobe_table_locks[i].lock));
   lock(&(kretprobe_table_locks[i].lock));

  *** DEADLOCK ***

  May be due to missing lock nesting notation

 1 lock held by sched-messaging/2767:
  #0: ffffffff9a491a18 (&(kretprobe_table_locks[i].lock)){-.-.}, at: kretprobe_trampoline+0x0/0x50

 stack backtrace:
 CPU: 3 PID: 2767 Comm: sched-messaging Not tainted 5.6.0-rc6+ #6
 Call Trace:
  dump_stack+0x96/0xe0
  __lock_acquire.cold.57+0x173/0x2b7
  ? native_queued_spin_lock_slowpath+0x42b/0x9e0
  ? lockdep_hardirqs_on+0x590/0x590
  ? __lock_acquire+0xf63/0x4030
  lock_acquire+0x15a/0x3d0
  ? kretprobe_hash_lock+0x52/0xa0
  _raw_spin_lock_irqsave+0x36/0x70
  ? kretprobe_hash_lock+0x52/0xa0
  kretprobe_hash_lock+0x52/0xa0
  trampoline_handler+0xf8/0x940
  ? kprobe_fault_handler+0x380/0x380
  ? find_held_lock+0x3a/0x1c0
  kretprobe_trampoline+0x25/0x50
  ? lock_acquired+0x392/0xbc0
  ? _raw_spin_lock_irqsave+0x50/0x70
  ? __get_valid_kprobe+0x1f0/0x1f0
  ? _raw_spin_unlock_irqrestore+0x3b/0x40
  ? finish_task_switch+0x4b9/0x6d0
  ? __switch_to_asm+0x34/0x70
  ? __switch_to_asm+0x40/0x70

The code within the kretprobe handler checks for probe reentrancy,
so we won't trigger any _raw_spin_lock_irqsave probe in there.

The problem is in outside kprobe_flush_task, where we call:

  kprobe_flush_task
    kretprobe_table_lock
      raw_spin_lock_irqsave
        _raw_spin_lock_irqsave

where _raw_spin_lock_irqsave triggers the kretprobe and installs
kretprobe_trampoline handler on _raw_spin_lock_irqsave return.

The kretprobe_trampoline handler is then executed with already
locked kretprobe_table_locks, and first thing it does is to
lock kretprobe_table_locks ;-) the whole lockup path like:

  kprobe_flush_task
    kretprobe_table_lock
      raw_spin_lock_irqsave
        _raw_spin_lock_irqsave ---> probe triggered, kretprobe_trampoline installed

        ---> kretprobe_table_locks locked

        kretprobe_trampoline
          trampoline_handler
            kretprobe_hash_lock(current, &head, &flags);  <--- deadlock

Adding kprobe_busy_begin/end helpers that mark code with fake
probe installed to prevent triggering of another kprobe within
this code.

Using these helpers in kprobe_flush_task, so the probe recursion
protection check is hit and the probe is never set to prevent
above lockup.

Link: http://lkml.kernel.org/r/158927059835.27680.7011202830041561604.stgit@devnote2

Fixes: ef53d9c5e4 ("kprobes: improve kretprobe scalability with hashed locking")
Cc: Ingo Molnar <mingo@kernel.org>
Cc: "Gustavo A . R . Silva" <gustavoars@kernel.org>
Cc: Anders Roxell <anders.roxell@linaro.org>
Cc: "Naveen N . Rao" <naveen.n.rao@linux.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: David Miller <davem@davemloft.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: stable@vger.kernel.org
Reported-by: "Ziqian SUN (Zamir)" <zsun@redhat.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-16 21:21:01 -04:00
Masami Hiramatsu
75ddf64dd2 kprobes: Remove redundant arch_disarm_kprobe() call
Fix to remove redundant arch_disarm_kprobe() call in
force_unoptimize_kprobe(). This arch_disarm_kprobe()
will be invoked if the kprobe is optimized but disabled,
but that means the kprobe (optprobe) is unused (and
unoptimized) state.

In that case, unoptimize_kprobe() puts it in freeing_list
and kprobe_optimizer (do_unoptimize_kprobes()) automatically
disarm it. Thus this arch_disarm_kprobe() is redundant.

Link: http://lkml.kernel.org/r/158927058719.27680.17183632908465341189.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-16 21:21:01 -04:00
Masami Hiramatsu
1a0aa991a6 kprobes: Fix to protect kick_kprobe_optimizer() by kprobe_mutex
In kprobe_optimizer() kick_kprobe_optimizer() is called
without kprobe_mutex, but this can race with other caller
which is protected by kprobe_mutex.

To fix that, expand kprobe_mutex protected area to protect
kick_kprobe_optimizer() call.

Link: http://lkml.kernel.org/r/158927057586.27680.5036330063955940456.stgit@devnote2

Fixes: cd7ebe2298 ("kprobes: Use text_poke_smp_batch for optimizing")
Cc: Ingo Molnar <mingo@kernel.org>
Cc: "Gustavo A . R . Silva" <gustavoars@kernel.org>
Cc: Anders Roxell <anders.roxell@linaro.org>
Cc: "Naveen N . Rao" <naveen.n.rao@linux.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: David Miller <davem@davemloft.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ziqian SUN <zsun@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-16 21:21:01 -04:00
Masami Hiramatsu
7e6a71d8e6 kprobes: Use non RCU traversal APIs on kprobe_tables if possible
Current kprobes uses RCU traversal APIs on kprobe_tables
even if it is safe because kprobe_mutex is locked.

Make those traversals to non-RCU APIs where the kprobe_mutex
is locked.

Link: http://lkml.kernel.org/r/158927056452.27680.9710575332163005121.stgit@devnote2

Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-16 21:21:01 -04:00
Masami Hiramatsu
6743ad432e kprobes: Suppress the suspicious RCU warning on kprobes
Anders reported that the lockdep warns that suspicious
RCU list usage in register_kprobe() (detected by
CONFIG_PROVE_RCU_LIST.) This is because get_kprobe()
access kprobe_table[] by hlist_for_each_entry_rcu()
without rcu_read_lock.

If we call get_kprobe() from the breakpoint handler context,
it is run with preempt disabled, so this is not a problem.
But in other cases, instead of rcu_read_lock(), we locks
kprobe_mutex so that the kprobe_table[] is not updated.
So, current code is safe, but still not good from the view
point of RCU.

Joel suggested that we can silent that warning by passing
lockdep_is_held() to the last argument of
hlist_for_each_entry_rcu().

Add lockdep_is_held(&kprobe_mutex) at the end of the
hlist_for_each_entry_rcu() to suppress the warning.

Link: http://lkml.kernel.org/r/158927055350.27680.10261450713467997503.stgit@devnote2

Reported-by: Anders Roxell <anders.roxell@linaro.org>
Suggested-by: Joel Fernandes <joel@joelfernandes.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-16 21:21:01 -04:00
Christian Brauner
e571d4ee33
nsproxy: restore EINVAL for non-namespace file descriptor
The LTP testsuite reported a regression where users would now see EBADF
returned instead of EINVAL when an fd was passed that referred to an open
file but the file was not a nsfd. Fix this by continuing to report EINVAL.

Reported-by: kernel test robot <rong.a.chen@intel.com>
Cc: Jan Stancek <jstancek@redhat.com>
Cc: Cyril Hrubis <chrubis@suse.cz>
Link: https://lore.kernel.org/lkml/20200615085836.GR12456@shao2-debian
Fixes: 303cc571d1 ("nsproxy: attach to namespaces via pidfds")
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-06-17 00:33:12 +02:00
Christian Brauner
60997c3d45
close_range: add CLOSE_RANGE_UNSHARE
One of the use-cases of close_range() is to drop file descriptors just before
execve(). This would usually be expressed in the sequence:

unshare(CLONE_FILES);
close_range(3, ~0U);

as pointed out by Linus it might be desirable to have this be a part of
close_range() itself under a new flag CLOSE_RANGE_UNSHARE.

This expands {dup,unshare)_fd() to take a max_fds argument that indicates the
maximum number of file descriptors to copy from the old struct files. When the
user requests that all file descriptors are supposed to be closed via
close_range(min, max) then we can cap via unshare_fd(min) and hence don't need
to do any of the heavy fput() work for everything above min.

The patch makes it so that if CLOSE_RANGE_UNSHARE is requested and we do in
fact currently share our file descriptor table we create a new private copy.
We then close all fds in the requested range and finally after we're done we
install the new fd table.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-06-17 00:07:38 +02:00
Gustavo A. R. Silva
7fac96f2be tracing/probe: Replace zero-length array with flexible-array
There is a regular need in the kernel to provide a way to declare having a
dynamically sized set of trailing elements in a structure. Kernel code should
always use “flexible array members”[1] for these cases. The older style of
one-element or zero-length arrays should no longer be used[2].

[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://github.com/KSPP/linux/issues/21

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-06-15 23:08:32 -05:00
Peter Zijlstra
8b700983de sched: Remove sched_set_*() return value
Ingo suggested that since the new sched_set_*() functions are
implemented using the 'nocheck' variants, they really shouldn't ever
fail, so remove the return value.

Cc: axboe@kernel.dk
Cc: daniel.lezcano@linaro.org
Cc: sudeep.holla@arm.com
Cc: airlied@redhat.com
Cc: broonie@kernel.org
Cc: paulmck@kernel.org
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
2020-06-15 14:10:26 +02:00
Peter Zijlstra
616d91b68c sched: Remove sched_setscheduler*() EXPORTs
Now that nothing (modular) still uses sched_setscheduler(), remove the
exports.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
2020-06-15 14:10:25 +02:00
Peter Zijlstra
2cca5426b9 sched,psi: Convert to sched_set_fifo_low()
Because SCHED_FIFO is a broken scheduler model (see previous patches)
take away the priority field, the kernel can't possibly make an
informed decision.

Effectively no change.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
2020-06-15 14:10:25 +02:00
Peter Zijlstra
ce1c8afd3f sched,rcutorture: Convert to sched_set_fifo_low()
Because SCHED_FIFO is a broken scheduler model (see previous patches)
take away the priority field, the kernel can't possibly make an
informed decision.

Effectively no change.

Cc: paulmck@kernel.org
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-15 14:10:25 +02:00
Peter Zijlstra
b1433395c4 sched,rcuperf: Convert to sched_set_fifo_low()
Because SCHED_FIFO is a broken scheduler model (see previous patches)
take away the priority field, the kernel can't possibly make an
informed decision.

Effectively no change.

Cc: paulmck@kernel.org
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-15 14:10:24 +02:00
Peter Zijlstra
93db9129fa sched,locktorture: Convert to sched_set_fifo()
Because SCHED_FIFO is a broken scheduler model (see previous patches)
take away the priority field, the kernel can't possibly make an
informed decision.

Effectively changes prio from 99 to 50.

Cc: paulmck@kernel.org
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-15 14:10:24 +02:00
Peter Zijlstra
7a40798c71 sched,irq: Convert to sched_set_fifo()
Because SCHED_FIFO is a broken scheduler model (see previous patches)
take away the priority field, the kernel can't possibly make an
informed decision.

Effectively no change.

Cc: tglx@linutronix.de
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
2020-06-15 14:10:24 +02:00
Peter Zijlstra
7318d4cc14 sched: Provide sched_set_fifo()
SCHED_FIFO (or any static priority scheduler) is a broken scheduler
model; it is fundamentally incapable of resource management, the one
thing an OS is actually supposed to do.

It is impossible to compose static priority workloads. One cannot take
two well designed and functional static priority workloads and mash
them together and still expect them to work.

Therefore it doesn't make sense to expose the priority field; the
kernel is fundamentally incapable of setting a sensible value, it
needs systems knowledge that it doesn't have.

Take away sched_setschedule() / sched_setattr() from modules and
replace them with:

  - sched_set_fifo(p); create a FIFO task (at prio 50)
  - sched_set_fifo_low(p); create a task higher than NORMAL,
	which ends up being a FIFO task at prio 1.
  - sched_set_normal(p, nice); (re)set the task to normal

This stops the proliferation of randomly chosen, and irrelevant, FIFO
priorities that dont't really mean anything anyway.

The system administrator/integrator, whoever has insight into the
actual system design and requirements (userspace) can set-up
appropriate priorities if and when needed.

Cc: airlied@redhat.com
Cc: alexander.deucher@amd.com
Cc: awalls@md.metrocast.net
Cc: axboe@kernel.dk
Cc: broonie@kernel.org
Cc: daniel.lezcano@linaro.org
Cc: gregkh@linuxfoundation.org
Cc: hannes@cmpxchg.org
Cc: herbert@gondor.apana.org.au
Cc: hverkuil@xs4all.nl
Cc: john.stultz@linaro.org
Cc: nico@fluxnic.net
Cc: paulmck@kernel.org
Cc: rafael.j.wysocki@intel.com
Cc: rmk+kernel@arm.linux.org.uk
Cc: sudeep.holla@arm.com
Cc: tglx@linutronix.de
Cc: ulf.hansson@linaro.org
Cc: wim@linux-watchdog.org
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-15 14:10:20 +02:00
Vincent Guittot
87e867b426 sched/pelt: Cleanup PELT divider
Factorize in a single place the calculation of the divider to be used to
to compute *_avg from *_sum value

Suggested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200612154703.23555-1-vincent.guittot@linaro.org
2020-06-15 14:10:06 +02:00
Christophe JAILLET
c49694173d sched/deadline: Fix a typo in a comment
s/deadine/deadline/

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200602195002.677448-1-christophe.jaillet@wanadoo.fr
2020-06-15 14:10:06 +02:00
Luca Abeni
23e71d8ba4 sched/deadline: Implement fallback mechanism for !fit case
When a task has a runtime that cannot be served within the scheduling
deadline by any of the idle CPU (later_mask) the task is doomed to miss
its deadline.

This can happen since the SCHED_DEADLINE admission control guarantees
only bounded tardiness and not the hard respect of all deadlines.
In this case try to select the idle CPU with the largest CPU capacity
to minimize tardiness.

Favor task_cpu(p) if it has max capacity of !fitting CPUs so that
find_later_rq() can potentially still return it (most likely cache-hot)
early.

Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lkml.kernel.org/r/20200520134243.19352-6-dietmar.eggemann@arm.com
2020-06-15 14:10:05 +02:00
Luca Abeni
b4118988fd sched/deadline: Make DL capacity-aware
The current SCHED_DEADLINE (DL) scheduler uses a global EDF scheduling
algorithm w/o considering CPU capacity or task utilization.
This works well on homogeneous systems where DL tasks are guaranteed
to have a bounded tardiness but presents issues on heterogeneous
systems.

A DL task can migrate to a CPU which does not have enough CPU capacity
to correctly serve the task (e.g. a task w/ 70ms runtime and 100ms
period on a CPU w/ 512 capacity).

Add the DL fitness function dl_task_fits_capacity() for DL admission
control on heterogeneous systems. A task fits onto a CPU if:

    CPU original capacity / 1024 >= task runtime / task deadline

Use this function on heterogeneous systems to try to find a CPU which
meets this criterion during task wakeup, push and offline migration.

On homogeneous systems the original behavior of the DL admission
control should be retained.

Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lkml.kernel.org/r/20200520134243.19352-5-dietmar.eggemann@arm.com
2020-06-15 14:10:05 +02:00
Luca Abeni
60ffd5edc5 sched/deadline: Improve admission control for asymmetric CPU capacities
The current SCHED_DEADLINE (DL) admission control ensures that

    sum of reserved CPU bandwidth < x * M

where

    x = /proc/sys/kernel/sched_rt_{runtime,period}_us
    M = # CPUs in root domain.

DL admission control works well for homogeneous systems where the
capacity of all CPUs are equal (1024). I.e. bounded tardiness for DL
and non-starvation of non-DL tasks is guaranteed.

But on heterogeneous systems where capacity of CPUs are different it
could fail by over-allocating CPU time on smaller capacity CPUs.

On an Arm big.LITTLE/DynamIQ system DL tasks can easily starve other
tasks making it unusable.

Fix this by explicitly considering the CPU capacity in the DL admission
test by replacing M with the root domain CPU capacity sum.

Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lkml.kernel.org/r/20200520134243.19352-4-dietmar.eggemann@arm.com
2020-06-15 14:10:05 +02:00
Dietmar Eggemann
fc9dc69847 sched/deadline: Add dl_bw_capacity()
Capacity-aware SCHED_DEADLINE Admission Control (AC) needs root domain
(rd) CPU capacity sum.

Introduce dl_bw_capacity() which for a symmetric rd w/ a CPU capacity
of SCHED_CAPACITY_SCALE simply relies on dl_bw_cpus() to return #CPUs
multiplied by SCHED_CAPACITY_SCALE.

For an asymmetric rd or a CPU capacity < SCHED_CAPACITY_SCALE it
computes the CPU capacity sum over rd span and cpu_active_mask.

A 'XXX Fix:' comment was added to highlight that if 'rq->rd ==
def_root_domain' AC should be performed against the capacity of the
CPU the task is running on rather the rd CPU capacity sum. This
issue already exists w/o capacity awareness.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lkml.kernel.org/r/20200520134243.19352-3-dietmar.eggemann@arm.com
2020-06-15 14:10:05 +02:00
Dietmar Eggemann
c81b893299 sched/deadline: Optimize dl_bw_cpus()
Return the weight of the root domain (rd) span in case it is a subset
of the cpu_active_mask.

Continue to compute the number of CPUs over rd span and cpu_active_mask
when in hotplug.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lkml.kernel.org/r/20200520134243.19352-2-dietmar.eggemann@arm.com
2020-06-15 14:10:04 +02:00
Peng Liu
9b1b234bb8 sched: correct SD_flags returned by tl->sd_flags()
During sched domain init, we check whether non-topological SD_flags are
returned by tl->sd_flags(), if found, fire a waning and correct the
violation, but the code failed to correct the violation. Correct this.

Fixes: 143e1e28cb ("sched: Rework sched_domain topology definition")
Signed-off-by: Peng Liu <iwtbavbm@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20200609150936.GA13060@iZj6chx1xj0e0buvshuecpZ
2020-06-15 14:10:04 +02:00
Vincent Guittot
3ea2f097b1 sched/fair: Fix NOHZ next idle balance
With commit:
  'b7031a02ec75 ("sched/fair: Add NOHZ_STATS_KICK")'
rebalance_domains of the local cfs_rq happens before others idle cpus have
updated nohz.next_balance and its value is overwritten.

Move the update of nohz.next_balance for other idles cpus before balancing
and updating the next_balance of local cfs_rq.

Also, the nohz.next_balance is now updated only if all idle cpus got a
chance to rebalance their domains and the idle balance has not been aborted
because of new activities on the CPU. In case of need_resched, the idle
load balance will be kick the next jiffie in order to address remaining
ilb.

Fixes: b7031a02ec ("sched/fair: Add NOHZ_STATS_KICK")
Reported-by: Peng Liu <iwtbavbm@gmail.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lkml.kernel.org/r/20200609123748.18636-1-vincent.guittot@linaro.org
2020-06-15 14:10:04 +02:00
Peter Zijlstra
b4098bfc5e sched/deadline: Impose global limits on sched_attr::sched_period
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190726161357.397880775@infradead.org
2020-06-15 14:10:04 +02:00
Marcelo Tosatti
9cc5b86568 isolcpus: Affine unbound kernel threads to housekeeping cpus
This is a kernel enhancement that configures the cpu affinity of kernel
threads via kernel boot option nohz_full=.

When this option is specified, the cpumask is immediately applied upon
kthread launch. This does not affect kernel threads that specify cpu
and node.

This allows CPU isolation (that is not allowing certain threads
to execute on certain CPUs) without using the isolcpus=domain parameter,
making it possible to enable load balancing on such CPUs
during runtime (see kernel-parameters.txt).

Note-1: this is based off on Wind River's patch at
https://github.com/starlingx-staging/stx-integ/blob/master/kernel/kernel-std/centos/patches/affine-compute-kernel-threads.patch

Difference being that this patch is limited to modifying kernel thread
cpumask. Behaviour of other threads can be controlled via cgroups or
sched_setaffinity.

Note-2: Wind River's patch was based off Christoph Lameter's patch at
https://lwn.net/Articles/565932/ with the only difference being
the kernel parameter changed from kthread to kthread_cpus.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200527142909.23372-3-frederic@kernel.org
2020-06-15 14:10:03 +02:00
Marcelo Tosatti
043eb8e105 kthread: Switch to cpu_possible_mask
Next patch will switch unbound kernel threads mask to
housekeeping_cpumask(), a subset of cpu_possible_mask. So in order to
ease bisection, lets first switch kthreads default affinity from
cpu_all_mask to cpu_possible_mask.

It looks safe to do so as cpu_possible_mask seem to be initialized
at setup_arch() time, way before kthreadd is created.

Suggested-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200527142909.23372-2-frederic@kernel.org
2020-06-15 14:10:03 +02:00
Suren Baghdasaryan
461daba06b psi: eliminate kthread_worker from psi trigger scheduling mechanism
Each psi group requires a dedicated kthread_delayed_work and
kthread_worker. Since no other work can be performed using psi_group's
kthread_worker, the same result can be obtained using a task_struct and
a timer directly. This makes psi triggering simpler by removing lists
and locks involved with kthread_worker usage and eliminates the need for
poll_scheduled atomic use in the hot path.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200528195442.190116-1-surenb@google.com
2020-06-15 14:10:03 +02:00
Vincent Donnefort
4581bea8b4 sched/debug: Add new tracepoints to track util_est
The util_est signals are key elements for EAS task placement and
frequency selection. Having tracepoints to track these signals enables
load-tracking and schedutil testing and/or debugging by a toolkit.

Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/1590597554-370150-1-git-send-email-vincent.donnefort@arm.com
2020-06-15 14:10:02 +02:00
Dietmar Eggemann
1ca2034ed7 sched/fair: Remove unused 'sd' parameter from scale_rt_capacity()
Since commit 8ec59c0f5f ("sched/topology: Remove unused 'sd'
parameter from arch_scale_cpu_capacity()") it is no longer needed.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20200603080304.16548-5-dietmar.eggemann@arm.com
2020-06-15 14:10:01 +02:00
Dietmar Eggemann
e3e76a6a04 sched/idle,stop: Remove .get_rr_interval from sched_class
The idle task and stop task sched_classes return 0 in this function.

The single call site in sched_rr_get_interval() calls
p->sched_class->get_rr_interval() only conditional in case it is
defined. Otherwise time_slice=0 will be used.

The deadline sched class does not define it. Commit a57beec5d4
("sched: Make sched_class::get_rr_interval() optional") introduced
the default time-slice=0 for sched classes which do not provide this
function.

So .get_rr_interval for idle and stop sched_class can be removed to
shrink the code a little.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200603080304.16548-4-dietmar.eggemann@arm.com
2020-06-15 14:10:01 +02:00
Dietmar Eggemann
0900acf2d8 sched/core: Remove redundant 'preempt' param from sched_class->yield_to_task()
Commit 6d1cafd8b5 ("sched: Resched proper CPU on yield_to()") moved
the code to resched the CPU from yield_to_task_fair() to yield_to()
making the preempt parameter in sched_class->yield_to_task()
unnecessary. Remove it. No other sched_class implements yield_to_task().

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200603080304.16548-3-dietmar.eggemann@arm.com
2020-06-15 14:10:01 +02:00
Dietmar Eggemann
844eb6458f sched/pelt: Remove redundant cap_scale() definition
Besides in PELT cap_scale() is used in the Deadline scheduler class for
scale-invariant bandwidth enforcement.
Remove the cap_scale() definition in kernel/sched/pelt.c and keep the
one in kernel/sched/sched.h.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20200603080304.16548-2-dietmar.eggemann@arm.com
2020-06-15 14:10:01 +02:00
Oleg Nesterov
3dc167ba57 sched/cputime: Improve cputime_adjust()
People report that utime and stime from /proc/<pid>/stat become very
wrong when the numbers are big enough, especially if you watch these
counters incrementally.

Specifically, the current implementation of: stime*rtime/total,
results in a saw-tooth function on top of the desired line, where the
teeth grow in size the larger the values become. IOW, it has a
relative error.

The result is that, when watching incrementally as time progresses
(for large values), we'll see periods of pure stime or utime increase,
irrespective of the actual ratio we're striving for.

Replace scale_stime() with a math64.h helper: mul_u64_u64_div_u64()
that is far more accurate. This also allows architectures to override
the implementation -- for instance they can opt for the old algorithm
if this new one turns out to be too expensive for them.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200519172506.GA317395@hirez.programming.kicks-ass.net
2020-06-15 14:10:00 +02:00
Adrian Hunter
548e1f6c76 ftrace: Add perf text poke events for ftrace trampolines
Add perf text poke events for ftrace trampolines when created and when
freed.

There can be 3 text_poke events for ftrace trampolines:

1. NULL -> trampoline
   By ftrace_update_trampoline() when !ops->trampoline
   Trampoline created

2. [e.g. on x86] CALL rel32 -> CALL rel32
   By arch_ftrace_update_trampoline() when ops->trampoline and
                        ops->flags & FTRACE_OPS_FL_ALLOC_TRAMP
   [e.g. on x86] via text_poke_bp() which generates text poke events
   Trampoline-called function target updated

3. trampoline -> NULL
   By ftrace_trampoline_free() when ops->trampoline and
                 ops->flags & FTRACE_OPS_FL_ALLOC_TRAMP
   Trampoline freed

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200512121922.8997-9-adrian.hunter@intel.com
2020-06-15 14:09:50 +02:00
Adrian Hunter
dd9ddf466a ftrace: Add perf ksymbol events for ftrace trampolines
Symbols are needed for tools to describe instruction addresses. Pages
allocated for ftrace's purposes need symbols to be created for them.
Add such symbols to be visible via perf ksymbol events.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200512121922.8997-8-adrian.hunter@intel.com
2020-06-15 14:09:49 +02:00
Adrian Hunter
fc0ea795f5 ftrace: Add symbols for ftrace trampolines
Symbols are needed for tools to describe instruction addresses. Pages
allocated for ftrace's purposes need symbols to be created for them.
Add such symbols to be visible via /proc/kallsyms.

Example on x86 with CONFIG_DYNAMIC_FTRACE=y

	# echo function > /sys/kernel/debug/tracing/current_tracer
	# cat /proc/kallsyms | grep '\[__builtin__ftrace\]'
	ffffffffc0238000 t ftrace_trampoline    [__builtin__ftrace]

Note: This patch adds "__builtin__ftrace" as a module name in /proc/kallsyms for
symbols for pages allocated for ftrace's purposes, even though "__builtin__ftrace"
is not a module.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200512121922.8997-7-adrian.hunter@intel.com
2020-06-15 14:09:49 +02:00
Adrian Hunter
69e4908869 kprobes: Add perf ksymbol events for kprobe insn pages
Symbols are needed for tools to describe instruction addresses. Pages
allocated for kprobe's purposes need symbols to be created for them.
Add such symbols to be visible via perf ksymbol events.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lkml.kernel.org/r/20200512121922.8997-5-adrian.hunter@intel.com
2020-06-15 14:09:49 +02:00
Adrian Hunter
d002b8bc6d kprobes: Add symbols for kprobe insn pages
Symbols are needed for tools to describe instruction addresses. Pages
allocated for kprobe's purposes need symbols to be created for them.
Add such symbols to be visible via /proc/kallsyms.

Note: kprobe insn pages are not used if ftrace is configured. To see the
effect of this patch, the kernel must be configured with:

	# CONFIG_FUNCTION_TRACER is not set
	CONFIG_KPROBES=y

and for optimised kprobes:

	CONFIG_OPTPROBES=y

Example on x86:

	# perf probe __schedule
	Added new event:
	  probe:__schedule     (on __schedule)
	# cat /proc/kallsyms | grep '\[__builtin__kprobes\]'
	ffffffffc00d4000 t kprobe_insn_page     [__builtin__kprobes]
	ffffffffc00d6000 t kprobe_optinsn_page  [__builtin__kprobes]

Note: This patch adds "__builtin__kprobes" as a module name in
/proc/kallsyms for symbols for pages allocated for kprobes' purposes, even
though "__builtin__kprobes" is not a module.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lkml.kernel.org/r/20200528080058.20230-1-adrian.hunter@intel.com
2020-06-15 14:09:48 +02:00
Adrian Hunter
e17d43b93e perf: Add perf text poke event
Record (single instruction) changes to the kernel text (i.e.
self-modifying code) in order to support tracers like Intel PT and
ARM CoreSight.

A copy of the running kernel code is needed as a reference point (e.g.
from /proc/kcore). The text poke event records the old bytes and the
new bytes so that the event can be processed forwards or backwards.

The basic problem is recording the modified instruction in an
unambiguous manner given SMP instruction cache (in)coherence. That is,
when modifying an instruction concurrently any solution with one or
multiple timestamps is not sufficient:

	CPU0				CPU1
 0
 1	write insn A
 2					execute insn A
 3	sync-I$
 4

Due to I$, CPU1 might execute either the old or new A. No matter where
we record tracepoints on CPU0, one simply cannot tell what CPU1 will
have observed, except that at 0 it must be the old one and at 4 it
must be the new one.

To solve this, take inspiration from x86 text poking, which has to
solve this exact problem due to variable length instruction encoding
and I-fetch windows.

 1) overwrite the instruction with a breakpoint and sync I$

This guarantees that that code flow will never hit the target
instruction anymore, on any CPU (or rather, it will cause an
exception).

 2) issue the TEXT_POKE event

 3) overwrite the breakpoint with the new instruction and sync I$

Now we know that any execution after the TEXT_POKE event will either
observe the breakpoint (and hit the exception) or the new instruction.

So by guarding the TEXT_POKE event with an exception on either side;
we can now tell, without doubt, which instruction another CPU will
have observed.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200512121922.8997-2-adrian.hunter@intel.com
2020-06-15 14:09:48 +02:00
David Rientjes
dbed452a07 dma-pool: decouple DMA_REMAP from DMA_COHERENT_POOL
DMA_REMAP is an unnecessary requirement for AMD SEV, which requires
DMA_COHERENT_POOL, so avoid selecting it when it is otherwise unnecessary.

The only other requirement for DMA coherent pools is DMA_DIRECT_REMAP, so
ensure that properly selects the config option when needed.

Fixes: 82fef0ad81 ("x86/mm: unencrypted non-blocking DMA allocations use coherent pools")
Reported-by: Alex Xu (Hello71) <alex_y_xu@yahoo.ca>
Signed-off-by: David Rientjes <rientjes@google.com>
Tested-by: Alex Xu (Hello71) <alex_y_xu@yahoo.ca>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-06-15 08:35:30 +02:00
Linus Torvalds
4a87b197c1 Add additional LSM hooks for SafeSetID
SafeSetID is capable of making allow/deny decisions for set*uid calls
 on a system, and we want to add similar functionality for set*gid
 calls. The work to do that is not yet complete, so probably won't make
 it in for v5.8, but we are looking to get this simple patch in for
 v5.8 since we have it ready. We are planning on the rest of the work
 for extending the SafeSetID LSM being merged during the v5.9 merge
 window.
 
 This patch was sent to the security mailing list and there were no objections.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEgvWslnM+qUy+sgVg5n2WYw6TPBAFAl7mZCoACgkQ5n2WYw6T
 PBAk1RAAl8t3/m3lELf8qIir4OAd4nK0kc4e+7W8WkznX2ljUl2IetlNxDCBmEXr
 T5qoW6uPsr6kl5AKnbl9Ii7WpW/halsslpKSUNQCs6zbecoVdxekJ8ISW7xHuboZ
 SvS1bqm+t++PM0c0nWSFEr7eXYmPH8OGbCqu6/+nnbxPZf2rJX03e5LnHkEFDFnZ
 0D/rsKgzMt01pdBJQXeoKk79etHO5MjuAkkYVEKJKCR1fM16lk7ECaCp0KJv1Mmx
 I88VncbLvI+um4t82d1Z8qDr2iLgogjJrMZC4WKfxDTmlmxox2Fz9ZJo+8sIWk6k
 T3a95x0s/mYCO4gWtpCVICt9+71Z3ie9T2iaI+CIe/kJvI/ysb+7LSkF+PD33bdz
 0yv6Y9+VMRdzb3pW69R28IoP4wdYQOJRomsY49z6ypH0RgBWcBvyE6e4v+WJGRNK
 E164Imevf6rrZeqJ0kGSBS1nL9WmQHMaXabAwxg1jK1KRZD+YZj3EKC9S/+PAkaT
 1qXUgvGuXHGjQrwU0hclQjgc6BAudWfAGdfrVr7IWwNKJmjgBf6C35my/azrkOg9
 wHCEpUWVmZZLIZLM69/6QXdmMA+iR+rPz5qlVnWhWTfjRYJUXM455Zk+aNo+Qnwi
 +saCcdU+9xqreLeDIoYoebcV/ctHeW0XCQi/+ebjexXVlyeSfYs=
 =I+0L
 -----END PGP SIGNATURE-----

Merge tag 'LSM-add-setgid-hook-5.8-author-fix' of git://github.com/micah-morton/linux

Pull SafeSetID update from Micah Morton:
 "Add additional LSM hooks for SafeSetID

  SafeSetID is capable of making allow/deny decisions for set*uid calls
  on a system, and we want to add similar functionality for set*gid
  calls.

  The work to do that is not yet complete, so probably won't make it in
  for v5.8, but we are looking to get this simple patch in for v5.8
  since we have it ready.

  We are planning on the rest of the work for extending the SafeSetID
  LSM being merged during the v5.9 merge window"

* tag 'LSM-add-setgid-hook-5.8-author-fix' of git://github.com/micah-morton/linux:
  security: Add LSM hooks to set*gid syscalls
2020-06-14 11:39:31 -07:00
Thomas Cedeno
39030e1351 security: Add LSM hooks to set*gid syscalls
The SafeSetID LSM uses the security_task_fix_setuid hook to filter
set*uid() syscalls according to its configured security policy. In
preparation for adding analagous support in the LSM for set*gid()
syscalls, we add the requisite hook here. Tested by putting print
statements in the security_task_fix_setgid hook and seeing them get hit
during kernel boot.

Signed-off-by: Thomas Cedeno <thomascedeno@google.com>
Signed-off-by: Micah Morton <mortonm@chromium.org>
2020-06-14 10:52:02 -07:00
Linus Torvalds
96144c58ab Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Fix cfg80211 deadlock, from Johannes Berg.

 2) RXRPC fails to send norigications, from David Howells.

 3) MPTCP RM_ADDR parsing has an off by one pointer error, fix from
    Geliang Tang.

 4) Fix crash when using MSG_PEEK with sockmap, from Anny Hu.

 5) The ucc_geth driver needs __netdev_watchdog_up exported, from
    Valentin Longchamp.

 6) Fix hashtable memory leak in dccp, from Wang Hai.

 7) Fix how nexthops are marked as FDB nexthops, from David Ahern.

 8) Fix mptcp races between shutdown and recvmsg, from Paolo Abeni.

 9) Fix crashes in tipc_disc_rcv(), from Tuong Lien.

10) Fix link speed reporting in iavf driver, from Brett Creeley.

11) When a channel is used for XSK and then reused again later for XSK,
    we forget to clear out the relevant data structures in mlx5 which
    causes all kinds of problems. Fix from Maxim Mikityanskiy.

12) Fix memory leak in genetlink, from Cong Wang.

13) Disallow sockmap attachments to UDP sockets, it simply won't work.
    From Lorenz Bauer.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (83 commits)
  net: ethernet: ti: ale: fix allmulti for nu type ale
  net: ethernet: ti: am65-cpsw-nuss: fix ale parameters init
  net: atm: Remove the error message according to the atomic context
  bpf: Undo internal BPF_PROBE_MEM in BPF insns dump
  libbpf: Support pre-initializing .bss global variables
  tools/bpftool: Fix skeleton codegen
  bpf: Fix memlock accounting for sock_hash
  bpf: sockmap: Don't attach programs to UDP sockets
  bpf: tcp: Recv() should return 0 when the peer socket is closed
  ibmvnic: Flush existing work items before device removal
  genetlink: clean up family attributes allocations
  net: ipa: header pad field only valid for AP->modem endpoint
  net: ipa: program upper nibbles of sequencer type
  net: ipa: fix modem LAN RX endpoint id
  net: ipa: program metadata mask differently
  ionic: add pcie_print_link_status
  rxrpc: Fix race between incoming ACK parser and retransmitter
  net/mlx5: E-Switch, Fix some error pointer dereferences
  net/mlx5: Don't fail driver on failure to create debugfs
  net/mlx5e: CT: Fix ipv6 nat header rewrite actions
  ...
2020-06-13 16:27:13 -07:00
David S. Miller
fa7566a0d6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Alexei Starovoitov says:

====================
pull-request: bpf 2020-06-12

The following pull-request contains BPF updates for your *net* tree.

We've added 26 non-merge commits during the last 10 day(s) which contain
a total of 27 files changed, 348 insertions(+), 93 deletions(-).

The main changes are:

1) sock_hash accounting fix, from Andrey.

2) libbpf fix and probe_mem sanitizing, from Andrii.

3) sock_hash fixes, from Jakub.

4) devmap_val fix, from Jesper.

5) load_bytes_relative fix, from YiFei.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-13 15:28:08 -07:00
Linus Torvalds
6adc19fd13 Kbuild updates for v5.8 (2nd)
- fix build rules in binderfs sample
 
  - fix build errors when Kbuild recurses to the top Makefile
 
  - covert '---help---' in Kconfig to 'help'
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAl7lBuYVHG1hc2FoaXJv
 eUBrZXJuZWwub3JnAAoJED2LAQed4NsGHvIP/3iErjPshpg/phwH8NTCS4SFkiti
 BZRM+2lupSn7Qs53BTpVzIkXoHBJQZlJxlQ5HY8ScO+fiz28rKZr+b40us+je1Q+
 SkvSPfwZzxjEg7lAZutznG4KgItJLWJKmDyh9T8Y8TAuG4f8WO0hKnXoAp3YorS2
 zppEIxso8O5spZPjp+fF/fPbxPjIsabGK7Jp2LpSVFR5pVDHI/ycTlKQS+MFpMEx
 6JIpdFRw7TkvKew1dr5uAWT5btWHatEqjSR3JeyVHv3EICTGQwHmcHK67cJzGInK
 T51+DT7/CpKtmRgGMiTEu/INfMzzoQAKl6Fcu+vMaShTN97Hk9DpdtQyvA6P/h3L
 8GA4UBct05J7fjjIB7iUD+GYQ0EZbaFujzRXLYk+dQqEJRbhcCwvdzggGp0WvGRs
 1f8/AIpgnQv8JSL/bOMgGMS5uL2dSLsgbzTdr6RzWf1jlYdI1i4u7AZ/nBrwWP+Z
 iOBkKsVceEoJrTbaynl3eoYqFLtWyDau+//oBc2gUvmhn8ioM5dfqBRiJjxJnPG9
 /giRj6xRIqMMEw8Gg8PCG7WebfWxWyaIQwlWBbPok7DwISURK5mvOyakZL+Q25/y
 6MBr2H8NEJsf35q0GTINpfZnot7NX4JXrrndJH8NIRC7HEhwd29S041xlQJdP0rs
 E76xsOr3hrAmBu4P
 =1NIT
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-v5.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull more Kbuild updates from Masahiro Yamada:

 - fix build rules in binderfs sample

 - fix build errors when Kbuild recurses to the top Makefile

 - covert '---help---' in Kconfig to 'help'

* tag 'kbuild-v5.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  treewide: replace '---help---' in Kconfig files with 'help'
  kbuild: fix broken builds because of GZIP,BZIP2,LZOP variables
  samples: binderfs: really compile this sample and fix build issues
2020-06-13 13:29:16 -07:00
Linus Torvalds
076f14be7f The X86 entry, exception and interrupt code rework
This all started about 6 month ago with the attempt to move the Posix CPU
 timer heavy lifting out of the timer interrupt code and just have lockless
 quick checks in that code path. Trivial 5 patches.
 
 This unearthed an inconsistency in the KVM handling of task work and the
 review requested to move all of this into generic code so other
 architectures can share.
 
 Valid request and solved with another 25 patches but those unearthed
 inconsistencies vs. RCU and instrumentation.
 
 Digging into this made it obvious that there are quite some inconsistencies
 vs. instrumentation in general. The int3 text poke handling in particular
 was completely unprotected and with the batched update of trace events even
 more likely to expose to endless int3 recursion.
 
 In parallel the RCU implications of instrumenting fragile entry code came
 up in several discussions.
 
 The conclusion of the X86 maintainer team was to go all the way and make
 the protection against any form of instrumentation of fragile and dangerous
 code pathes enforcable and verifiable by tooling.
 
 A first batch of preparatory work hit mainline with commit d5f744f9a2.
 
 The (almost) full solution introduced a new code section '.noinstr.text'
 into which all code which needs to be protected from instrumentation of all
 sorts goes into. Any call into instrumentable code out of this section has
 to be annotated. objtool has support to validate this. Kprobes now excludes
 this section fully which also prevents BPF from fiddling with it and all
 'noinstr' annotated functions also keep ftrace off. The section, kprobes
 and objtool changes are already merged.
 
 The major changes coming with this are:
 
     - Preparatory cleanups
 
     - Annotating of relevant functions to move them into the noinstr.text
       section or enforcing inlining by marking them __always_inline so the
       compiler cannot misplace or instrument them.
 
     - Splitting and simplifying the idtentry macro maze so that it is now
       clearly separated into simple exception entries and the more
       interesting ones which use interrupt stacks and have the paranoid
       handling vs. CR3 and GS.
 
     - Move quite some of the low level ASM functionality into C code:
 
        - enter_from and exit to user space handling. The ASM code now calls
          into C after doing the really necessary ASM handling and the return
 	 path goes back out without bells and whistels in ASM.
 
        - exception entry/exit got the equivivalent treatment
 
        - move all IRQ tracepoints from ASM to C so they can be placed as
          appropriate which is especially important for the int3 recursion
          issue.
 
     - Consolidate the declaration and definition of entry points between 32
       and 64 bit. They share a common header and macros now.
 
     - Remove the extra device interrupt entry maze and just use the regular
       exception entry code.
 
     - All ASM entry points except NMI are now generated from the shared header
       file and the corresponding macros in the 32 and 64 bit entry ASM.
 
     - The C code entry points are consolidated as well with the help of
       DEFINE_IDTENTRY*() macros. This allows to ensure at one central point
       that all corresponding entry points share the same semantics. The
       actual function body for most entry points is in an instrumentable
       and sane state.
 
       There are special macros for the more sensitive entry points,
       e.g. INT3 and of course the nasty paranoid #NMI, #MCE, #DB and #DF.
       They allow to put the whole entry instrumentation and RCU handling
       into safe places instead of the previous pray that it is correct
       approach.
 
     - The INT3 text poke handling is now completely isolated and the
       recursion issue banned. Aside of the entry rework this required other
       isolation work, e.g. the ability to force inline bsearch.
 
     - Prevent #DB on fragile entry code, entry relevant memory and disable
       it on NMI, #MC entry, which allowed to get rid of the nested #DB IST
       stack shifting hackery.
 
     - A few other cleanups and enhancements which have been made possible
       through this and already merged changes, e.g. consolidating and
       further restricting the IDT code so the IDT table becomes RO after
       init which removes yet another popular attack vector
 
     - About 680 lines of ASM maze are gone.
 
 There are a few open issues:
 
    - An escape out of the noinstr section in the MCE handler which needs
      some more thought but under the aspect that MCE is a complete
      trainwreck by design and the propability to survive it is low, this was
      not high on the priority list.
 
    - Paravirtualization
 
      When PV is enabled then objtool complains about a bunch of indirect
      calls out of the noinstr section. There are a few straight forward
      ways to fix this, but the other issues vs. general correctness were
      more pressing than parawitz.
 
    - KVM
 
      KVM is inconsistent as well. Patches have been posted, but they have
      not yet been commented on or picked up by the KVM folks.
 
    - IDLE
 
      Pretty much the same problems can be found in the low level idle code
      especially the parts where RCU stopped watching. This was beyond the
      scope of the more obvious and exposable problems and is on the todo
      list.
 
 The lesson learned from this brain melting exercise to morph the evolved
 code base into something which can be validated and understood is that once
 again the violation of the most important engineering principle
 "correctness first" has caused quite a few people to spend valuable time on
 problems which could have been avoided in the first place. The "features
 first" tinkering mindset really has to stop.
 
 With that I want to say thanks to everyone involved in contributing to this
 effort. Special thanks go to the following people (alphabetical order):
 
    Alexandre Chartre
    Andy Lutomirski
    Borislav Petkov
    Brian Gerst
    Frederic Weisbecker
    Josh Poimboeuf
    Juergen Gross
    Lai Jiangshan
    Macro Elver
    Paolo Bonzini
    Paul McKenney
    Peter Zijlstra
    Vitaly Kuznetsov
    Will Deacon
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl7j510THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoU2WD/4refvaNm08fG7aiVYem3JJzr0+Pq5O
 /opwnI/1D973ApApj5W/Nd53sN5tVqOiXncSKgywRBWZxRCAGjVYypl9rjpvXu4l
 HlMjhEKBmWkDryxxrM98Vr7hl3hnId5laR56oFfH+G4LUsItaV6Uak/HfXZ4Mq1k
 iYVbEtl2CN+KJjvSgZ6Y1l853Ab5mmGvmeGNHHWCj8ZyjF3cOLoelDTQNnsb0wXM
 crKXBcXJSsCWKYyJ5PTvB82crQCET7Su+LgwK06w/ZbW1//2hVIjSCiN5o/V+aRJ
 06BZNMj8v9tfglkN8LEQvRIjTlnEQ2sq3GxbrVtA53zxkzbBCBJQ96w8yYzQX0ux
 yhqQ/aIZJ1wTYEjJzSkftwLNMRHpaOUnKvJndXRKAYi+eGI7syF61qcZSYGKuAQ/
 bK3b/CzU6QWr1235oTADxh4isEwxA0Pg5wtJCfDDOG0MJ9ALMSOGUkhoiz5EqpkU
 mzFAwfG/Uj7hRjlkms7Yj2OjZfnU7iypj63GgpXghLjr5ksRFKEOMw8e1GXltVHs
 zzwghUjqp2EPq0VOOQn3lp9lol5Prc3xfFHczKpO+CJW6Rpa4YVdqJmejBqJy/on
 Hh/T/ST3wa2qBeAw89vZIeWiUJZZCsQ0f//+2hAbzJY45Y6DuR9vbTAPb9agRgOM
 xg+YaCfpQqFc1A==
 =llba
 -----END PGP SIGNATURE-----

Merge tag 'x86-entry-2020-06-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 entry updates from Thomas Gleixner:
 "The x86 entry, exception and interrupt code rework

  This all started about 6 month ago with the attempt to move the Posix
  CPU timer heavy lifting out of the timer interrupt code and just have
  lockless quick checks in that code path. Trivial 5 patches.

  This unearthed an inconsistency in the KVM handling of task work and
  the review requested to move all of this into generic code so other
  architectures can share.

  Valid request and solved with another 25 patches but those unearthed
  inconsistencies vs. RCU and instrumentation.

  Digging into this made it obvious that there are quite some
  inconsistencies vs. instrumentation in general. The int3 text poke
  handling in particular was completely unprotected and with the batched
  update of trace events even more likely to expose to endless int3
  recursion.

  In parallel the RCU implications of instrumenting fragile entry code
  came up in several discussions.

  The conclusion of the x86 maintainer team was to go all the way and
  make the protection against any form of instrumentation of fragile and
  dangerous code pathes enforcable and verifiable by tooling.

  A first batch of preparatory work hit mainline with commit
  d5f744f9a2 ("Pull x86 entry code updates from Thomas Gleixner")

  That (almost) full solution introduced a new code section
  '.noinstr.text' into which all code which needs to be protected from
  instrumentation of all sorts goes into. Any call into instrumentable
  code out of this section has to be annotated. objtool has support to
  validate this.

  Kprobes now excludes this section fully which also prevents BPF from
  fiddling with it and all 'noinstr' annotated functions also keep
  ftrace off. The section, kprobes and objtool changes are already
  merged.

  The major changes coming with this are:

    - Preparatory cleanups

    - Annotating of relevant functions to move them into the
      noinstr.text section or enforcing inlining by marking them
      __always_inline so the compiler cannot misplace or instrument
      them.

    - Splitting and simplifying the idtentry macro maze so that it is
      now clearly separated into simple exception entries and the more
      interesting ones which use interrupt stacks and have the paranoid
      handling vs. CR3 and GS.

    - Move quite some of the low level ASM functionality into C code:

       - enter_from and exit to user space handling. The ASM code now
         calls into C after doing the really necessary ASM handling and
         the return path goes back out without bells and whistels in
         ASM.

       - exception entry/exit got the equivivalent treatment

       - move all IRQ tracepoints from ASM to C so they can be placed as
         appropriate which is especially important for the int3
         recursion issue.

    - Consolidate the declaration and definition of entry points between
      32 and 64 bit. They share a common header and macros now.

    - Remove the extra device interrupt entry maze and just use the
      regular exception entry code.

    - All ASM entry points except NMI are now generated from the shared
      header file and the corresponding macros in the 32 and 64 bit
      entry ASM.

    - The C code entry points are consolidated as well with the help of
      DEFINE_IDTENTRY*() macros. This allows to ensure at one central
      point that all corresponding entry points share the same
      semantics. The actual function body for most entry points is in an
      instrumentable and sane state.

      There are special macros for the more sensitive entry points, e.g.
      INT3 and of course the nasty paranoid #NMI, #MCE, #DB and #DF.
      They allow to put the whole entry instrumentation and RCU handling
      into safe places instead of the previous pray that it is correct
      approach.

    - The INT3 text poke handling is now completely isolated and the
      recursion issue banned. Aside of the entry rework this required
      other isolation work, e.g. the ability to force inline bsearch.

    - Prevent #DB on fragile entry code, entry relevant memory and
      disable it on NMI, #MC entry, which allowed to get rid of the
      nested #DB IST stack shifting hackery.

    - A few other cleanups and enhancements which have been made
      possible through this and already merged changes, e.g.
      consolidating and further restricting the IDT code so the IDT
      table becomes RO after init which removes yet another popular
      attack vector

    - About 680 lines of ASM maze are gone.

  There are a few open issues:

   - An escape out of the noinstr section in the MCE handler which needs
     some more thought but under the aspect that MCE is a complete
     trainwreck by design and the propability to survive it is low, this
     was not high on the priority list.

   - Paravirtualization

     When PV is enabled then objtool complains about a bunch of indirect
     calls out of the noinstr section. There are a few straight forward
     ways to fix this, but the other issues vs. general correctness were
     more pressing than parawitz.

   - KVM

     KVM is inconsistent as well. Patches have been posted, but they
     have not yet been commented on or picked up by the KVM folks.

   - IDLE

     Pretty much the same problems can be found in the low level idle
     code especially the parts where RCU stopped watching. This was
     beyond the scope of the more obvious and exposable problems and is
     on the todo list.

  The lesson learned from this brain melting exercise to morph the
  evolved code base into something which can be validated and understood
  is that once again the violation of the most important engineering
  principle "correctness first" has caused quite a few people to spend
  valuable time on problems which could have been avoided in the first
  place. The "features first" tinkering mindset really has to stop.

  With that I want to say thanks to everyone involved in contributing to
  this effort. Special thanks go to the following people (alphabetical
  order): Alexandre Chartre, Andy Lutomirski, Borislav Petkov, Brian
  Gerst, Frederic Weisbecker, Josh Poimboeuf, Juergen Gross, Lai
  Jiangshan, Macro Elver, Paolo Bonzin,i Paul McKenney, Peter Zijlstra,
  Vitaly Kuznetsov, and Will Deacon"

* tag 'x86-entry-2020-06-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (142 commits)
  x86/entry: Force rcu_irq_enter() when in idle task
  x86/entry: Make NMI use IDTENTRY_RAW
  x86/entry: Treat BUG/WARN as NMI-like entries
  x86/entry: Unbreak __irqentry_text_start/end magic
  x86/entry: __always_inline CR2 for noinstr
  lockdep: __always_inline more for noinstr
  x86/entry: Re-order #DB handler to avoid *SAN instrumentation
  x86/entry: __always_inline arch_atomic_* for noinstr
  x86/entry: __always_inline irqflags for noinstr
  x86/entry: __always_inline debugreg for noinstr
  x86/idt: Consolidate idt functionality
  x86/idt: Cleanup trap_init()
  x86/idt: Use proper constants for table size
  x86/idt: Add comments about early #PF handling
  x86/idt: Mark init only functions __init
  x86/entry: Rename trace_hardirqs_off_prepare()
  x86/entry: Clarify irq_{enter,exit}_rcu()
  x86/entry: Remove DBn stacks
  x86/entry: Remove debug IDT frobbing
  x86/entry: Optimize local_db_save() for virt
  ...
2020-06-13 10:05:47 -07:00
Masahiro Yamada
a7f7f6248d treewide: replace '---help---' in Kconfig files with 'help'
Since commit 84af7a6194 ("checkpatch: kconfig: prefer 'help' over
'---help---'"), the number of '---help---' has been gradually
decreasing, but there are still more than 2400 instances.

This commit finishes the conversion. While I touched the lines,
I also fixed the indentation.

There are a variety of indentation styles found.

  a) 4 spaces + '---help---'
  b) 7 spaces + '---help---'
  c) 8 spaces + '---help---'
  d) 1 space + 1 tab + '---help---'
  e) 1 tab + '---help---'    (correct indentation)
  f) 1 tab + 1 space + '---help---'
  g) 1 tab + 2 spaces + '---help---'

In order to convert all of them to 1 tab + 'help', I ran the
following commend:

  $ find . -name 'Kconfig*' | xargs sed -i 's/^[[:space:]]*---help---/\thelp/'

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2020-06-14 01:57:21 +09:00
Linus Torvalds
6c32978414 Notifications over pipes + Keyring notifications
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl7U/i8ACgkQ+7dXa6fL
 C2u2eg/+Oy6ybq0hPovYVkFI9WIG7ZCz7w9Q6BEnfYMqqn3dnfJxKQ3l4pnQEOWw
 f4QfvpvevsYfMtOJkYcG6s66rQgbFdqc5TEyBBy0QNp3acRolN7IXkcopvv9xOpQ
 JxedpbFG1PTFLWjvBpyjlrUPouwLzq2FXAf1Ox0ZIMw6165mYOMWoli1VL8dh0A0
 Ai7JUB0WrvTNbrwhV413obIzXT/rPCdcrgbQcgrrLPex8lQ47ZAE9bq6k4q5HiwK
 KRzEqkQgnzId6cCNTFBfkTWsx89zZunz7jkfM5yx30MvdAtPSxvvpfIPdZRZkXsP
 E2K9Fk1/6OQZTC0Op3Pi/bt+hVG/mD1p0sQUDgo2MO3qlSS+5mMkR8h3mJEgwK12
 72P4YfOJkuAy2z3v4lL0GYdUDAZY6i6G8TMxERKu/a9O3VjTWICDOyBUS6F8YEAK
 C7HlbZxAEOKTVK0BTDTeEUBwSeDrBbvH6MnRlZCG5g1Fos2aWP0udhjiX8IfZLO7
 GN6nWBvK1fYzfsUczdhgnoCzQs3suoDo04HnsTPGJ8De52T4x2RsjV+gPx0nrNAq
 eWChl1JvMWsY2B3GLnl9XQz4NNN+EreKEkk+PULDGllrArrPsp5Vnhb9FJO1PVCU
 hMDJHohPiXnKbc8f4Bd78OhIvnuoGfJPdM5MtNe2flUKy2a2ops=
 =YTGf
 -----END PGP SIGNATURE-----

Merge tag 'notifications-20200601' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull notification queue from David Howells:
 "This adds a general notification queue concept and adds an event
  source for keys/keyrings, such as linking and unlinking keys and
  changing their attributes.

  Thanks to Debarshi Ray, we do have a pull request to use this to fix a
  problem with gnome-online-accounts - as mentioned last time:

     https://gitlab.gnome.org/GNOME/gnome-online-accounts/merge_requests/47

  Without this, g-o-a has to constantly poll a keyring-based kerberos
  cache to find out if kinit has changed anything.

  [ There are other notification pending: mount/sb fsinfo notifications
    for libmount that Karel Zak and Ian Kent have been working on, and
    Christian Brauner would like to use them in lxc, but let's see how
    this one works first ]

  LSM hooks are included:

   - A set of hooks are provided that allow an LSM to rule on whether or
     not a watch may be set. Each of these hooks takes a different
     "watched object" parameter, so they're not really shareable. The
     LSM should use current's credentials. [Wanted by SELinux & Smack]

   - A hook is provided to allow an LSM to rule on whether or not a
     particular message may be posted to a particular queue. This is
     given the credentials from the event generator (which may be the
     system) and the watch setter. [Wanted by Smack]

  I've provided SELinux and Smack with implementations of some of these
  hooks.

  WHY
  ===

  Key/keyring notifications are desirable because if you have your
  kerberos tickets in a file/directory, your Gnome desktop will monitor
  that using something like fanotify and tell you if your credentials
  cache changes.

  However, we also have the ability to cache your kerberos tickets in
  the session, user or persistent keyring so that it isn't left around
  on disk across a reboot or logout. Keyrings, however, cannot currently
  be monitored asynchronously, so the desktop has to poll for it - not
  so good on a laptop. This facility will allow the desktop to avoid the
  need to poll.

  DESIGN DECISIONS
  ================

   - The notification queue is built on top of a standard pipe. Messages
     are effectively spliced in. The pipe is opened with a special flag:

        pipe2(fds, O_NOTIFICATION_PIPE);

     The special flag has the same value as O_EXCL (which doesn't seem
     like it will ever be applicable in this context)[?]. It is given up
     front to make it a lot easier to prohibit splice&co from accessing
     the pipe.

     [?] Should this be done some other way?  I'd rather not use up a new
         O_* flag if I can avoid it - should I add a pipe3() system call
         instead?

     The pipe is then configured::

        ioctl(fds[1], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
        ioctl(fds[1], IOC_WATCH_QUEUE_SET_FILTER, &filter);

     Messages are then read out of the pipe using read().

   - It should be possible to allow write() to insert data into the
     notification pipes too, but this is currently disabled as the
     kernel has to be able to insert messages into the pipe *without*
     holding pipe->mutex and the code to make this work needs careful
     auditing.

   - sendfile(), splice() and vmsplice() are disabled on notification
     pipes because of the pipe->mutex issue and also because they
     sometimes want to revert what they just did - but one or more
     notification messages might've been interleaved in the ring.

   - The kernel inserts messages with the wait queue spinlock held. This
     means that pipe_read() and pipe_write() have to take the spinlock
     to update the queue pointers.

   - Records in the buffer are binary, typed and have a length so that
     they can be of varying size.

     This allows multiple heterogeneous sources to share a common
     buffer; there are 16 million types available, of which I've used
     just a few, so there is scope for others to be used. Tags may be
     specified when a watchpoint is created to help distinguish the
     sources.

   - Records are filterable as types have up to 256 subtypes that can be
     individually filtered. Other filtration is also available.

   - Notification pipes don't interfere with each other; each may be
     bound to a different set of watches. Any particular notification
     will be copied to all the queues that are currently watching for it
     - and only those that are watching for it.

   - When recording a notification, the kernel will not sleep, but will
     rather mark a queue as having lost a message if there's
     insufficient space. read() will fabricate a loss notification
     message at an appropriate point later.

   - The notification pipe is created and then watchpoints are attached
     to it, using one of:

        keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fds[1], 0x01);
        watch_mount(AT_FDCWD, "/", 0, fd, 0x02);
        watch_sb(AT_FDCWD, "/mnt", 0, fd, 0x03);

     where in both cases, fd indicates the queue and the number after is
     a tag between 0 and 255.

   - Watches are removed if either the notification pipe is destroyed or
     the watched object is destroyed. In the latter case, a message will
     be generated indicating the enforced watch removal.

  Things I want to avoid:

   - Introducing features that make the core VFS dependent on the
     network stack or networking namespaces (ie. usage of netlink).

   - Dumping all this stuff into dmesg and having a daemon that sits
     there parsing the output and distributing it as this then puts the
     responsibility for security into userspace and makes handling
     namespaces tricky. Further, dmesg might not exist or might be
     inaccessible inside a container.

   - Letting users see events they shouldn't be able to see.

  TESTING AND MANPAGES
  ====================

   - The keyutils tree has a pipe-watch branch that has keyctl commands
     for making use of notifications. Proposed manual pages can also be
     found on this branch, though a couple of them really need to go to
     the main manpages repository instead.

     If the kernel supports the watching of keys, then running "make
     test" on that branch will cause the testing infrastructure to spawn
     a monitoring process on the side that monitors a notifications pipe
     for all the key/keyring changes induced by the tests and they'll
     all be checked off to make sure they happened.

        https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/keyutils.git/log/?h=pipe-watch

   - A test program is provided (samples/watch_queue/watch_test) that
     can be used to monitor for keyrings, mount and superblock events.
     Information on the notifications is simply logged to stdout"

* tag 'notifications-20200601' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  smack: Implement the watch_key and post_notification hooks
  selinux: Implement the watch_key security hook
  keys: Make the KEY_NEED_* perms an enum rather than a mask
  pipe: Add notification lossage handling
  pipe: Allow buffers to be marked read-whole-or-error for notifications
  Add sample notification program
  watch_queue: Add a key/keyring notification facility
  security: Add hooks to rule on setting a watch
  pipe: Add general notification queue support
  pipe: Add O_NOTIFICATION_PIPE
  security: Add a hook for the point of notification insertion
  uapi: General notification queue definitions
2020-06-13 09:56:21 -07:00
Andrii Nakryiko
29fcb05bbf bpf: Undo internal BPF_PROBE_MEM in BPF insns dump
BPF_PROBE_MEM is kernel-internal implmementation details. When dumping BPF
instructions to user-space, it needs to be replaced back with BPF_MEM mode.

Fixes: 2a02759ef5 ("bpf: Add support for BTF pointers to interpreter")
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200613002115.1632142-1-andriin@fb.com
2020-06-12 17:35:38 -07:00
Linus Torvalds
5c2fb57af0 One more printk change for 5.8
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAl7jLTcACgkQUqAMR0iA
 lPJiaw/9FWnHlGHq1RMJ2cQTdgDStVDP12+eXUrSXBiXedGNoMfsfRWHHNmiqkSS
 4fmbjNu+//yz44QNzZPF783zix3rdY6IOaNOd95Pi1kjZ2wrcW3ioL7fk6Q0/vr0
 +pC1zeC+G2JzYdXdInvAM7HI0W5R7D0YBUaIORf3bTD4nW1CPbpDSknX6TkfjCRz
 fM9MZxWz1r788uk2HpwhLtjk6qoiNXihTzB/pRbiK9z9MlwQd5331W93RuARYIKy
 8gCM/MUWZ/iD7a4Tvn7+vdtqu1gEo+c2wfgY7ilK56JJsyqL/u1DkOxDQme73ffe
 PAtDgceEKz51N86Lsagp3UdRnP4K78BS7TKOUJ6mWaXb6uPHhB5o5qUN2tmQiEE9
 G5b+0wxiD4iwVK5XaP2g2bfaEBKcaj8CJmPP8u44urFQCEVDKGa3nLanUGvVkzaQ
 JUAZWLhgblfVV4IAJ4WM3JO10Cwt1Js5gUEgqJ8TE6wIKz4zCvFsPYqLl6EUdX/o
 BCn6NI4rX79WvqQVB6KjJ9fdIAJqs6Wgps0ceW/U+Xs0HWfyQL/Ow9ShSiG/ZJvf
 dF3o2wyrnQQq7jgc/2JpQG+mBFHrnRgwD24i8LzaymsadlwNxmmoJxljO/KG4ZSJ
 KniL33BrN92BSBEYMC6YHYAHXCyGYZxTGWugXIeS1N2iRt4A3Gw=
 =+1iq
 -----END PGP SIGNATURE-----

Merge tag 'printk-for-5.8-kdb-nmi' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux

Pull printk fix from Petr Mladek:
 "One more printk change for 5.8: make sure that messages printed from
  KDB context are redirected to KDB console handlers. It did not work
  when KDB interrupted NMI or printk_safe contexts.

  Arm people started hitting this problem more often recently. I forgot
  to add the fix into the previous pull request by mistake"

* tag 'printk-for-5.8-kdb-nmi' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
  printk/kdb: Redirect printk messages into kdb in any context
2020-06-12 12:13:36 -07:00
Linus Torvalds
b791d1bdf9 The Kernel Concurrency Sanitizer (KCSAN)
KCSAN is a dynamic race detector, which relies on compile-time
 instrumentation, and uses a watchpoint-based sampling approach to detect
 races.
 
 The feature was under development for quite some time and has already found
 legitimate bugs.
 
 Unfortunately it comes with a limitation, which was only understood late in
 the development cycle:
 
   It requires an up to date CLANG-11 compiler
 
 CLANG-11 is not yet released (scheduled for June), but it's the only
 compiler today which handles the kernel requirements and especially the
 annotations of functions to exclude them from KCSAN instrumentation
 correctly.
 
 These annotations really need to work so that low level entry code and
 especially int3 text poke handling can be completely isolated.
 
 A detailed discussion of the requirements and compiler issues can be found
 here:
 
   https://lore.kernel.org/lkml/CANpmjNMTsY_8241bS7=XAfqvZHFLrVEkv_uM4aDUWE_kh3Rvbw@mail.gmail.com/
 
 We came to the conclusion that trying to work around compiler limitations
 and bugs again would end up in a major trainwreck, so requiring a working
 compiler seemed to be the best choice.
 
 For Continous Integration purposes the compiler restriction is manageable
 and that's where most xxSAN reports come from.
 
 For a change this limitation might make GCC people actually look at their
 bugs. Some issues with CSAN in GCC are 7 years old and one has been 'fixed'
 3 years ago with a half baken solution which 'solved' the reported issue
 but not the underlying problem.
 
 The KCSAN developers also ponder to use a GCC plugin to become independent,
 but that's not something which will show up in a few days.
 
 Blocking KCSAN until wide spread compiler support is available is not a
 really good alternative because the continuous growth of lockless
 optimizations in the kernel demands proper tooling support.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl7im98THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoQ3xD/9+q87OmwnyoRTs6O3GDDbWZYoJGolh
 rctDOAYW8RSS73Fiw23z8hKlLl9tJCya6/X8Q9qoonB1YeIEPPRVj5HJWAMUNEIs
 YgjlZJFmh+mnbP/KQFctm3AWpoX8kqt3ncqj6zG72oQ9qKui691BY/2NmGVSLxUV
 DqtUYSKmi51XEQtZuXRuHEf3zBxoyeD43DaSCdJAXd6f5O2X7tmrWDuazHVeKzHV
 lhijvkyBvGMWvPg0IBrXkkLmeOvS0++MTGm3o+L72XF6nWpzTkcV7N0E9GEDFg45
 zwcidRVKD5d/1DoU5Tos96rCJpBEGh/wimlu0z14mcZpNiJgRQH5rzVEO9Y14UcP
 KL9FgRrb5dFw7yfX2zRQ070OFJ4AEDBMK0o5Lbu/QO5KLkvFkqnuWlQfmmtZJWCW
 DTRw/FgUgU7lvyPjRrao6HBvwy+yTb0u9K5seCOTRkuepR9nPJs0710pFiBsNCfV
 RY3cyggNBipAzgBOgLxixnq9+rHt70ton6S8Gijxpvt0dGGfO8k0wuEhFtA4zKrQ
 6HGK+pidxnoVdEgyQZhS+qzMMkyiUL0FXdaGJ2IX+/DC+Ij1UrUPjZBn7v25M0hQ
 ESkvxWKCn7snH4/NJsNxqCV1zyEc3zAW/WvLJUc9I7H8zPwtVvKWPrKEMzrJJ5bA
 aneySilbRxBFUg==
 =iplm
 -----END PGP SIGNATURE-----

Merge tag 'locking-kcsan-2020-06-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull the Kernel Concurrency Sanitizer from Thomas Gleixner:
 "The Kernel Concurrency Sanitizer (KCSAN) is a dynamic race detector,
  which relies on compile-time instrumentation, and uses a
  watchpoint-based sampling approach to detect races.

  The feature was under development for quite some time and has already
  found legitimate bugs.

  Unfortunately it comes with a limitation, which was only understood
  late in the development cycle:

     It requires an up to date CLANG-11 compiler

  CLANG-11 is not yet released (scheduled for June), but it's the only
  compiler today which handles the kernel requirements and especially
  the annotations of functions to exclude them from KCSAN
  instrumentation correctly.

  These annotations really need to work so that low level entry code and
  especially int3 text poke handling can be completely isolated.

  A detailed discussion of the requirements and compiler issues can be
  found here:

    https://lore.kernel.org/lkml/CANpmjNMTsY_8241bS7=XAfqvZHFLrVEkv_uM4aDUWE_kh3Rvbw@mail.gmail.com/

  We came to the conclusion that trying to work around compiler
  limitations and bugs again would end up in a major trainwreck, so
  requiring a working compiler seemed to be the best choice.

  For Continous Integration purposes the compiler restriction is
  manageable and that's where most xxSAN reports come from.

  For a change this limitation might make GCC people actually look at
  their bugs. Some issues with CSAN in GCC are 7 years old and one has
  been 'fixed' 3 years ago with a half baken solution which 'solved' the
  reported issue but not the underlying problem.

  The KCSAN developers also ponder to use a GCC plugin to become
  independent, but that's not something which will show up in a few
  days.

  Blocking KCSAN until wide spread compiler support is available is not
  a really good alternative because the continuous growth of lockless
  optimizations in the kernel demands proper tooling support"

* tag 'locking-kcsan-2020-06-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (76 commits)
  compiler_types.h, kasan: Use __SANITIZE_ADDRESS__ instead of CONFIG_KASAN to decide inlining
  compiler.h: Move function attributes to compiler_types.h
  compiler.h: Avoid nested statement expression in data_race()
  compiler.h: Remove data_race() and unnecessary checks from {READ,WRITE}_ONCE()
  kcsan: Update Documentation to change supported compilers
  kcsan: Remove 'noinline' from __no_kcsan_or_inline
  kcsan: Pass option tsan-instrument-read-before-write to Clang
  kcsan: Support distinguishing volatile accesses
  kcsan: Restrict supported compilers
  kcsan: Avoid inserting __tsan_func_entry/exit if possible
  ubsan, kcsan: Don't combine sanitizer with kcov on clang
  objtool, kcsan: Add kcsan_disable_current() and kcsan_enable_current_nowarn()
  kcsan: Add __kcsan_{enable,disable}_current() variants
  checkpatch: Warn about data_race() without comment
  kcsan: Use GFP_ATOMIC under spin lock
  Improve KCSAN documentation a bit
  kcsan: Make reporting aware of KCSAN tests
  kcsan: Fix function matching in report
  kcsan: Change data_race() to no longer require marking racing accesses
  kcsan: Move kcsan_{disable,enable}_current() to kcsan-checks.h
  ...
2020-06-11 18:55:43 -07:00
Linus Torvalds
a58dfea297 block-5.8-2020-06-11
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl7ioawQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpvbJD/wNLN/H4yIQ7tU5XDdvxvpx/u9FC1t2Pep0
 w/olj6wnrsHw/WsgJIlw7efTq9QATfszG/dJKJiBGdiJoCKE1TW/CM6RNfDJb4Z3
 TUa9ghYYzcfI2NRdV94Ol9qRThjB6OG6Cdw4k3oKbx44EJOzgatBI6xIA3nU+f/L
 XO+xl2z3+t28guMvcgUkdJsR8GvSrwcXCvw3X/3uqbtAv5hhMbR7jyqxcHDLX72t
 I+y3/dWfKaienujEmcLKeW+f2RFyjYIvDbQ5b/JDqLah7Fn1A2wYf+mx7iZuQZSi
 5nwGcPuj++8GXS6G8JegAl+s5L3AyBNdz5nrxdAlRjDTMgIUstFgueLnCaW64QNF
 93kWK5gDwhq+26AFl3mGJ3m+qhh1AhGWaVniBiFA3OUeWcOgVGlRf6jtmWazQaEI
 v15WTiAXTsQujnV+t5KYKQnm9vJLIcc/njiSss1JXnqrxR6fH+QCHQ96ckTCqx66
 0GbN5RkuC2J/RHYEyYnYIJlNZGDsCVoBC3QR10WNlng82cxMyrahS011xUTn9VN+
 0Gnz1ilNFc+bx1jUO+pl6EdIsEBbFkKioyoZsgba5mvM+Nn3nGbvqQPJc+18fSV2
 BW1x2yuoc6yjwuol9NMV+cy13Z9u+uA4c0mFIetjuyjE3rZb77iuIiIKVWMRh6Av
 Ip6GuPEA2A==
 =TOc1
 -----END PGP SIGNATURE-----

Merge tag 'block-5.8-2020-06-11' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "Some followup fixes for this merge window. In particular:

   - Seqcount write missing preemption disable for stats (Ahmed)

   - blktrace fixes (Chaitanya)

   - Redundant initializations (Colin)

   - Various small NVMe fixes (Chaitanya, Christoph, Daniel, Max,
     Niklas, Rikard)

   - loop flag bug regression fix (Martijn)

   - blk-mq tagging fixes (Christoph, Ming)"

* tag 'block-5.8-2020-06-11' of git://git.kernel.dk/linux-block:
  umem: remove redundant initialization of variable ret
  pktcdvd: remove redundant initialization of variable ret
  nvmet: fail outstanding host posted AEN req
  nvme-pci: use simple suspend when a HMB is enabled
  nvme-fc: don't call nvme_cleanup_cmd() for AENs
  nvmet-tcp: constify nvmet_tcp_ops
  nvme-tcp: constify nvme_tcp_mq_ops and nvme_tcp_admin_mq_ops
  nvme: do not call del_gendisk() on a disk that was never added
  blk-mq: fix blk_mq_all_tag_iter
  blk-mq: split out a __blk_mq_get_driver_tag helper
  blktrace: fix endianness for blk_log_remap()
  blktrace: fix endianness in get_pdu_int()
  blktrace: use errno instead of bi_status
  block: nr_sects_write(): Disable preemption on seqcount write
  block: remove the error argument to the block_bio_complete tracepoint
  loop: Fix wrong masking of status flags
  block/bio-integrity: don't free 'buf' if bio_integrity_add_page() failed
2020-06-11 16:07:33 -07:00
Linus Torvalds
6a45a65888 A set of fixes and updates for x86:
- Unbreak paravirt VDSO clocks. While the VDSO code was moved into lib
     for sharing a subtle check for the validity of paravirt clocks got
     replaced. While the replacement works perfectly fine for bare metal as
     the update of the VDSO clock mode is synchronous, it fails for paravirt
     clocks because the hypervisor can invalidate them asynchronous. Bring
     it back as an optional function so it does not inflict this on
     architectures which are free of PV damage.
 
   - Fix the jiffies to jiffies64 mapping on 64bit so it does not trigger
     an ODR violation on newer compilers
 
   - Three fixes for the SSBD and *IB* speculation mitigation maze to ensure
     consistency, not disabling of some *IB* variants wrongly and to prevent
     a rogue cross process shutdown of SSBD. All marked for stable.
 
   - Add yet more CPU models to the splitlock detection capable list !@#%$!
 
   - Bring the pr_info() back which tells that TSC deadline timer is enabled.
 
   - Reboot quirk for MacBook6,1
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl7ie1oTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYofXrEACDD0mNBU2c4vQiR+n4d41PqW1p15DM
 /wG7dYqYt2RdR6qOAspmNL5ilUP+L+eoT/86U9y0g4j3FtTREqyy6mpWE4MQzqaQ
 eKWVoeYt7l9QbR1kP4eks1CN94OyVBUPo3P78UPruWMB11iyKjyrkEdsDmRSLOdr
 6doqMFGHgowrQRwsLPFUt7b2lls6ssOSYgM/ChHi2Iga431ZuYYcRe2mNVsvqx3n
 0N7QZlJ/LivXdCmdpe3viMBsDaomiXAloKUo+HqgrCLYFXefLtfOq09U7FpddYqH
 ztxbGW/7gFn2HEbmdeaiufux263MdHtnjvdPhQZKHuyQmZzzxDNBFgOILSrBJb5y
 qLYJGhMa0sEwMBM9MMItomNgZnOITQ3WGYAdSCg3mG3jK4EXzr6aQm/Qz5SI+Cte
 bQKB2dgR53Gw/1uc7F5qMGQ2NzeUbKycT0ZbF3vkUPVh1kdU3juIntsovv2lFeBe
 Rog/rZliT1xdHrGAHRbubb2/3v66CSodMoYz0eQtr241Oz0LGwnyFqLN3qcZVLDt
 OtxHQ3bbaxevDEetJXfSh3CfHKNYMToAcszmGDse3MJxC7DL5AA51OegMa/GYOX6
 r5J99MUsEzZQoQYyXFf1MjwgxH4CQK1xBBUXYaVG65AcmhT21YbNWnCbxgf7hW+V
 hqaaUSig4V3NLw==
 =VlBk
 -----END PGP SIGNATURE-----

Merge tag 'x86-urgent-2020-06-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull more x86 updates from Thomas Gleixner:
 "A set of fixes and updates for x86:

   - Unbreak paravirt VDSO clocks.

     While the VDSO code was moved into lib for sharing a subtle check
     for the validity of paravirt clocks got replaced. While the
     replacement works perfectly fine for bare metal as the update of
     the VDSO clock mode is synchronous, it fails for paravirt clocks
     because the hypervisor can invalidate them asynchronously.

     Bring it back as an optional function so it does not inflict this
     on architectures which are free of PV damage.

   - Fix the jiffies to jiffies64 mapping on 64bit so it does not
     trigger an ODR violation on newer compilers

   - Three fixes for the SSBD and *IB* speculation mitigation maze to
     ensure consistency, not disabling of some *IB* variants wrongly and
     to prevent a rogue cross process shutdown of SSBD. All marked for
     stable.

   - Add yet more CPU models to the splitlock detection capable list
     !@#%$!

   - Bring the pr_info() back which tells that TSC deadline timer is
     enabled.

   - Reboot quirk for MacBook6,1"

* tag 'x86-urgent-2020-06-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/vdso: Unbreak paravirt VDSO clocks
  lib/vdso: Provide sanity check for cycles (again)
  clocksource: Remove obsolete ifdef
  x86_64: Fix jiffies ODR violation
  x86/speculation: PR_SPEC_FORCE_DISABLE enforcement for indirect branches.
  x86/speculation: Prevent rogue cross-process SSBD shutdown
  x86/speculation: Avoid force-disabling IBPB based on STIBP and enhanced IBRS.
  x86/cpu: Add Sapphire Rapids CPU model number
  x86/split_lock: Add Icelake microserver and Tigerlake CPU models
  x86/apic: Make TSC deadline timer detection message visible
  x86/reboot/quirks: Add MacBook6,1 reboot quirk
2020-06-11 15:54:31 -07:00
Linus Torvalds
623f6dc593 Merge branch 'akpm' (patches from Andrew)
Merge some more updates from Andrew Morton:

 - various hotfixes and minor things

 - hch's use_mm/unuse_mm clearnups

Subsystems affected by this patch series: mm/hugetlb, scripts, kcov,
lib, nilfs, checkpatch, lib, mm/debug, ocfs2, lib, misc.

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  kernel: set USER_DS in kthread_use_mm
  kernel: better document the use_mm/unuse_mm API contract
  kernel: move use_mm/unuse_mm to kthread.c
  kernel: move use_mm/unuse_mm to kthread.c
  stacktrace: cleanup inconsistent variable type
  lib: test get_count_order/long in test_bitops.c
  mm: add comments on pglist_data zones
  ocfs2: fix spelling mistake and grammar
  mm/debug_vm_pgtable: fix kernel crash by checking for THP support
  lib: fix bitmap_parse() on 64-bit big endian archs
  checkpatch: correct check for kernel parameters doc
  nilfs2: fix null pointer dereference at nilfs_segctor_do_construct()
  lib/lz4/lz4_decompress.c: document deliberate use of `&'
  kcov: check kcov_softirq in kcov_remote_stop()
  scripts/spelling: add a few more typos
  khugepaged: selftests: fix timeout condition in wait_for_scan()
2020-06-11 13:25:53 -07:00
Linus Torvalds
55d728b2b0 arm64 merge window fixes for -rc1
- Fix SCS debug check to report max stack usage in bytes as advertised
 - Fix typo: CONFIG_FTRACE_WITH_REGS => CONFIG_DYNAMIC_FTRACE_WITH_REGS
 - Fix incorrect mask in HiSilicon L3C perf PMU driver
 - Fix compat vDSO compilation under some toolchain configurations
 - Fix false UBSAN warning from ACPI IORT parsing code
 - Fix booting under bootloaders that ignore TEXT_OFFSET
 - Annotate debug initcall function with '__init'
 -----BEGIN PGP SIGNATURE-----
 
 iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAl7iMe8QHHdpbGxAa2Vy
 bmVsLm9yZwAKCRC3rHDchMFjNIp5B/46kdFZ1M8VSsGxtZMzLVZBR4MWzjx1wBD3
 Zzvcg5x0aLAvg+VephmQ5cBiQE78/KKISUdTKndevJ9feVhzz8kxbOhLB88o14+L
 Pk63p4jol8v7cJHiqcsBgSLR6MDAiY+4epsgeFA7WkO9cf529UIMO1ea2TCx0KbT
 tKniZghX5I485Fu2RHtZGLGBxQXqFBcDJUok3/IoZnp2SDyUxrzHPViFL9fHHzCb
 FNSEJijcoHfrIKiG4bPssKICmvbtcNysembDlJeyZ+5qJXqotty2M3OK+We7vPrg
 Ne5O/tQoeCt4lLuW40yEmpQzodNLG8D+isC6cFvspmPXSyHflSCz
 =EtmQ
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 fixes from Will Deacon:
 "arm64 fixes that came in during the merge window.

  There will probably be more to come, but it doesn't seem like it's
  worth me sitting on these in the meantime.

   - Fix SCS debug check to report max stack usage in bytes as advertised

   - Fix typo: CONFIG_FTRACE_WITH_REGS => CONFIG_DYNAMIC_FTRACE_WITH_REGS

   - Fix incorrect mask in HiSilicon L3C perf PMU driver

   - Fix compat vDSO compilation under some toolchain configurations

   - Fix false UBSAN warning from ACPI IORT parsing code

   - Fix booting under bootloaders that ignore TEXT_OFFSET

   - Annotate debug initcall function with '__init'"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  arm64: warn on incorrect placement of the kernel by the bootloader
  arm64: acpi: fix UBSAN warning
  arm64: vdso32: add CONFIG_THUMB2_COMPAT_VDSO
  drivers/perf: hisi: Fix wrong value for all counters enable
  arm64: ftrace: Change CONFIG_FTRACE_WITH_REGS to CONFIG_DYNAMIC_FTRACE_WITH_REGS
  arm64: debug: mark a function as __init to save some memory
  scs: Report SCS usage in bytes rather than number of entries
2020-06-11 12:53:23 -07:00
Marco Elver
75d75b7a4d kcsan: Support distinguishing volatile accesses
In the kernel, the "volatile" keyword is used in various concurrent
contexts, whether in low-level synchronization primitives or for
legacy reasons. If supported by the compiler, it will be assumed
that aligned volatile accesses up to sizeof(long long) (matching
compiletime_assert_rwonce_type()) are atomic.

Recent versions of Clang [1] (GCC tentative [2]) can instrument
volatile accesses differently. Add the option (required) to enable the
instrumentation, and provide the necessary runtime functions. None of
the updated compilers are widely available yet (Clang 11 will be the
first release to support the feature).

[1] 5a2c31116f
[2] https://gcc.gnu.org/pipermail/gcc-patches/2020-April/544452.html

This change allows removing of any explicit checks in primitives such as
READ_ONCE() and WRITE_ONCE().

 [ bp: Massage commit message a bit. ]

Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200521142047.169334-4-elver@google.com
2020-06-11 20:04:01 +02:00
Thomas Gleixner
37d1a04b13 Rebase locking/kcsan to locking/urgent
Merge the state of the locking kcsan branch before the read/write_once()
and the atomics modifications got merged.

Squash the fallout of the rebase on top of the read/write once and atomic
fallback work into the merge. The history of the original branch is
preserved in tag locking-kcsan-2020-06-02.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2020-06-11 20:02:46 +02:00
Linus Torvalds
c742b63473 Highlights:
- Keep nfsd clients from unnecessarily breaking their own delegations:
   Note this requires a small kthreadd addition, discussed at:
   https://lore.kernel.org/r/1588348912-24781-1-git-send-email-bfields@redhat.com
   The result is Tejun Heo's suggestion, and he was OK with this going
   through my tree.
 - Patch nfsd/clients/ to display filenames, and to fix byte-order when
   displaying stateid's.
 - fix a module loading/unloading bug, from Neil Brown.
 - A big series from Chuck Lever with RPC/RDMA and tracing improvements,
   and lay some groundwork for RPC-over-TLS.
 
 Note Stephen Rothwell spotted two conflicts in linux-next.  Both should
 be straightforward:
 	include/trace/events/sunrpc.h
 		https://lore.kernel.org/r/20200529105917.50dfc40f@canb.auug.org.au
 	net/sunrpc/svcsock.c
 		https://lore.kernel.org/r/20200529131955.26c421db@canb.auug.org.au
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCAAzFiEEYtFWavXG9hZotryuJ5vNeUKO4b4FAl7iRYwVHGJmaWVsZHNA
 ZmllbGRzZXMub3JnAAoJECebzXlCjuG+yx8QALIfyz/ziPgjGBnNJGCW8BjWHz7+
 rGI+1SP2EUpgJ0fGJc9MpGyYTa5T3pTgsENnIRtegyZDISg2OQ5GfifpkTz4U7vg
 QbWRihs/W9EhltVYhKvtLASAuSAJ8ETbDfLXVb2ncY7iO6JNvb22xwsgKZILmzm1
 uG4qSszmBZzpMUUy51kKJYJZ3ysP+v14qOnyOXEoeEMuJYNK9FkQ9bSPZ6wTJNOn
 hvZBMbU7LzRyVIvp358mFHY+vwq5qBNkJfVrZBkURGn4OxWPbWDXzqOi0Zs1oBjA
 L+QODIbTLGkopu/rD0r1b872PDtket7p5zsD8MreeI1vJOlt3xwqdCGlicIeNATI
 b0RG7sqh+pNv0mvwLxSNTf3rO0EKW6tUySqCnQZUAXFGRH0nYM2TWze4HUr2zfWT
 EgRMwxHY/AZUStZBuCIHPJ6inWnKuxSUELMf2a9JHO1BJc/yClRgmwJGdthVwb9u
 GP6F3/maFu+9YOO6iROMsqtxDA+q5vch5IBzevNOOBDEQDKqENmogR/knl9DmAhF
 sr+FOa3O0u6S4tgXw/TU97JS/h1L2Hu6QVEwU2iVzWtlUUOFVMZQODJTB6Lts4Ka
 gKzYXWvCHN+LyETsN6q7uHFg9mtO7xO5vrrIgo72SuVCscDw/8iHkoOOFLief+GE
 O0fR0IYjW8U1Rkn2
 =YEf0
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.8' of git://linux-nfs.org/~bfields/linux

Pull nfsd updates from Bruce Fields:
 "Highlights:

   - Keep nfsd clients from unnecessarily breaking their own
     delegations.

     Note this requires a small kthreadd addition. The result is Tejun
     Heo's suggestion (see link), and he was OK with this going through
     my tree.

   - Patch nfsd/clients/ to display filenames, and to fix byte-order
     when displaying stateid's.

   - fix a module loading/unloading bug, from Neil Brown.

   - A big series from Chuck Lever with RPC/RDMA and tracing
     improvements, and lay some groundwork for RPC-over-TLS"

Link: https://lore.kernel.org/r/1588348912-24781-1-git-send-email-bfields@redhat.com

* tag 'nfsd-5.8' of git://linux-nfs.org/~bfields/linux: (49 commits)
  sunrpc: use kmemdup_nul() in gssp_stringify()
  nfsd: safer handling of corrupted c_type
  nfsd4: make drc_slab global, not per-net
  SUNRPC: Remove unreachable error condition in rpcb_getport_async()
  nfsd: Fix svc_xprt refcnt leak when setup callback client failed
  sunrpc: clean up properly in gss_mech_unregister()
  sunrpc: svcauth_gss_register_pseudoflavor must reject duplicate registrations.
  sunrpc: check that domain table is empty at module unload.
  NFSD: Fix improperly-formatted Doxygen comments
  NFSD: Squash an annoying compiler warning
  SUNRPC: Clean up request deferral tracepoints
  NFSD: Add tracepoints for monitoring NFSD callbacks
  NFSD: Add tracepoints to the NFSD state management code
  NFSD: Add tracepoints to NFSD's duplicate reply cache
  SUNRPC: svc_show_status() macro should have enum definitions
  SUNRPC: Restructure svc_udp_recvfrom()
  SUNRPC: Refactor svc_recvfrom()
  SUNRPC: Clean up svc_release_skb() functions
  SUNRPC: Refactor recvfrom path dealing with incomplete TCP receives
  SUNRPC: Replace dprintk() call sites in TCP receive path
  ...
2020-06-11 10:33:13 -07:00
Peter Zijlstra
6eebad1ad3 lockdep: __always_inline more for noinstr
vmlinux.o: warning: objtool: debug_locks_off()+0xd: call to __debug_locks_off() leaves .noinstr.text section
vmlinux.o: warning: objtool: match_held_lock()+0x6a: call to look_up_lock_class.isra.0() leaves .noinstr.text section
vmlinux.o: warning: objtool: lock_is_held_type()+0x90: call to lockdep_recursion_finish() leaves .noinstr.text section

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200603114052.185201076@infradead.org
2020-06-11 15:15:28 +02:00
Peter Zijlstra
bf2b300844 x86/entry: Rename trace_hardirqs_off_prepare()
The typical pattern for trace_hardirqs_off_prepare() is:

  ENTRY
    lockdep_hardirqs_off(); // because hardware
    ... do entry magic
    instrumentation_begin();
    trace_hardirqs_off_prepare();
    ... do actual work
    trace_hardirqs_on_prepare();
    lockdep_hardirqs_on_prepare();
    instrumentation_end();
    ... do exit magic
    lockdep_hardirqs_on();

which shows that it's named wrong, rename it to
trace_hardirqs_off_finish(), as it concludes the hardirq_off transition.

Also, given that the above is the only correct order, make the traditional
all-in-one trace_hardirqs_off() follow suit.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200529213321.415774872@infradead.org
2020-06-11 15:15:24 +02:00
Peter Zijlstra
59bc300b71 x86/entry: Clarify irq_{enter,exit}_rcu()
Because:

  irq_enter_rcu() includes lockdep_hardirq_enter()
  irq_exit_rcu() does *NOT* include lockdep_hardirq_exit()

Which resulted in two 'stray' lockdep_hardirq_exit() calls in
idtentry.h, and me spending a long time trying to find the matching
enter calls.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200529213321.359433429@infradead.org
2020-06-11 15:15:24 +02:00
Thomas Gleixner
8a6bc4787f genirq: Provide irq_enter/exit_rcu()
irq_enter()/exit() currently include RCU handling. To properly separate the RCU
handling code, provide variants which contain only the non-RCU related
functionality.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20200521202117.567023613@linutronix.de
2020-06-11 15:15:06 +02:00
Thomas Gleixner
865d3a9afe x86/mce: Address objtools noinstr complaints
Mark the relevant functions noinstr, use the plain non-instrumented MSR
accessors. The only odd part is the instrumentation_begin()/end() pair around the
indirect machine_check_vector() call as objtool can't figure that out. The
possible invoked functions are annotated correctly.

Also use notrace variant of nmi_enter/exit(). If MCEs happen then hardware
latency tracing is the least of the worries.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20200505135315.476734898@linutronix.de
2020-06-11 15:15:02 +02:00
Thomas Gleixner
5916d5f9b3 bug: Annotate WARN/BUG/stackfail as noinstr safe
Warnings, bugs and stack protection fails from noinstr sections, e.g. low
level and early entry code, are likely to be fatal.

Mark them as "safe" to be invoked from noinstr protected code to avoid
annotating all usage sites. Getting the information out is important.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200505134100.376598577@linutronix.de
2020-06-11 15:14:36 +02:00
Thomas Gleixner
0372007f5a context_tracking: Ensure that the critical path cannot be instrumented
context tracking lacks a few protection mechanisms against instrumentation:

 - While the core functions are marked NOKPROBE they lack protection
   against function tracing which is required as the function entry/exit
   points can be utilized by BPF.

 - static functions invoked from the protected functions need to be marked
   as well as they can be instrumented otherwise.

 - using plain inline allows the compiler to emit traceable and probable
   functions.

Fix this by marking the functions noinstr and converting the plain inlines
to __always_inline.

The NOKPROBE_SYMBOL() annotations are removed as the .noinstr.text section
is already excluded from being probed.

Cures the following objtool warnings:

 vmlinux.o: warning: objtool: enter_from_user_mode()+0x34: call to __context_tracking_exit() leaves .noinstr.text section
 vmlinux.o: warning: objtool: prepare_exit_to_usermode()+0x29: call to __context_tracking_enter() leaves .noinstr.text section
 vmlinux.o: warning: objtool: syscall_return_slowpath()+0x29: call to __context_tracking_enter() leaves .noinstr.text section
 vmlinux.o: warning: objtool: do_syscall_64()+0x7f: call to __context_tracking_enter() leaves .noinstr.text section
 vmlinux.o: warning: objtool: do_int80_syscall_32()+0x3d: call to __context_tracking_enter() leaves .noinstr.text section
 vmlinux.o: warning: objtool: do_fast_syscall_32()+0x9c: call to __context_tracking_enter() leaves .noinstr.text section

and generates new ones...

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200505134340.811520478@linutronix.de
2020-06-11 15:14:36 +02:00
Petr Mladek
2a9e5ded95 printk/kdb: Redirect printk messages into kdb in any context
kdb has to get messages on consoles even when the system is stopped.
It uses kdb_printf() internally and calls console drivers on its own.

It uses a hack to reuse an existing code. It sets "kdb_trap_printk"
global variable to redirect even the normal printk() into the
kdb_printf() variant.

The variable "kdb_trap_printk" is checked in printk_default() and
it is ignored when printk is redirected to printk_safe in NMI context.
Solve this by moving the check into printk_func().

It is obvious that it is not fully safe. But it does not make things
worse. The console drivers are already called in this context by
db_printf() direct calls.

Reported-by: Sumit Garg <sumit.garg@linaro.org>
Tested-by: Sumit Garg <sumit.garg@linaro.org>
Reviewed-by: Daniel Thompson <daniel.thompson@linaro.org>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20200520102233.GC3464@linux-b0ei
2020-06-11 08:48:44 +02:00
Christoph Hellwig
37c54f9bd4 kernel: set USER_DS in kthread_use_mm
Some architectures like arm64 and s390 require USER_DS to be set for
kernel threads to access user address space, which is the whole purpose of
kthread_use_mm, but other like x86 don't.  That has lead to a huge mess
where some callers are fixed up once they are tested on said
architectures, while others linger around and yet other like io_uring try
to do "clever" optimizations for what usually is just a trivial asignment
to a member in the thread_struct for most architectures.

Make kthread_use_mm set USER_DS, and kthread_unuse_mm restore to the
previous value instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Felipe Balbi <balbi@kernel.org>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Zhi Wang <zhi.a.wang@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: http://lkml.kernel.org/r/20200404094101.672954-7-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-10 19:14:18 -07:00
Christoph Hellwig
f5678e7f2a kernel: better document the use_mm/unuse_mm API contract
Switch the function documentation to kerneldoc comments, and add
WARN_ON_ONCE asserts that the calling thread is a kernel thread and does
not have ->mm set (or has ->mm set in the case of unuse_mm).

Also give the functions a kthread_ prefix to better document the use case.

[hch@lst.de: fix a comment typo, cover the newly merged use_mm/unuse_mm caller in vfio]
  Link: http://lkml.kernel.org/r/20200416053158.586887-3-hch@lst.de
[sfr@canb.auug.org.au: powerpc/vas: fix up for {un}use_mm() rename]
  Link: http://lkml.kernel.org/r/20200422163935.5aa93ba5@canb.auug.org.au

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> [usb]
Acked-by: Haren Myneni <haren@linux.ibm.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Felipe Balbi <balbi@kernel.org>
Cc: Jason Wang <jasowang@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Zhi Wang <zhi.a.wang@intel.com>
Link: http://lkml.kernel.org/r/20200404094101.672954-6-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-10 19:14:18 -07:00
Christoph Hellwig
9bf5b9eb23 kernel: move use_mm/unuse_mm to kthread.c
Patch series "improve use_mm / unuse_mm", v2.

This series improves the use_mm / unuse_mm interface by better documenting
the assumptions, and my taking the set_fs manipulations spread over the
callers into the core API.

This patch (of 3):

Use the proper API instead.

Link: http://lkml.kernel.org/r/20200404094101.672954-1-hch@lst.de

These helpers are only for use with kernel threads, and I will tie them
more into the kthread infrastructure going forward.  Also move the
prototypes to kthread.h - mmu_context.h was a little weird to start with
as it otherwise contains very low-level MM bits.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Felipe Balbi <balbi@kernel.org>
Cc: Jason Wang <jasowang@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Zhi Wang <zhi.a.wang@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: http://lkml.kernel.org/r/20200404094101.672954-1-hch@lst.de
Link: http://lkml.kernel.org/r/20200416053158.586887-1-hch@lst.de
Link: http://lkml.kernel.org/r/20200404094101.672954-5-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-10 19:14:18 -07:00
Andrey Konovalov
3021e69219 kcov: check kcov_softirq in kcov_remote_stop()
kcov_remote_stop() should check that the corresponding kcov_remote_start()
actually found the specified remote handle and started collecting
coverage.  This is done by checking the per thread kcov_softirq flag.

A particular failure scenario where this was observed involved a softirq
with a remote coverage collection section coming between check_kcov_mode()
and the access to t->kcov_area in __sanitizer_cov_trace_pc().  In that
softirq kcov_remote_start() bailed out after kcov_remote_find() check, but
the matching kcov_remote_stop() didn't check if kcov_remote_start()
succeeded, and overwrote per thread kcov parameters with invalid (zero)
values.

Fixes: 5ff3b30ab5 ("kcov: collect coverage from interrupts")
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Link: http://lkml.kernel.org/r/fcd1cd16eac1d2c01a66befd8ea4afc6f8d09833.1591576806.git.andreyknvl@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-10 19:14:17 -07:00
Linus Torvalds
1c38372662 Merge branch 'work.sysctl' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull sysctl fixes from Al Viro:
 "Fixups to regressions in sysctl series"

* 'work.sysctl' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  sysctl: reject gigantic reads/write to sysctl files
  cdrom: fix an incorrect __user annotation on cdrom_sysctl_info
  trace: fix an incorrect __user annotation on stack_trace_sysctl
  random: fix an incorrect __user annotation on proc_do_entropy
  net/sysctl: remove leftover __user annotations on neigh_proc_dointvec*
  net/sysctl: use cpumask_parse in flow_limit_cpu_sysctl
2020-06-10 16:05:54 -07:00
Linus Torvalds
4382a79b27 Merge branch 'uaccess.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc uaccess updates from Al Viro:
 "Assorted uaccess patches for this cycle - the stuff that didn't fit
  into thematic series"

* 'uaccess.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  bpf: make bpf_check_uarg_tail_zero() use check_zeroed_user()
  x86: kvm_hv_set_msr(): use __put_user() instead of 32bit __clear_user()
  user_regset_copyout_zero(): use clear_user()
  TEST_ACCESS_OK _never_ had been checked anywhere
  x86: switch cp_stat64() to unsafe_put_user()
  binfmt_flat: don't use __put_user()
  binfmt_elf_fdpic: don't use __... uaccess primitives
  binfmt_elf: don't bother with __{put,copy_to}_user()
  pselect6() and friends: take handling the combined 6th/7th args into helper
2020-06-10 16:02:54 -07:00
Linus Torvalds
4152d146ee Merge branch 'rwonce/rework' of git://git.kernel.org/pub/scm/linux/kernel/git/will/linux
Pull READ/WRITE_ONCE rework from Will Deacon:
 "This the READ_ONCE rework I've been working on for a while, which
  bumps the minimum GCC version and improves code-gen on arm64 when
  stack protector is enabled"

[ Side note: I'm _really_ tempted to raise the minimum gcc version to
  4.9, so that we can just say that we require _Generic() support.

  That would allow us to more cleanly handle a lot of the cases where we
  depend on very complex macros with 'sizeof' or __builtin_choose_expr()
  with __builtin_types_compatible_p() etc.

  This branch has a workaround for sparse not handling _Generic(),
  either, but that was already fixed in the sparse development branch,
  so it's really just gcc-4.9 that we'd require.   - Linus ]

* 'rwonce/rework' of git://git.kernel.org/pub/scm/linux/kernel/git/will/linux:
  compiler_types.h: Use unoptimized __unqual_scalar_typeof for sparse
  compiler_types.h: Optimize __unqual_scalar_typeof compilation time
  compiler.h: Enforce that READ_ONCE_NOCHECK() access size is sizeof(long)
  compiler-types.h: Include naked type in __pick_integer_type() match
  READ_ONCE: Fix comment describing 2x32-bit atomicity
  gcov: Remove old GCC 3.4 support
  arm64: barrier: Use '__unqual_scalar_typeof' for acquire/release macros
  locking/barriers: Use '__unqual_scalar_typeof' for load-acquire macros
  READ_ONCE: Drop pointer qualifiers when reading from scalar types
  READ_ONCE: Enforce atomicity for {READ,WRITE}_ONCE() memory accesses
  READ_ONCE: Simplify implementations of {READ,WRITE}_ONCE()
  arm64: csum: Disable KASAN for do_csum()
  fault_inject: Don't rely on "return value" from WRITE_ONCE()
  net: tls: Avoid assigning 'const' pointer to non-const pointer
  netfilter: Avoid assigning 'const' pointer to non-const pointer
  compiler/gcc: Raise minimum GCC version for kernel builds to 4.8
2020-06-10 14:46:54 -07:00
Linus Torvalds
0c67f6b297 More power management updates for 5.8-rc1
- Add support for interconnect bandwidth to the OPP core (Georgi
    Djakov, Saravana Kannan, Sibi Sankar, Viresh Kumar).
 
  - Add support for regulator enable/disable to the OPP core (Kamil
    Konieczny).
 
  - Add boost support to the CPPC cpufreq driver (Xiongfeng Wang).
 
  - Make the tegra186 cpufreq driver set the
    CPUFREQ_NEED_INITIAL_FREQ_CHECK flag (Mian Yousaf Kaukab).
 
  - Prevent the ACPI power management from using power resources
    with devices where the list of power resources for power state
    D0 (full power) is missing (Rafael Wysocki).
 
  - Annotate a hibernation-related function with __init (Christophe
    JAILLET).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl7g/zESHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxZzgP+wTBW/WVLVkrlk2tQhcbbj+y9TNNOfU1
 FZ9C56bR+5VJhrrxTxVHeQP7PDCNxqVM57M8Bcnl3I0LFi+OHNAqkN/xW323N7ZA
 8OGkFgeqSxgG21691rTEwVnwwhdvQsNw47Fqjbu10PiNFYm6W8YNI5JMQRxfTVHb
 H8Nt7xcJ5i7wnMRnAyrotnTUYmS3nZ7IwpHFEoM2SwWCqlYr0h9rDqKz3MvbiE59
 m0G+4tFUv8egyshzwMD78PeFG+7iZP9s4uovsKujj4UGskmAVn9BGP0vI5AJiS4b
 9KdrDNdX5NAEBFn5eDVzZSMzYhRI3pebd306oWUS+A4/rDA+BtC4ECkaWKE6IlX6
 pmJuY5w8mZU0geH+W2xJQp6t/f60XJymEYKu88opm69ujgJL8X2PglWNq8tal6iX
 BfaPNMla+0mNt9L+GzIb6v5f/nbNBQ8qe6vXlQndqcxxerIBPktLTP+j18FzN9N4
 4XOyFJYoauvfMidK8+fjGlSGi54GnlseopNcVTD6IRjRkXhYYJE62oTfgoFe1DIs
 7pFyEtnTgA0Ma9h1CMs2/UwaL1Xule4HMZzdeAnkelAAivdJI181XU0k3Y6vuNwA
 4IXkpMph8utvWX/Dp+OH0a5YGvpEAuihwe8a2Ld/VtOGVINh4hVQucU/yrGN6EIJ
 Od7S+Zua3bR6
 =p5eu
 -----END PGP SIGNATURE-----

Merge tag 'pm-5.8-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull more power management updates from Rafael Wysocki:
 "These are operating performance points (OPP) framework updates mostly,
  including support for interconnect bandwidth in the OPP core, plus a
  few cpufreq changes, including boost support in the CPPC cpufreq
  driver, an ACPI device power management fix and a hibernation code
  cleanup.

  Specifics:

   - Add support for interconnect bandwidth to the OPP core (Georgi
     Djakov, Saravana Kannan, Sibi Sankar, Viresh Kumar).

   - Add support for regulator enable/disable to the OPP core (Kamil
     Konieczny).

   - Add boost support to the CPPC cpufreq driver (Xiongfeng Wang).

   - Make the tegra186 cpufreq driver set the
     CPUFREQ_NEED_INITIAL_FREQ_CHECK flag (Mian Yousaf Kaukab).

   - Prevent the ACPI power management from using power resources with
     devices where the list of power resources for power state D0 (full
     power) is missing (Rafael Wysocki).

   - Annotate a hibernation-related function with __init (Christophe
     JAILLET)"

* tag 'pm-5.8-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI: PM: Avoid using power resources if there are none for D0
  cpufreq: CPPC: add SW BOOST support
  cpufreq: change '.set_boost' to act on one policy
  PM: hibernate: Add __init annotation to swsusp_header_init()
  opp: Don't parse icc paths unnecessarily
  opp: Remove bandwidth votes when target_freq is zero
  opp: core: add regulators enable and disable
  opp: Reorder the code for !target_freq case
  opp: Expose bandwidth information via debugfs
  cpufreq: dt: Add support for interconnect bandwidth scaling
  opp: Update the bandwidth on OPP frequency changes
  opp: Add sanity checks in _read_opp_key()
  opp: Add support for parsing interconnect bandwidth
  cpufreq: tegra186: add CPUFREQ_NEED_INITIAL_FREQ_CHECK flag
  OPP: Add helpers for reading the binding properties
  dt-bindings: opp: Introduce opp-peak-kBps and opp-avg-kBps bindings
2020-06-10 14:04:39 -07:00
Jesper Dangaard Brouer
281920b7e0 bpf: Devmap adjust uapi for attach bpf program
V2:
- Defer changing BPF-syscall to start at file-descriptor 1
- Use {} to zero initialise struct.

The recent commit fbee97feed ("bpf: Add support to attach bpf program to a
devmap entry"), introduced ability to attach (and run) a separate XDP
bpf_prog for each devmap entry. A bpf_prog is added via a file-descriptor.
As zero were a valid FD, not using the feature requires using value minus-1.
The UAPI is extended via tail-extending struct bpf_devmap_val and using
map->value_size to determine the feature set.

This will break older userspace applications not using the bpf_prog feature.
Consider an old userspace app that is compiled against newer kernel
uapi/bpf.h, it will not know that it need to initialise the member
bpf_prog.fd to minus-1. Thus, users will be forced to update source code to
get program running on newer kernels.

This patch remove the minus-1 checks, and have zero mean feature isn't used.

Followup patches either for kernel or libbpf should handle and avoid
returning file-descriptor zero in the first place.

Fixes: fbee97feed ("bpf: Add support to attach bpf program to a devmap entry")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/159170950687.2102545.7235914718298050113.stgit@firesoul
2020-06-09 11:36:18 -07:00
Lorenz Bauer
248e00ac47 bpf: cgroup: Allow multi-attach program to replace itself
When using BPF_PROG_ATTACH to attach a program to a cgroup in
BPF_F_ALLOW_MULTI mode, it is not possible to replace a program
with itself. This is because the check for duplicate programs
doesn't take the replacement program into account.

Replacing a program with itself might seem weird, but it has
some uses: first, it allows resetting the associated cgroup storage.
Second, it makes the API consistent with the non-ALLOW_MULTI usage,
where it is possible to replace a program with itself. Third, it
aligns BPF_PROG_ATTACH with bpf_link, where replacing itself is
also supported.

Sice this code has been refactored a few times this change will
only apply to v5.7 and later. Adjustments could be made to
commit 1020c1f24a ("bpf: Simplify __cgroup_bpf_attach") and
commit d7bf2c10af ("bpf: allocate cgroup storage entries on attaching bpf programs")
as well as commit 324bda9e6c ("bpf: multi program support for cgroup+bpf")

Fixes: af6eea5743 ("bpf: Implement bpf_link-based cgroup BPF program attachment")
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200608162202.94002-1-lmb@cloudflare.com
2020-06-09 11:21:43 -07:00
David Ahern
26afa0a4eb bpf: Reset data_meta before running programs attached to devmap entry
This is a new context that does not handle metadata at the moment, so
mark data_meta invalid.

Fixes: fbee97feed ("bpf: Add support to attach bpf program to a devmap entry")
Signed-off-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200608151723.9539-1-dsahern@kernel.org
2020-06-09 11:19:35 -07:00
Jean-Philippe Brucker
22d5bd6867 tracing/probe: Fix bpf_task_fd_query() for kprobes and uprobes
Commit 60d53e2c3b ("tracing/probe: Split trace_event related data from
trace_probe") removed the trace_[ku]probe structure from the
trace_event_call->data pointer. As bpf_get_[ku]probe_info() were
forgotten in that change, fix them now. These functions are currently
only used by the bpf_task_fd_query() syscall handler to collect
information about a perf event.

Fixes: 60d53e2c3b ("tracing/probe: Split trace_event related data from trace_probe")
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lore.kernel.org/bpf/20200608124531.819838-1-jean-philippe@linaro.org
2020-06-09 11:10:12 -07:00
Linus Torvalds
d1e521adad Tracing updates for 5.8:
No new features this release. Mostly clean ups, restructuring and
 documentation.
 
  - Have ftrace_bug() show ftrace errors before the WARN, as the WARN will
    reboot the box before the error messages are printed if panic_on_warn
    is set.
 
  - Have traceoff_on_warn disable tracing sooner (before prints)
 
  - Write a message to the trace buffer that its being disabled when
    disable_trace_on_warning() is set.
 
  - Separate out synthetic events from histogram code to let it be used by
    other parts of the kernel.
 
  - More documentation on histogram design.
 
  - Other small fixes and clean ups.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXt+LEhQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qj2zAP9sD/W4jafYayucj+MvRP7sy+Q0iAH7
 WMn8fkk958cgfQD8D1QFtkkx+3O3TRT6ApGf11w5+JgSWUE2gSbW9H4fPQk=
 =X5t4
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing updates from Steven Rostedt:
 "No new features this release. Mostly clean ups, restructuring and
  documentation.

   - Have ftrace_bug() show ftrace errors before the WARN, as the WARN
     will reboot the box before the error messages are printed if
     panic_on_warn is set.

   - Have traceoff_on_warn disable tracing sooner (before prints)

   - Write a message to the trace buffer that its being disabled when
     disable_trace_on_warning() is set.

   - Separate out synthetic events from histogram code to let it be used
     by other parts of the kernel.

   - More documentation on histogram design.

   - Other small fixes and clean ups"

* tag 'trace-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Remove obsolete PREEMPTIRQ_EVENTS kconfig option
  tracing/doc: Fix ascii-art in histogram-design.rst
  tracing: Add a trace print when traceoff_on_warning is triggered
  ftrace,bug: Improve traceoff_on_warn
  selftests/ftrace: Distinguish between hist and synthetic event checks
  tracing: Move synthetic events to a separate file
  tracing: Fix events.rst section numbering
  tracing/doc: Fix typos in histogram-design.rst
  tracing: Add hist_debug trace event files for histogram debugging
  tracing: Add histogram-design document
  tracing: Check state.disabled in synth event trace functions
  tracing/probe: reverse arguments to list_add
  tools/bootconfig: Add a summary of test cases and return error
  ftrace: show debugging information when panic_on_warn set
2020-06-09 10:06:18 -07:00
Linus Torvalds
a5ad5742f6 Merge branch 'akpm' (patches from Andrew)
Merge even more updates from Andrew Morton:

 - a kernel-wide sweep of show_stack()

 - pagetable cleanups

 - abstract out accesses to mmap_sem - prep for mmap_sem scalability work

 - hch's user acess work

Subsystems affected by this patch series: debug, mm/pagemap, mm/maccess,
mm/documentation.

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (93 commits)
  include/linux/cache.h: expand documentation over __read_mostly
  maccess: return -ERANGE when probe_kernel_read() fails
  x86: use non-set_fs based maccess routines
  maccess: allow architectures to provide kernel probing directly
  maccess: move user access routines together
  maccess: always use strict semantics for probe_kernel_read
  maccess: remove strncpy_from_unsafe
  tracing/kprobes: handle mixed kernel/userspace probes better
  bpf: rework the compat kernel probe handling
  bpf:bpf_seq_printf(): handle potentially unsafe format string better
  bpf: handle the compat string in bpf_trace_copy_string better
  bpf: factor out a bpf_trace_copy_string helper
  maccess: unify the probe kernel arch hooks
  maccess: remove probe_read_common and probe_write_common
  maccess: rename strnlen_unsafe_user to strnlen_user_nofault
  maccess: rename strncpy_from_unsafe_strict to strncpy_from_kernel_nofault
  maccess: rename strncpy_from_unsafe_user to strncpy_from_user_nofault
  maccess: update the top of file comment
  maccess: clarify kerneldoc comments
  maccess: remove duplicate kerneldoc comments
  ...
2020-06-09 09:54:46 -07:00
Oleg Nesterov
013b2deba9 uprobes: ensure that uprobe->offset and ->ref_ctr_offset are properly aligned
uprobe_write_opcode() must not cross page boundary; prepare_uprobe()
relies on arch_uprobe_analyze_insn() which should validate "vaddr" but
some architectures (csky, s390, and sparc) don't do this.

We can remove the BUG_ON() check in prepare_uprobe() and validate the
offset early in __uprobe_register(). The new IS_ALIGNED() check matches
the alignment check in arch_prepare_kprobe() on supported architectures,
so I think that all insns must be aligned to UPROBE_SWBP_INSN_SIZE.

Another problem is __update_ref_ctr() which was wrong from the very
beginning, it can read/write outside of kmap'ed page unless "vaddr" is
aligned to sizeof(short), __uprobe_register() should check this too.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Tested-by: Sven Schnelle <svens@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:49:24 -07:00
Christoph Hellwig
98a23609b1 maccess: always use strict semantics for probe_kernel_read
Except for historical confusion in the kprobes/uprobes and bpf tracers,
which has been fixed now, there is no good reason to ever allow user
memory accesses from probe_kernel_read.  Switch probe_kernel_read to only
read from kernel memory.

[akpm@linux-foundation.org: update it for "mm, dump_page(): do not crash with invalid mapping pointer"]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20200521152301.2587579-17-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:15 -07:00
Christoph Hellwig
9de1fec50b tracing/kprobes: handle mixed kernel/userspace probes better
Instead of using the dangerous probe_kernel_read and strncpy_from_unsafe
helpers, rework probes to try a user probe based on the address if the
architecture has a common address space for kernel and userspace.

[svens@linux.ibm.com:use strncpy_from_kernel_nofault() in fetch_store_string()]
  Link: http://lkml.kernel.org/r/20200606181903.49384-1-svens@linux.ibm.com

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20200521152301.2587579-15-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:15 -07:00
Christoph Hellwig
8d92db5c04 bpf: rework the compat kernel probe handling
Instead of using the dangerous probe_kernel_read and strncpy_from_unsafe
helpers, rework the compat probes to check if an address is a kernel or
userspace one, and then use the low-level kernel or user probe helper
shared by the proper kernel and user probe helpers.  This slightly
changes behavior as the compat probe on a user address doesn't check
the lockdown flags, just as the pure user probes do.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20200521152301.2587579-14-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:15 -07:00
Andrew Morton
19c8d8ac63 bpf:bpf_seq_printf(): handle potentially unsafe format string better
User the proper helper for kernel or userspace addresses based on
TASK_SIZE instead of the dangerous strncpy_from_unsafe function.

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:15 -07:00
Christoph Hellwig
aec6ce5913 bpf: handle the compat string in bpf_trace_copy_string better
User the proper helper for kernel or userspace addresses based on
TASK_SIZE instead of the dangerous strncpy_from_unsafe function.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20200521152301.2587579-13-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:15 -07:00
Christoph Hellwig
d7b2977b81 bpf: factor out a bpf_trace_copy_string helper
Split out a helper to do the fault free access to the string pointer
to get it out of a crazy indentation level.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20200521152301.2587579-12-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:15 -07:00
Christoph Hellwig
02dddb160e maccess: rename strnlen_unsafe_user to strnlen_user_nofault
This matches the naming of strnlen_user, and also makes it more clear
what the function is supposed to do.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20200521152301.2587579-9-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:15 -07:00
Christoph Hellwig
c4cb164426 maccess: rename strncpy_from_unsafe_strict to strncpy_from_kernel_nofault
This matches the naming of strncpy_from_user_nofault, and also makes it
more clear what the function is supposed to do.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20200521152301.2587579-8-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:15 -07:00
Christoph Hellwig
bd88bb5d40 maccess: rename strncpy_from_unsafe_user to strncpy_from_user_nofault
This matches the naming of strncpy_from_user, and also makes it more
clear what the function is supposed to do.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20200521152301.2587579-7-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:15 -07:00
Michel Lespinasse
c1e8d7c6a7 mmap locking API: convert mmap_sem comments
Convert comments that reference mmap_sem to reference mmap_lock instead.

[akpm@linux-foundation.org: fix up linux-next leftovers]
[akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil]
[akpm@linux-foundation.org: more linux-next fixups, per Michel]

Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ying Han <yinghan@google.com>
Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:14 -07:00
Michel Lespinasse
0cc55a0213 mmap locking API: add mmap_read_trylock_non_owner()
Add a couple APIs used by kernel/bpf/stackmap.c only:
- mmap_read_trylock_non_owner()
- mmap_read_unlock_non_owner() (may be called from a work queue).

It's still not ideal that bpf/stackmap subverts the lock ownership in this
way.  Thanks to Peter Zijlstra for suggesting this API as the least-ugly
way of addressing this in the short term.

Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ying Han <yinghan@google.com>
Link: http://lkml.kernel.org/r/20200520052908.204642-8-walken@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:14 -07:00
Michel Lespinasse
aaa2cc56c1 mmap locking API: convert nested write lock sites
Add API for nested write locks and convert the few call sites doing that.

Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ying Han <yinghan@google.com>
Link: http://lkml.kernel.org/r/20200520052908.204642-7-walken@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:14 -07:00
Michel Lespinasse
d8ed45c5dc mmap locking API: use coccinelle to convert mmap_sem rwsem call sites
This change converts the existing mmap_sem rwsem calls to use the new mmap
locking API instead.

The change is generated using coccinelle with the following rule:

// spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

@@
expression mm;
@@
(
-init_rwsem
+mmap_init_lock
|
-down_write
+mmap_write_lock
|
-down_write_killable
+mmap_write_lock_killable
|
-down_write_trylock
+mmap_write_trylock
|
-up_write
+mmap_write_unlock
|
-downgrade_write
+mmap_write_downgrade
|
-down_read
+mmap_read_lock
|
-down_read_killable
+mmap_read_lock_killable
|
-down_read_trylock
+mmap_read_trylock
|
-up_read
+mmap_read_unlock
)
-(&mm->mmap_sem)
+(mm)

Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ying Han <yinghan@google.com>
Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:14 -07:00
Mike Rapoport
ca5999fde0 mm: introduce include/linux/pgtable.h
The include/linux/pgtable.h is going to be the home of generic page table
manipulation functions.

Start with moving asm-generic/pgtable.h to include/linux/pgtable.h and
make the latter include asm/pgtable.h.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200514170327.31389-3-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:13 -07:00
Mike Rapoport
e31cf2f4ca mm: don't include asm/pgtable.h if linux/mm.h is already included
Patch series "mm: consolidate definitions of page table accessors", v2.

The low level page table accessors (pXY_index(), pXY_offset()) are
duplicated across all architectures and sometimes more than once.  For
instance, we have 31 definition of pgd_offset() for 25 supported
architectures.

Most of these definitions are actually identical and typically it boils
down to, e.g.

static inline unsigned long pmd_index(unsigned long address)
{
        return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
}

static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
{
        return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
}

These definitions can be shared among 90% of the arches provided
XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined.

For architectures that really need a custom version there is always
possibility to override the generic version with the usual ifdefs magic.

These patches introduce include/linux/pgtable.h that replaces
include/asm-generic/pgtable.h and add the definitions of the page table
accessors to the new header.

This patch (of 12):

The linux/mm.h header includes <asm/pgtable.h> to allow inlining of the
functions involving page table manipulations, e.g.  pte_alloc() and
pmd_alloc().  So, there is no point to explicitly include <asm/pgtable.h>
in the files that include <linux/mm.h>.

The include statements in such cases are remove with a simple loop:

	for f in $(git grep -l "include <linux/mm.h>") ; do
		sed -i -e '/include <asm\/pgtable.h>/ d' $f
	done

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200514170327.31389-1-rppt@kernel.org
Link: http://lkml.kernel.org/r/20200514170327.31389-2-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:13 -07:00
Dmitry Safonov
9cb8f069de kernel: rename show_stack_loglvl() => show_stack()
Now the last users of show_stack() got converted to use an explicit log
level, show_stack_loglvl() can drop it's redundant suffix and become once
again well known show_stack().

Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200418201944.482088-51-dima@arista.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:13 -07:00
Dmitry Safonov
fe1993a001 kernel: use show_stack_loglvl()
Align the last users of show_stack() by KERN_DEFAULT as the surrounding
headers/messages.

Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Link: http://lkml.kernel.org/r/20200418201944.482088-50-dima@arista.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:13 -07:00
Dmitry Safonov
8ba09b1dc1 sched: print stack trace with KERN_INFO
Aligning with other messages printed in sched_show_task() - use KERN_INFO
to print the backtrace.

Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/20200418201944.482088-49-dima@arista.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:12 -07:00
Dmitry Safonov
77819daf24 kdb: don't play with console_loglevel
Print the stack trace with KERN_EMERG - it should be always visible.

Playing with console_loglevel is a bad idea as there may be more messages
printed than wanted.  Also the stack trace might be not printed at all if
printk() was deferred and console_loglevel was raised back before the
trace got flushed.

Unfortunately, after rebasing on commit 2277b49258 ("kdb: Fix stack
crawling on 'running' CPUs that aren't the master"), kdb_show_stack() uses
now kdb_dump_stack_on_cpu(), which for now won't be converted as it uses
dump_stack() instead of show_stack().

Convert for now the branch that uses show_stack() and remove
console_loglevel exercise from that case.

Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Acked-by: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Link: http://lkml.kernel.org/r/20200418201944.482088-48-dima@arista.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:12 -07:00
Dmitry Safonov
2062a4e8ae kallsyms/printk: add loglvl to print_ip_sym()
Patch series "Add log level to show_stack()", v3.

Add log level argument to show_stack().

Done in three stages:
1. Introducing show_stack_loglvl() for every architecture
2. Migrating old users with an explicit log level
3. Renaming show_stack_loglvl() into show_stack()

Justification:

- It's a design mistake to move a business-logic decision into platform
  realization detail.

- I have currently two patches sets that would benefit from this work:
  Removing console_loglevel jumps in sysrq driver [1] Hung task warning
  before panic [2] - suggested by Tetsuo (but he probably didn't realise
  what it would involve).

- While doing (1), (2) the backtraces were adjusted to headers and other
  messages for each situation - so there won't be a situation when the
  backtrace is printed, but the headers are missing because they have
  lesser log level (or the reverse).

- As the result in (2) plays with console_loglevel for kdb are removed.

The least important for upstream, but maybe still worth to note that every
company I've worked in so far had an off-list patch to print backtrace
with the needed log level (but only for the architecture they cared
about).  If you have other ideas how you will benefit from show_stack()
with a log level - please, reply to this cover letter.

See also discussion on v1:
https://lore.kernel.org/linux-riscv/20191106083538.z5nlpuf64cigxigh@pathway.suse.cz/

This patch (of 50):

print_ip_sym() needs to have a log level parameter to comply with other
parts being printed.  Otherwise, half of the expected backtrace would be
printed and other may be missing with some logging level.

The following callee(s) are using now the adjusted log level:
- microblaze/unwind: the same level as headers & userspace unwind.
  Note that pr_debug()'s there are for debugging the unwinder itself.
- nds32/traps: symbol addresses are printed with the same log level
  as backtrace headers.
- lockdep: ip for locking issues is printed with the same log level
  as other part of the warning.
- sched: ip where preemption was disabled is printed as error like
  the rest part of the message.
- ftrace: bug reports are now consistent in the log level being used.

Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Burton <paulburton@kernel.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Will Deacon <will@kernel.org>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: Dmitry Safonov <dima@arista.com>
Cc: Jiri Slaby <jslaby@suse.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Aurelien Jacquiot <jacquiot.aurelien@gmail.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Guo Ren <guoren@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Rich Felker <dalias@libc.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Douglas Anderson <dianders@chromium.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Link: http://lkml.kernel.org/r/20200418201944.482088-2-dima@arista.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:10 -07:00
Thomas Gleixner
c7f3d43b62 clocksource: Remove obsolete ifdef
CONFIG_GENERIC_VDSO_CLOCK_MODE was a transitional config switch which got
removed after all architectures got converted to the new storage model.

But the removal forgot to remove the #ifdef which guards the
vdso_clock_mode sanity check, which effectively disables the sanity check.

Remove it now.

Fixes: f86fd32db7 ("lib/vdso: Cleanup clock mode storage leftovers")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20200606221531.845475036@linutronix.de
2020-06-09 16:36:47 +02:00
Geert Uytterhoeven
3ee06a6d53 dma-pool: fix too large DMA pools on medium memory size systems
On systems with at least 32 MiB, but less than 32 GiB of RAM, the DMA
memory pools are much larger than intended (e.g. 2 MiB instead of 128
KiB on a 256 MiB system).

Fix this by correcting the calculation of the number of GiBs of RAM in
the system.  Invert the order of the min/max operations, to keep on
calculating in pages until the last step, which aids readability.

Fixes: 1d659236fb ("dma-pool: scale the default DMA coherent pool size with memory capacity")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-06-09 15:25:52 +02:00
Christoph Hellwig
490741ab1b module: move the set_fs hack for flush_icache_range to m68k
flush_icache_range generally operates on kernel addresses, but for some
reason m68k needed a set_fs override.  Move that into the m68k code
insted of keeping it in the module loader.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Jessica Yu <jeyu@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Yonghong Song <yhs@fb.com>
Link: http://lkml.kernel.org/r/20200515143646.3857579-30-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-08 11:05:58 -07:00
Christoph Hellwig
885f7f8e30 mm: rename flush_icache_user_range to flush_icache_user_page
The function currently known as flush_icache_user_range only operates on
a single page.  Rename it to flush_icache_user_page as we'll need the
name flush_icache_user_range for something else soon.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: http://lkml.kernel.org/r/20200515143646.3857579-20-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-08 11:05:58 -07:00
Souptick Joarder
dadbb612f6 mm/gup.c: convert to use get_user_{page|pages}_fast_only()
API __get_user_pages_fast() renamed to get_user_pages_fast_only() to
align with pin_user_pages_fast_only().

As part of this we will get rid of write parameter.  Instead caller will
pass FOLL_WRITE to get_user_pages_fast_only().  This will not change any
existing functionality of the API.

All the callers are changed to pass FOLL_WRITE.

Also introduce get_user_page_fast_only(), and use it in a few places
that hard-code nr_pages to 1.

Updated the documentation of the API.

Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Paul Mackerras <paulus@ozlabs.org>		[arch/powerpc/kvm]
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Michal Suchanek <msuchanek@suse.de>
Link: http://lkml.kernel.org/r/1590396812-31277-1-git-send-email-jrdr.linux@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-08 11:05:56 -07:00
Rafael Aquini
e77132e758 kernel/sysctl.c: ignore out-of-range taint bits introduced via kernel.tainted
Users with SYS_ADMIN capability can add arbitrary taint flags to the
running kernel by writing to /proc/sys/kernel/tainted or issuing the
command 'sysctl -w kernel.tainted=...'.  This interface, however, is
open for any integer value and this might cause an invalid set of flags
being committed to the tainted_mask bitset.

This patch introduces a simple way for proc_taint() to ignore any
eventual invalid bit coming from the user input before committing those
bits to the kernel tainted_mask.

Signed-off-by: Rafael Aquini <aquini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Link: http://lkml.kernel.org/r/20200512223946.888020-1-aquini@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-08 11:05:56 -07:00
Guilherme G. Piccoli
60c958d8df panic: add sysctl to dump all CPUs backtraces on oops event
Usually when the kernel reaches an oops condition, it's a point of no
return; in case not enough debug information is available in the kernel
splat, one of the last resorts would be to collect a kernel crash dump
and analyze it.  The problem with this approach is that in order to
collect the dump, a panic is required (to kexec-load the crash kernel).
When in an environment of multiple virtual machines, users may prefer to
try living with the oops, at least until being able to properly shutdown
their VMs / finish their important tasks.

This patch implements a way to collect a bit more debug details when an
oops event is reached, by printing all the CPUs backtraces through the
usage of NMIs (on architectures that support that).  The sysctl added
(and documented) here was called "oops_all_cpu_backtrace", and when set
will (as the name suggests) dump all CPUs backtraces.

Far from ideal, this may be the last option though for users that for
some reason cannot panic on oops.  Most of times oopses are clear enough
to indicate the kernel portion that must be investigated, but in virtual
environments it's possible to observe hypervisor/KVM issues that could
lead to oopses shown in other guests CPUs (like virtual APIC crashes).
This patch hence aims to help debug such complex issues without
resorting to kdump.

Signed-off-by: Guilherme G. Piccoli <gpiccoli@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Matthew Wilcox <willy@infradead.org>
Link: http://lkml.kernel.org/r/20200327224116.21030-1-gpiccoli@canonical.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-08 11:05:56 -07:00
Guilherme G. Piccoli
0ec9dc9bcb kernel/hung_task.c: introduce sysctl to print all traces when a hung task is detected
Commit 401c636a0e ("kernel/hung_task.c: show all hung tasks before
panic") introduced a change in that we started to show all CPUs
backtraces when a hung task is detected _and_ the sysctl/kernel
parameter "hung_task_panic" is set.  The idea is good, because usually
when observing deadlocks (that may lead to hung tasks), the culprit is
another task holding a lock and not necessarily the task detected as
hung.

The problem with this approach is that dumping backtraces is a slightly
expensive task, specially printing that on console (and specially in
many CPU machines, as servers commonly found nowadays).  So, users that
plan to collect a kdump to investigate the hung tasks and narrow down
the deadlock definitely don't need the CPUs backtrace on dmesg/console,
which will delay the panic and pollute the log (crash tool would easily
grab all CPUs traces with 'bt -a' command).

Also, there's the reciprocal scenario: some users may be interested in
seeing the CPUs backtraces but not have the system panic when a hung
task is detected.  The current approach hence is almost as embedding a
policy in the kernel, by forcing the CPUs backtraces' dump (only) on
hung_task_panic.

This patch decouples the panic event on hung task from the CPUs
backtraces dump, by creating (and documenting) a new sysctl called
"hung_task_all_cpu_backtrace", analog to the approach taken on soft/hard
lockups, that have both a panic and an "all_cpu_backtrace" sysctl to
allow individual control.  The new mechanism for dumping the CPUs
backtraces on hung task detection respects "hung_task_warnings" by not
dumping the traces in case there's no warnings left.

Signed-off-by: Guilherme G. Piccoli <gpiccoli@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Link: http://lkml.kernel.org/r/20200327223646.20779-1-gpiccoli@canonical.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-08 11:05:56 -07:00
Guilherme G. Piccoli
f117955a22 kernel/watchdog.c: convert {soft/hard}lockup boot parameters to sysctl aliases
After a recent change introduced by Vlastimil's series [0], kernel is
able now to handle sysctl parameters on kernel command line; also, the
series introduced a simple infrastructure to convert legacy boot
parameters (that duplicate sysctls) into sysctl aliases.

This patch converts the watchdog parameters softlockup_panic and
{hard,soft}lockup_all_cpu_backtrace to use the new alias infrastructure.
It fixes the documentation too, since the alias only accepts values 0 or
1, not the full range of integers.

We also took the opportunity here to improve the documentation of the
previously converted hung_task_panic (see the patch series [0]) and put
the alias table in alphabetical order.

[0] http://lkml.kernel.org/r/20200427180433.7029-1-vbabka@suse.cz

Signed-off-by: Guilherme G. Piccoli <gpiccoli@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Kees Cook <keescook@chromium.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Link: http://lkml.kernel.org/r/20200507214624.21911-1-gpiccoli@canonical.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-08 11:05:56 -07:00
Vlastimil Babka
b467f3ef3c kernel/hung_task convert hung_task_panic boot parameter to sysctl
We can now handle sysctl parameters on kernel command line and have
infrastructure to convert legacy command line options that duplicate
sysctl to become a sysctl alias.

This patch converts the hung_task_panic parameter.  Note that the sysctl
handler is more strict and allows only 0 and 1, while the legacy
parameter allowed any non-zero value.  But there is little reason anyone
would not be using 1.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Guilherme G . Piccoli" <gpiccoli@canonical.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Ivan Teterevkov <ivan.teterevkov@nutanix.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20200427180433.7029-4-vbabka@suse.cz
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-08 11:05:56 -07:00
Rafael Aquini
db38d5c106 kernel: add panic_on_taint
Analogously to the introduction of panic_on_warn, this patch introduces
a kernel option named panic_on_taint in order to provide a simple and
generic way to stop execution and catch a coredump when the kernel gets
tainted by any given flag.

This is useful for debugging sessions as it avoids having to rebuild the
kernel to explicitly add calls to panic() into the code sites that
introduce the taint flags of interest.

For instance, if one is interested in proceeding with a post-mortem
analysis at the point a given code path is hitting a bad page (i.e.
unaccount_page_cache_page(), or slab_bug()), a coredump can be collected
by rebooting the kernel with 'panic_on_taint=0x20' amended to the
command line.

Another, perhaps less frequent, use for this option would be as a means
for assuring a security policy case where only a subset of taints, or no
single taint (in paranoid mode), is allowed for the running system.  The
optional switch 'nousertaint' is handy in this particular scenario, as
it will avoid userspace induced crashes by writes to sysctl interface
/proc/sys/kernel/tainted causing false positive hits for such policies.

[akpm@linux-foundation.org: tweak kernel-parameters.txt wording]

Suggested-by: Qian Cai <cai@lca.pw>
Signed-off-by: Rafael Aquini <aquini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Dave Young <dyoung@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Bunk <bunk@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Takashi Iwai <tiwai@suse.de>
Link: http://lkml.kernel.org/r/20200515175502.146720-1-aquini@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-08 11:05:56 -07:00
Christoph Hellwig
7ff0d4490e trace: fix an incorrect __user annotation on stack_trace_sysctl
No user pointers for sysctls anymore.

Fixes: 32927393dc ("sysctl: pass kernel pointers to ->proc_handler")
Reported-by: build test robot <lkp@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-06-08 10:13:56 -04:00
Linus Torvalds
9aa900c809 Char/Misc driver patches for 5.8-rc1
Here is the large set of char/misc driver patches for 5.8-rc1
 
 Included in here are:
 	- habanalabs driver updates, loads
 	- mhi bus driver updates
 	- extcon driver updates
 	- clk driver updates (approved by the clock maintainer)
 	- firmware driver updates
 	- fpga driver updates
 	- gnss driver updates
 	- coresight driver updates
 	- interconnect driver updates
 	- parport driver updates (it's still alive!)
 	- nvmem driver updates
 	- soundwire driver updates
 	- visorbus driver updates
 	- w1 driver updates
 	- various misc driver updates
 
 In short, loads of different driver subsystem updates along with the
 drivers as well.
 
 All have been in linux-next for a while with no reported issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXtzkHw8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+yldOwCgus/DgpnI1UL4z+NdBxJrAXtkPmgAn2sgTUea
 i5RblCmcVMqvHaGtYkY+
 =tScN
 -----END PGP SIGNATURE-----

Merge tag 'char-misc-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc

Pull char/misc driver updates from Greg KH:
 "Here is the large set of char/misc driver patches for 5.8-rc1

  Included in here are:

   - habanalabs driver updates, loads

   - mhi bus driver updates

   - extcon driver updates

   - clk driver updates (approved by the clock maintainer)

   - firmware driver updates

   - fpga driver updates

   - gnss driver updates

   - coresight driver updates

   - interconnect driver updates

   - parport driver updates (it's still alive!)

   - nvmem driver updates

   - soundwire driver updates

   - visorbus driver updates

   - w1 driver updates

   - various misc driver updates

  In short, loads of different driver subsystem updates along with the
  drivers as well.

  All have been in linux-next for a while with no reported issues"

* tag 'char-misc-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (233 commits)
  habanalabs: correctly cast u64 to void*
  habanalabs: initialize variable to default value
  extcon: arizona: Fix runtime PM imbalance on error
  extcon: max14577: Add proper dt-compatible strings
  extcon: adc-jack: Fix an error handling path in 'adc_jack_probe()'
  extcon: remove redundant assignment to variable idx
  w1: omap-hdq: print dev_err if irq flags are not cleared
  w1: omap-hdq: fix interrupt handling which did show spurious timeouts
  w1: omap-hdq: fix return value to be -1 if there is a timeout
  w1: omap-hdq: cleanup to add missing newline for some dev_dbg
  /dev/mem: Revoke mappings when a driver claims the region
  misc: xilinx-sdfec: convert get_user_pages() --> pin_user_pages()
  misc: xilinx-sdfec: cleanup return value in xsdfec_table_write()
  misc: xilinx-sdfec: improve get_user_pages_fast() error handling
  nvmem: qfprom: remove incorrect write support
  habanalabs: handle MMU cache invalidation timeout
  habanalabs: don't allow hard reset with open processes
  habanalabs: GAUDI does not support soft-reset
  habanalabs: add print for soft reset due to event
  habanalabs: improve MMU cache invalidation code
  ...
2020-06-07 10:59:32 -07:00
Linus Torvalds
081096d98b TTY/Serial driver updates for 5.8-rc1
Here is the tty and serial driver updates for 5.8-rc1
 
 Nothing huge at all, just a lot of little serial driver fixes, updates
 for new devices and features, and other small things.  Full details are
 in the shortlog.
 
 Note, you will get a conflict merging with your tree in the
 Documentation/devicetree/bindings/serial/rs485.yaml file, but it should
 be pretty obvious what to do.  If not, I'm sure Rob will clean it all up
 afterwards :)
 
 All of these have been in linux-next with no issues for a while.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXtzpCg8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ylRxACgjGtOKPjahONL4lWd0F8ZYEcyw7sAn34woBCO
 BDUV3kolrRQ4OYNJWsHP
 =TvqG
 -----END PGP SIGNATURE-----

Merge tag 'tty-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty

Pull tty/serial driver updates from Greg KH:
 "Here is the tty and serial driver updates for 5.8-rc1

  Nothing huge at all, just a lot of little serial driver fixes, updates
  for new devices and features, and other small things. Full details are
  in the shortlog.

  All of these have been in linux-next with no issues for a while"

* tag 'tty-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (67 commits)
  tty: serial: qcom_geni_serial: Add 51.2MHz frequency support
  tty: serial: imx: clear Ageing Timer Interrupt in handler
  serial: 8250_fintek: Add F81966 Support
  sc16is7xx: Add flag to activate IrDA mode
  dt-bindings: sc16is7xx: Add flag to activate IrDA mode
  serial: 8250: Support rs485 bus termination GPIO
  serial: 8520_port: Fix function param documentation
  dt-bindings: serial: Add binding for rs485 bus termination GPIO
  vt: keyboard: avoid signed integer overflow in k_ascii
  serial: 8250: Enable 16550A variants by default on non-x86
  tty: hvc_console, fix crashes on parallel open/close
  serial: imx: Initialize lock for non-registered console
  sc16is7xx: Read the LSR register for basic device presence check
  sc16is7xx: Allow sharing the IRQ line
  sc16is7xx: Use threaded IRQ
  sc16is7xx: Always use falling edge IRQ
  tty: n_gsm: Fix bogus i++ in gsm_data_kick
  tty: n_gsm: Remove unnecessary test in gsm_print_packet()
  serial: stm32: add no_console_suspend support
  tty: serial: fsl_lpuart: Use __maybe_unused instead of #if CONFIG_PM_SLEEP
  ...
2020-06-07 09:52:36 -07:00
Linus Torvalds
cff11abeca Kbuild updates for v5.8
- fix warnings in 'make clean' for ARCH=um, hexagon, h8300, unicore32
 
  - ensure to rebuild all objects when the compiler is upgraded
 
  - exclude system headers from dependency tracking and fixdep processing
 
  - fix potential bit-size mismatch between the kernel and BPF user-mode
    helper
 
  - add the new syntax 'userprogs' to build user-space programs for the
    target architecture (the same arch as the kernel)
 
  - compile user-space sample code under samples/ for the target arch
    instead of the host arch
 
  - make headers_install fail if a CONFIG option is leaked to user-space
 
  - sanitize the output format of scripts/checkstack.pl
 
  - handle ARM 'push' instruction in scripts/checkstack.pl
 
  - error out before modpost if a module name conflict is found
 
  - error out when multiple directories are passed to M= because this
    feature is broken for a long time
 
  - add CONFIG_DEBUG_INFO_COMPRESSED to support compressed debug info
 
  - a lot of cleanups of modpost
 
  - dump vmlinux symbols out into vmlinux.symvers, and reuse it in the
    second pass of modpost
 
  - do not run the second pass of modpost if nothing in modules is updated
 
  - install modules.builtin(.modinfo) by 'make install' as well as by
    'make modules_install' because it is useful even when CONFIG_MODULES=n
 
  - add new command line variables, GZIP, BZIP2, LZOP, LZMA, LZ4, and XZ
    to allow users to use alternatives such as pigz, pbzip2, etc.
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAl7brm0VHG1hc2FoaXJv
 eUBrZXJuZWwub3JnAAoJED2LAQed4NsGjeEP/Rrf8H9cp/Tq+ALQCBycI3W5ZEHg
 n2EqprZkVP2MlOV0d+8b9t4PdZf6E5Wmfv26sMaBAhl6X1KQI/0NgPMnTINvy5jJ
 Q2SMhj9y8Gwr3XKFu9Hd/0U+Sax5rz+LmY84tdF95dXzPIUWjAEVnbmN+ofY6T++
 sNf2YGNFSR6iiqr3uCYA0hHZmpKlfhVgDPAdncWa5aadSsuQb79nZQWefGeVEsuD
 HrISpwnkhBc0qY1xyWry6agE92xWmkNkdjKq6A7peguZL02XySWLRWjyHoiiaPOB
 6U4urKs/NSXqPgxGxwZthhwERHryC3+g4s8wRBDKE6ISRWKBBA2ruHpgdF5h/utu
 re1ZP2qRcAt8NBFynr4MEb2AU0mYkv7iEgfLJ7NUCRlMOtqrn5RFwnS4r8ReyQp5
 1UM11RbPhYgYjM5g9hBHJ7nK944/kfvy1/4jF4I1+M5O7QL6f00pu3r2bBIa/65g
 DWrNOpIliKG27GgnRlxi7HgLfxs9etFcXTpHO0ymgnMmlz+7FQsdceR9qqybGU9o
 yBWw6zculMQjb3E+k0DTnE5kLWsycbua921wxM9ABSxRmJi7WciNF73RdLUIBoAY
 VUbwrP2aIpdL+2uyX6RqdTaWzEBpW8omszr46aQ96pX+RiqMrPvJRLaA/tr3ZH8g
 tdHenJPWdHSaOcO4
 =GKe5
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild updates from Masahiro Yamada:

 - fix warnings in 'make clean' for ARCH=um, hexagon, h8300, unicore32

 - ensure to rebuild all objects when the compiler is upgraded

 - exclude system headers from dependency tracking and fixdep processing

 - fix potential bit-size mismatch between the kernel and BPF user-mode
   helper

 - add the new syntax 'userprogs' to build user-space programs for the
   target architecture (the same arch as the kernel)

 - compile user-space sample code under samples/ for the target arch
   instead of the host arch

 - make headers_install fail if a CONFIG option is leaked to user-space

 - sanitize the output format of scripts/checkstack.pl

 - handle ARM 'push' instruction in scripts/checkstack.pl

 - error out before modpost if a module name conflict is found

 - error out when multiple directories are passed to M= because this
   feature is broken for a long time

 - add CONFIG_DEBUG_INFO_COMPRESSED to support compressed debug info

 - a lot of cleanups of modpost

 - dump vmlinux symbols out into vmlinux.symvers, and reuse it in the
   second pass of modpost

 - do not run the second pass of modpost if nothing in modules is
   updated

 - install modules.builtin(.modinfo) by 'make install' as well as by
   'make modules_install' because it is useful even when
   CONFIG_MODULES=n

 - add new command line variables, GZIP, BZIP2, LZOP, LZMA, LZ4, and XZ
   to allow users to use alternatives such as pigz, pbzip2, etc.

* tag 'kbuild-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (96 commits)
  kbuild: add variables for compression tools
  Makefile: install modules.builtin even if CONFIG_MODULES=n
  mksysmap: Fix the mismatch of '.L' symbols in System.map
  kbuild: doc: rename LDFLAGS to KBUILD_LDFLAGS
  modpost: change elf_info->size to size_t
  modpost: remove is_vmlinux() helper
  modpost: strip .o from modname before calling new_module()
  modpost: set have_vmlinux in new_module()
  modpost: remove mod->skip struct member
  modpost: add mod->is_vmlinux struct member
  modpost: remove is_vmlinux() call in check_for_{gpl_usage,unused}()
  modpost: remove mod->is_dot_o struct member
  modpost: move -d option in scripts/Makefile.modpost
  modpost: remove -s option
  modpost: remove get_next_text() and make {grab,release_}file static
  modpost: use read_text_file() and get_line() for reading text files
  modpost: avoid false-positive file open error
  modpost: fix potential mmap'ed file overrun in get_src_version()
  modpost: add read_text_file() and get_line() helpers
  modpost: do not call get_modinfo() for vmlinux(.o)
  ...
2020-06-06 12:00:25 -07:00
Linus Torvalds
1ee18de929 dma-mapping updates for 5.8, part 1
- enhance the dma pool to allow atomic allocation on x86 with AMD SEV
    (David Rientjes)
  - two small cleanups (Jason Yan and Peter Collingbourne)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAl7bvTULHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYMJVhAAgTiWNzxPJhM6RTeRooM6W0NvcZGTJT6ExyJghaau
 aJvHUjXPrRmeBM8Zjwbbu5dioncd8c7npfRjBvATaEL74pa1u9gH3jnUTxh6L4WQ
 /FTNYryZVbprXJsdFuDZvCsO/CChqfZL8PWz+NFgIpICOyyXdorQELMhCaeOhnfU
 /goq6SvKmPlmXdb4eM2fXRD7udt1qlp+Oq2EZUdT3Xb4CBFsWUYbOMde22VY390Z
 2E9mEztOaKjNgAM/TfCoXo7iRUSwxcpO5aSliDhJJ/7uWaxyWTzFlaoIlwIkkNKb
 TcguNJbIZtjIXwBMv9gS6CqVEgFymmWqX5Tr23+vbb7S/235HqKtN1dPmV2h4R0H
 QOpvYXfm6kc4tpH4J32NMp+IqfQmwgMbNtUsiXWk5Lxl27cb8K2Q5eqEwxRWMbG+
 HObO7Kzb8oCygWwozZ+3QcWSr+9QAgzsb4Jl4jg6adjd8LDcbmKo4B9TKptGpVnL
 xjDleKdb/P4Vq55q9KHFLjqFUesuQIv2mKl2s+zr2BqROxjZ562kM9QHwsoCqc4Q
 tFuVed+XOoT7yhdKdtwEK7lwcQBtZgP5l/HgsoosmuJ975holsQ4pbKSf4A2Y4yo
 XwHYonSwOAEbi4nPxnvKIm4aUNq+PC44TH0VJcXud3tmQ/DGipdlLW8/nyw9ecfa
 qaQ=
 =GT3J
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-5.8' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping updates from Christoph Hellwig:

 - enhance the dma pool to allow atomic allocation on x86 with AMD SEV
   (David Rientjes)

 - two small cleanups (Jason Yan and Peter Collingbourne)

* tag 'dma-mapping-5.8' of git://git.infradead.org/users/hch/dma-mapping:
  dma-contiguous: fix comment for dma_release_from_contiguous
  dma-pool: scale the default DMA coherent pool size with memory capacity
  x86/mm: unencrypted non-blocking DMA allocations use coherent pools
  dma-pool: add pool sizes to debugfs
  dma-direct: atomic allocations must come from atomic coherent pools
  dma-pool: dynamically expanding atomic pools
  dma-pool: add additional coherent pools to map to gfp mask
  dma-remap: separate DMA atomic pools from direct remap code
  dma-debug: make __dma_entry_alloc_check_leak() static
2020-06-06 11:43:23 -07:00
Linus Torvalds
fe3bc8a988 Merge branch 'for-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue updates from Tejun Heo:
 "Mostly cleanups and other trivial changes.

  The only interesting change is Sebastian's rcuwait conversion for RT"

* 'for-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: use BUILD_BUG_ON() for compile time test instead of WARN_ON()
  workqueue: fix a piece of comment about reserved bits for work flags
  workqueue: remove useless unlock() and lock() in series
  workqueue: void unneeded requeuing the pwq in rescuer thread
  workqueue: Convert the pool::lock and wq_mayday_lock to raw_spinlock_t
  workqueue: Use rcuwait for wq_manager_wait
  workqueue: Remove unnecessary kfree() call in rcu_free_wq()
  workqueue: Fix an use after free in init_rescuer()
  workqueue: Use IS_ERR and PTR_ERR instead of PTR_ERR_OR_ZERO.
2020-06-06 10:01:48 -07:00
Linus Torvalds
4a7e89c5ec Merge branch 'for-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "Just two patches: one to add system-level cpu.stat to the root cgroup
  for convenience and a trivial comment update"

* 'for-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: add cpu.stat file to root cgroup
  cgroup: Remove stale comments
2020-06-06 09:59:34 -07:00
Denis Efremov
8dfb61dcba kbuild: add variables for compression tools
Allow user to use alternative implementations of compression tools,
such as pigz, pbzip2, pxz. For example, multi-threaded tools to
speed up the build:
$ make GZIP=pigz BZIP2=pbzip2

Variables _GZIP, _BZIP2, _LZOP are used internally because original env
vars are reserved by the tools. The use of GZIP in gzip tool is obsolete
since 2015. However, alternative implementations (e.g., pigz) still rely
on it. BZIP2, BZIP, LZOP vars are not obsolescent.

The credit goes to @grsecurity.

As a sidenote, for multi-threaded lzma, xz compression one can use:
$ export XZ_OPT="--threads=0"

Signed-off-by: Denis Efremov <efremov@linux.com>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2020-06-06 23:42:01 +09:00
Mel Gorman
388d8bdb87 tracing: Remove obsolete PREEMPTIRQ_EVENTS kconfig option
The PREEMPTIRQ_EVENTS option is unused after commit c3bc8fd637 ("tracing:
Centralize preemptirq tracepoints and unify their usage"). Remove it.

Note that this option is hazardous as it stands. It enables TRACE_IRQFLAGS
event on non-preempt configurations without the irqsoff tracer enabled.
TRACE_IRQFLAGS as it stands incurs significant overhead on each IRQ
entry/exit. This is because trace_hardirqs_[on|off] does all the per-cpu
manipulations and NMI checks even if tracing is completely disabled for
some insane reason.  For example, netperf running UDP_STREAM on localhost
incurs a 4-6% performance penalty without any tracing if IRQFLAGS is
set. It can be put behind a static brach but even the function entry/exit
costs a little bit.

Link: https://lkml.kernel.org/r/20200409104034.GJ3818@techsingularity.net

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-05 20:08:26 -04:00
Linus Torvalds
7ae77150d9 powerpc updates for 5.8
- Support for userspace to send requests directly to the on-chip GZIP
    accelerator on Power9.
 
  - Rework of our lockless page table walking (__find_linux_pte()) to make it
    safe against parallel page table manipulations without relying on an IPI for
    serialisation.
 
  - A series of fixes & enhancements to make our machine check handling more
    robust.
 
  - Lots of plumbing to add support for "prefixed" (64-bit) instructions on
    Power10.
 
  - Support for using huge pages for the linear mapping on 8xx (32-bit).
 
  - Remove obsolete Xilinx PPC405/PPC440 support, and an associated sound driver.
 
  - Removal of some obsolete 40x platforms and associated cruft.
 
  - Initial support for booting on Power10.
 
  - Lots of other small features, cleanups & fixes.
 
 Thanks to:
   Alexey Kardashevskiy, Alistair Popple, Andrew Donnellan, Andrey Abramov,
   Aneesh Kumar K.V, Balamuruhan S, Bharata B Rao, Bulent Abali, Cédric Le
   Goater, Chen Zhou, Christian Zigotzky, Christophe JAILLET, Christophe Leroy,
   Dmitry Torokhov, Emmanuel Nicolet, Erhard F., Gautham R. Shenoy, Geoff Levand,
   George Spelvin, Greg Kurz, Gustavo A. R. Silva, Gustavo Walbon, Haren Myneni,
   Hari Bathini, Joel Stanley, Jordan Niethe, Kajol Jain, Kees Cook, Leonardo
   Bras, Madhavan Srinivasan., Mahesh Salgaonkar, Markus Elfring, Michael
   Neuling, Michal Simek, Nathan Chancellor, Nathan Lynch, Naveen N. Rao,
   Nicholas Piggin, Oliver O'Halloran, Paul Mackerras, Pingfan Liu, Qian Cai, Ram
   Pai, Raphael Moreira Zinsly, Ravi Bangoria, Sam Bobroff, Sandipan Das, Segher
   Boessenkool, Stephen Rothwell, Sukadev Bhattiprolu, Tyrel Datwyler, Wolfram
   Sang, Xiongfeng Wang.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAl7aYZ8THG1wZUBlbGxl
 cm1hbi5pZC5hdQAKCRBR6+o8yOGlgPiKD/9zNCuZLFMAFrIdbm0HlYA2RGYZFT75
 GUHsqYyei1pxA7PgM3KwJiXELVODsBv0eQbgNh1tbecKrxPRegN/cywd1KLjPZ7I
 v5/qweQP8MvR0RhzjbhvUcO0jq/f8u2LbJr5mUfVzjU6tAvrvcWo3oZqDElsekCS
 kgyOH3r1vZ2PLTMiGFhb0gWi2iqc+6BHU1AFCGPCMjB1Vu5d5+54VvZ/6lllGsOF
 yg9CBXmmVvQ+Bn6tH4zdEB78FYxnAIwBqlbmL79i5ca+HQJ0Sw6HuPRy9XYq35p6
 2EiXS4Wrgp7i7+1TN3HO362u5Onb8TSyQU7NS6yCFPoJ6JQxcJMBIw6mHhnXOPuZ
 CrjgcdwUMjx8uDoKmX1Epbfuex2w+AysW+4yBHPFiSgl3klKC3D0wi95mR485w2F
 rN8uzJtrDeFKcYZJG7IoB/cgFCCPKGf9HaXr8q0S/jBKMffx91ul3cfzlfdIXOCw
 FDNw/+ZX7UD6ddFEG12ZTO+vdL8yf1uCRT/DIZwUiDMIA0+M6F4nc7j3lfyZfoO1
 65f9UlhoLxScq7VH2fKH4UtZatO9cPID2z1CmiY4UbUIPtFDepSuYClgLF+Duf4b
 rkfxhKU0+Ja1zNH5XNc+L+Bc5/W4lFiJXz02dYIjtHoUpWkc1aToOETVwzggYFNM
 G3PXIBOI0jRgRw==
 =o0WU
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-5.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:

 - Support for userspace to send requests directly to the on-chip GZIP
   accelerator on Power9.

 - Rework of our lockless page table walking (__find_linux_pte()) to
   make it safe against parallel page table manipulations without
   relying on an IPI for serialisation.

 - A series of fixes & enhancements to make our machine check handling
   more robust.

 - Lots of plumbing to add support for "prefixed" (64-bit) instructions
   on Power10.

 - Support for using huge pages for the linear mapping on 8xx (32-bit).

 - Remove obsolete Xilinx PPC405/PPC440 support, and an associated sound
   driver.

 - Removal of some obsolete 40x platforms and associated cruft.

 - Initial support for booting on Power10.

 - Lots of other small features, cleanups & fixes.

Thanks to: Alexey Kardashevskiy, Alistair Popple, Andrew Donnellan,
Andrey Abramov, Aneesh Kumar K.V, Balamuruhan S, Bharata B Rao, Bulent
Abali, Cédric Le Goater, Chen Zhou, Christian Zigotzky, Christophe
JAILLET, Christophe Leroy, Dmitry Torokhov, Emmanuel Nicolet, Erhard F.,
Gautham R. Shenoy, Geoff Levand, George Spelvin, Greg Kurz, Gustavo A.
R. Silva, Gustavo Walbon, Haren Myneni, Hari Bathini, Joel Stanley,
Jordan Niethe, Kajol Jain, Kees Cook, Leonardo Bras, Madhavan
Srinivasan., Mahesh Salgaonkar, Markus Elfring, Michael Neuling, Michal
Simek, Nathan Chancellor, Nathan Lynch, Naveen N. Rao, Nicholas Piggin,
Oliver O'Halloran, Paul Mackerras, Pingfan Liu, Qian Cai, Ram Pai,
Raphael Moreira Zinsly, Ravi Bangoria, Sam Bobroff, Sandipan Das, Segher
Boessenkool, Stephen Rothwell, Sukadev Bhattiprolu, Tyrel Datwyler,
Wolfram Sang, Xiongfeng Wang.

* tag 'powerpc-5.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (299 commits)
  powerpc/pseries: Make vio and ibmebus initcalls pseries specific
  cxl: Remove dead Kconfig options
  powerpc: Add POWER10 architected mode
  powerpc/dt_cpu_ftrs: Add MMA feature
  powerpc/dt_cpu_ftrs: Enable Prefixed Instructions
  powerpc/dt_cpu_ftrs: Advertise support for ISA v3.1 if selected
  powerpc: Add support for ISA v3.1
  powerpc: Add new HWCAP bits
  powerpc/64s: Don't set FSCR bits in INIT_THREAD
  powerpc/64s: Save FSCR to init_task.thread.fscr after feature init
  powerpc/64s: Don't let DT CPU features set FSCR_DSCR
  powerpc/64s: Don't init FSCR_DSCR in __init_FSCR()
  powerpc/32s: Fix another build failure with CONFIG_PPC_KUAP_DEBUG
  powerpc/module_64: Use special stub for _mcount() with -mprofile-kernel
  powerpc/module_64: Simplify check for -mprofile-kernel ftrace relocations
  powerpc/module_64: Consolidate ftrace code
  powerpc/32: Disable KASAN with pages bigger than 16k
  powerpc/uaccess: Don't set KUEP by default on book3s/32
  powerpc/uaccess: Don't set KUAP by default on book3s/32
  powerpc/8xx: Reduce time spent in allow_user_access() and friends
  ...
2020-06-05 12:39:30 -07:00
Linus Torvalds
084623e468 Modules updates for v5.8
Summary of modules changes for the 5.8 merge window:
 
 - Harden CONFIG_STRICT_MODULE_RWX by rejecting any module that has
   SHF_WRITE|SHF_EXECINSTR sections
 - Remove and clean up nested #ifdefs, as it makes code hard to read
 
 Signed-off-by: Jessica Yu <jeyu@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEVrp26glSWYuDNrCUwEV+OM47wXIFAl7Z+qkQHGpleXVAa2Vy
 bmVsLm9yZwAKCRDARX44zjvBcm+vEACo/JelhaEDjjLzVuTBN4FQfXgBRbh6CGeF
 br7J1VtI9B0Coun0wfrDewwDieB2uHXQqivdC+KzNGXrBVhvFNLWuTTFcG7yce0K
 KZlbVREpiTRlenEa6A5EdIdvH0Ttg/+PFfkvOvvQOrbx5woZKptG49VdII9mhPuE
 LKnZk1xrK6jQLbOCPtUjyfB+eLqi0swhwstcfdIPUXsi2HtuLKmu7JPRpbW3Yz1v
 0Y9xix7ByTSg+wsphiKgvDnoJ9TYC3bFlAwpw+A+tsOKE0pmyetWKvkJ/hH42W9D
 7w6odSlG85d4ZExO2K8fHHsKmW2HgWx0cgLFf91sXCALqRYNjwTRkXH2OBXggLsz
 n3k8PmTRkDwS1oElrNG9E4LjKhjFbZZpP4kz+VlVe7YAsnlYpOBOcroTuaiuSV/A
 xmAP6mHgVAeUugwF5vHAPV4mwGJBLUVcvHENzge4jnyl/rSHsWE7hhwLnik9qkua
 cWINAVhBJpQvk3nrDr0dmUAXRJ7CXXtyPBbNh8ivvA4XeDNBLcYr8mcHPLpAoN4C
 xVcTRBxDXTRkw246zlQRcTySVpovtSv2C+IiaaHxA3j7lo4C1xhj6ihUIKz3Z1/A
 70zvBxmg3zpMwZWiRGeasAIu69a0qvBh0AOfVTbHBrIp5RokibtdoyuOu7ZGOU9B
 Un1WJNrSYQ==
 =OlVv
 -----END PGP SIGNATURE-----

Merge tag 'modules-for-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux

Pull module updates from Jessica Yu:

 - Harden CONFIG_STRICT_MODULE_RWX by rejecting any module that has
   SHF_WRITE|SHF_EXECINSTR sections

 - Remove and clean up nested #ifdefs, as it makes code hard to read

* tag 'modules-for-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
  module: Harden STRICT_MODULE_RWX
  module: break nested ARCH_HAS_STRICT_MODULE_RWX and STRICT_MODULE_RWX #ifdefs
2020-06-05 12:31:16 -07:00
Christophe JAILLET
afd8d7c7f9 PM: hibernate: Add __init annotation to swsusp_header_init()
'swsusp_header_init()' is only called via 'core_initcall'.
It can be marked as __init to save a few bytes of memory.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
[ rjw: Subject ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-06-05 13:52:38 +02:00
Chaitanya Kulkarni
5aec598c45 blktrace: fix endianness for blk_log_remap()
The function blk_log_remap() can be simplified by removing the
call to get_pdu_remap() that copies the values into extra variable to
print the data, which also fixes the endiannness warning reported by
sparse.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-04 21:23:38 -06:00
Chaitanya Kulkarni
71df3fd82e blktrace: fix endianness in get_pdu_int()
In function get_pdu_len() replace variable type from __u64 to
__be64. This fixes sparse warning.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-04 21:23:38 -06:00
Chaitanya Kulkarni
48bc3cd3e0 blktrace: use errno instead of bi_status
In blk_add_trace_spliti() blk_add_trace_bio_remap() use
blk_status_to_errno() to pass the error instead of pasing the bi_status.
This fixes the sparse warning.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-04 21:23:38 -06:00
Christoph Hellwig
d24de76af8 block: remove the error argument to the block_bio_complete tracepoint
The status can be trivially derived from the bio itself.  That also avoid
callers like NVMe to incorrectly pass a blk_status_t instead of the errno,
and the overhead of translating the blk_status_t to the errno in the I/O
completion fast path when no tracing is enabled.

Fixes: 35fe0d12c8 ("nvme: trace bio completion")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-04 21:16:11 -06:00
Linus Torvalds
435faf5c21 RISC-V Patches for the 5.8 Merge Window, Part 1
* The remainder of the code necessary to support the Kendryte K210.
     * Support for building device trees into the kernel, as the K210 doesn't
       have a bootloader that provides one.
     * A K210 device tree and the associated defconfig update.
     * Support for skipping PMP initialization on systems that trap on PMP
       accesses rather than treating them as WARL.
 * Support for KGDB.
 * Improvements to text patching.
 * Some cleanups to the SiFive L2 cache driver.
 
 I may have a second part, but I wanted to get this out earlier rather than
 later as they've been ready to go for a while now.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEKzw3R0RoQ7JKlDp6LhMZ81+7GIkFAl7YRtMTHHBhbG1lckBk
 YWJiZWx0LmNvbQAKCRAuExnzX7sYif+oD/4kaoQqmiw0VvyFjBlPT/TC7OPbaoXO
 5WBegSEAl+oOq+fUtjHomzn6HaTW2cPNexsqC/KuGP1/QL0OQ23T5O18LaCmrGMd
 CvELeAcvu6G4kbRuHwF0U16hBI2A1xIYVUJ8qmbuHPpEwVel3AsP2RIjQ+bOvm7g
 fVTIrlzl1s1eTsOXdOJb3sFZtISHb7o3lHZmcplDva3N13x8E0FrjuoJhHv0f2mj
 kcmZcy3sr8luAkwAJmbqmdPuwYTlteMnucXdCUv/kG2SaPkBwU5TS+MN7No2ylXH
 v4vxjkuSyJ1db72pla6uUB/cYD+mh2YI4tjjvH93s3k4I/GZMCrqivXZW2QtxBdt
 na++GxK8e4FrBG+VCCCQvx2mqugpMFMDnZDt36N2BqwjOtEzO3W8GqjrKCovXtQf
 farlgAiuFNVcoo7VFdpNZE+N5elQa0GWf57csCqqMNN4J8JLaOgyS7ByozJ9b9/A
 1x4g7kQANnMvYiYAeV+C2YfBcK6x8gHEPlWio1QZqirW5la34VUEWa18+rQeSRMz
 DArKPauCR1GuYZZVUkkDI8cXj0Gs5fTXsYmn+WsIpjFHnDYeYVe6Ocmczxyb3o3O
 lQwm7+drHlSmMvgrOlWJMrIUbdiomgelDU/iD4PiZzKozC52yKDS6A5gzoG6uZSM
 UleeAIeVG/TDCw==
 =keTq
 -----END PGP SIGNATURE-----

Merge tag 'riscv-for-linus-5.8-mw0' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux

Pull RISC-V updates from Palmer Dabbelt:

 - The remainder of the code necessary to support the Kendryte K210:

     * Support for building device trees into the kernel, as the K210
       doesn't have a bootloader that provides one

     * A K210 device tree and the associated defconfig update

     * Support for skipping PMP initialization on systems that trap on
       PMP accesses rather than treating them as WARL

 - Support for KGDB

 - Improvements to text patching

 - Some cleanups to the SiFive L2 cache driver

* tag 'riscv-for-linus-5.8-mw0' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
  soc: sifive: l2 cache: Mark l2_get_priv_group as static
  soc: sifive: l2 cache: Eliminate an unsigned zero compare warning
  riscv: Add support to determine no. of L2 cache way enabled
  riscv: cacheinfo: Implement cache_get_priv_group with a generic ops structure
  riscv: Use text_mutex instead of patch_lock
  riscv: Use NOKPROBE_SYMBOL() instead of __krpobes annotation
  riscv: Remove the 'riscv_' prefix of function name
  riscv: Add SW single-step support for KDB
  riscv: Use the XML target descriptions to report 3 system registers
  riscv: Add KGDB support
  kgdb: Add kgdb_has_hit_break function
  RISC-V: Skip setting up PMPs on traps
  riscv: K210: Update defconfig
  riscv: K210: Add a built-in device tree
  riscv: Allow device trees to be built into the kernel
2020-06-04 20:14:18 -07:00
Linus Torvalds
828f3e18e1 ARM/SoC: drivers for v5.7
These are updates to SoC specific drivers that did not have
 another subsystem maintainer tree to go through for some
 reason:
 
 - Some bus and memory drivers for the MIPS P5600 based
   Baikal-T1 SoC that is getting added through the MIPS tree.
 
 - There are new soc_device identification drivers for TI K3,
   Qualcomm MSM8939
 
 - New reset controller drivers for NXP i.MX8MP, Renesas
   RZ/G1H, and Hisilicon hi6220
 
 - The SCMI firmware interface can now work across ARM SMC/HVC
   as a transport.
 
 - Mediatek platforms now use a new driver for their "MMSYS"
   hardware block that controls clocks and some other aspects
   in behalf of the media and gpu drivers.
 
 - Some Tegra processors have improved power management
   support, including getting woken up by the PMIC and cluster
   power down during idle.
 
 - A new v4l staging driver for Tegra is added.
 
 - Cleanups and minor bugfixes for TI, NXP, Hisilicon,
   Mediatek, and Tegra.
 
 Signed-off-by: Arnd Bergmann <arnd@arndb.de>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEo6/YBQwIrVS28WGKmmx57+YAGNkFAl7XvgAACgkQmmx57+YA
 GNmj/hAAnAJ/hYehLfgCe711HUntgeRkaoTVpCt8BJNMdxsa23sn3V6k5+WYn1uG
 PtlgpefZEMHLUEEVDegR4nZXLG0Pzu1SR12KW34YPcQKkNo/+vlQ9zYUajnJ/KX6
 10zdLSIzHfk1VtXKvvQQ8xFyE+S/trGmjC57E6gfoCUT3rl1maD+ccVXUBaz9oob
 wuMxGXQAl57mio5yT1OfSk6Fev39xRE2dN1hzP7KUYhsemZajBwBBW5wVJZCsCB8
 LCGmxVkavM7BV4r2NokbBDs5rlfedBl/P/IPd9Is5a5tuGUkSsVRG9zqShxYLGM3
 S06az6POQFwXKFJoUKW0dK/Koy0D7BK+vhUBPzFv4HZ8iDCVf6Jju2MJ02GMqHPj
 OOrXaCbLYrvN/edVUWeeFywqwMbYTRwC4DxyTq5m7HxEB004xTOhs0rX5aR0u4n1
 bbsR97LguolwH9iEMzd3F3jCiKBcMecH3lAh5WcrtwlFIRrNhbWoGDoA/4TuORFS
 b11rgsTRIJ5Vc++D1HnSnx0ZZvUzyluMvygdALnSgVah6xYe6KVw9Kg/wioAJ04G
 uSTidqP3qRhsyET2HQo7CxdVfZbKfP25iKCKrdhQziztKvhF8qrUmZKloXOodRw+
 ewYSRmv8c324OYYit1X43oAdW8dntq1XbSIauaqxEb4JC7x/xRY=
 =44Jc
 -----END PGP SIGNATURE-----

Merge tag 'arm-drivers-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc

Pull ARM/SoC driver updates from Arnd Bergmann:
 "These are updates to SoC specific drivers that did not have another
  subsystem maintainer tree to go through for some reason:

   - Some bus and memory drivers for the MIPS P5600 based Baikal-T1 SoC
     that is getting added through the MIPS tree.

   - There are new soc_device identification drivers for TI K3, Qualcomm
     MSM8939

   - New reset controller drivers for NXP i.MX8MP, Renesas RZ/G1H, and
     Hisilicon hi6220

   - The SCMI firmware interface can now work across ARM SMC/HVC as a
     transport.

   - Mediatek platforms now use a new driver for their "MMSYS" hardware
     block that controls clocks and some other aspects in behalf of the
     media and gpu drivers.

   - Some Tegra processors have improved power management support,
     including getting woken up by the PMIC and cluster power down
     during idle.

   - A new v4l staging driver for Tegra is added.

   - Cleanups and minor bugfixes for TI, NXP, Hisilicon, Mediatek, and
     Tegra"

* tag 'arm-drivers-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (155 commits)
  clk: sprd: fix compile-testing
  bus: bt1-axi: Build the driver into the kernel
  bus: bt1-apb: Build the driver into the kernel
  bus: bt1-axi: Use sysfs_streq instead of strncmp
  bus: bt1-axi: Optimize the return points in the driver
  bus: bt1-apb: Use sysfs_streq instead of strncmp
  bus: bt1-apb: Use PTR_ERR_OR_ZERO to return from request-regs method
  bus: bt1-apb: Fix show/store callback identations
  bus: bt1-apb: Include linux/io.h
  dt-bindings: memory: Add Baikal-T1 L2-cache Control Block binding
  memory: Add Baikal-T1 L2-cache Control Block driver
  bus: Add Baikal-T1 APB-bus driver
  bus: Add Baikal-T1 AXI-bus driver
  dt-bindings: bus: Add Baikal-T1 APB-bus binding
  dt-bindings: bus: Add Baikal-T1 AXI-bus binding
  staging: tegra-video: fix V4L2 dependency
  tee: fix crypto select
  drivers: soc: ti: knav_qmss_queue: Make knav_gp_range_ops static
  soc: ti: add k3 platforms chipid module driver
  dt-bindings: soc: ti: add binding for k3 platforms chipid module
  ...
2020-06-04 19:56:20 -07:00
Linus Torvalds
886d7de631 Merge branch 'akpm' (patches from Andrew)
Merge yet more updates from Andrew Morton:

 - More MM work. 100ish more to go. Mike Rapoport's "mm: remove
   __ARCH_HAS_5LEVEL_HACK" series should fix the current ppc issue

 - Various other little subsystems

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (127 commits)
  lib/ubsan.c: fix gcc-10 warnings
  tools/testing/selftests/vm: remove duplicate headers
  selftests: vm: pkeys: fix multilib builds for x86
  selftests: vm: pkeys: use the correct page size on powerpc
  selftests/vm/pkeys: override access right definitions on powerpc
  selftests/vm/pkeys: test correct behaviour of pkey-0
  selftests/vm/pkeys: introduce a sub-page allocator
  selftests/vm/pkeys: detect write violation on a mapped access-denied-key page
  selftests/vm/pkeys: associate key on a mapped page and detect write violation
  selftests/vm/pkeys: associate key on a mapped page and detect access violation
  selftests/vm/pkeys: improve checks to determine pkey support
  selftests/vm/pkeys: fix assertion in test_pkey_alloc_exhaust()
  selftests/vm/pkeys: fix number of reserved powerpc pkeys
  selftests/vm/pkeys: introduce powerpc support
  selftests/vm/pkeys: introduce generic pkey abstractions
  selftests: vm: pkeys: use the correct huge page size
  selftests/vm/pkeys: fix alloc_random_pkey() to make it really random
  selftests/vm/pkeys: fix assertion in pkey_disable_set/clear()
  selftests/vm/pkeys: fix pkey_disable_clear()
  selftests: vm: pkeys: add helpers for pkey bits
  ...
2020-06-04 19:18:29 -07:00
Pengcheng Yang
341a7213e5 kernel/relay.c: fix read_pos error when multiple readers
When reading, read_pos should start with bytes_consumed, not file->f_pos.
Because when there is more than one reader, the read_pos corresponding to
file->f_pos may have been consumed, which will cause the data that has
been consumed to be read and the bytes_consumed update error.

Signed-off-by: Pengcheng Yang <yangpc@wangsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jann Horn <jannh@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>e
Link: http://lkml.kernel.org/r/1579691175-28949-1-git-send-email-yangpc@wangsu.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04 19:06:26 -07:00
Daniel Axtens
54e200ab40 kernel/relay.c: handle alloc_percpu returning NULL in relay_open
alloc_percpu() may return NULL, which means chan->buf may be set to NULL.
In that case, when we do *per_cpu_ptr(chan->buf, ...), we dereference an
invalid pointer:

  BUG: Unable to handle kernel data access at 0x7dae0000
  Faulting instruction address: 0xc0000000003f3fec
  ...
  NIP relay_open+0x29c/0x600
  LR relay_open+0x270/0x600
  Call Trace:
     relay_open+0x264/0x600 (unreliable)
     __blk_trace_setup+0x254/0x600
     blk_trace_setup+0x68/0xa0
     sg_ioctl+0x7bc/0x2e80
     do_vfs_ioctl+0x13c/0x1300
     ksys_ioctl+0x94/0x130
     sys_ioctl+0x48/0xb0
     system_call+0x5c/0x68

Check if alloc_percpu returns NULL.

This was found by syzkaller both on x86 and powerpc, and the reproducer
it found on powerpc is capable of hitting the issue as an unprivileged
user.

Fixes: 017c59c042 ("relay: Use per CPU constructs for the relay channel buffer pointers")
Reported-by: syzbot+1e925b4b836afe85a1c6@syzkaller-ppc64.appspotmail.com
Reported-by: syzbot+587b2421926808309d21@syzkaller-ppc64.appspotmail.com
Reported-by: syzbot+58320b7171734bf79d26@syzkaller.appspotmail.com
Reported-by: syzbot+d6074fb08bdb2e010520@syzkaller.appspotmail.com
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Andrew Donnellan <ajd@linux.ibm.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Akash Goel <akash.goel@intel.com>
Cc: Andrew Donnellan <ajd@linux.ibm.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Salvatore Bonaccorso <carnil@debian.org>
Cc: <stable@vger.kernel.org>	[4.10+]
Link: http://lkml.kernel.org/r/20191219121256.26480-1-dja@axtens.net
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04 19:06:26 -07:00
Kefeng Wang
eac2cece45 kernel/kprobes.c: convert to use DEFINE_SEQ_ATTRIBUTE macro
Use DEFINE_SEQ_ATTRIBUTE macro to simplify the code.

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Link: http://lkml.kernel.org/r/20200509064031.181091-4-wangkefeng.wang@huawei.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04 19:06:26 -07:00
Jason Yan
de83dbd97f user.c: make uidhash_table static
Fix the following sparse warning:

  kernel/user.c:85:19: warning: symbol 'uidhash_table' was not declared.
  Should it be static?

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20200413082146.22737-1-yanaijie@huawei.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04 19:06:24 -07:00
David Hildenbrand
3fe4f4991a kexec_file: don't place kexec images on IORESOURCE_MEM_DRIVER_MANAGED
Memory flagged with IORESOURCE_MEM_DRIVER_MANAGED is special - it won't be
part of the initial memmap of the kexec kernel and not all memory might be
accessible.  Don't place any kexec images onto it.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Link: http://lkml.kernel.org/r/20200508084217.9160-4-david@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04 19:06:23 -07:00
Andrey Konovalov
5ff3b30ab5 kcov: collect coverage from interrupts
This change extends kcov remote coverage support to allow collecting
coverage from soft interrupts in addition to kernel background threads.

To collect coverage from code that is executed in softirq context, a part
of that code has to be annotated with kcov_remote_start/stop() in a
similar way as how it is done for global kernel background threads.  Then
the handle used for the annotations has to be passed to the
KCOV_REMOTE_ENABLE ioctl.

Internally this patch adjusts the __sanitizer_cov_trace_pc() compiler
inserted callback to not bail out when called from softirq context.
kcov_remote_start/stop() are updated to save/restore the current per task
kcov state in a per-cpu area (in case the softirq came when the kernel was
already collecting coverage in task context).  Coverage from softirqs is
collected into pre-allocated per-cpu areas, whose size is controlled by
the new CONFIG_KCOV_IRQ_AREA_SIZE.

[andreyknvl@google.com: turn current->kcov_softirq into unsigned int to fix objtool warning]
  Link: http://lkml.kernel.org/r/841c778aa3849c5cb8c3761f56b87ce653a88671.1585233617.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Marco Elver <elver@google.com>
Link: http://lkml.kernel.org/r/469bd385c431d050bc38a593296eff4baae50666.1584655448.git.andreyknvl@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04 19:06:20 -07:00
Andrey Konovalov
5fe7042dc0 kcov: use t->kcov_mode as enabled indicator
Currently kcov_remote_start() and kcov_remote_stop() check t->kcov to find
out whether the coverage is already being collected by the current task.
Use t->kcov_mode for that instead.  This doesn't change the overall
behavior in any way, but serves as a preparation for the following softirq
coverage collection support patch.

Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Alexander Potapenko <glider@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Marco Elver <elver@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Link: http://lkml.kernel.org/r/f70377945d1d8e6e4916cbce871a12303d6186b4.1585233617.git.andreyknvl@google.com
Link: http://lkml.kernel.org/r/ee1a1dec43059da5d7664c85c1addc89c4cd58de.1584655448.git.andreyknvl@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04 19:06:20 -07:00
Andrey Konovalov
eeb91f9a2e kcov: move t->kcov_sequence assignment
Move t->kcov_sequence assignment before assigning t->kcov_mode for
consistency.

Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Alexander Potapenko <glider@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Marco Elver <elver@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Link: http://lkml.kernel.org/r/5889efe35e0b300e69dba97216b1288d9c2428a8.1585233617.git.andreyknvl@google.com
Link: http://lkml.kernel.org/r/f0283c676bab3335cb48bfe12d375a3da4719f59.1584655448.git.andreyknvl@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04 19:06:20 -07:00
Andrey Konovalov
76484b1c77 kcov: move t->kcov assignments into kcov_start/stop
Every time kcov_start/stop() is called, t->kcov is also assigned, so move
the assignment into the functions.

Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Alexander Potapenko <glider@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Marco Elver <elver@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Link: http://lkml.kernel.org/r/6644839d3567df61ade3c4b246a46cacbe4f9e11.1585233617.git.andreyknvl@google.com
Link: http://lkml.kernel.org/r/82625ef3ff878f0b585763cc31d09d9b08ca37d6.1584655448.git.andreyknvl@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04 19:06:20 -07:00
Andrey Konovalov
67b3d3cca3 kcov: fix potential use-after-free in kcov_remote_start
If vmalloc() fails in kcov_remote_start() we'll access remote->kcov
without holding kcov_remote_lock, so remote might potentially be freed at
that point.  Cache kcov pointer in a local variable.

Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Alexander Potapenko <glider@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Marco Elver <elver@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Link: http://lkml.kernel.org/r/9d9134359725a965627b7e8f2652069f86f1d1fa.1585233617.git.andreyknvl@google.com
Link: http://lkml.kernel.org/r/de0d3d30ff90776a2a509cc34c7c1c7521bda125.1584655448.git.andreyknvl@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04 19:06:20 -07:00
Andrey Konovalov
3c61df3885 kcov: cleanup debug messages
Patch series "kcov: collect coverage from usb soft interrupts", v4.

This patchset extends kcov to allow collecting coverage from soft
interrupts and then uses the new functionality to collect coverage from
USB code.

This has allowed to find at least one new HID bug [1], which was recently
fixed by Alan [2].

[1] https://syzkaller.appspot.com/bug?extid=09ef48aa58261464b621
[2] https://patchwork.kernel.org/patch/11283319/

Any subsystem that uses softirqs (e.g. timers) can make use of this in
the future. Looking at the recent syzbot reports, an obvious candidate
is the networking subsystem [3, 4, 5 and many more].

[3] https://syzkaller.appspot.com/bug?extid=522ab502c69badc66ab7
[4] https://syzkaller.appspot.com/bug?extid=57f89d05946c53dbbb31
[5] https://syzkaller.appspot.com/bug?extid=df358e65d9c1b9d3f5f4

This pach (of 7):

Previous commit left a lot of excessive debug messages, clean them up.

Link; http://lkml.kernel.org/r/cover.1585233617.git.andreyknvl@google.com
Link; http://lkml.kernel.org/r/ab5e2885ce674ba6e04368551e51eeb6a2c11baf.1585233617.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Alexander Potapenko <glider@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Link: http://lkml.kernel.org/r/4a497134b2cf7a9d306d28e3dd2746f5446d1605.1584655448.git.andreyknvl@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04 19:06:20 -07:00
Dan Carpenter
e7ed83d6fa bpf: Fix an error code in check_btf_func()
This code returns success if the "info_aux" allocation fails but it
should return -ENOMEM.

Fixes: 8c1b6e69dc ("bpf: Compare BTF types of functions arguments with actual types")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200604085436.GA943001@mwanda
2020-06-04 23:38:54 +02:00
Linus Torvalds
15a2bc4dbb Merge branch 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull execve updates from Eric Biederman:
 "Last cycle for the Nth time I ran into bugs and quality of
  implementation issues related to exec that could not be easily be
  fixed because of the way exec is implemented. So I have been digging
  into exec and cleanup up what I can.

  I don't think I have exec sorted out enough to fix the issues I
  started with but I have made some headway this cycle with 4 sets of
  changes.

   - promised cleanups after introducing exec_update_mutex

   - trivial cleanups for exec

   - control flow simplifications

   - remove the recomputation of bprm->cred

  The net result is code that is a bit easier to understand and work
  with and a decrease in the number of lines of code (if you don't count
  the added tests)"

* 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (24 commits)
  exec: Compute file based creds only once
  exec: Add a per bprm->file version of per_clear
  binfmt_elf_fdpic: fix execfd build regression
  selftests/exec: Add binfmt_script regression test
  exec: Remove recursion from search_binary_handler
  exec: Generic execfd support
  exec/binfmt_script: Don't modify bprm->buf and then return -ENOEXEC
  exec: Move the call of prepare_binprm into search_binary_handler
  exec: Allow load_misc_binary to call prepare_binprm unconditionally
  exec: Convert security_bprm_set_creds into security_bprm_repopulate_creds
  exec: Factor security_bprm_creds_for_exec out of security_bprm_set_creds
  exec: Teach prepare_exec_creds how exec treats uids & gids
  exec: Set the point of no return sooner
  exec: Move handling of the point of no return to the top level
  exec: Run sync_mm_rss before taking exec_update_mutex
  exec: Fix spelling of search_binary_handler in a comment
  exec: Move the comment from above de_thread to above unshare_sighand
  exec: Rename flush_old_exec begin_new_exec
  exec: Move most of setup_new_exec into flush_old_exec
  exec: In setup_new_exec cache current in the local variable me
  ...
2020-06-04 14:07:08 -07:00
Linus Torvalds
9ff7258575 Merge branch 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull proc updates from Eric Biederman:
 "This has four sets of changes:

   - modernize proc to support multiple private instances

   - ensure we see the exit of each process tid exactly

   - remove has_group_leader_pid

   - use pids not tasks in posix-cpu-timers lookup

  Alexey updated proc so each mount of proc uses a new superblock. This
  allows people to actually use mount options with proc with no fear of
  messing up another mount of proc. Given the kernel's internal mounts
  of proc for things like uml this was a real problem, and resulted in
  Android's hidepid mount options being ignored and introducing security
  issues.

  The rest of the changes are small cleanups and fixes that came out of
  my work to allow this change to proc. In essence it is swapping the
  pids in de_thread during exec which removes a special case the code
  had to handle. Then updating the code to stop handling that special
  case"

* 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  proc: proc_pid_ns takes super_block as an argument
  remove the no longer needed pid_alive() check in __task_pid_nr_ns()
  posix-cpu-timers: Replace __get_task_for_clock with pid_for_clock
  posix-cpu-timers: Replace cpu_timer_pid_type with clock_pid_type
  posix-cpu-timers: Extend rcu_read_lock removing task_struct references
  signal: Remove has_group_leader_pid
  exec: Remove BUG_ON(has_group_leader_pid)
  posix-cpu-timer:  Unify the now redundant code in lookup_task
  posix-cpu-timer: Tidy up group_leader logic in lookup_task
  proc: Ensure we see the exit of each process tid exactly once
  rculist: Add hlists_swap_heads_rcu
  proc: Use PIDTYPE_TGID in next_tgid
  Use proc_pid_ns() to get pid_namespace from the proc superblock
  proc: use named enums for better readability
  proc: use human-readable values for hidepid
  docs: proc: add documentation for "hidepid=4" and "subset=pid" options and new mount behavior
  proc: add option to mount only a pids subset
  proc: instantiate only pids that we can ptrace on 'hidepid=4' mount option
  proc: allow to mount many instances of proc in one pid namespace
  proc: rename struct proc_fs_info to proc_fs_opts
2020-06-04 13:54:34 -07:00
Linus Torvalds
9fb4c5250f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching
Pull livepatching updates from Jiri Kosina:

 - simplifications and improvements for issues Peter Ziljstra found
   during his previous work on W^X cleanups.

   This allows us to remove livepatch arch-specific .klp.arch sections
   and add proper support for jump labels in patched code.

   Also, this patchset removes the last module_disable_ro() usage in the
   tree.

   Patches from Josh Poimboeuf and Peter Zijlstra

 - a few other minor cleanups

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching:
  MAINTAINERS: add lib/livepatch to LIVE PATCHING
  livepatch: add arch-specific headers to MAINTAINERS
  livepatch: Make klp_apply_object_relocs static
  MAINTAINERS: adjust to livepatch .klp.arch removal
  module: Make module_enable_ro() static again
  x86/module: Use text_mutex in apply_relocate_add()
  module: Remove module_disable_ro()
  livepatch: Remove module_disable_ro() usage
  x86/module: Use text_poke() for late relocations
  s390/module: Use s390_kernel_write() for late relocations
  s390: Change s390_kernel_write() return type to match memcpy()
  livepatch: Prevent module-specific KLP rela sections from referencing vmlinux symbols
  livepatch: Remove .klp.arch
  livepatch: Apply vmlinux-specific KLP relocations early
  livepatch: Disallow vmlinux.ko
2020-06-04 11:13:03 -07:00
Will Deacon
333ed74689 scs: Report SCS usage in bytes rather than number of entries
Fix the SCS debug usage check so that we report the number of bytes
used, rather than the number of entries.

Fixes: 5bbaf9d1fc ("scs: Add support for stack usage debugging")
Reported-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Will Deacon <will@kernel.org>
2020-06-04 16:14:56 +01:00
Linus Torvalds
ee01c4d72a Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:
 "More mm/ work, plenty more to come

  Subsystems affected by this patch series: slub, memcg, gup, kasan,
  pagealloc, hugetlb, vmscan, tools, mempolicy, memblock, hugetlbfs,
  thp, mmap, kconfig"

* akpm: (131 commits)
  arm64: mm: use ARCH_HAS_DEBUG_WX instead of arch defined
  x86: mm: use ARCH_HAS_DEBUG_WX instead of arch defined
  riscv: support DEBUG_WX
  mm: add DEBUG_WX support
  drivers/base/memory.c: cache memory blocks in xarray to accelerate lookup
  mm/thp: rename pmd_mknotpresent() as pmd_mkinvalid()
  powerpc/mm: drop platform defined pmd_mknotpresent()
  mm: thp: don't need to drain lru cache when splitting and mlocking THP
  hugetlbfs: get unmapped area below TASK_UNMAPPED_BASE for hugetlbfs
  sparc32: register memory occupied by kernel as memblock.memory
  include/linux/memblock.h: fix minor typo and unclear comment
  mm, mempolicy: fix up gup usage in lookup_node
  tools/vm/page_owner_sort.c: filter out unneeded line
  mm: swap: memcg: fix memcg stats for huge pages
  mm: swap: fix vmstats for huge pages
  mm: vmscan: limit the range of LRU type balancing
  mm: vmscan: reclaim writepage is IO cost
  mm: vmscan: determine anon/file pressure balance at the reclaim root
  mm: balance LRU lists based on relative thrashing
  mm: only count actual rotations as LRU reclaim cost
  ...
2020-06-03 20:24:15 -07:00
Johannes Weiner
c843966c55 mm: allow swappiness that prefers reclaiming anon over the file workingset
With the advent of fast random IO devices (SSDs, PMEM) and in-memory swap
devices such as zswap, it's possible for swap to be much faster than
filesystems, and for swapping to be preferable over thrashing filesystem
caches.

Allow setting swappiness - which defines the rough relative IO cost of
cache misses between page cache and swap-backed pages - to reflect such
situations by making the swap-preferred range configurable.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Link: http://lkml.kernel.org/r/20200520232525.798933-4-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03 20:09:48 -07:00
Johannes Weiner
d9eb1ea2bf mm: memcontrol: delete unused lrucare handling
Swapin faults were the last event to charge pages after they had already
been put on the LRU list.  Now that we charge directly on swapin, the
lrucare portion of the charge code is unused.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Shakeel Butt <shakeelb@google.com>
Link: http://lkml.kernel.org/r/20200508183105.225460-19-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03 20:09:48 -07:00
Johannes Weiner
9d82c69438 mm: memcontrol: convert anon and file-thp to new mem_cgroup_charge() API
With the page->mapping requirement gone from memcg, we can charge anon and
file-thp pages in one single step, right after they're allocated.

This removes two out of three API calls - especially the tricky commit
step that needed to happen at just the right time between when the page is
"set up" and when it's "published" - somewhat vague and fluid concepts
that varied by page type.  All we need is a freshly allocated page and a
memcg context to charge.

v2: prevent double charges on pre-allocated hugepages in khugepaged

[hannes@cmpxchg.org: Fix crash - *hpage could be ERR_PTR instead of NULL]
  Link: http://lkml.kernel.org/r/20200512215813.GA487759@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Qian Cai <cai@lca.pw>
Link: http://lkml.kernel.org/r/20200508183105.225460-13-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03 20:09:48 -07:00
Johannes Weiner
be5d0a74c6 mm: memcontrol: switch to native NR_ANON_MAPPED counter
Memcg maintains a private MEMCG_RSS counter.  This divergence from the
generic VM accounting means unnecessary code overhead, and creates a
dependency for memcg that page->mapping is set up at the time of charging,
so that page types can be told apart.

Convert the generic accounting sites to mod_lruvec_page_state and friends
to maintain the per-cgroup vmstat counter of NR_ANON_MAPPED.  We use
lock_page_memcg() to stabilize page->mem_cgroup during rmap changes, the
same way we do for NR_FILE_MAPPED.

With the previous patch removing MEMCG_CACHE and the private NR_SHMEM
counter, this patch finally eliminates the need to have page->mapping set
up at charge time.  However, we need to have page->mem_cgroup set up by
the time rmap runs and does the accounting, so switch the commit and the
rmap callbacks around.

v2: fix temporary accounting bug by switching rmap<->commit (Joonsoo)

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Link: http://lkml.kernel.org/r/20200508183105.225460-11-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03 20:09:47 -07:00
Johannes Weiner
3fba69a56e mm: memcontrol: drop @compound parameter from memcg charging API
The memcg charging API carries a boolean @compound parameter that tells
whether the page we're dealing with is a hugepage.
mem_cgroup_commit_charge() has another boolean @lrucare that indicates
whether the page needs LRU locking or not while charging.  The majority of
callsites know those parameters at compile time, which results in a lot of
naked "false, false" argument lists.  This makes for cryptic code and is a
breeding ground for subtle mistakes.

Thankfully, the huge page state can be inferred from the page itself and
doesn't need to be passed along.  This is safe because charging completes
before the page is published and somebody may split it.

Simplify the callsites by removing @compound, and let memcg infer the
state by using hpage_nr_pages() unconditionally.  That function does
PageTransHuge() to identify huge pages, which also helpfully asserts that
nobody passes in tail pages by accident.

The following patches will introduce a new charging API, best not to carry
over unnecessary weight.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Link: http://lkml.kernel.org/r/20200508183105.225460-4-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03 20:09:47 -07:00
Daniel Jordan
004ed42638 padata: add basic support for multithreaded jobs
Sometimes the kernel doesn't take full advantage of system memory
bandwidth, leading to a single CPU spending excessive time in
initialization paths where the data scales with memory size.

Multithreading naturally addresses this problem.

Extend padata, a framework that handles many parallel yet singlethreaded
jobs, to also handle multithreaded jobs by adding support for splitting up
the work evenly, specifying a minimum amount of work that's appropriate
for one helper thread to do, load balancing between helpers, and
coordinating them.

This is inspired by work from Pavel Tatashin and Steve Sistare.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Josh Triplett <josh@joshtriplett.org>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Robert Elliott <elliott@hpe.com>
Cc: Shile Zhang <shile.zhang@linux.alibaba.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zi Yan <ziy@nvidia.com>
Link: http://lkml.kernel.org/r/20200527173608.2885243-5-daniel.m.jordan@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03 20:09:45 -07:00
Daniel Jordan
4611ce2246 padata: allocate work structures for parallel jobs from a pool
padata allocates per-CPU, per-instance work structs for parallel jobs.  A
do_parallel call assigns a job to a sequence number and hashes the number
to a CPU, where the job will eventually run using the corresponding work.

This approach fit with how padata used to bind a job to each CPU
round-robin, makes less sense after commit bfde23ce20 ("padata: unbind
parallel jobs from specific CPUs") because a work isn't bound to a
particular CPU anymore, and isn't needed at all for multithreaded jobs
because they don't have sequence numbers.

Replace the per-CPU works with a preallocated pool, which allows sharing
them between existing padata users and the upcoming multithreaded user.
The pool will also facilitate setting NUMA-aware concurrency limits with
later users.

The pool is sized according to the number of possible CPUs.  With this
limit, MAX_OBJ_NUM no longer makes sense, so remove it.

If the global pool is exhausted, a parallel job is run in the current task
instead to throttle a system trying to do too much in parallel.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Josh Triplett <josh@joshtriplett.org>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Robert Elliott <elliott@hpe.com>
Cc: Shile Zhang <shile.zhang@linux.alibaba.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zi Yan <ziy@nvidia.com>
Link: http://lkml.kernel.org/r/20200527173608.2885243-4-daniel.m.jordan@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03 20:09:45 -07:00
Daniel Jordan
f1b192b117 padata: initialize earlier
padata will soon initialize the system's struct pages in parallel, so it
needs to be ready by page_alloc_init_late().

The error return from padata_driver_init() triggers an initcall warning,
so add a warning to padata_init() to avoid silent failure.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Josh Triplett <josh@joshtriplett.org>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Robert Elliott <elliott@hpe.com>
Cc: Shile Zhang <shile.zhang@linux.alibaba.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zi Yan <ziy@nvidia.com>
Link: http://lkml.kernel.org/r/20200527173608.2885243-3-daniel.m.jordan@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03 20:09:45 -07:00
Daniel Jordan
305dacf779 padata: remove exit routine
Patch series "padata: parallelize deferred page init", v3.

Deferred struct page init is a bottleneck in kernel boot--the biggest for
us and probably others.  Optimizing it maximizes availability for
large-memory systems and allows spinning up short-lived VMs as needed
without having to leave them running.  It also benefits bare metal
machines hosting VMs that are sensitive to downtime.  In projects such as
VMM Fast Restart[1], where guest state is preserved across kexec reboot,
it helps prevent application and network timeouts in the guests.

So, multithread deferred init to take full advantage of system memory
bandwidth.

Extend padata, a framework that handles many parallel singlethreaded jobs,
to handle multithreaded jobs as well by adding support for splitting up
the work evenly, specifying a minimum amount of work that's appropriate
for one helper thread to do, load balancing between helpers, and
coordinating them.  More documentation in patches 4 and 8.

This series is the first step in a project to address other memory
proportional bottlenecks in the kernel such as pmem struct page init, vfio
page pinning, hugetlb fallocate, and munmap.  Deferred page init doesn't
require concurrency limits, resource control, or priority adjustments like
these other users will because it happens during boot when the system is
otherwise idle and waiting for page init to finish.

This has been run on a variety of x86 systems and speeds up kernel boot by
4% to 49%, saving up to 1.6 out of 4 seconds.  Patch 6 has more numbers.

This patch (of 8):

padata_driver_exit() is unnecessary because padata isn't built as a module
and doesn't exit.

padata's init routine will soon allocate memory, so getting rid of the
exit function now avoids pointless code to free it.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Josh Triplett <josh@joshtriplett.org>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Robert Elliott <elliott@hpe.com>
Cc: Shile Zhang <shile.zhang@linux.alibaba.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zi Yan <ziy@nvidia.com>
Link: http://lkml.kernel.org/r/20200527173608.2885243-1-daniel.m.jordan@oracle.com
Link: http://lkml.kernel.org/r/20200527173608.2885243-2-daniel.m.jordan@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03 20:09:45 -07:00
Linus Torvalds
cb8e59cc87 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from David Miller:

 1) Allow setting bluetooth L2CAP modes via socket option, from Luiz
    Augusto von Dentz.

 2) Add GSO partial support to igc, from Sasha Neftin.

 3) Several cleanups and improvements to r8169 from Heiner Kallweit.

 4) Add IF_OPER_TESTING link state and use it when ethtool triggers a
    device self-test. From Andrew Lunn.

 5) Start moving away from custom driver versions, use the globally
    defined kernel version instead, from Leon Romanovsky.

 6) Support GRO vis gro_cells in DSA layer, from Alexander Lobakin.

 7) Allow hard IRQ deferral during NAPI, from Eric Dumazet.

 8) Add sriov and vf support to hinic, from Luo bin.

 9) Support Media Redundancy Protocol (MRP) in the bridging code, from
    Horatiu Vultur.

10) Support netmap in the nft_nat code, from Pablo Neira Ayuso.

11) Allow UDPv6 encapsulation of ESP in the ipsec code, from Sabrina
    Dubroca. Also add ipv6 support for espintcp.

12) Lots of ReST conversions of the networking documentation, from Mauro
    Carvalho Chehab.

13) Support configuration of ethtool rxnfc flows in bcmgenet driver,
    from Doug Berger.

14) Allow to dump cgroup id and filter by it in inet_diag code, from
    Dmitry Yakunin.

15) Add infrastructure to export netlink attribute policies to
    userspace, from Johannes Berg.

16) Several optimizations to sch_fq scheduler, from Eric Dumazet.

17) Fallback to the default qdisc if qdisc init fails because otherwise
    a packet scheduler init failure will make a device inoperative. From
    Jesper Dangaard Brouer.

18) Several RISCV bpf jit optimizations, from Luke Nelson.

19) Correct the return type of the ->ndo_start_xmit() method in several
    drivers, it's netdev_tx_t but many drivers were using
    'int'. From Yunjian Wang.

20) Add an ethtool interface for PHY master/slave config, from Oleksij
    Rempel.

21) Add BPF iterators, from Yonghang Song.

22) Add cable test infrastructure, including ethool interfaces, from
    Andrew Lunn. Marvell PHY driver is the first to support this
    facility.

23) Remove zero-length arrays all over, from Gustavo A. R. Silva.

24) Calculate and maintain an explicit frame size in XDP, from Jesper
    Dangaard Brouer.

25) Add CAP_BPF, from Alexei Starovoitov.

26) Support terse dumps in the packet scheduler, from Vlad Buslov.

27) Support XDP_TX bulking in dpaa2 driver, from Ioana Ciornei.

28) Add devm_register_netdev(), from Bartosz Golaszewski.

29) Minimize qdisc resets, from Cong Wang.

30) Get rid of kernel_getsockopt and kernel_setsockopt in order to
    eliminate set_fs/get_fs calls. From Christoph Hellwig.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2517 commits)
  selftests: net: ip_defrag: ignore EPERM
  net_failover: fixed rollback in net_failover_open()
  Revert "tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv"
  Revert "tipc: Fix potential tipc_node refcnt leak in tipc_rcv"
  vmxnet3: allow rx flow hash ops only when rss is enabled
  hinic: add set_channels ethtool_ops support
  selftests/bpf: Add a default $(CXX) value
  tools/bpf: Don't use $(COMPILE.c)
  bpf, selftests: Use bpf_probe_read_kernel
  s390/bpf: Use bcr 0,%0 as tail call nop filler
  s390/bpf: Maintain 8-byte stack alignment
  selftests/bpf: Fix verifier test
  selftests/bpf: Fix sample_cnt shared between two threads
  bpf, selftests: Adapt cls_redirect to call csum_level helper
  bpf: Add csum_level helper for fixing up csum levels
  bpf: Fix up bpf_skb_adjust_room helper's skb csum setting
  sfc: add missing annotation for efx_ef10_try_update_nic_stats_vf()
  crypto/chtls: IPv6 support for inline TLS
  Crypto/chcr: Fixes a coccinile check error
  Crypto/chcr: Fixes compilations warnings
  ...
2020-06-03 16:27:18 -07:00
Linus Torvalds
ae03c53d00 Merge branch 'work.splice' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull splice updates from Al Viro:
 "Christoph's assorted splice cleanups"

* 'work.splice' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: rename pipe_buf ->steal to ->try_steal
  fs: make the pipe_buf_operations ->confirm operation optional
  fs: make the pipe_buf_operations ->steal operation optional
  trace: remove tracing_pipe_buf_ops
  pipe: merge anon_pipe_buf*_ops
  fs: simplify do_splice_from
  fs: simplify do_splice_to
2020-06-03 15:52:19 -07:00
Linus Torvalds
039aeb9deb ARM:
- Move the arch-specific code into arch/arm64/kvm
 - Start the post-32bit cleanup
 - Cherry-pick a few non-invasive pre-NV patches
 
 x86:
 - Rework of TLB flushing
 - Rework of event injection, especially with respect to nested virtualization
 - Nested AMD event injection facelift, building on the rework of generic code
 and fixing a lot of corner cases
 - Nested AMD live migration support
 - Optimization for TSC deadline MSR writes and IPIs
 - Various cleanups
 - Asynchronous page fault cleanups (from tglx, common topic branch with tip tree)
 - Interrupt-based delivery of asynchronous "page ready" events (host side)
 - Hyper-V MSRs and hypercalls for guest debugging
 - VMX preemption timer fixes
 
 s390:
 - Cleanups
 
 Generic:
 - switch vCPU thread wakeup from swait to rcuwait
 
 The other architectures, and the guest side of the asynchronous page fault
 work, will come next week.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAl7VJcYUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroPf6QgAq4wU5wdd1lTGz/i3DIhNVJNJgJlp
 ozLzRdMaJbdbn5RpAK6PEBd9+pt3+UlojpFB3gpJh2Nazv2OzV4yLQgXXXyyMEx1
 5Hg7b4UCJYDrbkCiegNRv7f/4FWDkQ9dx++RZITIbxeskBBCEI+I7GnmZhGWzuC4
 7kj4ytuKAySF2OEJu0VQF6u0CvrNYfYbQIRKBXjtOwuRK4Q6L63FGMJpYo159MBQ
 asg3B1jB5TcuGZ9zrjL5LkuzaP4qZZHIRs+4kZsH9I6MODHGUxKonrkablfKxyKy
 CFK+iaHCuEXXty5K0VmWM3nrTfvpEjVjbMc7e1QGBQ5oXsDM0pqn84syRg==
 =v7Wn
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "ARM:
   - Move the arch-specific code into arch/arm64/kvm

   - Start the post-32bit cleanup

   - Cherry-pick a few non-invasive pre-NV patches

  x86:
   - Rework of TLB flushing

   - Rework of event injection, especially with respect to nested
     virtualization

   - Nested AMD event injection facelift, building on the rework of
     generic code and fixing a lot of corner cases

   - Nested AMD live migration support

   - Optimization for TSC deadline MSR writes and IPIs

   - Various cleanups

   - Asynchronous page fault cleanups (from tglx, common topic branch
     with tip tree)

   - Interrupt-based delivery of asynchronous "page ready" events (host
     side)

   - Hyper-V MSRs and hypercalls for guest debugging

   - VMX preemption timer fixes

  s390:
   - Cleanups

  Generic:
   - switch vCPU thread wakeup from swait to rcuwait

  The other architectures, and the guest side of the asynchronous page
  fault work, will come next week"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (256 commits)
  KVM: selftests: fix rdtsc() for vmx_tsc_adjust_test
  KVM: check userspace_addr for all memslots
  KVM: selftests: update hyperv_cpuid with SynDBG tests
  x86/kvm/hyper-v: Add support for synthetic debugger via hypercalls
  x86/kvm/hyper-v: enable hypercalls regardless of hypercall page
  x86/kvm/hyper-v: Add support for synthetic debugger interface
  x86/hyper-v: Add synthetic debugger definitions
  KVM: selftests: VMX preemption timer migration test
  KVM: nVMX: Fix VMX preemption timer migration
  x86/kvm/hyper-v: Explicitly align hcall param for kvm_hyperv_exit
  KVM: x86/pmu: Support full width counting
  KVM: x86/pmu: Tweak kvm_pmu_get_msr to pass 'struct msr_data' in
  KVM: x86: announce KVM_FEATURE_ASYNC_PF_INT
  KVM: x86: acknowledgment mechanism for async pf page ready notifications
  KVM: x86: interrupt based APF 'page ready' event delivery
  KVM: introduce kvm_read_guest_offset_cached()
  KVM: rename kvm_arch_can_inject_async_page_present() to kvm_arch_can_dequeue_async_page_present()
  KVM: x86: extend struct kvm_vcpu_pv_apf_data with token info
  Revert "KVM: async_pf: Fix #DF due to inject "Page not Present" and "Page Ready" exceptions simultaneously"
  KVM: VMX: Replace zero-length array with flexible-array
  ...
2020-06-03 15:13:47 -07:00
Linus Torvalds
f1e455352b kgdb patches for 5.8-rc1
By far the biggest change in this cycle are the changes that allow much
 earlier debug of systems that are hooked up via UART by taking advantage
 of the earlycon framework to implement the kgdb I/O hooks before handing
 over to the regular polling I/O drivers once they are available. When
 discussing Doug's work we also found and fixed an broken
 raw_smp_processor_id() sequence in in_dbg_master().
 
 Also included are a collection of much smaller fixes and tweaks: a
 couple of tweaks to ged rid of doc gen or coccicheck warnings, future
 proof some internal calculations that made implicit power-of-2
 assumptions and eliminate some rather weird handling of magic
 environment variables in kdb.
 
 Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEELzVBU1D3lWq6cKzwfOMlXTn3iKEFAl7WfPsACgkQfOMlXTn3
 iKGhvBAAmalPhPvJ74djkSfSuz+fNVgjer5wKGQNhz4lSd+0W3lCkY8T2fkUIpL5
 jR3Q0gzJSA2WMSA7RrIwegDt0kCiQI0rtRKDkQxo33HBVSLlh2p5oXg7P5lQ4uOi
 QZyPI176V1KncFZjPKK2HzhTjoPNlx8GqVys6PBQETvTvxKR3f9qoq5qOKl/f9kQ
 Q4Dzb/npl6/XGJnQfdnkRcrXXtlK08yRxfXQyBEv0X6U9PUe1xmEZb1i9WBrrOYv
 u6N94fy2z6vqRgnbv4F6FTiQEHR1VFW2nPGpJ6GFv3KGFpT4QSWuyqTjm1Biee2y
 Gjn5ACAhW6tdPL+tCK3MRNGih7MaKoR01SnXz5D4T9V1zFTOhW7vyw+t3zoLfR7R
 fJoymQWKyfWbtj0Do8POiF31V+hvGVuqhzG/lTpnynSRJL38x4il6sFmtuRxMW+8
 vyxaetrPX+omf+fq1ueYTJS5Y5bl1Zp3avajD3VPXq2Vq2m4zl++AOlzTOJDF5A+
 P9RbwfWJh5Tm3VdCCWv849IDCK3R15DjoNLsuJkNRzqAYrJMVjA/QWyIAT14KR3z
 Nx3ix/QVKFkNnP5g1N38i2AvWRWZ/QuAmAFRgsmgnYPapeeX4EPtgdmqnloV9AAx
 CgO7KgUJF4LSIKTfoeWNJ4mpgSVR8zxkOR9w6DX0EQHDbfwlx8o=
 =uLAB
 -----END PGP SIGNATURE-----

Merge tag 'kgdb-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux

Pull kgdb updates from Daniel Thompson:
 "By far the biggest change in this cycle are the changes that allow
  much earlier debug of systems that are hooked up via UART by taking
  advantage of the earlycon framework to implement the kgdb I/O hooks
  before handing over to the regular polling I/O drivers once they are
  available. When discussing Doug's work we also found and fixed an
  broken raw_smp_processor_id() sequence in in_dbg_master().

  Also included are a collection of much smaller fixes and tweaks: a
  couple of tweaks to ged rid of doc gen or coccicheck warnings, future
  proof some internal calculations that made implicit power-of-2
  assumptions and eliminate some rather weird handling of magic
  environment variables in kdb"

* tag 'kgdb-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux:
  kdb: Remove the misfeature 'KDBFLAGS'
  kdb: Cleanup math with KDB_CMD_HISTORY_COUNT
  serial: amba-pl011: Support kgdboc_earlycon
  serial: 8250_early: Support kgdboc_earlycon
  serial: qcom_geni_serial: Support kgdboc_earlycon
  serial: kgdboc: Allow earlycon initialization to be deferred
  Documentation: kgdboc: Document new kgdboc_earlycon parameter
  kgdb: Don't call the deinit under spinlock
  kgdboc: Disable all the early code when kgdboc is a module
  kgdboc: Add kgdboc_earlycon to support early kgdb using boot consoles
  kgdboc: Remove useless #ifdef CONFIG_KGDB_SERIAL_CONSOLE in kgdboc
  kgdb: Prevent infinite recursive entries to the debugger
  kgdb: Delay "kgdbwait" to dbg_late_init() by default
  kgdboc: Use a platform device to handle tty drivers showing up late
  Revert "kgdboc: disable the console lock when in kgdb"
  kgdb: Disable WARN_CONSOLE_UNLOCKED for all kgdb
  kgdb: Return true in kgdb_nmi_poll_knock()
  kgdb: Drop malformed kernel doc comment
  kgdb: Fix spurious true from in_dbg_master()
2020-06-03 14:57:03 -07:00
Al Viro
b7e4b65f3f bpf: make bpf_check_uarg_tail_zero() use check_zeroed_user()
... rather than open-coding it, and badly, at that.

Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-06-03 16:59:45 -04:00
Linus Torvalds
44e40e96b5 Merge branch 'parisc-5.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parsic updates from Helge Deller:
 "Enable the sysctl file interface for panic_on_stackoverflow for
  parisc, a warning fix and a bunch of documentation updates since the
  parisc website is now at https://parisc.wiki.kernel.org"

* 'parisc-5.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
  parisc: MAINTAINERS: Update references to parisc website
  parisc: module: Update references to parisc website
  parisc: hardware: Update references to parisc website
  parisc: firmware: Update references to parisc website
  parisc: Kconfig: Update references to parisc website
  parisc: add sysctl file interface panic_on_stackoverflow
  parisc: use -fno-strict-aliasing for decompressor
  parisc: suppress error messages for 'make clean'
2020-06-03 13:45:21 -07:00
Linus Torvalds
e7c93cbfe9 threads-v5.8
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXtYhfgAKCRCRxhvAZXjc
 oghSAP9uVX3vxYtEtNvu9WtEn1uYZcSKZoF1YrcgY7UfSmna0gEAruzyZcai4CJL
 WKv+4aRq2oYk+hsqZDycAxIsEgWvNg8=
 =ZWj3
 -----END PGP SIGNATURE-----

Merge tag 'threads-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull thread updates from Christian Brauner:
 "We have been discussing using pidfds to attach to namespaces for quite
  a while and the patches have in one form or another already existed
  for about a year. But I wanted to wait to see how the general api
  would be received and adopted.

  This contains the changes to make it possible to use pidfds to attach
  to the namespaces of a process, i.e. they can be passed as the first
  argument to the setns() syscall.

  When only a single namespace type is specified the semantics are
  equivalent to passing an nsfd. That means setns(nsfd, CLONE_NEWNET)
  equals setns(pidfd, CLONE_NEWNET).

  However, when a pidfd is passed, multiple namespace flags can be
  specified in the second setns() argument and setns() will attach the
  caller to all the specified namespaces all at once or to none of them.

  Specifying 0 is not valid together with a pidfd. Here are just two
  obvious examples:

    setns(pidfd, CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET);
    setns(pidfd, CLONE_NEWUSER);

  Allowing to also attach subsets of namespaces supports various
  use-cases where callers setns to a subset of namespaces to retain
  privilege, perform an action and then re-attach another subset of
  namespaces.

  Apart from significantly reducing the number of syscalls needed to
  attach to all currently supported namespaces (eight "open+setns"
  sequences vs just a single "setns()"), this also allows atomic setns
  to a set of namespaces, i.e. either attaching to all namespaces
  succeeds or we fail without having changed anything.

  This is centered around a new internal struct nsset which holds all
  information necessary for a task to switch to a new set of namespaces
  atomically. Fwiw, with this change a pidfd becomes the only token
  needed to interact with a container. I'm expecting this to be
  picked-up by util-linux for nsenter rather soon.

  Associated with this change is a shiny new test-suite dedicated to
  setns() (for pidfds and nsfds alike)"

* tag 'threads-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  selftests/pidfd: add pidfd setns tests
  nsproxy: attach to namespaces via pidfds
  nsproxy: add struct nsset
2020-06-03 13:12:57 -07:00
Linus Torvalds
d479c5a191 The changes in this cycle are:
- Optimize the task wakeup CPU selection logic, to improve scalability and
    reduce wakeup latency spikes
 
  - PELT enhancements
 
  - CFS bandwidth handling fixes
 
  - Optimize the wakeup path by remove rq->wake_list and replacing it with ->ttwu_pending
 
  - Optimize IPI cross-calls by making flush_smp_call_function_queue()
    process sync callbacks first.
 
  - Misc fixes and enhancements.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl7WPL0RHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1i0ThAAs0fbvMzNJ5SWFdwOQ4KZIlA+Im4dEBMK
 sx/XAZqa/hGxvkm1jS0RDVQl1V1JdOlru5UF4C42ctnAFGtBBHDriO5rn9oCpkSw
 DAoLc4eZqzldIXN6sDZ0xMtC14Eu15UAP40OyM4qxBc4GqGlOnnale6Vhn+n+pLQ
 jAuZlMJIkmmzeA6cuvtultevrVh+QUqJ/5oNUANlTER4OM48umjr5rNTOb8cIW53
 9K3vbS3nmqSvJuIyqfRFoMy5GFM6+Jj2+nYuq8aTuYLEtF4qqWzttS3wBzC9699g
 XYRKILkCK8ZP4RB5Ps/DIKj6maZGZoICBxTJEkIgXujJlxlKKTD3mddk+0LBXChW
 Ijznanxn67akoAFpqi/Dnkhieg7cUrE9v1OPRS2J0xy550synSPFcSgOK3viizga
 iqbjptY4scUWkCwHQNjABerxc7MWzrwbIrRt+uNvCaqJLweUh0GnEcV5va8R+4I8
 K20XwOdrzuPLo5KdDWA/BKOEv49guHZDvoykzlwMlR3gFfwHS/UsjzmSQIWK3gZG
 9OMn8ibO2f1OzhRcEpDLFzp7IIj6NJmPFVSW+7xHyL9/vTveUx3ZXPLteb2qxJVP
 BYPsduVx8YeGRBlLya0PJriB23ajQr0lnHWo15g0uR9o/0Ds1ephcymiF3QJmCaA
 To3CyIuQN8M=
 =C2OP
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2020-06-02' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:
 "The changes in this cycle are:

   - Optimize the task wakeup CPU selection logic, to improve
     scalability and reduce wakeup latency spikes

   - PELT enhancements

   - CFS bandwidth handling fixes

   - Optimize the wakeup path by remove rq->wake_list and replacing it
     with ->ttwu_pending

   - Optimize IPI cross-calls by making flush_smp_call_function_queue()
     process sync callbacks first.

   - Misc fixes and enhancements"

* tag 'sched-core-2020-06-02' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
  irq_work: Define irq_work_single() on !CONFIG_IRQ_WORK too
  sched/headers: Split out open-coded prototypes into kernel/sched/smp.h
  sched: Replace rq::wake_list
  sched: Add rq::ttwu_pending
  irq_work, smp: Allow irq_work on call_single_queue
  smp: Optimize send_call_function_single_ipi()
  smp: Move irq_work_run() out of flush_smp_call_function_queue()
  smp: Optimize flush_smp_call_function_queue()
  sched: Fix smp_call_function_single_async() usage for ILB
  sched/core: Offload wakee task activation if it the wakee is descheduling
  sched/core: Optimize ttwu() spinning on p->on_cpu
  sched: Defend cfs and rt bandwidth quota against overflow
  sched/cpuacct: Fix charge cpuacct.usage_sys
  sched/fair: Replace zero-length array with flexible-array
  sched/pelt: Sync util/runnable_sum with PELT window when propagating
  sched/cpuacct: Use __this_cpu_add() instead of this_cpu_ptr()
  sched/fair: Optimize enqueue_task_fair()
  sched: Make scheduler_ipi inline
  sched: Clean up scheduler_ipi()
  sched/core: Simplify sched_init()
  ...
2020-06-03 13:06:42 -07:00
Linus Torvalds
f6606d0c00 The generic interrupt departement provides:
- Cleanup of the irq_domain API
   - Overhaul of the interrupt chip simulator
   - The usual pile of new interrupt chip drivers
   - Cleanups, improvements and fixes all over the place
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl7WDy0THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoU8gD/9kxzl/q7oZ0lnsbtORSZ68hZ+n9rwH
 XkLXZVCqOpXgU0Wbe8TLmSUQfxJeMBhTqI6gxkyLWFduRKBP30ChKHbAcvRV9Xj4
 s3sdkypiOgilu3skGRm0cUXDPDrACJW0QzZK0TIxVEou5j51opeM+tWb+pGbr7ao
 UaKARUsVzcNhYs0K3LoM2NrfXL2NtjPkGM6CE4sx7CaEx73dJyJZHgakleLEAGAo
 4ok7A71eB4DHRkeiwAPRLUzaBwGFJT86034+FvFCqwouW0FV40TpKi94Mf+a9IQv
 H8HFIUwfTCvfsouzmHYUfqalmShUMGqNOU4qbHeKzf2Nf/OASTOPiQ3ejWEWMkr0
 IHSjVzIeE+66RyMe2gkEmQnehrWeR9Djc7yZheo03IuO2oP3YQlpXvyA95hN7T1J
 kZ2MVgjNaB7bxQ6eJyA6/xQxv4EJI0a/jQ0ojd1jnomCyRB7k40s9GU2etDmgr08
 lA5z8+n9w1lbPWhaZl0BITE2mNoz/b+W+MTiRgN+CqTEXWK7Ee/otoYnZf5nkJpk
 0ixxwMvREFoO9JhnxlfG1knQXmL2mLx5I3TGrz7rpx9Ycgoer1hmDWCdMFQaAADL
 aVVnw3w8G8cwfqBkBYFHW6hq6228uicutcr22/klXsHnN6OYTHxVRvMb8HP3R7AJ
 ExGk4kquk9DhCQ==
 =Pi5h
 -----END PGP SIGNATURE-----

Merge tag 'irq-core-2020-06-02' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq updates from Thomas Gleixner:
 "The generic interrupt departement provides:

   - Cleanup of the irq_domain API

   - Overhaul of the interrupt chip simulator

   - The usual pile of new interrupt chip drivers

   - Cleanups, improvements and fixes all over the place"

* tag 'irq-core-2020-06-02' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
  irqchip: Fix "Loongson HyperTransport Vector support" driver build on all non-MIPS platforms
  dt-bindings: interrupt-controller: Add Loongson PCH MSI
  irqchip: Add Loongson PCH MSI controller
  dt-bindings: interrupt-controller: Add Loongson PCH PIC
  irqchip: Add Loongson PCH PIC controller
  dt-bindings: interrupt-controller: Add Loongson HTVEC
  irqchip: Add Loongson HyperTransport Vector support
  genirq: Check irq_data_get_irq_chip() return value before use
  irqchip/sifive-plic: Improve boot prints for multiple PLIC instances
  irqchip/sifive-plic: Setup cpuhp once after boot CPU handler is present
  irqchip/sifive-plic: Set default irq affinity in plic_irqdomain_map()
  irqchip/gic-v2, v3: Drop extra IRQ_NOAUTOEN setting for (E)PPIs
  irqdomain: Allow software nodes for IRQ domain creation
  irqdomain: Get rid of special treatment for ACPI in __irq_domain_add()
  irqdomain: Make __irq_domain_add() less OF-dependent
  iio: dummy_evgen: Fix use after free on error in iio_dummy_evgen_create()
  irqchip/gic-v3-its: Balance initial LPI affinity across CPUs
  irqchip/gic-v3-its: Track LPI distribution on a per CPU basis
  genirq/irq_sim: Simplify the API
  irqdomain: Make irq_domain_reset_irq_data() available to  non-hierarchical users
  ...
2020-06-03 10:05:11 -07:00
Ingo Molnar
5fdeefa053 Merge branch 'urgent-for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/urgent
Pull RCU fix from Paul E. McKenney.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-06-03 11:14:37 +02:00
Linus Torvalds
9d99b1647f audit/stable-5.8 PR 20200601
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAl7VnKEUHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXMbHA/+PQmrPdzPvkLAjjf1y3LXvyEIAXIQ
 h2r8SxHa7iGyF6vVPz+ya7ux0KAm8wCVdfkokWG5jxjwK7pysS6gx9JzBVK7dbhD
 FsKBSoq9+to9fYlaCyX7vn85C7kK5oGrwS/ECos0BHBpij8ukLgvPQu+PDs7d4xW
 1X2Nrgqnc7M4L8ayzXTQX0fDWcOkapzaN86+R+Lavb4hO/FownaYbuCFn+1mdzux
 ZNBpt3/y1pM6vi5YBkI1rkauBCmkl/YSX/mf/EwDNlQ0XmcadGQ6z7iwjyiE826g
 etCHWD3cgQH7Zzz6CxBNX8Xbq0nIQueHHiFYpVyy9lf4xleFvnfFDebrs8Q9TB6G
 jTWU8okioUKPZyRDaRuIAmCf/LBQRsMkIYTU3w6J0ZqsBycTw3NXPiQArmlxZESM
 HquxWpKoZytRiw581hiSGKNqY+R3FvA+Jroc/7bWfNOE3IdFxegvCsC3giKJf1rY
 AlQitehql9a5jp7A57+477WRYOygYRnd+ntMD5KqR90QSIcQXeg0/lFKhco+zc2p
 bXbWLE+aaOTGCeC+3Eow3T7FMWmrIn6ccKgM84+WT7YQYtRqUYu3RIZbnlYXN7uH
 8xGXT6ccPcEwIjgyF87J0KyGhrbT1N91Jd2jMJkEry9OLAn/yr+pUBQtAa456MMi
 JYevS4atZaUqgvw=
 =iLfC
 -----END PGP SIGNATURE-----

Merge tag 'audit-pr-20200601' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit

Pull audit updates from Paul Moore:
 "Summary of the significant patches:

   - Record information about binds/unbinds to the audit multicast
     socket. This helps identify which processes have/had access to the
     information in the audit stream.

   - Cleanup and add some additional information to the netfilter
     configuration events collected by audit.

   - Fix some of the audit error handling code so we don't leak network
     namespace references"

* tag 'audit-pr-20200601' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
  audit: add subj creds to NETFILTER_CFG record to
  audit: Replace zero-length array with flexible-array
  audit: make symbol 'audit_nfcfgs' static
  netfilter: add audit table unregister actions
  audit: tidy and extend netfilter_cfg x_tables
  audit: log audit netlink multicast bind and unbind
  audit: fix a net reference leak in audit_list_rules_send()
  audit: fix a net reference leak in audit_send_reply()
2020-06-02 17:13:37 -07:00
Linus Torvalds
750a02ab8d for-5.8/block-2020-06-01
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl7VOwMQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpoR7EADAlz3TCkb4wwuHytTBDrm6gVDdsJ9zUfQW
 Cl2ASLtufA8PWZUCEI3vhFyOe6P5e+ZZ0O2HjljSevmHyogCaRYXFYVfbWKcQKuk
 AcxiTgnYNevh8KbGLfJY1WL4eXsY+C3QUGivg35cCgrx+kr9oDaHMeqA9Tm1plyM
 FSprDBoSmHPqRxiV/1gnr8uXLX6K7i/fHzwmKgySMhavum7Ma8W3wdAGebzvQwrO
 SbFSuJVgz06e4B1Fzr/wSvVNUE/qW/KqfGuQKIp7VQFIywbgG7TgRMHjE1FSnpnh
 gn+BfL+O5gc0sTvcOTGOE0SRWWwLx961WNg8Azq08l3fzsxLA6h8/AnoDf3i+QMA
 rHmLpWZIic2xPSvjaFHX3/V9ITyGYeAMpAR77EL+4ivWrKv5JrBhnSLDt1fKILdg
 5elxm7RDI+C4nCP4xuTlVCy5gCd6gwjgytKj+NUWhNq1WiGAD0B54SSiV+SbCSH6
 Om2f5trcxz8E4pqWcf0k3LjFapVKRNV8v/+TmVkCdRPBl3y9P0h0wFTkkcEquqnJ
 y7Yq6efdWviRCnX5w/r/yj0qBuk4xo5hMVsPmlthCWtnBm+xZQ6LwMRcq4HQgZgR
 2SYNscZ3OFMekHssH7DvY4DAy1J+n83ims+KzbScbLg2zCZjh/scQuv38R5Eh9WZ
 rCS8c+T7Ig==
 =HYf4
 -----END PGP SIGNATURE-----

Merge tag 'for-5.8/block-2020-06-01' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:
 "Core block changes that have been queued up for this release:

   - Remove dead blk-throttle and blk-wbt code (Guoqing)

   - Include pid in blktrace note traces (Jan)

   - Don't spew I/O errors on wouldblock termination (me)

   - Zone append addition (Johannes, Keith, Damien)

   - IO accounting improvements (Konstantin, Christoph)

   - blk-mq hardware map update improvements (Ming)

   - Scheduler dispatch improvement (Salman)

   - Inline block encryption support (Satya)

   - Request map fixes and improvements (Weiping)

   - blk-iocost tweaks (Tejun)

   - Fix for timeout failing with error injection (Keith)

   - Queue re-run fixes (Douglas)

   - CPU hotplug improvements (Christoph)

   - Queue entry/exit improvements (Christoph)

   - Move DMA drain handling to the few drivers that use it (Christoph)

   - Partition handling cleanups (Christoph)"

* tag 'for-5.8/block-2020-06-01' of git://git.kernel.dk/linux-block: (127 commits)
  block: mark bio_wouldblock_error() bio with BIO_QUIET
  blk-wbt: rename __wbt_update_limits to wbt_update_limits
  blk-wbt: remove wbt_update_limits
  blk-throttle: remove tg_drain_bios
  blk-throttle: remove blk_throtl_drain
  null_blk: force complete for timeout request
  blk-mq: drain I/O when all CPUs in a hctx are offline
  blk-mq: add blk_mq_all_tag_iter
  blk-mq: open code __blk_mq_alloc_request in blk_mq_alloc_request_hctx
  blk-mq: use BLK_MQ_NO_TAG in more places
  blk-mq: rename BLK_MQ_TAG_FAIL to BLK_MQ_NO_TAG
  blk-mq: move more request initialization to blk_mq_rq_ctx_init
  blk-mq: simplify the blk_mq_get_request calling convention
  blk-mq: remove the bio argument to ->prepare_request
  nvme: force complete cancelled requests
  blk-mq: blk-mq: provide forced completion method
  block: fix a warning when blkdev.h is included for !CONFIG_BLOCK builds
  block: blk-crypto-fallback: remove redundant initialization of variable err
  block: reduce part_stat_lock() scope
  block: use __this_cpu_add() instead of access by smp_processor_id()
  ...
2020-06-02 15:29:19 -07:00
Linus Torvalds
355ba37d75 Power management updates for 5.8-rc1
- Rework the system-wide PM driver flags to make them easier to
    understand and use and update their documentation (Rafael Wysocki,
    Alan Stern).
 
  - Allow cpuidle governors to be switched at run time regardless of
    the kernel configuration and update the related documentation
    accordingly (Hanjun Guo).
 
  - Improve the resume device handling in the user space hibernarion
    interface code (Domenico Andreoli).
 
  - Document the intel-speed-select sysfs interface (Srinivas
    Pandruvada).
 
  - Make the ACPI code handing suspend to idle print more debug
    messages to help diagnose issues with it (Rafael Wysocki).
 
  - Fix a helper routine in the cpufreq core and correct a typo in
    the struct cpufreq_driver kerneldoc comment (Rafael Wysocki, Wang
    Wenhu).
 
  - Update cpufreq drivers:
 
    * Make the intel_pstate driver start in the passive mode by
      default on systems without HWP (Rafael Wysocki).
 
    * Add i.MX7ULP support to the imx-cpufreq-dt driver and add
      i.MX7ULP to the cpufreq-dt-platdev blacklist (Peng Fan).
 
    * Convert the qoriq cpufreq driver to a platform one, make the
      platform code create a suitable device object for it and add
      platform dependencies to it (Mian Yousaf Kaukab, Geert
      Uytterhoeven).
 
    * Fix wrong compatible binding in the qcom driver (Ansuel Smith).
 
    * Build the omap driver by default for ARCH_OMAP2PLUS (Anders
      Roxell).
 
    * Add r8a7742 SoC support to the dt cpufreq driver (Lad Prabhakar).
 
  - Update cpuidle core and drivers:
 
    * Fix three reference count leaks in error code paths in the
      cpuidle core (Qiushi Wu).
 
    * Convert Qualcomm SPM to a generic cpuidle driver (Stephan
      Gerhold).
 
    * Fix up the execution order when entering a domain idle state in
      the PSCI driver (Ulf Hansson).
 
  - Fix a reference counting issue related to clock management and
    clean up two oddities in the PM-runtime framework (Rafael Wysocki,
    Andy Shevchenko).
 
  - Add ElkhartLake support to the Intel RAPL power capping driver
    and remove an unused local MSR definition from it (Jacob Pan,
    Sumeet Pawnikar).
 
  - Update devfreq core and drivers:
 
    * Replace strncpy() with strscpy() in the devfreq core and use
      lockdep asserts instead of manual checks for a locked mutex in
      it (Dmitry Osipenko, Krzysztof Kozlowski).
 
    * Add a generic imx bus scaling driver and make it register an
      interconnect device (Leonard Crestez, Gustavo A. R. Silva).
 
    * Make the cpufreq notifier in the tegra30 driver take boosting
      into account and delete an unuseful error message from that
      driver (Dmitry Osipenko, Markus Elfring).
 
  - Remove unneeded semicolon from the cpupower code (Zou Wei).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl7VGjwSHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRx46gP/jGAXlddFEQswi6qUT3Cff0A9mb8CdcX
 dyKrjX4xxo/wtBIAwSN4achxrgse//ayo2dYTzWRDd31W9Azbv+5F+46XsDRz4hL
 pH29u/E66NMtFWnHCmt78NEJn0FzSa0YBC43ZzwFwKktCK9skYIpGN2z6iuXUBSX
 Q5GHqop3zvDsdKQFBGL62xvUw/AmOTPG7ohIZvqWBN2mbOqEqMcoFHT+aUF/NbLj
 +i14dvTH767eDZGRVASmXWQyljjaRWm+SIw4+m8zT1D1Y3d5IFObuMN+9RQl1Tif
 BYjkgJ2oDDMhCJLW7TBuJB+g7exiyaSQds3nMr2ZR+eZbJipICjU4eehNEKIUopU
 DM17tHQfnwZfS/7YbCx3vYQwLkNq37AJyXS9uqCAIFM+0n4xN4/mIVmgWYISLDTs
 1v9olFxtwMRNpjGGQWPJAO7ebB8Zz9qhQv7pIkSQEfwp93/SzvlVf4vvruTeFN9J
 qqG60cDumXWAm+s43eQHJNn5nOd5ocWv0FBpo/cxqKbzxFVWwdB42Cm0SY+rK2ID
 uHdnc2DJcK2c78UVbz3Cmk4272foJt2zxchqjFXXAZPLrOsFfzmti4B28VxGxjmP
 LG3MhH5sdbF4yl/1aSC1Bnrt+PV9Lus6ut/VKhjwIpw8cqiXgpwSbMoDoaBd9UMQ
 ubGz2rplGAtB
 =APdj
 -----END PGP SIGNATURE-----

Merge tag 'pm-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:
 "These rework the system-wide PM driver flags, make runtime switching
  of cpuidle governors easier, improve the user space hibernation
  interface code, add intel-speed-select interface documentation, add
  more debug messages to the ACPI code handling suspend to idle, update
  the cpufreq core and drivers, fix a minor issue in the cpuidle core
  and update two cpuidle drivers, improve the PM-runtime framework,
  update the Intel RAPL power capping driver, update devfreq core and
  drivers, and clean up the cpupower utility.

  Specifics:

   - Rework the system-wide PM driver flags to make them easier to
     understand and use and update their documentation (Rafael Wysocki,
     Alan Stern).

   - Allow cpuidle governors to be switched at run time regardless of
     the kernel configuration and update the related documentation
     accordingly (Hanjun Guo).

   - Improve the resume device handling in the user space hibernarion
     interface code (Domenico Andreoli).

   - Document the intel-speed-select sysfs interface (Srinivas
     Pandruvada).

   - Make the ACPI code handing suspend to idle print more debug
     messages to help diagnose issues with it (Rafael Wysocki).

   - Fix a helper routine in the cpufreq core and correct a typo in the
     struct cpufreq_driver kerneldoc comment (Rafael Wysocki, Wang
     Wenhu).

   - Update cpufreq drivers:

      - Make the intel_pstate driver start in the passive mode by
        default on systems without HWP (Rafael Wysocki).

      - Add i.MX7ULP support to the imx-cpufreq-dt driver and add
        i.MX7ULP to the cpufreq-dt-platdev blacklist (Peng Fan).

      - Convert the qoriq cpufreq driver to a platform one, make the
        platform code create a suitable device object for it and add
        platform dependencies to it (Mian Yousaf Kaukab, Geert
        Uytterhoeven).

      - Fix wrong compatible binding in the qcom driver (Ansuel Smith).

      - Build the omap driver by default for ARCH_OMAP2PLUS (Anders
        Roxell).

      - Add r8a7742 SoC support to the dt cpufreq driver (Lad
        Prabhakar).

   - Update cpuidle core and drivers:

      - Fix three reference count leaks in error code paths in the
        cpuidle core (Qiushi Wu).

      - Convert Qualcomm SPM to a generic cpuidle driver (Stephan
        Gerhold).

      - Fix up the execution order when entering a domain idle state in
        the PSCI driver (Ulf Hansson).

   - Fix a reference counting issue related to clock management and
     clean up two oddities in the PM-runtime framework (Rafael Wysocki,
     Andy Shevchenko).

   - Add ElkhartLake support to the Intel RAPL power capping driver and
     remove an unused local MSR definition from it (Jacob Pan, Sumeet
     Pawnikar).

   - Update devfreq core and drivers:

      - Replace strncpy() with strscpy() in the devfreq core and use
        lockdep asserts instead of manual checks for a locked mutex in
        it (Dmitry Osipenko, Krzysztof Kozlowski).

      - Add a generic imx bus scaling driver and make it register an
        interconnect device (Leonard Crestez, Gustavo A. R. Silva).

      - Make the cpufreq notifier in the tegra30 driver take boosting
        into account and delete an unuseful error message from that
        driver (Dmitry Osipenko, Markus Elfring).

   - Remove unneeded semicolon from the cpupower code (Zou Wei)"

* tag 'pm-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (51 commits)
  cpuidle: Fix three reference count leaks
  PM: runtime: Replace pm_runtime_callbacks_present()
  PM / devfreq: Use lockdep asserts instead of manual checks for locked mutex
  PM / devfreq: imx-bus: Fix inconsistent IS_ERR and PTR_ERR
  PM / devfreq: Replace strncpy with strscpy
  PM / devfreq: imx: Register interconnect device
  PM / devfreq: Add generic imx bus scaling driver
  PM / devfreq: tegra30: Delete an error message in tegra_devfreq_probe()
  PM / devfreq: tegra30: Make CPUFreq notifier to take into account boosting
  PM: hibernate: Restrict writes to the resume device
  PM: runtime: clk: Fix clk_pm_runtime_get() error path
  cpuidle: Convert Qualcomm SPM driver to a generic CPUidle driver
  ACPI: EC: PM: s2idle: Extend GPE dispatching debug message
  ACPI: PM: s2idle: Print type of wakeup debug messages
  powercap: RAPL: remove unused local MSR define
  PM: runtime: Make clear what we do when conditions are wrong in rpm_suspend()
  Documentation: admin-guide: pm: Document intel-speed-select
  PM: hibernate: Split off snapshot dev option
  PM: hibernate: Incorporate concurrency handling
  Documentation: ABI: make current_governer_ro as a candidate for removal
  ...
2020-06-02 13:17:23 -07:00
Linus Torvalds
94709049fb Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:
 "A few little subsystems and a start of a lot of MM patches.

  Subsystems affected by this patch series: squashfs, ocfs2, parisc,
  vfs. With mm subsystems: slab-generic, slub, debug, pagecache, gup,
  swap, memcg, pagemap, memory-failure, vmalloc, kasan"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (128 commits)
  kasan: move kasan_report() into report.c
  mm/mm_init.c: report kasan-tag information stored in page->flags
  ubsan: entirely disable alignment checks under UBSAN_TRAP
  kasan: fix clang compilation warning due to stack protector
  x86/mm: remove vmalloc faulting
  mm: remove vmalloc_sync_(un)mappings()
  x86/mm/32: implement arch_sync_kernel_mappings()
  x86/mm/64: implement arch_sync_kernel_mappings()
  mm/ioremap: track which page-table levels were modified
  mm/vmalloc: track which page-table levels were modified
  mm: add functions to track page directory modifications
  s390: use __vmalloc_node in stack_alloc
  powerpc: use __vmalloc_node in alloc_vm_stack
  arm64: use __vmalloc_node in arch_alloc_vmap_stack
  mm: remove vmalloc_user_node_flags
  mm: switch the test_vmalloc module to use __vmalloc_node
  mm: remove __vmalloc_node_flags_caller
  mm: remove both instances of __vmalloc_node_flags
  mm: remove the prot argument to __vmalloc_node
  mm: remove the pgprot argument to __vmalloc
  ...
2020-06-02 12:21:36 -07:00
Joerg Roedel
73f693c3a7 mm: remove vmalloc_sync_(un)mappings()
These functions are not needed anymore because the vmalloc and ioremap
mappings are now synchronized when they are created or torn down.

Remove all callers and function definitions.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20200515140023.25469-7-joro@8bytes.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:12 -07:00
Christoph Hellwig
041de93ff8 mm: remove vmalloc_user_node_flags
Open code it in __bpf_map_area_alloc, which is the only caller.  Also
clean up __bpf_map_area_alloc to have a single vmalloc call with slightly
different flags instead of the current two different calls.

For this to compile for the nommu case add a __vmalloc_node_range stub to
nommu.c.

[akpm@linux-foundation.org: fix nommu.c build]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Airlie <airlied@linux.ie>
Cc: Gao Xiang <xiang@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Kelley <mikelley@microsoft.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200414131348.444715-27-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:11 -07:00
Christoph Hellwig
2b9059489c mm: remove __vmalloc_node_flags_caller
Just use __vmalloc_node instead which gets and extra argument.  To be able
to to use __vmalloc_node in all caller make it available outside of
vmalloc and implement it in nommu.c.

[akpm@linux-foundation.org: fix nommu build]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Airlie <airlied@linux.ie>
Cc: Gao Xiang <xiang@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Kelley <mikelley@microsoft.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200414131348.444715-25-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:11 -07:00
Christoph Hellwig
88dca4ca5a mm: remove the pgprot argument to __vmalloc
The pgprot argument to __vmalloc is always PAGE_KERNEL now, so remove it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Michael Kelley <mikelley@microsoft.com> [hyperv]
Acked-by: Gao Xiang <xiang@kernel.org> [erofs]
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Wei Liu <wei.liu@kernel.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Airlie <airlied@linux.ie>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Link: http://lkml.kernel.org/r/20200414131348.444715-22-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:11 -07:00
Christoph Hellwig
515e5b6d90 dma-mapping: use vmap insted of reimplementing it
Replace the open coded instance of vmap with the actual function.  In
the non-contiguous (IOMMU) case this requires an extra find_vm_area,
but given that this isn't a fast path function that is a small price
to pay.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Airlie <airlied@linux.ie>
Cc: Gao Xiang <xiang@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Kelley <mikelley@microsoft.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Link: http://lkml.kernel.org/r/20200414131348.444715-6-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:10 -07:00
NeilBrown
a37b0715dd mm/writeback: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE
PF_LESS_THROTTLE exists for loop-back nfsd (and a similar need in the
loop block driver and callers of prctl(PR_SET_IO_FLUSHER)), where a
daemon needs to write to one bdi (the final bdi) in order to free up
writes queued to another bdi (the client bdi).

The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty
pages, so that it can still dirty pages after other processses have been
throttled.  The purpose of this is to avoid deadlock that happen when
the PF_LESS_THROTTLE process must write for any dirty pages to be freed,
but it is being thottled and cannot write.

This approach was designed when all threads were blocked equally,
independently on which device they were writing to, or how fast it was.
Since that time the writeback algorithm has changed substantially with
different threads getting different allowances based on non-trivial
heuristics.  This means the simple "add 25%" heuristic is no longer
reliable.

The important issue is not that the daemon needs a *larger* dirty page
allowance, but that it needs a *private* dirty page allowance, so that
dirty pages for the "client" bdi that it is helping to clear (the bdi
for an NFS filesystem or loop block device etc) do not affect the
throttling of the daemon writing to the "final" bdi.

This patch changes the heuristic so that the task is not throttled when
the bdi it is writing to has a dirty page count below below (or equal
to) the free-run threshold for that bdi.  This ensures it will always be
able to have some pages in flight, and so will not deadlock.

In a steady-state, it is expected that PF_LOCAL_THROTTLE tasks might
still be throttled by global threshold, but that is acceptable as it is
only the deadlock state that is interesting for this flag.

This approach of "only throttle when target bdi is busy" is consistent
with the other use of PF_LESS_THROTTLE in current_may_throttle(), were
it causes attention to be focussed only on the target bdi.

So this patch
 - renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE,
 - removes the 25% bonus that that flag gives, and
 - If PF_LOCAL_THROTTLE is set, don't delay at all unless the
   global and the local free-run thresholds are exceeded.

Note that previously realtime threads were treated the same as
PF_LESS_THROTTLE threads.  This patch does *not* change the behvaiour
for real-time threads, so it is now different from the behaviour of nfsd
and loop tasks.  I don't know what is wanted for realtime.

[akpm@linux-foundation.org: coding style fixes]
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Chuck Lever <chuck.lever@oracle.com>	[nfsd]
Cc: Christoph Hellwig <hch@lst.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Link: http://lkml.kernel.org/r/87ftbf7gs3.fsf@notabene.neil.brown.name
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:08 -07:00
Kefeng Wang
b3e2d20973 rcuperf: Fix printk format warning
Using "%zu" to fix following warning,
kernel/rcu/rcuperf.c: In function ‘kfree_perf_init’:
include/linux/kern_levels.h:5:18: warning: format ‘%lu’ expects argument of type ‘long unsigned int’, but argument 2 has type ‘unsigned int’ [-Wformat=]

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-02 08:41:37 -07:00
Wei Li
c893de12e1 kdb: Remove the misfeature 'KDBFLAGS'
Currently, 'KDBFLAGS' is an internal variable of kdb, it is combined
by 'KDBDEBUG' and state flags. It will be shown only when 'KDBDEBUG'
is set, and the user can define an environment variable named 'KDBFLAGS'
too. These are puzzling indeed.

After communication with Daniel, it seems that 'KDBFLAGS' is a misfeature.
So let's replace 'KDBFLAGS' with 'KDBDEBUG' to just show the value we
wrote into. After this modification, we can use `md4c1 kdb_flags` instead,
to observe the state flags.

Suggested-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Wei Li <liwei391@huawei.com>
Link: https://lore.kernel.org/r/20200521072125.21103-1-liwei391@huawei.com
[daniel.thompson@linaro.org: Make kdb_flags unsigned to avoid arithmetic
right shift]
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2020-06-02 15:15:46 +01:00
Douglas Anderson
1b310030bb kdb: Cleanup math with KDB_CMD_HISTORY_COUNT
From code inspection the math in handle_ctrl_cmd() looks super sketchy
because it subjects -1 from cmdptr and then does a "%
KDB_CMD_HISTORY_COUNT".  It turns out that this code works because
"cmdptr" is unsigned and KDB_CMD_HISTORY_COUNT is a nice power of 2.
Let's make this a little less sketchy.

This patch should be a no-op.

Signed-off-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20200507161125.1.I2cce9ac66e141230c3644b8174b6c15d4e769232@changeid
Reviewed-by: Sumit Garg <sumit.garg@linaro.org>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2020-06-02 15:15:46 +01:00
Douglas Anderson
b1350132fe kgdb: Don't call the deinit under spinlock
When I combined kgdboc_earlycon with an inflight patch titled ("soc:
qcom-geni-se: Add interconnect support to fix earlycon crash") [1]
things went boom.  Specifically I got a crash during the transition
between kgdboc_earlycon and the main kgdboc that looked like this:

Call trace:
 __schedule_bug+0x68/0x6c
 __schedule+0x75c/0x924
 schedule+0x8c/0xbc
 schedule_timeout+0x9c/0xfc
 do_wait_for_common+0xd0/0x160
 wait_for_completion_timeout+0x54/0x74
 rpmh_write_batch+0x1fc/0x23c
 qcom_icc_bcm_voter_commit+0x1b4/0x388
 qcom_icc_set+0x2c/0x3c
 apply_constraints+0x5c/0x98
 icc_set_bw+0x204/0x3bc
 icc_put+0x30/0xf8
 geni_remove_earlycon_icc_vote+0x6c/0x9c
 qcom_geni_serial_earlycon_exit+0x10/0x1c
 kgdboc_earlycon_deinit+0x38/0x58
 kgdb_register_io_module+0x11c/0x194
 configure_kgdboc+0x108/0x174
 kgdboc_probe+0x38/0x60
 platform_drv_probe+0x90/0xb0
 really_probe+0x130/0x2fc
 ...

The problem was that we were holding the "kgdb_registration_lock"
while calling into code that didn't expect to be called in spinlock
context.

Let's slightly defer when we call the deinit code so that it's not
done under spinlock.

NOTE: this does mean that the "deinit" call of the old kgdb IO module
is now made _after_ the init of the new IO module, but presumably
that's OK.

[1] https://lkml.kernel.org/r/1588919619-21355-3-git-send-email-akashast@codeaurora.org

Fixes: 220995622d ("kgdboc: Add kgdboc_earlycon to support early kgdb using boot consoles")
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20200526142001.1.I523dc33f96589cb9956f5679976d402c8cda36fa@changeid
[daniel.thompson@linaro.org: Resolved merge issues by hand]
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2020-06-02 15:15:45 +01:00
Ingo Molnar
25de110d14 irq_work: Define irq_work_single() on !CONFIG_IRQ_WORK too
Some SMP platforms don't have CONFIG_IRQ_WORK defined, resulting in a link
error at build time.

Define a stub and clean up the prototype definitions.

Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-06-02 12:34:45 +02:00
Linus Torvalds
8b39a57e96 Merge branch 'work.set_fs-exec' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull uaccess/coredump updates from Al Viro:
 "set_fs() removal in coredump-related area - mostly Christoph's
  stuff..."

* 'work.set_fs-exec' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  binfmt_elf_fdpic: remove the set_fs(KERNEL_DS) in elf_fdpic_core_dump
  binfmt_elf: remove the set_fs(KERNEL_DS) in elf_core_dump
  binfmt_elf: remove the set_fs in fill_siginfo_note
  signal: refactor copy_siginfo_to_user32
  powerpc/spufs: simplify spufs core dumping
  powerpc/spufs: stop using access_ok
  powerpc/spufs: fix copy_to_user while atomic
2020-06-01 16:21:46 -07:00
Linus Torvalds
4fdea5848b Merge branch 'uaccess.__put_user' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull uaccess/__put-user updates from Al Viro:
 "Removal of __put_user() calls - misc patches that don't fit into any
  other series"

* 'uaccess.__put_user' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  pcm_native: result of put_user() needs to be checked
  scsi_ioctl.c: switch SCSI_IOCTL_GET_IDLUN to copy_to_user()
  compat sysinfo(2): don't bother with field-by-field copyout
2020-06-01 16:17:04 -07:00
Linus Torvalds
e148a8f948 Merge branch 'uaccess.readdir' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull uaccess/readdir updates from Al Viro:
 "Finishing the conversion of readdir.c to unsafe_... API.

  This includes the uaccess_{read,write}_begin series by Christophe
  Leroy"

* 'uaccess.readdir' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  readdir.c: get rid of the last __put_user(), drop now-useless access_ok()
  readdir.c: get compat_filldir() more or less in sync with filldir()
  switch readdir(2) to unsafe_copy_dirent_name()
  drm/i915/gem: Replace user_access_begin by user_write_access_begin
  uaccess: Selectively open read or write user access
  uaccess: Add user_read_access_begin/end and user_write_access_begin/end
2020-06-01 16:11:38 -07:00
Linus Torvalds
b23c4771ff A fair amount of stuff this time around, dominated by yet another massive
set from Mauro toward the completion of the RST conversion.  I *really*
 hope we are getting close to the end of this.  Meanwhile, those patches
 reach pretty far afield to update document references around the tree;
 there should be no actual code changes there.  There will be, alas, more of
 the usual trivial merge conflicts.
 
 Beyond that we have more translations, improvements to the sphinx
 scripting, a number of additions to the sysctl documentation, and lots of
 fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAl7VId8PHGNvcmJldEBs
 d24ubmV0AAoJEBdDWhNsDH5Yq/gH/iaDgirQZV6UZ2v9sfwQNYolNpf2sKAuOZjd
 bPFB7WJoMQbKwQEvYrAUL2+5zPOcLYuIfzyOfo1BV1py+EyKbACcKjI4AedxfJF7
 +NchmOBhlEqmEhzx2U08HRc4/8J223WG17fJRVsV3p+opJySexSFeQucfOciX5NR
 RUCxweWWyg/FgyqjkyMMTtsePqZPmcT5dWTlVXISlbWzcv5NFhuJXnSrw8Sfzcmm
 SJMzqItv3O+CabnKQ8kMLV2PozXTMfjeWH47ZUK0Y8/8PP9+cvqwFzZ0UDQJ1Xaz
 oyW/TqmunaXhfMsMFeFGSwtfgwRHvXdxkQdtwNHvo1dV4dzTvDw=
 =fDC/
 -----END PGP SIGNATURE-----

Merge tag 'docs-5.8' of git://git.lwn.net/linux

Pull documentation updates from Jonathan Corbet:
 "A fair amount of stuff this time around, dominated by yet another
  massive set from Mauro toward the completion of the RST conversion. I
  *really* hope we are getting close to the end of this. Meanwhile,
  those patches reach pretty far afield to update document references
  around the tree; there should be no actual code changes there. There
  will be, alas, more of the usual trivial merge conflicts.

  Beyond that we have more translations, improvements to the sphinx
  scripting, a number of additions to the sysctl documentation, and lots
  of fixes"

* tag 'docs-5.8' of git://git.lwn.net/linux: (130 commits)
  Documentation: fixes to the maintainer-entry-profile template
  zswap: docs/vm: Fix typo accept_threshold_percent in zswap.rst
  tracing: Fix events.rst section numbering
  docs: acpi: fix old http link and improve document format
  docs: filesystems: add info about efivars content
  Documentation: LSM: Correct the basic LSM description
  mailmap: change email for Ricardo Ribalda
  docs: sysctl/kernel: document unaligned controls
  Documentation: admin-guide: update bug-hunting.rst
  docs: sysctl/kernel: document ngroups_max
  nvdimm: fixes to maintainter-entry-profile
  Documentation/features: Correct RISC-V kprobes support entry
  Documentation/features: Refresh the arch support status files
  Revert "docs: sysctl/kernel: document ngroups_max"
  docs: move locking-specific documents to locking/
  docs: move digsig docs to the security book
  docs: move the kref doc into the core-api book
  docs: add IRQ documentation at the core-api book
  docs: debugging-via-ohci1394.txt: add it to the core-api book
  docs: fix references for ipmi.rst file
  ...
2020-06-01 15:45:27 -07:00
Linus Torvalds
c2b0fc847f ARM updates for 5.8-rc1:
- remove a now unnecessary usage of the KERNEL_DS for
   sys_oabi_epoll_ctl()
 - update my email address in a number of drivers
 - decompressor EFI updates from Ard Biesheuvel
 - module unwind section handling updates
 - sparsemem Kconfig cleanups
 - make act_mm macro respect THREAD_SIZE
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEuNNh8scc2k/wOAE+9OeQG+StrGQFAl7VacAACgkQ9OeQG+St
 rGRHaA//Z8m+8LnSd+sqwrlEZYVj6IdPoihOhVZRsEhp4Fb09ZENOxL06W4lynHy
 tPcs4YAqLyp3Xmn+pk5NuH3SBoFGPOuUxseMpQKyT2ZnA126LB/3sW+xFPSbCayY
 gBHT7QhX6MxJEwxCxgvp2McOs6F55rYENcjozQ+DQMiNY5MTm0fKgGgbn1kpzglz
 7N2U7MR9ulXTCof3hZolQWBMOKa6LRldG7C3ajPITeOtk+vjyAPobqrkbzRDsIPV
 09j6BruFQoUbuyxtycNC0x+BDotrS/NN5OyhR07eJR5R0QNDW+qn8iqrkkVQUQsr
 mZpTR8CelzLL2+/1CDY2KrweY13eFbDoxiTVJl9aqCdlOsJKxwk1yv4HrEcpbBoK
 vtKwPDxPIKxFeJSCJX3xFjg9g6mRrBJ5CItPOThVgEqNt/dsbogqXlX4UhIjXzPs
 DBbeQ+EEZgNg7Ws/EwXIwtM8ZPc+bZZY8fskJd0gRCjbiCtstXXNjsHRd1vZ16KM
 yytpDxEIB7A+6lxcnV80VSCjD++A//kVThZ5kBl+ec1HOxRSyYOGIMGUMZhuyfE8
 f4xE3KVVsbqHGyh94C6tDLx73XgkmjfNx8YAgGRss+fQBoJbmwkJ0fDy4MhKlznD
 UnVcOXSjs7Iqih7R+icAtbIkbo1EUF5Mwu2I3SEZ/FOJmzEbbCY=
 =vMHE
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm

Pull ARM updates from Russell King:

 - remove a now unnecessary usage of the KERNEL_DS for
   sys_oabi_epoll_ctl()

 - update my email address in a number of drivers

 - decompressor EFI updates from Ard Biesheuvel

 - module unwind section handling updates

 - sparsemem Kconfig cleanups

 - make act_mm macro respect THREAD_SIZE

* tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm:
  ARM: 8980/1: Allow either FLATMEM or SPARSEMEM on the multiplatform build
  ARM: 8979/1: Remove redundant ARCH_SPARSEMEM_DEFAULT setting
  ARM: 8978/1: mm: make act_mm() respect THREAD_SIZE
  ARM: decompressor: run decompressor in place if loaded via UEFI
  ARM: decompressor: move GOT into .data for EFI enabled builds
  ARM: decompressor: defer loading of the contents of the LC0 structure
  ARM: decompressor: split off _edata and stack base into separate object
  ARM: decompressor: move headroom variable out of LC0
  ARM: 8976/1: module: allow arch overrides for .init section names
  ARM: 8975/1: module: fix handling of unwind init sections
  ARM: 8974/1: use SPARSMEM_STATIC when SPARSEMEM is enabled
  ARM: 8971/1: replace the sole use of a symbol with its definition
  ARM: 8969/1: decompressor: simplify libfdt builds
  Update rmk's email address in various drivers
  ARM: compat: remove KERNEL_DS usage in sys_oabi_epoll_ctl()
2020-06-01 15:36:32 -07:00
Jakub Sitnicki
0c047ecbb7 bpf, cgroup: Return ENOLINK for auto-detached links on update
Failure to update a bpf_link because it has been auto-detached by a dying
cgroup currently results in EINVAL error, even though the arguments passed
to bpf() syscall are not wrong.

bpf_links attaching to netns in this case will return ENOLINK, which
carries the message that the link is no longer attached to anything.

Change cgroup bpf_links to do the same to keep the uAPI errors consistent.

Fixes: 0c991ebc8c ("bpf: Implement bpf_prog replacement for an active bpf_cgroup_link")
Suggested-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200531082846.2117903-6-jakub@cloudflare.com
2020-06-01 15:21:03 -07:00
Jakub Sitnicki
7f045a49fe bpf: Add link-based BPF program attachment to network namespace
Extend bpf() syscall subcommands that operate on bpf_link, that is
LINK_CREATE, LINK_UPDATE, OBJ_GET_INFO, to accept attach types tied to
network namespaces (only flow dissector at the moment).

Link-based and prog-based attachment can be used interchangeably, but only
one can exist at a time. Attempts to attach a link when a prog is already
attached directly, and the other way around, will be met with -EEXIST.
Attempts to detach a program when link exists result in -EINVAL.

Attachment of multiple links of same attach type to one netns is not
supported with the intention to lift the restriction when a use-case
presents itself. Because of that link create returns -E2BIG when trying to
create another netns link, when one already exists.

Link-based attachments to netns don't keep a netns alive by holding a ref
to it. Instead links get auto-detached from netns when the latter is being
destroyed, using a pernet pre_exit callback.

When auto-detached, link lives in defunct state as long there are open FDs
for it. -ENOLINK is returned if a user tries to update a defunct link.

Because bpf_link to netns doesn't hold a ref to struct net, special care is
taken when releasing, updating, or filling link info. The netns might be
getting torn down when any of these link operations are in progress. That
is why auto-detach and update/release/fill_info are synchronized by the
same mutex. Also, link ops have to always check if auto-detach has not
happened yet and if netns is still alive (refcnt > 0).

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200531082846.2117903-5-jakub@cloudflare.com
2020-06-01 15:21:03 -07:00
Jakub Sitnicki
b27f7bb590 flow_dissector: Move out netns_bpf prog callbacks
Move functions to manage BPF programs attached to netns that are not
specific to flow dissector to a dedicated module named
bpf/net_namespace.c.

The set of functions will grow with the addition of bpf_link support for
netns attached programs. This patch prepares ground by creating a place
for it.

This is a code move with no functional changes intended.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200531082846.2117903-4-jakub@cloudflare.com
2020-06-01 15:21:02 -07:00
Jakub Sitnicki
a3fd7ceee0 net: Introduce netns_bpf for BPF programs attached to netns
In order to:

 (1) attach more than one BPF program type to netns, or
 (2) support attaching BPF programs to netns with bpf_link, or
 (3) support multi-prog attach points for netns

we will need to keep more state per netns than a single pointer like we
have now for BPF flow dissector program.

Prepare for the above by extracting netns_bpf that is part of struct net,
for storing all state related to BPF programs attached to netns.

Turn flow dissector callbacks for querying/attaching/detaching a program
into generic ones that operate on netns_bpf. Next patch will move the
generic callbacks into their own module.

This is similar to how it is organized for cgroup with cgroup_bpf.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20200531082846.2117903-3-jakub@cloudflare.com
2020-06-01 15:21:02 -07:00
Linus Torvalds
533b220f7b arm64 updates for 5.8
- Branch Target Identification (BTI)
 	* Support for ARMv8.5-BTI in both user- and kernel-space. This
 	  allows branch targets to limit the types of branch from which
 	  they can be called and additionally prevents branching to
 	  arbitrary code, although kernel support requires a very recent
 	  toolchain.
 
 	* Function annotation via SYM_FUNC_START() so that assembly
 	  functions are wrapped with the relevant "landing pad"
 	  instructions.
 
 	* BPF and vDSO updates to use the new instructions.
 
 	* Addition of a new HWCAP and exposure of BTI capability to
 	  userspace via ID register emulation, along with ELF loader
 	  support for the BTI feature in .note.gnu.property.
 
 	* Non-critical fixes to CFI unwind annotations in the sigreturn
 	  trampoline.
 
 - Shadow Call Stack (SCS)
 	* Support for Clang's Shadow Call Stack feature, which reserves
 	  platform register x18 to point at a separate stack for each
 	  task that holds only return addresses. This protects function
 	  return control flow from buffer overruns on the main stack.
 
 	* Save/restore of x18 across problematic boundaries (user-mode,
 	  hypervisor, EFI, suspend, etc).
 
 	* Core support for SCS, should other architectures want to use it
 	  too.
 
 	* SCS overflow checking on context-switch as part of the existing
 	  stack limit check if CONFIG_SCHED_STACK_END_CHECK=y.
 
 - CPU feature detection
 	* Removed numerous "SANITY CHECK" errors when running on a system
 	  with mismatched AArch32 support at EL1. This is primarily a
 	  concern for KVM, which disabled support for 32-bit guests on
 	  such a system.
 
 	* Addition of new ID registers and fields as the architecture has
 	  been extended.
 
 - Perf and PMU drivers
 	* Minor fixes and cleanups to system PMU drivers.
 
 - Hardware errata
 	* Unify KVM workarounds for VHE and nVHE configurations.
 
 	* Sort vendor errata entries in Kconfig.
 
 - Secure Monitor Call Calling Convention (SMCCC)
 	* Update to the latest specification from Arm (v1.2).
 
 	* Allow PSCI code to query the SMCCC version.
 
 - Software Delegated Exception Interface (SDEI)
 	* Unexport a bunch of unused symbols.
 
 	* Minor fixes to handling of firmware data.
 
 - Pointer authentication
 	* Add support for dumping the kernel PAC mask in vmcoreinfo so
 	  that the stack can be unwound by tools such as kdump.
 
 	* Simplification of key initialisation during CPU bringup.
 
 - BPF backend
 	* Improve immediate generation for logical and add/sub
 	  instructions.
 
 - vDSO
 	- Minor fixes to the linker flags for consistency with other
 	  architectures and support for LLVM's unwinder.
 
 	- Clean up logic to initialise and map the vDSO into userspace.
 
 - ACPI
 	- Work around for an ambiguity in the IORT specification relating
 	  to the "num_ids" field.
 
 	- Support _DMA method for all named components rather than only
 	  PCIe root complexes.
 
 	- Minor other IORT-related fixes.
 
 - Miscellaneous
 	* Initialise debug traps early for KGDB and fix KDB cacheflushing
 	  deadlock.
 
 	* Minor tweaks to early boot state (documentation update, set
 	  TEXT_OFFSET to 0x0, increase alignment of PE/COFF sections).
 
 	* Refactoring and cleanup
 -----BEGIN PGP SIGNATURE-----
 
 iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAl7U9csQHHdpbGxAa2Vy
 bmVsLm9yZwAKCRC3rHDchMFjNLBHCACs/YU4SM7Om5f+7QnxIKao5DBr2CnGGvdC
 yTfDghFDTLQVv3MufLlfno3yBe5G8sQpcZfcc+hewfcGoMzVZXu8s7LzH6VSn9T9
 jmT3KjDMrg0RjSHzyumJp2McyelTk0a4FiKArSIIKsJSXUyb1uPSgm7SvKVDwEwU
 JGDzL9IGilmq59GiXfDzGhTZgmC37QdwRoRxDuqtqWQe5CHoRXYexg87HwBKOQxx
 HgU9L7ehri4MRZfpyjaDrr6quJo3TVnAAKXNBh3mZAskVS9ZrfKpEH0kYWYuqybv
 znKyHRecl/rrGePV8RTMtrwnSdU26zMXE/omsVVauDfG9hqzqm+Q
 =w3qi
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Will Deacon:
 "A sizeable pile of arm64 updates for 5.8.

  Summary below, but the big two features are support for Branch Target
  Identification and Clang's Shadow Call stack. The latter is currently
  arm64-only, but the high-level parts are all in core code so it could
  easily be adopted by other architectures pending toolchain support

  Branch Target Identification (BTI):

   - Support for ARMv8.5-BTI in both user- and kernel-space. This allows
     branch targets to limit the types of branch from which they can be
     called and additionally prevents branching to arbitrary code,
     although kernel support requires a very recent toolchain.

   - Function annotation via SYM_FUNC_START() so that assembly functions
     are wrapped with the relevant "landing pad" instructions.

   - BPF and vDSO updates to use the new instructions.

   - Addition of a new HWCAP and exposure of BTI capability to userspace
     via ID register emulation, along with ELF loader support for the
     BTI feature in .note.gnu.property.

   - Non-critical fixes to CFI unwind annotations in the sigreturn
     trampoline.

  Shadow Call Stack (SCS):

   - Support for Clang's Shadow Call Stack feature, which reserves
     platform register x18 to point at a separate stack for each task
     that holds only return addresses. This protects function return
     control flow from buffer overruns on the main stack.

   - Save/restore of x18 across problematic boundaries (user-mode,
     hypervisor, EFI, suspend, etc).

   - Core support for SCS, should other architectures want to use it
     too.

   - SCS overflow checking on context-switch as part of the existing
     stack limit check if CONFIG_SCHED_STACK_END_CHECK=y.

  CPU feature detection:

   - Removed numerous "SANITY CHECK" errors when running on a system
     with mismatched AArch32 support at EL1. This is primarily a concern
     for KVM, which disabled support for 32-bit guests on such a system.

   - Addition of new ID registers and fields as the architecture has
     been extended.

  Perf and PMU drivers:

   - Minor fixes and cleanups to system PMU drivers.

  Hardware errata:

   - Unify KVM workarounds for VHE and nVHE configurations.

   - Sort vendor errata entries in Kconfig.

  Secure Monitor Call Calling Convention (SMCCC):

   - Update to the latest specification from Arm (v1.2).

   - Allow PSCI code to query the SMCCC version.

  Software Delegated Exception Interface (SDEI):

   - Unexport a bunch of unused symbols.

   - Minor fixes to handling of firmware data.

  Pointer authentication:

   - Add support for dumping the kernel PAC mask in vmcoreinfo so that
     the stack can be unwound by tools such as kdump.

   - Simplification of key initialisation during CPU bringup.

  BPF backend:

   - Improve immediate generation for logical and add/sub instructions.

  vDSO:

   - Minor fixes to the linker flags for consistency with other
     architectures and support for LLVM's unwinder.

   - Clean up logic to initialise and map the vDSO into userspace.

  ACPI:

   - Work around for an ambiguity in the IORT specification relating to
     the "num_ids" field.

   - Support _DMA method for all named components rather than only PCIe
     root complexes.

   - Minor other IORT-related fixes.

  Miscellaneous:

   - Initialise debug traps early for KGDB and fix KDB cacheflushing
     deadlock.

   - Minor tweaks to early boot state (documentation update, set
     TEXT_OFFSET to 0x0, increase alignment of PE/COFF sections).

   - Refactoring and cleanup"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (148 commits)
  KVM: arm64: Move __load_guest_stage2 to kvm_mmu.h
  KVM: arm64: Check advertised Stage-2 page size capability
  arm64/cpufeature: Add get_arm64_ftr_reg_nowarn()
  ACPI/IORT: Remove the unused __get_pci_rid()
  arm64/cpuinfo: Add ID_MMFR4_EL1 into the cpuinfo_arm64 context
  arm64/cpufeature: Add remaining feature bits in ID_AA64PFR1 register
  arm64/cpufeature: Add remaining feature bits in ID_AA64PFR0 register
  arm64/cpufeature: Add remaining feature bits in ID_AA64ISAR0 register
  arm64/cpufeature: Add remaining feature bits in ID_MMFR4 register
  arm64/cpufeature: Add remaining feature bits in ID_PFR0 register
  arm64/cpufeature: Introduce ID_MMFR5 CPU register
  arm64/cpufeature: Introduce ID_DFR1 CPU register
  arm64/cpufeature: Introduce ID_PFR2 CPU register
  arm64/cpufeature: Make doublelock a signed feature in ID_AA64DFR0
  arm64/cpufeature: Drop TraceFilt feature exposure from ID_DFR0 register
  arm64/cpufeature: Add explicit ftr_id_isar0[] for ID_ISAR0 register
  arm64: mm: Add asid_gen_match() helper
  firmware: smccc: Fix missing prototype warning for arm_smccc_version_init
  arm64: vdso: Fix CFI directives in sigreturn trampoline
  arm64: vdso: Don't prefix sigreturn trampoline with a BTI C instruction
  ...
2020-06-01 15:18:27 -07:00
Jiri Olsa
958a3f2d2a bpf: Use tracing helpers for lsm programs
Currenty lsm uses bpf_tracing_func_proto helpers which do
not include stack trace or perf event output. It's useful
to have those for bpftrace lsm support [1].

Using tracing_prog_func_proto helpers for lsm programs.

[1] https://github.com/iovisor/bpftrace/pull/1347

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: KP Singh <kpsingh@google.com>
Link: https://lore.kernel.org/bpf/20200531154255.896551-1-jolsa@kernel.org
2020-06-01 15:08:04 -07:00
Lorenzo Bianconi
1b698fa5d8 xdp: Rename convert_to_xdp_frame in xdp_convert_buff_to_frame
In order to use standard 'xdp' prefix, rename convert_to_xdp_frame
utility routine in xdp_convert_buff_to_frame and replace all the
occurrences

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/bpf/6344f739be0d1a08ab2b9607584c4d5478c8c083.1590698295.git.lorenzo@kernel.org
2020-06-01 15:02:53 -07:00
Denis Efremov
bb2359f4db bpf: Change kvfree to kfree in generic_map_lookup_batch()
buf_prevkey in generic_map_lookup_batch() is allocated with
kmalloc(). It's safe to free it with kfree().

Fixes: cb4d03ab49 ("bpf: Add generic support for lookup batch op")
Signed-off-by: Denis Efremov <efremov@linux.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200601162814.17426-1-efremov@linux.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-06-01 14:48:33 -07:00
David Ahern
64b59025c1 xdp: Add xdp_txq_info to xdp_buff
Add xdp_txq_info as the Tx counterpart to xdp_rxq_info. At the
moment only the device is added. Other fields (queue_index)
can be added as use cases arise.

>From a UAPI perspective, add egress_ifindex to xdp context for
bpf programs to see the Tx device.

Update the verifier to only allow accesses to egress_ifindex by
XDP programs with BPF_XDP_DEVMAP expected attach type.

Signed-off-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20200529220716.75383-4-dsahern@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-06-01 14:48:32 -07:00
David Ahern
fbee97feed bpf: Add support to attach bpf program to a devmap entry
Add BPF_XDP_DEVMAP attach type for use with programs associated with a
DEVMAP entry.

Allow DEVMAPs to associate a program with a device entry by adding
a bpf_prog.fd to 'struct bpf_devmap_val'. Values read show the program
id, so the fd and id are a union. bpf programs can get access to the
struct via vmlinux.h.

The program associated with the fd must have type XDP with expected
attach type BPF_XDP_DEVMAP. When a program is associated with a device
index, the program is run on an XDP_REDIRECT and before the buffer is
added to the per-cpu queue. At this point rxq data is still valid; the
next patch adds tx device information allowing the prorgam to see both
ingress and egress device indices.

XDP generic is skb based and XDP programs do not work with skb's. Block
the use case by walking maps used by a program that is to be attached
via xdpgeneric and fail if any of them are DEVMAP / DEVMAP_HASH with

Block attach of BPF_XDP_DEVMAP programs to devices.

Signed-off-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20200529220716.75383-3-dsahern@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-06-01 14:48:32 -07:00
David Ahern
7f1c04269f devmap: Formalize map value as a named struct
Add 'struct bpf_devmap_val' to formalize the expected values that can
be passed in for a DEVMAP. Update devmap code to use the struct.

Signed-off-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20200529220716.75383-2-dsahern@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-06-01 14:48:31 -07:00
Yonghong Song
b36e62eb85 bpf: Use strncpy_from_unsafe_strict() in bpf_seq_printf() helper
In bpf_seq_printf() helper, when user specified a "%s" in the
format string, strncpy_from_unsafe() is used to read the actual string
to a buffer. The string could be a format string or a string in
the kernel data structure. It is really unlikely that the string
will reside in the user memory.

This is different from Commit b2a5212fb6 ("bpf: Restrict bpf_trace_printk()'s %s
usage and add %pks, %pus specifier") which still used
strncpy_from_unsafe() for "%s" to preserve the old behavior.

If in the future, bpf_seq_printf() indeed needs to read user
memory, we can implement "%pus" format string.

Based on discussion in [1], if the intent is to read kernel memory,
strncpy_from_unsafe_strict() should be used. So this patch
changed to use strncpy_from_unsafe_strict().

[1]: https://lore.kernel.org/bpf/20200521152301.2587579-1-hch@lst.de/T/

Fixes: 492e639f0c ("bpf: Add bpf_seq_printf and bpf_seq_write helpers")
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/bpf/20200529004810.3352219-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-06-01 14:48:31 -07:00
Andrii Nakryiko
457f44363a bpf: Implement BPF ring buffer and verifier support for it
This commit adds a new MPSC ring buffer implementation into BPF ecosystem,
which allows multiple CPUs to submit data to a single shared ring buffer. On
the consumption side, only single consumer is assumed.

Motivation
----------
There are two distinctive motivators for this work, which are not satisfied by
existing perf buffer, which prompted creation of a new ring buffer
implementation.
  - more efficient memory utilization by sharing ring buffer across CPUs;
  - preserving ordering of events that happen sequentially in time, even
  across multiple CPUs (e.g., fork/exec/exit events for a task).

These two problems are independent, but perf buffer fails to satisfy both.
Both are a result of a choice to have per-CPU perf ring buffer.  Both can be
also solved by having an MPSC implementation of ring buffer. The ordering
problem could technically be solved for perf buffer with some in-kernel
counting, but given the first one requires an MPSC buffer, the same solution
would solve the second problem automatically.

Semantics and APIs
------------------
Single ring buffer is presented to BPF programs as an instance of BPF map of
type BPF_MAP_TYPE_RINGBUF. Two other alternatives considered, but ultimately
rejected.

One way would be to, similar to BPF_MAP_TYPE_PERF_EVENT_ARRAY, make
BPF_MAP_TYPE_RINGBUF could represent an array of ring buffers, but not enforce
"same CPU only" rule. This would be more familiar interface compatible with
existing perf buffer use in BPF, but would fail if application needed more
advanced logic to lookup ring buffer by arbitrary key. HASH_OF_MAPS addresses
this with current approach. Additionally, given the performance of BPF
ringbuf, many use cases would just opt into a simple single ring buffer shared
among all CPUs, for which current approach would be an overkill.

Another approach could introduce a new concept, alongside BPF map, to
represent generic "container" object, which doesn't necessarily have key/value
interface with lookup/update/delete operations. This approach would add a lot
of extra infrastructure that has to be built for observability and verifier
support. It would also add another concept that BPF developers would have to
familiarize themselves with, new syntax in libbpf, etc. But then would really
provide no additional benefits over the approach of using a map.
BPF_MAP_TYPE_RINGBUF doesn't support lookup/update/delete operations, but so
doesn't few other map types (e.g., queue and stack; array doesn't support
delete, etc).

The approach chosen has an advantage of re-using existing BPF map
infrastructure (introspection APIs in kernel, libbpf support, etc), being
familiar concept (no need to teach users a new type of object in BPF program),
and utilizing existing tooling (bpftool). For common scenario of using
a single ring buffer for all CPUs, it's as simple and straightforward, as
would be with a dedicated "container" object. On the other hand, by being
a map, it can be combined with ARRAY_OF_MAPS and HASH_OF_MAPS map-in-maps to
implement a wide variety of topologies, from one ring buffer for each CPU
(e.g., as a replacement for perf buffer use cases), to a complicated
application hashing/sharding of ring buffers (e.g., having a small pool of
ring buffers with hashed task's tgid being a look up key to preserve order,
but reduce contention).

Key and value sizes are enforced to be zero. max_entries is used to specify
the size of ring buffer and has to be a power of 2 value.

There are a bunch of similarities between perf buffer
(BPF_MAP_TYPE_PERF_EVENT_ARRAY) and new BPF ring buffer semantics:
  - variable-length records;
  - if there is no more space left in ring buffer, reservation fails, no
    blocking;
  - memory-mappable data area for user-space applications for ease of
    consumption and high performance;
  - epoll notifications for new incoming data;
  - but still the ability to do busy polling for new data to achieve the
    lowest latency, if necessary.

BPF ringbuf provides two sets of APIs to BPF programs:
  - bpf_ringbuf_output() allows to *copy* data from one place to a ring
    buffer, similarly to bpf_perf_event_output();
  - bpf_ringbuf_reserve()/bpf_ringbuf_commit()/bpf_ringbuf_discard() APIs
    split the whole process into two steps. First, a fixed amount of space is
    reserved. If successful, a pointer to a data inside ring buffer data area
    is returned, which BPF programs can use similarly to a data inside
    array/hash maps. Once ready, this piece of memory is either committed or
    discarded. Discard is similar to commit, but makes consumer ignore the
    record.

bpf_ringbuf_output() has disadvantage of incurring extra memory copy, because
record has to be prepared in some other place first. But it allows to submit
records of the length that's not known to verifier beforehand. It also closely
matches bpf_perf_event_output(), so will simplify migration significantly.

bpf_ringbuf_reserve() avoids the extra copy of memory by providing a memory
pointer directly to ring buffer memory. In a lot of cases records are larger
than BPF stack space allows, so many programs have use extra per-CPU array as
a temporary heap for preparing sample. bpf_ringbuf_reserve() avoid this needs
completely. But in exchange, it only allows a known constant size of memory to
be reserved, such that verifier can verify that BPF program can't access
memory outside its reserved record space. bpf_ringbuf_output(), while slightly
slower due to extra memory copy, covers some use cases that are not suitable
for bpf_ringbuf_reserve().

The difference between commit and discard is very small. Discard just marks
a record as discarded, and such records are supposed to be ignored by consumer
code. Discard is useful for some advanced use-cases, such as ensuring
all-or-nothing multi-record submission, or emulating temporary malloc()/free()
within single BPF program invocation.

Each reserved record is tracked by verifier through existing
reference-tracking logic, similar to socket ref-tracking. It is thus
impossible to reserve a record, but forget to submit (or discard) it.

bpf_ringbuf_query() helper allows to query various properties of ring buffer.
Currently 4 are supported:
  - BPF_RB_AVAIL_DATA returns amount of unconsumed data in ring buffer;
  - BPF_RB_RING_SIZE returns the size of ring buffer;
  - BPF_RB_CONS_POS/BPF_RB_PROD_POS returns current logical possition of
    consumer/producer, respectively.
Returned values are momentarily snapshots of ring buffer state and could be
off by the time helper returns, so this should be used only for
debugging/reporting reasons or for implementing various heuristics, that take
into account highly-changeable nature of some of those characteristics.

One such heuristic might involve more fine-grained control over poll/epoll
notifications about new data availability in ring buffer. Together with
BPF_RB_NO_WAKEUP/BPF_RB_FORCE_WAKEUP flags for output/commit/discard helpers,
it allows BPF program a high degree of control and, e.g., more efficient
batched notifications. Default self-balancing strategy, though, should be
adequate for most applications and will work reliable and efficiently already.

Design and implementation
-------------------------
This reserve/commit schema allows a natural way for multiple producers, either
on different CPUs or even on the same CPU/in the same BPF program, to reserve
independent records and work with them without blocking other producers. This
means that if BPF program was interruped by another BPF program sharing the
same ring buffer, they will both get a record reserved (provided there is
enough space left) and can work with it and submit it independently. This
applies to NMI context as well, except that due to using a spinlock during
reservation, in NMI context, bpf_ringbuf_reserve() might fail to get a lock,
in which case reservation will fail even if ring buffer is not full.

The ring buffer itself internally is implemented as a power-of-2 sized
circular buffer, with two logical and ever-increasing counters (which might
wrap around on 32-bit architectures, that's not a problem):
  - consumer counter shows up to which logical position consumer consumed the
    data;
  - producer counter denotes amount of data reserved by all producers.

Each time a record is reserved, producer that "owns" the record will
successfully advance producer counter. At that point, data is still not yet
ready to be consumed, though. Each record has 8 byte header, which contains
the length of reserved record, as well as two extra bits: busy bit to denote
that record is still being worked on, and discard bit, which might be set at
commit time if record is discarded. In the latter case, consumer is supposed
to skip the record and move on to the next one. Record header also encodes
record's relative offset from the beginning of ring buffer data area (in
pages). This allows bpf_ringbuf_commit()/bpf_ringbuf_discard() to accept only
the pointer to the record itself, without requiring also the pointer to ring
buffer itself. Ring buffer memory location will be restored from record
metadata header. This significantly simplifies verifier, as well as improving
API usability.

Producer counter increments are serialized under spinlock, so there is
a strict ordering between reservations. Commits, on the other hand, are
completely lockless and independent. All records become available to consumer
in the order of reservations, but only after all previous records where
already committed. It is thus possible for slow producers to temporarily hold
off submitted records, that were reserved later.

Reservation/commit/consumer protocol is verified by litmus tests in
Documentation/litmus-test/bpf-rb.

One interesting implementation bit, that significantly simplifies (and thus
speeds up as well) implementation of both producers and consumers is how data
area is mapped twice contiguously back-to-back in the virtual memory. This
allows to not take any special measures for samples that have to wrap around
at the end of the circular buffer data area, because the next page after the
last data page would be first data page again, and thus the sample will still
appear completely contiguous in virtual memory. See comment and a simple ASCII
diagram showing this visually in bpf_ringbuf_area_alloc().

Another feature that distinguishes BPF ringbuf from perf ring buffer is
a self-pacing notifications of new data being availability.
bpf_ringbuf_commit() implementation will send a notification of new record
being available after commit only if consumer has already caught up right up
to the record being committed. If not, consumer still has to catch up and thus
will see new data anyways without needing an extra poll notification.
Benchmarks (see tools/testing/selftests/bpf/benchs/bench_ringbuf.c) show that
this allows to achieve a very high throughput without having to resort to
tricks like "notify only every Nth sample", which are necessary with perf
buffer. For extreme cases, when BPF program wants more manual control of
notifications, commit/discard/output helpers accept BPF_RB_NO_WAKEUP and
BPF_RB_FORCE_WAKEUP flags, which give full control over notifications of data
availability, but require extra caution and diligence in using this API.

Comparison to alternatives
--------------------------
Before considering implementing BPF ring buffer from scratch existing
alternatives in kernel were evaluated, but didn't seem to meet the needs. They
largely fell into few categores:
  - per-CPU buffers (perf, ftrace, etc), which don't satisfy two motivations
    outlined above (ordering and memory consumption);
  - linked list-based implementations; while some were multi-producer designs,
    consuming these from user-space would be very complicated and most
    probably not performant; memory-mapping contiguous piece of memory is
    simpler and more performant for user-space consumers;
  - io_uring is SPSC, but also requires fixed-sized elements. Naively turning
    SPSC queue into MPSC w/ lock would have subpar performance compared to
    locked reserve + lockless commit, as with BPF ring buffer. Fixed sized
    elements would be too limiting for BPF programs, given existing BPF
    programs heavily rely on variable-sized perf buffer already;
  - specialized implementations (like a new printk ring buffer, [0]) with lots
    of printk-specific limitations and implications, that didn't seem to fit
    well for intended use with BPF programs.

  [0] https://lwn.net/Articles/779550/

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200529075424.3139988-2-andriin@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-06-01 14:38:22 -07:00
Anton Protopopov
1ea0f9120c bpf: Fix map permissions check
The map_lookup_and_delete_elem() function should check for both FMODE_CAN_WRITE
and FMODE_CAN_READ permissions because it returns a map element to user space.

Fixes: bd513cd08f ("bpf: add MAP_LOOKUP_AND_DELETE_ELEM syscall")
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200527185700.14658-5-a.s.protopopov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-06-01 14:38:21 -07:00
John Fastabend
f470378c75 bpf: Extend bpf_base_func_proto helpers with probe_* and *current_task*
Often it is useful when applying policy to know something about the
task. If the administrator has CAP_SYS_ADMIN rights then they can
use kprobe + networking hook and link the two programs together to
accomplish this. However, this is a bit clunky and also means we have
to call both the network program and kprobe program when we could just
use a single program and avoid passing metadata through sk_msg/skb->cb,
socket, maps, etc.

To accomplish this add probe_* helpers to bpf_base_func_proto programs
guarded by a perfmon_capable() check. New supported helpers are the
following,

 BPF_FUNC_get_current_task
 BPF_FUNC_probe_read_user
 BPF_FUNC_probe_read_kernel
 BPF_FUNC_probe_read_user_str
 BPF_FUNC_probe_read_kernel_str

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/159033905529.12355.4368381069655254932.stgit@john-Precision-5820-Tower
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-06-01 14:38:20 -07:00
Chris Packham
0142dddcbe bpf: Fix spelling in comment explaining ARG1 in ___bpf_prog_run
Change 'handeled' to 'handled'.

Signed-off-by: Chris Packham <chris.packham@alliedtelesis.co.nz>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200525230025.14470-1-chris.packham@alliedtelesis.co.nz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-06-01 14:38:20 -07:00
Jakub Sitnicki
fe537393b5 bpf: Fix returned error sign when link doesn't support updates
System calls encode returned errors as negative values. Fix a typo that
breaks this convention for bpf(LINK_UPDATE) when bpf_link doesn't support
update operation.

Fixes: f9d041271c ("bpf: Refactor bpf_link update handling")
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200525122928.1164495-1-jakub@cloudflare.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-06-01 14:38:19 -07:00
Linus Torvalds
17e0a7cb6a Misc cleanups, with an emphasis on removing obsolete/dead code.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl7VLcQRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1iFnhAArGBqco3C2RPQugv7UDDbKEaMvxOGrc5B
 kwnyOS/k/yeIkfhT9u11oBuLcaj/Zgw8YCjFyRfaNsorRqnytLyZzZ6PvdCCE3YU
 X3DVYgulcdAQnM4bS2e3Kt9ciJvFxB27XNm0AfuyLMUxMqCD+iIO4gJ6TuQNBYy3
 dfUMfB1R9OUDW13GCrASe+p1Dw76uaqVngdFWJhnC8Rm49E6gFXq7CLQp5Cka81I
 KZeJ8I6ug9p3gqhOIXdi+S6g5CM5jf86Wkk7dOHwHFH7CceFb3FIz7z0n1je4Wgd
 L5rYX7+PwfNeZ73GIuvEBN+agJH2K0H/KmnlWNWeZHzc+J12MeruSdSMBIkBOEpn
 iSbYAOmDpQLzBjTdZjC8bDqTZf472WrTh4VwN9NxHLucjdC+IqGoTAvnyyEOmZ5o
 R7sv7Q++316CVwRhYVXbzwZcqtiinCDE1EkP5nKTo9z3z0kMF5+ce/k7wn5sgZIk
 zJq3LXtaToiDoDRAPGxcvFPts9MdC0EI1aKTIjaK/n6i2h/SpJfrTKgANWaldYTe
 XJIqlSB43saqf5YAQ3/sY+wnpCRBmmCU+sfKja4C8bH7RuggI3mZS19uhFs0Qctq
 Yx5bIXVSBAIqjJtgzQ0WAAZ5LrCpNNyAzb35ZYefQlGyJlx1URKXVBmxa6S99biU
 KiYX7Dk5uhQ=
 =0ZQd
 -----END PGP SIGNATURE-----

Merge tag 'x86-cleanups-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cleanups from Ingo Molnar:
 "Misc cleanups, with an emphasis on removing obsolete/dead code"

* tag 'x86-cleanups-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/spinlock: Remove obsolete ticket spinlock macros and types
  x86/mm: Drop deprecated DISCONTIGMEM support for 32-bit
  x86/apb_timer: Drop unused declaration and macro
  x86/apb_timer: Drop unused TSC calibration
  x86/io_apic: Remove unused function mp_init_irq_at_boot()
  x86/mm: Stop printing BRK addresses
  x86/audit: Fix a -Wmissing-prototypes warning for ia32_classify_syscall()
  x86/nmi: Remove edac.h include leftover
  mm: Remove MPX leftovers
  x86/mm/mmap: Fix -Wmissing-prototypes warnings
  x86/early_printk: Remove unused includes
  crash_dump: Remove no longer used saved_max_pfn
  x86/smpboot: Remove the last ICPU() macro
2020-06-01 13:47:10 -07:00
Linus Torvalds
d861f6e682 Misc cleanups in the SMP hotplug and cross-call code.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl7VJfsRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1ihcA/+Ko18kdGRPAlShM9qkDWO5N80p1LEp7F0
 ku1OxPAz9ii7K/jlnGr9wYYPxsIL3lbFeqFE7q5q5socXufaN8MUj9sVCmN7ScmR
 zO84aTHtxrJJhKIPM6HkUTbVl5KrQaud3F/J56CCjuKPsJWy9iuCGnKtfKK38bx+
 qJEfVKVm95Bv0NSEvqvci3DKKPYjzpKzuuttHXQ8Z80zG94FEkwj0JwZzttIjLl1
 rgRMgWTH7+3tQCMnZEfXG8xBxbXS9i3hKyr/v5QTNgIICyXGquPkf5MiwjJFS2Xb
 wpPqNh8HTo5kUJstYygRjcftatU7K72h2Rz/CoUkN2roNYlvRAhdBaBMwN0cGaG8
 pPhnLHHHRYZjl4fiROgRwVV3A6LcAHSrIcKzwGrvpCSpqyVozPGsmD/e8ZG1JYpC
 vxESTZbCDywng2Ls8jqQBut+dFGElvopXl1s004bCak89IFR4p15qojMJK2MSsqu
 BxhjIoqp8/f1fsAX+1p0RBEYnEr1KFtWa+nY8aVKL6bEx+Y7Qyq0ypMGtKavP06X
 VMcPMm1gYeXoGpLaTLYBRL5t7Rmm7i+xufuDQKUJetenfh2YS4aQ9lfV+rsQH1YE
 wavQrbwThfBZ9K1XkEmOkSqONysZ2YAtK9slKzciQIZvY3V8NbKAmBudCgqTgarp
 xqeW9NFfeFc=
 =Rr2n
 -----END PGP SIGNATURE-----

Merge tag 'smp-core-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull SMP updates from Ingo Molnar:
 "Misc cleanups in the SMP hotplug and cross-call code"

* tag 'smp-core-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  cpu/hotplug: Remove __freeze_secondary_cpus()
  cpu/hotplug: Remove disable_nonboot_cpus()
  cpu/hotplug: Fix a typo in comment "broadacasted"->"broadcasted"
  smp: Use smp_call_func_t in on_each_cpu()
2020-06-01 13:38:55 -07:00
Linus Torvalds
a7092c8204 Kernel side changes:
- Add AMD Fam17h RAPL support
   - Introduce CAP_PERFMON to kernel and user space
   - Add Zhaoxin CPU support
   - Misc fixes and cleanups
 
 Tooling changes:
 
   perf record:
 
     - Introduce --switch-output-event to use arbitrary events to be setup
       and read from a side band thread and, when they take place a signal
       be sent to the main 'perf record' thread, reusing the --switch-output
       code to take perf.data snapshots from the --overwrite ring buffer, e.g.:
 
 	# perf record --overwrite -e sched:* \
 		      --switch-output-event syscalls:*connect* \
 		      workload
 
       will take perf.data.YYYYMMDDHHMMSS snapshots up to around the
       connect syscalls.
 
     - Add --num-synthesize-threads option to control degree of parallelism of the
       synthesize_mmap() code which is scanning /proc/PID/task/PID/maps and can be
       time consuming. This mimics pre-existing behaviour in 'perf top'.
 
   perf bench:
 
     - Add a multi-threaded synthesize benchmark.
     - Add kallsyms parsing benchmark.
 
   Intel PT support:
 
     - Stitch LBR records from multiple samples to get deeper backtraces,
       there are caveats, see the csets for details.
     - Allow using Intel PT to synthesize callchains for regular events.
     - Add support for synthesizing branch stacks for regular events (cycles,
       instructions, etc) from Intel PT data.
 
   Misc changes:
 
     - Updated perf vendor events for power9 and Coresight.
     - Add flamegraph.py script via 'perf flamegraph'
     - Misc other changes, fixes and cleanups - see the Git log for details.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl7VJAcRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hAYw/8DFtzGkMaaWkrDSj62LXtWQiqr1l01ZFt
 9GzV4aN4/go+K4BQtsQN8cUjOkRHFnOryLuD9LfSBfqsdjuiyTynV/cJkeUGQBck
 TT/GgWf3XKJzTUBRQRk367Gbqs9UKwBP8CdFhOXcNzGEQpjhbwwIDPmem94U4L1N
 XLsysgC45ejWL1kMTZKmk6hDIidlFeDg9j70WDPX1nNfCeisk25rxwTpdgvjsjcj
 3RzPRt2EGS+IkuF4QSCT5leYSGaCpVDHCQrVpHj57UoADfWAyC71uopTLG4OgYSx
 PVd9gvloMeeqWmroirIxM67rMd/TBTfVekNolhnQDjqp60Huxm+gGUYmhsyjNqdx
 Pb8HRZCBAudei9Ue4jNMfhCRK2Ug1oL5wNvN1xcSteAqrwMlwBMGHWns6l12x0ks
 BxYhyLvfREvnKijXc1o8D5paRgqohJgfnHlrUZeacyaw5hQCbiVRpwg0T1mWAF53
 u9hfWLY0Oy+Qs2C7EInNsWSYXRw8oPQNTFVx2I968GZqsEn4DC6Pt3ovWrDKIDnz
 ugoZJQkJ3/O8stYSMiyENehdWlo575NkapCTDwhLWnYztrw4skqqHE8ighU/e8ug
 o/Kx7ANWN9OjjjQpq2GVUeT0jCaFO+OMiGMNEkKoniYgYjogt3Gw5PeedBMtY07p
 OcWTiQZamjU=
 =i27M
 -----END PGP SIGNATURE-----

Merge tag 'perf-core-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf updates from Ingo Molnar:
 "Kernel side changes:

   - Add AMD Fam17h RAPL support

   - Introduce CAP_PERFMON to kernel and user space

   - Add Zhaoxin CPU support

   - Misc fixes and cleanups

  Tooling changes:

   - perf record:

     Introduce '--switch-output-event' to use arbitrary events to be
     setup and read from a side band thread and, when they take place a
     signal be sent to the main 'perf record' thread, reusing the core
     for '--switch-output' to take perf.data snapshots from the ring
     buffer used for '--overwrite', e.g.:

	# perf record --overwrite -e sched:* \
		      --switch-output-event syscalls:*connect* \
		      workload

     will take perf.data.YYYYMMDDHHMMSS snapshots up to around the
     connect syscalls.

     Add '--num-synthesize-threads' option to control degree of
     parallelism of the synthesize_mmap() code which is scanning
     /proc/PID/task/PID/maps and can be time consuming. This mimics
     pre-existing behaviour in 'perf top'.

   - perf bench:

     Add a multi-threaded synthesize benchmark and kallsyms parsing
     benchmark.

   - Intel PT support:

     Stitch LBR records from multiple samples to get deeper backtraces,
     there are caveats, see the csets for details.

     Allow using Intel PT to synthesize callchains for regular events.

     Add support for synthesizing branch stacks for regular events
     (cycles, instructions, etc) from Intel PT data.

  Misc changes:

   - Updated perf vendor events for power9 and Coresight.

   - Add flamegraph.py script via 'perf flamegraph'

   - Misc other changes, fixes and cleanups - see the Git log for details

  Also, since over the last couple of years perf tooling has matured and
  decoupled from the kernel perf changes to a large degree, going
  forward Arnaldo is going to send perf tooling changes via direct pull
  requests"

* tag 'perf-core-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (163 commits)
  perf/x86/rapl: Add AMD Fam17h RAPL support
  perf/x86/rapl: Make perf_probe_msr() more robust and flexible
  perf/x86/rapl: Flip logic on default events visibility
  perf/x86/rapl: Refactor to share the RAPL code between Intel and AMD CPUs
  perf/x86/rapl: Move RAPL support to common x86 code
  perf/core: Replace zero-length array with flexible-array
  perf/x86: Replace zero-length array with flexible-array
  perf/x86/intel: Add more available bits for OFFCORE_RESPONSE of Intel Tremont
  perf/x86/rapl: Add Ice Lake RAPL support
  perf flamegraph: Use /bin/bash for report and record scripts
  perf cs-etm: Move definition of 'traceid_list' global variable from header file
  libsymbols kallsyms: Move hex2u64 out of header
  libsymbols kallsyms: Parse using io api
  perf bench: Add kallsyms parsing
  perf: cs-etm: Update to build with latest opencsd version.
  perf symbol: Fix kernel symbol address display
  perf inject: Rename perf_evsel__*() operating on 'struct evsel *' to evsel__*()
  perf annotate: Rename perf_evsel__*() operating on 'struct evsel *' to evsel__*()
  perf trace: Rename perf_evsel__*() operating on 'struct evsel *' to evsel__*()
  perf script: Rename perf_evsel__*() operating on 'struct evsel *' to evsel__*()
  ...
2020-06-01 13:23:59 -07:00
Linus Torvalds
60056060be The biggest change to core locking facilities in this cycle is the introduction
of local_lock_t - this primitive comes from the -rt project and identifies
 CPU-local locking dependencies normally handled opaquely beind preempt_disable()
 or local_irq_save/disable() critical sections.
 
 The generated code on mainline kernels doesn't change as a result, but still there
 are benefits: improved debugging and better documentation of data structure
 accesses.
 
 The new local_lock_t primitives are introduced and then utilized in a couple of
 kernel subsystems. No change in functionality is intended.
 
 There's also other smaller changes and cleanups.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl7VAogRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1h67BAAusYb44jJyZUE74rmaLnJr0c6j7eJ6twT
 8LKRwxb21Y35DMuX6M5ewmvnHiLFYmjL728z+y8O+SP8vb4PSJBX/75X+wsawIJB
 cjHdxonyynVVC4zcbdrc37FsrOiVoKLbbZcpqRzHksKkCq2PHbFVxBNvEaKHZCWW
 1jnq0MRy9wEJtW9EThDWPLD+OPWhBvocUFYJH4fiqCIaDiip/E16fz3i+yMPt545
 Jz4Ibnsq+G5Ehm1N2AkaZuK9V9nYv85E7Z/UNiK4mkDOApE6OMS+q3d86BhqgPg5
 g/HL3HNXAtIY74tBYAac5tAQglT+283LuTpEPt9BEjNM7QxKg/ecXO7lwtn7Boku
 dACMqeuMHbLyru8uhbun/VBx1gca7HIhW1cvXO5OoR7o78fHpEFivjJ0B0OuSYAI
 y+/DsA41OlkWSEnboUs+zTQgFatqxQPke92xpGOJtjVVZRYHRqxcPtw9WFmoVqWA
 HeczDQLcSUhqbKSfr6X9BO2u3qxys5BzmImTKMqXEQ4d8Kk0QXbJgGYGfS8+ASey
 Am/jwUP3Cvzs99NxLH5gECKRSuTx3rY7nRGaIBYa+Ui575bdSF8sVAF13riB2mBp
 NJq2Pw0D36WcX7ecaC2Fk2ezkphbeuAr8E7gh/Mt/oVxjrfwRGfPMrnIwKygUydw
 1W5x+WZ+WsY=
 =TBTY
 -----END PGP SIGNATURE-----

Merge tag 'locking-core-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking updates from Ingo Molnar:
 "The biggest change to core locking facilities in this cycle is the
  introduction of local_lock_t - this primitive comes from the -rt
  project and identifies CPU-local locking dependencies normally handled
  opaquely beind preempt_disable() or local_irq_save/disable() critical
  sections.

  The generated code on mainline kernels doesn't change as a result, but
  still there are benefits: improved debugging and better documentation
  of data structure accesses.

  The new local_lock_t primitives are introduced and then utilized in a
  couple of kernel subsystems. No change in functionality is intended.

  There's also other smaller changes and cleanups"

* tag 'locking-core-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  zram: Use local lock to protect per-CPU data
  zram: Allocate struct zcomp_strm as per-CPU memory
  connector/cn_proc: Protect send_msg() with a local lock
  squashfs: Make use of local lock in multi_cpu decompressor
  mm/swap: Use local_lock for protection
  radix-tree: Use local_lock for protection
  locking: Introduce local_lock()
  locking/lockdep: Replace zero-length array with flexible-array
  locking/rtmutex: Remove unused rt_mutex_cmpxchg_relaxed()
2020-06-01 13:03:31 -07:00
Linus Torvalds
2227e5b21a The RCU updates for this cycle were:
- RCU-tasks update, including addition of RCU Tasks Trace for
    BPF use and TASKS_RUDE_RCU
  - kfree_rcu() updates.
  - Remove scheduler locking restriction
  - RCU CPU stall warning updates.
  - Torture-test updates.
  - Miscellaneous fixes and other updates.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl7U/r0RHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hSNxAAirKhPGBoLI9DW1qde4OFhZg+BlIpS+LD
 IE/0eGB8hGwhb1793RGbzIJfSnRQpSOPxWbWc6DJZ4Zpi5/ZbVkiPKsuXpM1xGxs
 kuBCTOhWy1/p3iCZ1JH/JCrCAdWGZkIzEoaV7ipnHtV/+UrRbCWH5PB7R0fYvcbI
 q5bUcWJyEp/bYMxQn8DhAih6SLPHx+F9qaGAqqloLSHstTYG2HkBhBGKnqcd/Jex
 twkLK53poCkeP/c08V1dyagU2IRWj2jGB1NjYh/Ocm+Sn/vru15CVGspjVjqO5FF
 oq07lad357ddMsZmKoM2F5DhXbOh95A+EqF9VDvIzCvfGMUgqYI1oxWF4eycsGhg
 /aYJgYuN23YeEe2DkDzJB67GvBOwl4WgdoFaxKRzOiCSfrhkM8KqM4G9Fz1JIepG
 abRJCF85iGcLslU9DkrShQiDsd/CRPzu/jz6ybK0I2II2pICo6QRf76T7TdOvKnK
 yXwC6OdL7/dwOht20uT6XfnDXMCWI4MutiUrb8/C1DbaihwEaI2denr3YYL+IwrB
 B38CdP6sfKZ5UFxKh0xb+sOzWrw0KA+ThSAXeJhz3tKdxdyB6nkaw3J9lFg8oi20
 XGeAujjtjMZG5cxt2H+wO9kZY0RRau/nTqNtmmRrCobd5yJjHHPHH8trEd0twZ9A
 X5Wjh11lv3E=
 =Yisx
 -----END PGP SIGNATURE-----

Merge tag 'core-rcu-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull RCU updates from Ingo Molnar:
 "The RCU updates for this cycle were:

   - RCU-tasks update, including addition of RCU Tasks Trace for BPF use
     and TASKS_RUDE_RCU

   - kfree_rcu() updates.

   - Remove scheduler locking restriction

   - RCU CPU stall warning updates.

   - Torture-test updates.

   - Miscellaneous fixes and other updates"

* tag 'core-rcu-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (103 commits)
  rcu: Allow for smp_call_function() running callbacks from idle
  rcu: Provide rcu_irq_exit_check_preempt()
  rcu: Abstract out rcu_irq_enter_check_tick() from rcu_nmi_enter()
  rcu: Provide __rcu_is_watching()
  rcu: Provide rcu_irq_exit_preempt()
  rcu: Make RCU IRQ enter/exit functions rely on in_nmi()
  rcu/tree: Mark the idle relevant functions noinstr
  x86: Replace ist_enter() with nmi_enter()
  x86/mce: Send #MC singal from task work
  x86/entry: Get rid of ist_begin/end_non_atomic()
  sched,rcu,tracing: Avoid tracing before in_nmi() is correct
  sh/ftrace: Move arch_ftrace_nmi_{enter,exit} into nmi exception
  lockdep: Always inline lockdep_{off,on}()
  hardirq/nmi: Allow nested nmi_enter()
  arm64: Prepare arch_nmi_enter() for recursion
  printk: Disallow instrumenting print_nmi_enter()
  printk: Prepare for nested printk_nmi_enter()
  rcutorture: Convert ULONG_CMP_LT() to time_before()
  torture: Add a --kasan argument
  torture: Save a few lines by using config_override_param initially
  ...
2020-06-01 12:56:29 -07:00
Linus Torvalds
0bd957eb11 Various kprobes updates, mostly centered around cleaning up the no-instrumentation
logic, instead of the current per debug facility blacklist, use the more generic
 .noinstr.text approach, combined with a 'noinstr' marker for functions.
 
 Also add instrumentation_begin()/end() to better manage the exact place in entry
 code where instrumentation may be used.
 
 Also add a kprobes blacklist for modules.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl7U/KERHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1h6xg//bnWhJzrxlOr89d7c5pEUeZehTscZ4OxU
 HyiWnfgd6bHJGHiB8TRHZInJFys/Y0UG+xzQvCP2YCIHW42tguD3u0wQ1rOrA6im
 VkDxUwHn72avqnBq+knMwtqiKQjxJrPe+YpikWOgb4B+9jQwLARzTArhs+aoWBRn
 a9jRP1jcuS26F/9wxctFoHVvKZ7Vv+HCgtNzequHsd1e0J8ElvDRk+QkfkaZopl5
 cQ44TIfzR8xjJuGqW45hXwOw5PPjhZHwytSoFquSMb57txoWL2devn7S38VaCWv7
 /fqmQAnQqlW5eG5ipJ0zWY1n0uLZLRrIecfA1INY8fdJeFFr6cxaN6FM1GhVZ93I
 GjZZFYwxDv9IftpeSyCaIzF1zISV+as3r9sMKMt89us77XazRiobjWCi1aE9a1rX
 QRv1nTjmypWg65IMV+nfIT26riP6YXSZ3uXQJPwm+kzEjJJl0LSi2AfjWQadcHeZ
 Z8svSIepP4oJBJ9tJlZ3K7kHBV3E0G4SV3fnHaUYGrp9gheqhe33U0VWfILcvq7T
 zIhtZXzqRGaMKuw0IFy2xITCQyEZAXwTedtSSeyXt0CN/hwhaxbrd38HhKOBw8WH
 k+OAmXZ+lgSO5ZvkoxgV6QgHtjsif3ICcHNelJtcbRA80/3oj/QwJ5dAVR61EDZa
 3Jn8mMxvCn0=
 =25Vr
 -----END PGP SIGNATURE-----

Merge tag 'core-kprobes-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull kprobes updates from Ingo Molnar:
 "Various kprobes updates, mostly centered around cleaning up the
  no-instrumentation logic.

  Instead of the current per debug facility blacklist, use the more
  generic .noinstr.text approach, combined with a 'noinstr' marker for
  functions.

  Also add instrumentation_begin()/end() to better manage the exact
  place in entry code where instrumentation may be used.

  And add a kprobes blacklist for modules"

* tag 'core-kprobes-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  kprobes: Prevent probes in .noinstr.text section
  vmlinux.lds.h: Create section for protection against instrumentation
  samples/kprobes: Add __kprobes and NOKPROBE_SYMBOL() for handlers.
  kprobes: Support NOKPROBE_SYMBOL() in modules
  kprobes: Support __kprobes blacklist in modules
  kprobes: Lock kprobe_mutex while showing kprobe_blacklist
2020-06-01 12:45:04 -07:00
Linus Torvalds
ca1f5df23f Printk changes for 5.8
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAl7U1TIACgkQUqAMR0iA
 lPI0mQ//TcVlRJgts/iwv0M2Simje28t9tziOHgWmEeiyGwE7vwDPDzt8QFiKzBa
 IrJ5iTRMtCrEF+eapqeH4g+Ve4Npm5Cobl8/h9JEiVu3SNC48TuaiUzU3+Bfl1zV
 vcDfZtN9QD1/CdLGlyKO75xjkCOaJRCFnx5ToXnd3llshMKI2XebUCnEH4TDe6Fz
 NGTjJL4kCPwzmae++UhlMfKwkayBtNbqcLkaTb7d67Tw2DcuuIVixUER77oC9QPN
 SfxdS07s0UVc4C9bCVe3KtYZR5YU/riOjKNJNutzP3JDtQNugywrrtI0qBwEisqM
 puMJ3xLeLssTn10FJxRK9ewRlXy2zT9mmcCuaVU6LtiyGnHOuEwIU+Ewu0itiJGu
 JuFovsNqTvOiZqFP7+pkDktOjffF2hsY/a6NxHr6aof8CrdO2w9dmCAgzGvnwV1N
 /zlmmPSEigVLz47eeivIIdQrPUejrEV7g1wOYYApnIlNCmGjdkGnXKNaUrLGDehQ
 QIlpx1uvmhntjiw1hTSbOV7KOQxLGtuy6DWLYC7uHD3H3aGDQyQO8dZYXSfOY1Qu
 cvQ/K8ykW0Kq2JKEHjkwSRHKDpxhvbQg7N5JCHQA49ahdZD3O5bOfeTofw+7nFO0
 j9g2Qv6nWFjLbAIAFtqjh6wz0UtGNoTl2cqQswsS10wlDcsq8vs=
 =pUBG
 -----END PGP SIGNATURE-----

Merge tag 'printk-for-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux

Pull printk updates from Petr Mladek:

 - Benjamin Herrenschmidt solved a problem with non-matched console
   aliases by first checking consoles defined on the command line. It is
   a more conservative approach than the previous attempts.

 - Benjamin also made sure that the console accessible via /dev/console
   always has CON_CONSDEV flag.

 - Andy Shevchenko added the %ptT modifier for printing struct time64_t.
   It extends the existing %ptR handling for struct rtc_time.

 - Bruno Meneguele fixed /dev/kmsg error value returned by unsupported
   SEEK_CUR.

 - Tetsuo Handa removed unused pr_cont_once().

... and a few small fixes.

* tag 'printk-for-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
  printk: Remove pr_cont_once()
  printk: handle blank console arguments passed in.
  kernel/printk: add kmsg SEEK_CUR handling
  printk: Fix a typo in comment "interator"->"iterator"
  usb: pulse8-cec: Switch to use %ptT
  ARM: bcm2835: Switch to use %ptT
  lib/vsprintf: Print time64_t in human readable format
  lib/vsprintf: update comment about simple_strto<foo>() functions
  printk: Correctly set CON_CONSDEV even when preferred console was not registered
  printk: Fix preferred console selection with multiple matches
  printk: Move console matching logic into a separate function
  printk: Convert a use of sprintf to snprintf in console_unlock
2020-06-01 12:13:30 -07:00
Linus Torvalds
829f3b9401 Fixes and new features for pstore
- refactor pstore locking for safer module unloading (Kees Cook)
 - remove orphaned records from pstorefs when backend unloaded (Kees Cook)
 - refactor dump_oops parameter into max_reason (Pavel Tatashin)
 - introduce pstore/zone for common code for contiguous storage (WeiXiong Liao)
 - introduce pstore/blk for block device backend (WeiXiong Liao)
 - introduce mtd backend (WeiXiong Liao)
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAl7UbYYWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJpkgD/9/09OkJIWydwk2lr2T89HW5fSF
 5uBT0a309/QDUpnV9yhcRsrESEicnvbtaGxD0kuYIInkiW/2cj1l689EkyRjUmy9
 q3z4GzLqOlC7qvd7LUPFNGHmllBb09H/CxmXDxRP3aynB9oHzdpNQdPcpLBDA00r
 0byp/AE48dFbKIhtT0QxpGUYZFOlyc7XVAaOkED4bmu148gx8q7MU1AxFgbx0Feb
 9iPV0r6XYMgXJZ3sn/3PJsxF0V/giDSJ8ui2xsYRjCE408zVIYLdDs2e8dz+2yW6
 +3Lyankgo+ofZc4XYExTYgn3WjhPFi+pjVRUaj+BcyTk9SLNIj2WmZdmcLMuzanh
 BaUurmED7ffTtlsH4PhQgn8/OY4FX2PO2MwUHwlU+87Y8YDiW0lpzTq5H822OO8p
 QQ8awql/6lLCJuyzuWIciVUsS65MCPxsZ4+LSiMZzyYpWu1sxrEY8ic3agzCgsA0
 0i+4nZFlLG+Aap/oiKpegenkIyAunn2tDXAyFJFH6qLOiZJ78iRuws3XZqjCElhJ
 XqvyDJIfjkJhWUb++ckeqX7ThOR4CPSnwba/7GHv7NrQWuk3Cn+GQ80oxydXUY6b
 2/4eYjq0wtvf9NeuJ4/LYNXotLR/bq9zS0zqwTWG50v+RPmuC3bNJB+RmF7fCiCG
 jo1Sd1LMeTQ7bnULpA==
 =7s1u
 -----END PGP SIGNATURE-----

Merge tag 'pstore-v5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull pstore updates from Kees Cook:
 "Fixes and new features for pstore.

  This is a pretty big set of changes (relative to past pstore pulls),
  but it has been in -next for a while. The biggest change here is the
  ability to support a block device as a pstore backend, which has been
  desired for a while. A lot of additional fixes and refactorings are
  also included, mostly in support of the new features.

   - refactor pstore locking for safer module unloading (Kees Cook)

   - remove orphaned records from pstorefs when backend unloaded (Kees
     Cook)

   - refactor dump_oops parameter into max_reason (Pavel Tatashin)

   - introduce pstore/zone for common code for contiguous storage
     (WeiXiong Liao)

   - introduce pstore/blk for block device backend (WeiXiong Liao)

   - introduce mtd backend (WeiXiong Liao)"

* tag 'pstore-v5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (35 commits)
  mtd: Support kmsg dumper based on pstore/blk
  pstore/blk: Introduce "best_effort" mode
  pstore/blk: Support non-block storage devices
  pstore/blk: Provide way to query pstore configuration
  pstore/zone: Provide way to skip "broken" zone for MTD devices
  Documentation: Add details for pstore/blk
  pstore/zone,blk: Add ftrace frontend support
  pstore/zone,blk: Add console frontend support
  pstore/zone,blk: Add support for pmsg frontend
  pstore/blk: Introduce backend for block devices
  pstore/zone: Introduce common layer to manage storage zones
  ramoops: Add "max-reason" optional field to ramoops DT node
  pstore/ram: Introduce max_reason and convert dump_oops
  pstore/platform: Pass max_reason to kmesg dump
  printk: Introduce kmsg_dump_reason_str()
  printk: honor the max_reason field in kmsg_dumper
  printk: Collapse shutdown types into a single dump reason
  pstore/ftrace: Provide ftrace log merging routine
  pstore/ram: Refactor ftrace buffer merging
  pstore/ram: Refactor DT size parsing
  ...
2020-06-01 12:07:34 -07:00
Linus Torvalds
81e8c10dac Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
 "API:
   - Introduce crypto_shash_tfm_digest() and use it wherever possible.
   - Fix use-after-free and race in crypto_spawn_alg.
   - Add support for parallel and batch requests to crypto_engine.

  Algorithms:
   - Update jitter RNG for SP800-90B compliance.
   - Always use jitter RNG as seed in drbg.

  Drivers:
   - Add Arm CryptoCell driver cctrng.
   - Add support for SEV-ES to the PSP driver in ccp"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (114 commits)
  crypto: hisilicon - fix driver compatibility issue with different versions of devices
  crypto: engine - do not requeue in case of fatal error
  crypto: cavium/nitrox - Fix a typo in a comment
  crypto: hisilicon/qm - change debugfs file name from qm_regs to regs
  crypto: hisilicon/qm - add DebugFS for xQC and xQE dump
  crypto: hisilicon/zip - add debugfs for Hisilicon ZIP
  crypto: hisilicon/hpre - add debugfs for Hisilicon HPRE
  crypto: hisilicon/sec2 - add debugfs for Hisilicon SEC
  crypto: hisilicon/qm - add debugfs to the QM state machine
  crypto: hisilicon/qm - add debugfs for QM
  crypto: stm32/crc32 - protect from concurrent accesses
  crypto: stm32/crc32 - don't sleep in runtime pm
  crypto: stm32/crc32 - fix multi-instance
  crypto: stm32/crc32 - fix run-time self test issue.
  crypto: stm32/crc32 - fix ext4 chksum BUG_ON()
  crypto: hisilicon/zip - Use temporary sqe when doing work
  crypto: hisilicon - add device error report through abnormal irq
  crypto: hisilicon - remove codes of directly report device errors through MSI
  crypto: hisilicon - QM memory management optimization
  crypto: hisilicon - unify initial value assignment into QM
  ...
2020-06-01 12:00:10 -07:00
Lai Jiangshan
10cdb15759 workqueue: use BUILD_BUG_ON() for compile time test instead of WARN_ON()
Any runtime WARN_ON() has to be fixed, and BUILD_BUG_ON() can
help you nitice it earlier.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2020-06-01 11:02:42 -04:00
Rafael J. Wysocki
be6018a44c Merge branches 'pm-core' and 'pm-sleep'
* pm-core:
  PM: runtime: Replace pm_runtime_callbacks_present()
  PM: runtime: clk: Fix clk_pm_runtime_get() error path
  PM: runtime: Make clear what we do when conditions are wrong in rpm_suspend()

* pm-sleep:
  PM: hibernate: Restrict writes to the resume device
  PM: hibernate: Split off snapshot dev option
  PM: hibernate: Incorporate concurrency handling
  PM: sleep: Helpful edits for devices.rst documentation
  Documentation: PM: sleep: Update driver flags documentation
  PM: sleep: core: Rename DPM_FLAG_LEAVE_SUSPENDED
  PM: sleep: core: Rename DPM_FLAG_NEVER_SKIP
  PM: sleep: core: Rename dev_pm_smart_suspend_and_suspended()
  PM: sleep: core: Rename dev_pm_may_skip_resume()
  PM: sleep: core: Rework the power.may_skip_resume handling
  PM: sleep: core: Do not skip callbacks in the resume phase
  PM: sleep: core: Fold functions into their callers
  PM: sleep: core: Simplify the SMART_SUSPEND flag handling
2020-06-01 15:19:08 +02:00
Steven Rostedt (VMware)
c200784a08 tracing: Add a trace print when traceoff_on_warning is triggered
When "traceoff_on_warning" is enabled and a warning happens, there can still
be many trace events happening on other CPUs between the time the warning
occurred and the last trace event on that same CPU. This can cause confusion
in examining the trace, as it may not be obvious where the warning happened.
By adding a trace print into the trace just before disabling tracing, it
makes it obvious where the warning occurred, and the developer doesn't have
to look at other means to see what CPU it occurred on.

Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-01 08:23:47 -04:00
Tom Zanussi
726721a518 tracing: Move synthetic events to a separate file
With the addition of the in-kernel synthetic event API, synthetic
events are no longer specifically tied to the histogram triggers.

The synthetic event code is also making trace_event_hist.c very
bloated, so for those reasons, move it to a separate file,
trace_events_synth.c, along with a new trace_synth.h header file.

Because synthetic events are now independent from hist triggers, add a
new CONFIG_SYNTH_EVENTS config option, and have CONFIG_HIST_TRIGGERS
select it, and have CONFIG_SYNTH_EVENT_GEN_TEST depend on it.

Link: http://lkml.kernel.org/r/4d1fa1f85ed5982706ac44844ac92451dcb04715.1590693308.git.zanussi@kernel.org

Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-01 08:23:22 -04:00
Tom Zanussi
2d19bd79ae tracing: Add hist_debug trace event files for histogram debugging
Add a new "hist_debug" file for each trace event, which when read will
dump out a bunch of internal details about the hist triggers defined
on that event.

This is normally off but can be enabled by saying 'y' to the new
CONFIG_HIST_TRIGGERS_DEBUG config option.

This is in support of the new Documentation file describing histogram
internals, Documentation/trace/histogram-design.rst, which was
requested by developers trying to understand the internals when
extending or making use of the hist triggers for higher-level tools.

The histogram-design.rst documentation refers to the hist_debug files
and demonstrates their use with output in the test examples.

Link: http://lkml.kernel.org/r/77914c22b0ba493d9783c53bbfbc6087d6a7e1b1.1585941485.git.zanussi@kernel.org

Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-01 08:22:30 -04:00
Tom Zanussi
1b94b3aed3 tracing: Check state.disabled in synth event trace functions
Since trace_state.disabled is set in __synth_event_trace_start() at
the same time -ENOENT is returned, don't bother returning -ENOENT -
just have callers check trace_state.disabled instead, and avoid the
extra return val munging.

Link: http://lkml.kernel.org/r/87315c3889af870e8370e82b76cf48b426d70130.1585941485.git.zanussi@kernel.org

Suggested-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@godmis.org>
2020-06-01 08:21:08 -04:00
Ingo Molnar
cb3cb6733f Merge branch 'WIP.core/rcu' into core/rcu, to pick up two x86/entry dependencies
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-06-01 11:47:29 +02:00
Petr Mladek
d053cf0d77 Merge branch 'for-5.8' into for-linus 2020-06-01 10:15:16 +02:00
Petr Mladek
6a0af9fc8c Merge branch 'for-5.7-preferred-console' into for-linus 2020-06-01 10:13:51 +02:00
David S. Miller
1806c13dc2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
xdp_umem.c had overlapping changes between the 64-bit math fix
for the calculation of npgs and the removal of the zerocopy
memory type which got rid of the chunk_size_nohdr member.

The mlx5 Kconfig conflict is a case where we just take the
net-next copy of the Kconfig entry dependency as it takes on
the ESWITCH dependency by one level of indirection which is
what the 'net' conflicting change is trying to ensure.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-31 17:48:46 -07:00
Linus Torvalds
3d04282329 A single scheduler fix preventing a crash in NUMA balancing. The
current->mm check is not reliable as the mm might be temporary
 due to use_mm() in a kthread. Check for PF_KTHREAD explictely.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl7TtiATHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoR52EACe2TGyJ2k5raj86CD7tWqdTXnSJu9+
 Q8njzbfwwJWc9SR9jxJBd7H6VQ5Kyd71Yeyi0RcIjpC0CtxdR2ZzYl/mHSwPSuku
 TKFo1WgwajRnny6daoNuJMmVZKxaZfOVTtf8ekJOjUrWKOWIJyUb5wvcgstdL7Uz
 ZMPIYmL5TpqreRI0gLYIpPoZQoE/Cja0eB45H3JocD0s+o0BZJhUql/DuXB+SUWO
 ANVz2RZ4dNM8LHBfXdQrWTq5G6Ckhr9pm0o0lQPeEcKmypOrG0l9p2qeQV59pFgD
 QS0jsAqjrmtnfHvIdATUThE7oBfimf2NX1Nqmf1l1BmMiQnL0u3+yuvibAQj2/aA
 J2WoQXKsJHTKEcZ2DGmjksOrzeognDMqXgXR7hZztKkAoBvEpmiebqxMgTv6XaoF
 ZhF5NAx+DUngg2uQd65UW8dbFSg9yQh+73+wItRiBSuNB8ePweqgop7OZVv/hOa6
 MiFBaWAzSTugR2E2LBbLuwMsB+4lLyUaXo4TeM2NLUVGurtKO3HL6xgo7bII+l+X
 QNn/PaNweT6OD2kEJb66wYWVWxN7JsryDIHaJq6e/j7u7JmKCYAxd7XD92sKkVoS
 JNsCDzcUNRao/UI0jsirntvWy9bS9J+HOUaSF5N0AcDOuJeZSffm8AVyr0nlUI0y
 HGYFGiSqS0I3EA==
 =GfbN
 -----END PGP SIGNATURE-----

Merge tag 'sched-urgent-2020-05-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fix from Thomas Gleixner:
 "A single scheduler fix preventing a crash in NUMA balancing.

  The current->mm check is not reliable as the mm might be temporary due
  to use_mm() in a kthread. Check for PF_KTHREAD explictly"

* tag 'sched-urgent-2020-05-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Don't NUMA balance for kthreads
2020-05-31 10:43:17 -07:00
Linus Torvalds
19835b1ba6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:
 "Another week, another set of bug fixes:

   1) Fix pskb_pull length in __xfrm_transport_prep(), from Xin Long.

   2) Fix double xfrm_state put in esp{4,6}_gro_receive(), also from Xin
      Long.

   3) Re-arm discovery timer properly in mac80211 mesh code, from Linus
      Lüssing.

   4) Prevent buffer overflows in nf_conntrack_pptp debug code, from
      Pablo Neira Ayuso.

   5) Fix race in ktls code between tls_sw_recvmsg() and
      tls_decrypt_done(), from Vinay Kumar Yadav.

   6) Fix crashes on TCP fallback in MPTCP code, from Paolo Abeni.

   7) More validation is necessary of untrusted GSO packets coming from
      virtualization devices, from Willem de Bruijn.

   8) Fix endianness of bnxt_en firmware message length accesses, from
      Edwin Peer.

   9) Fix infinite loop in sch_fq_pie, from Davide Caratti.

  10) Fix lockdep splat in DSA by setting lockless TX in netdev features
      for slave ports, from Vladimir Oltean.

  11) Fix suspend/resume crashes in mlx5, from Mark Bloch.

  12) Fix use after free in bpf fmod_ret, from Alexei Starovoitov.

  13) ARP retransmit timer guard uses wrong offset, from Hongbin Liu.

  14) Fix leak in inetdev_init(), from Yang Yingliang.

  15) Don't try to use inet hash and unhash in l2tp code, results in
      crashes. From Eric Dumazet"

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (77 commits)
  l2tp: add sk_family checks to l2tp_validate_socket
  l2tp: do not use inet_hash()/inet_unhash()
  net: qrtr: Allocate workqueue before kernel_bind
  mptcp: remove msk from the token container at destruction time.
  mptcp: fix race between MP_JOIN and close
  mptcp: fix unblocking connect()
  net/sched: act_ct: add nat mangle action only for NAT-conntrack
  devinet: fix memleak in inetdev_init()
  virtio_vsock: Fix race condition in virtio_transport_recv_pkt
  drivers/net/ibmvnic: Update VNIC protocol version reporting
  NFC: st21nfca: add missed kfree_skb() in an error path
  neigh: fix ARP retransmit timer guard
  bpf, selftests: Add a verifier test for assigning 32bit reg states to 64bit ones
  bpf, selftests: Verifier bounds tests need to be updated
  bpf: Fix a verifier issue when assigning 32bit reg states to 64bit ones
  bpf: Fix use-after-free in fmod_ret check
  net/mlx5e: replace EINVAL in mlx5e_flower_parse_meta()
  net/mlx5e: Fix MLX5_TC_CT dependencies
  net/mlx5e: Properly set default values when disabling adaptive moderation
  net/mlx5e: Fix arch depending casting issue in FEC
  ...
2020-05-31 10:16:53 -07:00
Kees Cook
fb13cb8a04 printk: Introduce kmsg_dump_reason_str()
The pstore subsystem already had a private version of this function.
With the coming addition of the pstore/zone driver, this needs to be
shared. As it really should live with printk, move it there instead.

Link: https://lore.kernel.org/lkml/20200515184434.8470-4-keescook@chromium.org/
Acked-by: Petr Mladek <pmladek@suse.com>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-05-30 10:34:03 -07:00
Pavel Tatashin
b1f6f161b2 printk: honor the max_reason field in kmsg_dumper
kmsg_dump() allows to dump kmesg buffer for various system events: oops,
panic, reboot, etc. It provides an interface to register a callback
call for clients, and in that callback interface there is a field
"max_reason", but it was getting ignored when set to any "reason"
higher than KMSG_DUMP_OOPS unless "always_kmsg_dump" was passed as
kernel parameter.

Allow clients to actually control their "max_reason", and keep the
current behavior when "max_reason" is not set.

Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Link: https://lore.kernel.org/lkml/20200515184434.8470-3-keescook@chromium.org/
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-05-30 10:34:03 -07:00
Kees Cook
6d3cf962dd printk: Collapse shutdown types into a single dump reason
To turn the KMSG_DUMP_* reasons into a more ordered list, collapse
the redundant KMSG_DUMP_(RESTART|HALT|POWEROFF) reasons into
KMSG_DUMP_SHUTDOWN. The current users already don't meaningfully
distinguish between them, so there's no need to, as discussed here:
https://lore.kernel.org/lkml/CA+CK2bAPv5u1ih5y9t5FUnTyximtFCtDYXJCpuyjOyHNOkRdqw@mail.gmail.com/

Link: https://lore.kernel.org/lkml/20200515184434.8470-2-keescook@chromium.org/
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-05-30 10:34:03 -07:00
Thomas Gleixner
76fe06c1e6 irqchip updates for Linux 5.8:
- A few new drivers for the Loongson MIPS platform (HTVEC, PIC, MSI)
 - A cleanup of the __irq_domain_add() API
 - A cleanup of the IRQ simulator to actually use some of
   the irq infrastructure
 - Some fixes for the Sifive PLIC when used in a multi-controller
   context
 - Fixes for the GICv3 ITS to spread interrupts according to the
   load of each CPU, and to honor managed interrupts
 - Numerous cleanups and documentation fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAl7Q/OQPHG1hekBrZXJu
 ZWwub3JnAAoJECPQ0LrRPXpD/vQQAKnqdQ/RD1anm7mHNXzzbEnd9zqoke7er1EM
 kkZioYJ7KAFEc5SMHiyWwoAfxBvJ1hh/9d48N3dXVdNNaBezpS1uSyviUP92ofzL
 ZS+aEEyPA3CLAkYzJKmMUQ2LrLpdMc4QM7G+nHgquk0UBT1aNpduDoMRqToPpJ5C
 ZXfVjhZ8gaM//sueB2aRZZUGmu8lAqXiNBUn+encBFfCVp1iydBlumwD//viTD/g
 BMO956DAfzoVqb/0n/ZqdftbNR5gtb8wb5POH8I36HOfmqUIdF78OuRhzCm/Odqf
 uV/eRQ4tDgMVM5PuHNYACGax2DWRRlHVK1wCXcEgj9Wh1p2a7moPOq0zZuREQaEw
 4DyIi2VeK1MDkWv4NrPSuVupicwAioha9wVeSnfm9SkKAo1GEj5gHs49MUKas11s
 ikVLzYCenRfdviu54P0axI1x5HCTvPTXuosPR8Sn7kMQhxQuZxMrYo+9X8WudVOU
 oEPmK1khRQaU+sf2+wPWvwWt6LMcVzTwWWsXarhWDN1H+budTulpxKsfb0b0rxCp
 2viQJ+ncCks1usFrsmuD0bqxyJShDgqFMz46z0R6PYYO/4bJFU5t4UVjIsW1bJZV
 Vrnkkf9Aw6OiAV5eR7hsRDlYw8tRGm8CZiLRr3szVu4Py9RC4KdoprPf3xn78qqV
 TIujPzK6
 =RX31
 -----END PGP SIGNATURE-----

Merge tag 'irqchip-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/core

Pull irqchip updates from Marc Zyngier:

 - A few new drivers for the Loongson MIPS platform (HTVEC, PIC, MSI)
 - A cleanup of the __irq_domain_add() API
 - A cleanup of the IRQ simulator to actually use some of
   the irq infrastructure
 - Some fixes for the Sifive PLIC when used in a multi-controller
   context
 - Fixes for the GICv3 ITS to spread interrupts according to the
   load of each CPU, and to honor managed interrupts
 - Numerous cleanups and documentation fixes
2020-05-30 09:40:12 +02:00
David S. Miller
f9e0ce3ddc Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Alexei Starovoitov says:

====================
pull-request: bpf 2020-05-29

The following pull-request contains BPF updates for your *net* tree.

We've added 6 non-merge commits during the last 7 day(s) which contain
a total of 4 files changed, 55 insertions(+), 34 deletions(-).

The main changes are:

1) minor verifier fix for fmod_ret progs, from Alexei.

2) af_xdp overflow check, from Bjorn.

3) minor verifier fix for 32bit assignment, from John.

4) powerpc has non-overlapping addr space, from Petr.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-29 15:59:08 -07:00
John Fastabend
3a71dc366d bpf: Fix a verifier issue when assigning 32bit reg states to 64bit ones
With the latest trunk llvm (llvm 11), I hit a verifier issue for
test_prog subtest test_verif_scale1.

The following simplified example illustrate the issue:
    w9 = 0  /* R9_w=inv0 */
    r8 = *(u32 *)(r1 + 80)  /* __sk_buff->data_end */
    r7 = *(u32 *)(r1 + 76)  /* __sk_buff->data */
    ......
    w2 = w9 /* R2_w=inv0 */
    r6 = r7 /* R6_w=pkt(id=0,off=0,r=0,imm=0) */
    r6 += r2 /* R6_w=inv(id=0) */
    r3 = r6 /* R3_w=inv(id=0) */
    r3 += 14 /* R3_w=inv(id=0) */
    if r3 > r8 goto end
    r5 = *(u32 *)(r6 + 0) /* R6_w=inv(id=0) */
       <== error here: R6 invalid mem access 'inv'
    ...
  end:

In real test_verif_scale1 code, "w9 = 0" and "w2 = w9" are in
different basic blocks.

In the above, after "r6 += r2", r6 becomes a scalar, which eventually
caused the memory access error. The correct register state should be
a pkt pointer.

The inprecise register state starts at "w2 = w9".
The 32bit register w9 is 0, in __reg_assign_32_into_64(),
the 64bit reg->smax_value is assigned to be U32_MAX.
The 64bit reg->smin_value is 0 and the 64bit register
itself remains constant based on reg->var_off.

In adjust_ptr_min_max_vals(), the verifier checks for a known constant,
smin_val must be equal to smax_val. Since they are not equal,
the verifier decides r6 is a unknown scalar, which caused later failure.

The llvm10 does not have this issue as it generates different code:
    w9 = 0  /* R9_w=inv0 */
    r8 = *(u32 *)(r1 + 80)  /* __sk_buff->data_end */
    r7 = *(u32 *)(r1 + 76)  /* __sk_buff->data */
    ......
    r6 = r7 /* R6_w=pkt(id=0,off=0,r=0,imm=0) */
    r6 += r9 /* R6_w=pkt(id=0,off=0,r=0,imm=0) */
    r3 = r6 /* R3_w=pkt(id=0,off=0,r=0,imm=0) */
    r3 += 14 /* R3_w=pkt(id=0,off=14,r=0,imm=0) */
    if r3 > r8 goto end
    ...

To fix the above issue, we can include zero in the test condition for
assigning the s32_max_value and s32_min_value to their 64-bit equivalents
smax_value and smin_value.

Further, fix the condition to avoid doing zero extension bounds checks
when s32_min_value <= 0. This could allow for the case where bounds
32-bit bounds (-1,1) get incorrectly translated to (0,1) 64-bit bounds.
When in-fact the -1 min value needs to force U32_MAX bound.

Fixes: 3f50f132d8 ("bpf: Verifier, do explicit ALU32 bounds tracking")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/159077331983.6014.5758956193749002737.stgit@john-Precision-5820-Tower
2020-05-29 13:34:06 -07:00
Alexei Starovoitov
18644cec71 bpf: Fix use-after-free in fmod_ret check
Fix the following issue:
[  436.749342] BUG: KASAN: use-after-free in bpf_trampoline_put+0x39/0x2a0
[  436.749995] Write of size 4 at addr ffff8881ef38b8a0 by task kworker/3:5/2243
[  436.750712]
[  436.752677] Workqueue: events bpf_prog_free_deferred
[  436.753183] Call Trace:
[  436.756483]  bpf_trampoline_put+0x39/0x2a0
[  436.756904]  bpf_prog_free_deferred+0x16d/0x3d0
[  436.757377]  process_one_work+0x94a/0x15b0
[  436.761969]
[  436.762130] Allocated by task 2529:
[  436.763323]  bpf_trampoline_lookup+0x136/0x540
[  436.763776]  bpf_check+0x2872/0xa0a8
[  436.764144]  bpf_prog_load+0xb6f/0x1350
[  436.764539]  __do_sys_bpf+0x16d7/0x3720
[  436.765825]
[  436.765988] Freed by task 2529:
[  436.767084]  kfree+0xc6/0x280
[  436.767397]  bpf_trampoline_put+0x1fd/0x2a0
[  436.767826]  bpf_check+0x6832/0xa0a8
[  436.768197]  bpf_prog_load+0xb6f/0x1350
[  436.768594]  __do_sys_bpf+0x16d7/0x3720

prog->aux->trampoline = tr should be set only when prog is valid.
Otherwise prog freeing will try to put trampoline via prog->aux->trampoline,
but it may not point to a valid trampoline.

Fixes: 6ba43b761c ("bpf: Attachment verification for BPF_MODIFY_RETURN")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: KP Singh <kpsingh@google.com>
Link: https://lore.kernel.org/bpf/20200529043839.15824-2-alexei.starovoitov@gmail.com
2020-05-29 22:25:58 +02:00
Lai Jiangshan
b8f06b0444 workqueue: remove useless unlock() and lock() in series
This is no point to unlock() and then lock() the same mutex
back to back.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2020-05-29 10:25:23 -04:00
Lai Jiangshan
4f3f4cf388 workqueue: void unneeded requeuing the pwq in rescuer thread
008847f66c ("workqueue: allow rescuer thread to do more work.") made
the rescuer worker requeue the pwq immediately if there may be more
work items which need rescuing instead of waiting for the next mayday
timer expiration.  Unfortunately, it checks only whether the pool needs
help from rescuers, but it doesn't check whether the pwq has work items
in the pool (the real reason that this rescuer can help for the pool).

The patch adds the check and void unneeded requeuing.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2020-05-29 10:22:10 -04:00
Sebastian Andrzej Siewior
a9b8a98529 workqueue: Convert the pool::lock and wq_mayday_lock to raw_spinlock_t
The workqueue code has it's internal spinlocks (pool::lock), which
are acquired on most workqueue operations. These spinlocks are
converted to 'sleeping' spinlocks on a RT-kernel.

Workqueue functions can be invoked from contexts which are truly atomic
even on a PREEMPT_RT enabled kernel. Taking sleeping locks from such
contexts is forbidden.

The pool::lock hold times are bound and the code sections are
relatively short, which allows to convert pool::lock and as a
consequence wq_mayday_lock to raw spinlocks which are truly spinning
locks even on a PREEMPT_RT kernel.

With the previous conversion of the manager waitqueue to a simple
waitqueue workqueues are now fully RT compliant.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2020-05-29 10:03:47 -04:00
Sebastian Andrzej Siewior
d8bb65ab70 workqueue: Use rcuwait for wq_manager_wait
The workqueue code has it's internal spinlock (pool::lock) and also
implicit spinlock usage in the wq_manager waitqueue. These spinlocks
are converted to 'sleeping' spinlocks on a RT-kernel.

Workqueue functions can be invoked from contexts which are truly atomic
even on a PREEMPT_RT enabled kernel. Taking sleeping locks from such
contexts is forbidden.

pool::lock can be converted to a raw spinlock as the lock held times
are short. But the workqueue manager waitqueue is handled inside of
pool::lock held regions which again violates the lock nesting rules
of raw and regular spinlocks.

The manager waitqueue has no special requirements like custom wakeup
callbacks or mass wakeups. While it does not use exclusive wait mode
explicitly there is no strict requirement to queue the waiters in a
particular order as there is only one waiter at a time.

This allows to replace the waitqueue with rcuwait which solves the
locking problem because rcuwait relies on existing locking.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
2020-05-29 10:00:35 -04:00
Boris Burkov
936f2a70f2 cgroup: add cpu.stat file to root cgroup
Currently, the root cgroup does not have a cpu.stat file. Add one which
is consistent with /proc/stat to capture global cpu statistics that
might not fall under cgroup accounting.

We haven't done this in the past because the data are already presented
in /proc/stat and we didn't want to add overhead from collecting root
cgroup stats when cgroups are configured, but no cgroups have been
created.

By keeping the data consistent with /proc/stat, I think we avoid the
first problem, while improving the usability of cgroups stats.
We avoid the second problem by computing the contents of cpu.stat from
existing data collected for /proc/stat anyway.

Signed-off-by: Boris Burkov <boris@bur.io>
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2020-05-28 10:06:35 -04:00
Marek Vasut
1d0326f352 genirq: Check irq_data_get_irq_chip() return value before use
irq_data_get_irq_chip() can return NULL, however it is expected that this
never happens. If a buggy driver leads to NULL being returned from
irq_data_get_irq_chip(), warn about it instead of crashing the machine.

Signed-off-by: Marek Vasut <marex@denx.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

To: linux-arm-kernel@lists.infradead.org
2020-05-28 15:58:04 +02:00
Ingo Molnar
1f8db41505 sched/headers: Split out open-coded prototypes into kernel/sched/smp.h
Move the prototypes for sched_ttwu_pending() and send_call_function_single_ipi()
into the newly created kernel/sched/smp.h header, to make sure they are all
the same, and to architectures happy that use -Wmissing-prototypes.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-05-28 11:03:20 +02:00
Peter Zijlstra
a148866489 sched: Replace rq::wake_list
The recent commit: 90b5363acd ("sched: Clean up scheduler_ipi()")
got smp_call_function_single_async() subtly wrong. Even though it will
return -EBUSY when trying to re-use a csd, that condition is not
atomic and still requires external serialization.

The change in ttwu_queue_remote() got this wrong.

While on first reading ttwu_queue_remote() has an atomic test-and-set
that appears to serialize the use, the matching 'release' is not in
the right place to actually guarantee this serialization.

The actual race is vs the sched_ttwu_pending() call in the idle loop;
that can run the wakeup-list without consuming the CSD.

Instead of trying to chain the lists, merge them.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200526161908.129371594@infradead.org
2020-05-28 10:54:16 +02:00
Peter Zijlstra
126c2092e5 sched: Add rq::ttwu_pending
In preparation of removing rq->wake_list, replace the
!list_empty(rq->wake_list) with rq->ttwu_pending. This is not fully
equivalent as this new variable is racy.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200526161908.070399698@infradead.org
2020-05-28 10:54:16 +02:00
Peter Zijlstra
4b44a21dd6 irq_work, smp: Allow irq_work on call_single_queue
Currently irq_work_queue_on() will issue an unconditional
arch_send_call_function_single_ipi() and has the handler do
irq_work_run().

This is unfortunate in that it makes the IPI handler look at a second
cacheline and it misses the opportunity to avoid the IPI. Instead note
that struct irq_work and struct __call_single_data are very similar in
layout, so use a few bits in the flags word to encode a type and stick
the irq_work on the call_single_queue list.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200526161908.011635912@infradead.org
2020-05-28 10:54:15 +02:00
Peter Zijlstra
b2a02fc43a smp: Optimize send_call_function_single_ipi()
Just like the ttwu_queue_remote() IPI, make use of _TIF_POLLING_NRFLAG
to avoid sending IPIs to idle CPUs.

[ mingo: Fix UP build bug. ]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200526161907.953304789@infradead.org
2020-05-28 10:54:15 +02:00
Peter Zijlstra
afaa653c56 smp: Move irq_work_run() out of flush_smp_call_function_queue()
This ensures flush_smp_call_function_queue() is strictly about
call_single_queue.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200526161907.895109676@infradead.org
2020-05-28 10:54:15 +02:00
Peter Zijlstra
52103be07d smp: Optimize flush_smp_call_function_queue()
The call_single_queue can contain (two) different callbacks,
synchronous and asynchronous. The current interrupt handler runs them
in-order, which means that remote CPUs that are waiting for their
synchronous call can be delayed by running asynchronous callbacks.

Rework the interrupt handler to first run the synchonous callbacks.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200526161907.836818381@infradead.org
2020-05-28 10:54:15 +02:00
Peter Zijlstra
19a1f5ec69 sched: Fix smp_call_function_single_async() usage for ILB
The recent commit: 90b5363acd ("sched: Clean up scheduler_ipi()")
got smp_call_function_single_async() subtly wrong. Even though it will
return -EBUSY when trying to re-use a csd, that condition is not
atomic and still requires external serialization.

The change in kick_ilb() got this wrong.

While on first reading kick_ilb() has an atomic test-and-set that
appears to serialize the use, the matching 'release' is not in the
right place to actually guarantee this serialization.

Rework the nohz_idle_balance() trigger so that the release is in the
IPI callback and thus guarantees the required serialization for the
CSD.

Fixes: 90b5363acd ("sched: Clean up scheduler_ipi()")
Reported-by: Qian Cai <cai@lca.pw>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Cc: mgorman@techsingularity.net
Link: https://lore.kernel.org/r/20200526161907.778543557@infradead.org
2020-05-28 10:54:15 +02:00
Ingo Molnar
58ef57b16d Merge branch 'core/rcu' into sched/core, to pick up dependency
We are going to rely on the loosening of RCU callback semantics,
introduced by this commit:

  806f04e9fd: ("rcu: Allow for smp_call_function() running callbacks from idle")

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-05-28 10:52:53 +02:00
Ingo Molnar
498bdcdb94 Merge branch 'sched/urgent' into sched/core, to pick up fix
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-05-28 10:52:37 +02:00
Peter Zijlstra
806f04e9fd rcu: Allow for smp_call_function() running callbacks from idle
Current RCU hard relies on smp_call_function() callbacks running from
interrupt context. A pending optimization is going to break that, it
will allow idle CPUs to run the callbacks from the idle loop. This
avoids raising the IPI on the requesting CPU and avoids handling an
exception on the receiving CPU.

Change rcu_is_cpu_rrupt_from_idle() to also accept task context,
provided it is the idle task.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Link: https://lore.kernel.org/r/20200527171236.GC706495@hirez.programming.kicks-ass.net
2020-05-28 10:50:12 +02:00
Ingo Molnar
4f470fff67 Linux 5.7-rc7
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl7K9iEeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGzTAH/0ifZEG4BQ8x/WlB
 8YLSLE6QQTSXYi25nyExuJbFkkKY5Tik8M2HD/36xwY/HnZOlH9jH6m0ntqZxpaA
 3EU9lr1ct79nCBMYhiJssvz8d9AOZXlyogFW9y2y9pmPjlmUtseZ7yGh1xD465cj
 B5Ty2w2W34cs7zF3og2xn5agOJMtWWXLXZ5mRa9EOquKC5zeYyRicmd0T+plYQD6
 hbRYmxFfDfppVnBCBARPNN0+NU5JJD94H+8bOuf1tl48XNrLiZMOicmtohKNQ6+W
 rZNpJNEGEp7KMtqWH0Nl3hmy3yfZHMwe1DXM/AZDqR7jTHZY4mZ0GEpLyfI9AU4n
 34jVHwU=
 =SmJ9
 -----END PGP SIGNATURE-----

Merge tag 'v5.7-rc7' into WIP.locking/core, to refresh the tree

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-05-28 10:30:40 +02:00
Ingo Molnar
0bffedbce9 Linux 5.7-rc7
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl7K9iEeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGzTAH/0ifZEG4BQ8x/WlB
 8YLSLE6QQTSXYi25nyExuJbFkkKY5Tik8M2HD/36xwY/HnZOlH9jH6m0ntqZxpaA
 3EU9lr1ct79nCBMYhiJssvz8d9AOZXlyogFW9y2y9pmPjlmUtseZ7yGh1xD465cj
 B5Ty2w2W34cs7zF3og2xn5agOJMtWWXLXZ5mRa9EOquKC5zeYyRicmd0T+plYQD6
 hbRYmxFfDfppVnBCBARPNN0+NU5JJD94H+8bOuf1tl48XNrLiZMOicmtohKNQ6+W
 rZNpJNEGEp7KMtqWH0Nl3hmy3yfZHMwe1DXM/AZDqR7jTHZY4mZ0GEpLyfI9AU4n
 34jVHwU=
 =SmJ9
 -----END PGP SIGNATURE-----

Merge tag 'v5.7-rc7' into perf/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-05-28 07:58:12 +02:00
Linus Torvalds
3301f6ae2d Merge branch 'for-5.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:

 - Reverted stricter synchronization for cgroup recursive stats which
   was prepping it for event counter usage which never got merged. The
   change was causing performation regressions in some cases.

 - Restore bpf-based device-cgroup operation even when cgroup1 device
   cgroup is disabled.

 - An out-param init fix.

* 'for-5.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  device_cgroup: Cleanup cgroup eBPF device filter code
  xattr: fix uninitialized out-param
  Revert "cgroup: Add memory barriers to plug cgroup_rstat_updated() race window"
2020-05-27 10:58:19 -07:00
Domenico Andreoli
ad1e4f74c0 PM: hibernate: Restrict writes to the resume device
Hibernation via snapshot device requires write permission to the swap
block device, the one that more often (but not necessarily) is used to
store the hibernation image.

With this patch, such permissions are granted iff:

 1) snapshot device config option is enabled
 2) swap partition is used as resume device

In other circumstances the swap device is not writable from userspace.

In order to achieve this, every write attempt to a swap device is
checked against the device configured as part of the uswsusp API [0]
using a pointer to the inode struct in memory. If the swap device being
written was not configured for resuming, the write request is denied.

NOTE: this implementation works only for swap block devices, where the
inode configured by swapon (which sets S_SWAPFILE) is the same used
by SNAPSHOT_SET_SWAP_AREA.

In case of swap file, SNAPSHOT_SET_SWAP_AREA indeed receives the inode
of the block device containing the filesystem where the swap file is
located (+ offset in it) which is never passed to swapon and then has
not set S_SWAPFILE.

As result, the swap file itself (as a file) has never an option to be
written from userspace. Instead it remains writable if accessed directly
from the containing block device, which is always writeable from root.

[0] Documentation/power/userland-swsusp.rst

v2:
 - rename is_hibernate_snapshot_dev() to is_hibernate_resume_dev()
 - fix description so to correctly refer to the resume device

Signed-off-by: Domenico Andreoli <domenico.andreoli@linux.com>
Acked-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-05-27 17:55:59 +02:00
Zhang Qiang
342ed2400b workqueue: Remove unnecessary kfree() call in rcu_free_wq()
The data structure member "wq->rescuer" was reset to a null pointer
in one if branch. It was passed to a call of the function "kfree"
in the callback function "rcu_free_wq" (which was eventually executed).
The function "kfree" does not perform more meaningful data processing
for a passed null pointer (besides immediately returning from such a call).
Thus delete this function call which became unnecessary with the referenced
software update.

Fixes: def98c84b6 ("workqueue: Fix spurious sanity check failures in destroy_workqueue()")

Suggested-by: Markus Elfring <Markus.Elfring@web.de>
Signed-off-by: Zhang Qiang <qiang.zhang@windriver.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2020-05-27 09:52:41 -04:00
Dan Williams
3234ac664a /dev/mem: Revoke mappings when a driver claims the region
Close the hole of holding a mapping over kernel driver takeover event of
a given address range.

Commit 90a545e981 ("restrict /dev/mem to idle io memory ranges")
introduced CONFIG_IO_STRICT_DEVMEM with the goal of protecting the
kernel against scenarios where a /dev/mem user tramples memory that a
kernel driver owns. However, this protection only prevents *new* read(),
write() and mmap() requests. Established mappings prior to the driver
calling request_mem_region() are left alone.

Especially with persistent memory, and the core kernel metadata that is
stored there, there are plentiful scenarios for a /dev/mem user to
violate the expectations of the driver and cause amplified damage.

Teach request_mem_region() to find and shoot down active /dev/mem
mappings that it believes it has successfully claimed for the exclusive
use of the driver. Effectively a driver call to request_mem_region()
becomes a hole-punch on the /dev/mem device.

The typical usage of unmap_mapping_range() is part of
truncate_pagecache() to punch a hole in a file, but in this case the
implementation is only doing the "first half" of a hole punch. Namely it
is just evacuating current established mappings of the "hole", and it
relies on the fact that /dev/mem establishes mappings in terms of
absolute physical address offsets. Once existing mmap users are
invalidated they can attempt to re-establish the mapping, or attempt to
continue issuing read(2) / write(2) to the invalidated extent, but they
will then be subject to the CONFIG_IO_STRICT_DEVMEM checking that can
block those subsequent accesses.

Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fixes: 90a545e981 ("restrict /dev/mem to idle io memory ranges")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/159009507306.847224.8502634072429766747.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-27 11:10:05 +02:00
Zefan Li
6b6ebb3474 cgroup: Remove stale comments
- The default root is where we can create v2 cgroups.
- The __DEVEL__sane_behavior mount option has been removed long long ago.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2020-05-26 13:20:24 -04:00
Thomas Gleixner
07325d4a90 rcu: Provide rcu_irq_exit_check_preempt()
Provide a debug check which can be invoked from exception return to kernel
mode before an attempt is made to schedule. Warn if RCU is not ready for
this.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20200521202117.089709607@linutronix.de
2020-05-26 19:05:11 +02:00
Paul E. McKenney
aaf2bc50df rcu: Abstract out rcu_irq_enter_check_tick() from rcu_nmi_enter()
There will likely be exception handlers that can sleep, which rules
out the usual approach of invoking rcu_nmi_enter() on entry and also
rcu_nmi_exit() on all exit paths.  However, the alternative approach of
just not calling anything can prevent RCU from coaxing quiescent states
from nohz_full CPUs that are looping in the kernel:  RCU must instead
IPI them explicitly.  It would be better to enable the scheduler tick
on such CPUs to interact with RCU in a lighter-weight manner, and this
enabling is one of the things that rcu_nmi_enter() currently does.

What is needed is something that helps RCU coax quiescent states while
not preventing subsequent sleeps.  This commit therefore splits out the
nohz_full scheduler-tick enabling from the rest of the rcu_nmi_enter()
logic into a new function named rcu_irq_enter_check_tick().

[ tglx: Renamed the function and made it a nop when context tracking is off ]
[ mingo: Fixed a CONFIG_NO_HZ_FULL assumption, harmonized and fixed all the
         comment blocks and cleaned up rcu_nmi_enter()/exit() definitions. ]

Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200521202116.996113173@linutronix.de
2020-05-26 19:04:18 +02:00
Jens Axboe
18f855e574 sched/fair: Don't NUMA balance for kthreads
Stefano reported a crash with using SQPOLL with io_uring:

  BUG: kernel NULL pointer dereference, address: 00000000000003b0
  CPU: 2 PID: 1307 Comm: io_uring-sq Not tainted 5.7.0-rc7 #11
  RIP: 0010:task_numa_work+0x4f/0x2c0
  Call Trace:
   task_work_run+0x68/0xa0
   io_sq_thread+0x252/0x3d0
   kthread+0xf9/0x130
   ret_from_fork+0x35/0x40

which is task_numa_work() oopsing on current->mm being NULL.

The task work is queued by task_tick_numa(), which checks if current->mm is
NULL at the time of the call. But this state isn't necessarily persistent,
if the kthread is using use_mm() to temporarily adopt the mm of a task.

Change the task_tick_numa() check to exclude kernel threads in general,
as it doesn't make sense to attempt ot balance for kthreads anyway.

Reported-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/865de121-8190-5d30-ece5-3b097dc74431@kernel.dk
2020-05-26 18:34:58 +02:00
Arnd Bergmann
502afe7f04 Qualcomm driver updates for v5.8
This contains a large set of cleanups, bug fixes, general improvements
 and documentation fixes for the RPMH driver. It adds a debugfs mechanism
 for inspecting Command DB. Socinfo got the "soc_id" attribute defines
 and definitions for a various variants of MSM8939.
 
 RPMH, RPMPD and RPMHPD where made possible to build as modules, but RPMH
 had to be reverted due to a compilation issue when tracing is enabled.
 
 RPMHPD gained power-domains for the SM8250 voltage corners.
 
 The SCM driver gained fixes for two build warnings and the SMP2P had an
 unnecessary error print removed.
 -----BEGIN PGP SIGNATURE-----
 
 iQJPBAABCAA5FiEEBd4DzF816k8JZtUlCx85Pw2ZrcUFAl7DZcEbHGJqb3JuLmFu
 ZGVyc3NvbkBsaW5hcm8ub3JnAAoJEAsfOT8Nma3FEwIQAK1TrENzsRjB23fY4pEW
 +hN/SkfMjNPsinmyNOHCo03MQzdFIKUl40aNvaHh3foQXaSG4TW12iot9Ul5nsxn
 /u6dCSzl15FK7pHYj/VQPWTz2WvpANVqsm6G5tf43hBg2TnStbK1AsxgJ6aq47fp
 QHehMbfeKpF/gltEowv1b+H7xwNFY7eqlQ9O9umYm3hUQh3Bl5mI6PdbkazDdO1j
 l/vHuQKkZXRdHtD1BxGBfvhPtM4NWDbOPeWWrw8HRFM5muDPgK9mXMRGDcv+fpHq
 4I3670xiTWK1Mfz9+FRBMoxLkIWT6zXILg9aNzuZMTOLoNRt3s6VNdne6uf3naND
 2Nrcu7t0b+4xWGbdwwpiAFZm8C6l+R1WTCiY4iOJXDychhVvyVwOLzPdZ4i8Pzx9
 a4UJDElu8xj2g++oCcjK816IdNwkA46bM7qVz4mkHRUBMkmK9AU7okBA9HQfbP64
 MrKPaUljyZuy7zonpJJBBDKLDEFa9a1pI8sehU8p+MUldxYB/4f/iGC3KzyDvv4Q
 Uzj6AqdP5pjgIxz98YgpOPPl8pwg1c5OgblNLWoHj4yTaqUzT0ZHIRyQO/O/+3Lg
 HwCrSy+f3+tgzts3DhyVRmOkOXFHfflSdf4rfKtbiq435yB2dz60dt77dJ0jQuBV
 M9MiF+SE9oASUupPhs47kNte
 =tkhb
 -----END PGP SIGNATURE-----

Merge tag 'qcom-drivers-for-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux into arm/drivers

Qualcomm driver updates for v5.8

This contains a large set of cleanups, bug fixes, general improvements
and documentation fixes for the RPMH driver. It adds a debugfs mechanism
for inspecting Command DB. Socinfo got the "soc_id" attribute defines
and definitions for a various variants of MSM8939.

RPMH, RPMPD and RPMHPD where made possible to build as modules, but RPMH
had to be reverted due to a compilation issue when tracing is enabled.

RPMHPD gained power-domains for the SM8250 voltage corners.

The SCM driver gained fixes for two build warnings and the SMP2P had an
unnecessary error print removed.

* tag 'qcom-drivers-for-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux: (42 commits)
  Revert "soc: qcom: rpmh: Allow RPMH driver to be loaded as a module"
  soc: qcom: rpmh-rsc: Remove the pm_lock
  soc: qcom: rpmh-rsc: Simplify locking by eliminating the per-TCS lock
  kernel/cpu_pm: Fix uninitted local in cpu_pm
  soc: qcom: rpmh-rsc: We aren't notified of our own failure w/ NOTIFY_BAD
  soc: qcom: rpmh-rsc: Correctly ignore CPU_CLUSTER_PM notifications
  firmware: qcom_scm-legacy: Replace zero-length array with flexible-array
  soc: qcom: rpmh-rsc: Timeout after 1 second in write_tcs_reg_sync()
  soc: qcom: rpmh-rsc: Factor "tcs_reg_addr" and "tcs_cmd_addr" calculation
  soc: qcom: socinfo: add msm8936/39 and apq8036/39 soc ids
  soc: qcom: aoss: Add SM8250 compatible
  soc: qcom: pdr: Remove impossible error condition
  soc: qcom: rpmh: Dirt can only make you dirtier, not cleaner
  soc: qcom: rpmhpd: Add SM8250 power domains
  firmware: qcom_scm: fix bogous abuse of dma-direct internals
  dt-bindings: soc: qcom: apr: Use generic node names for APR services
  firmware: qcom_scm: Remove unneeded conversion to bool
  soc: qcom: cmd-db: Properly endian swap the slv_id for debugfs
  soc: qcom: cmd-db: Use 5 digits for printing address
  soc: qcom: cmd-db: Cast sizeof() to int to silence field width warning
  ...

Link: https://lore.kernel.org/r/20200519052533.1250024-1-bjorn.andersson@linaro.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2020-05-25 23:19:06 +02:00
Greg Kroah-Hartman
344235f557 Merge 5.7-rc7 into tty-next
We need the tty/serial fixes in here as well.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-25 13:22:05 +02:00
Mel Gorman
2ebb177175 sched/core: Offload wakee task activation if it the wakee is descheduling
The previous commit:

  c6e7bd7afa: ("sched/core: Optimize ttwu() spinning on p->on_cpu")

avoids spinning on p->on_rq when the task is descheduling, but only if the
wakee is on a CPU that does not share cache with the waker.

This patch offloads the activation of the wakee to the CPU that is about to
go idle if the task is the only one on the runqueue. This potentially allows
the waker task to continue making progress when the wakeup is not strictly
synchronous.

This is very obvious with netperf UDP_STREAM running on localhost. The
waker is sending packets as quickly as possible without waiting for any
reply. It frequently wakes the server for the processing of packets and
when netserver is using local memory, it quickly completes the processing
and goes back to idle. The waker often observes that netserver is on_rq
and spins excessively leading to a drop in throughput.

This is a comparison of 5.7-rc6 against "sched: Optimize ttwu() spinning
on p->on_cpu" and against this patch labeled vanilla, optttwu-v1r1 and
localwakelist-v1r2 respectively.

                                  5.7.0-rc6              5.7.0-rc6              5.7.0-rc6
                                    vanilla           optttwu-v1r1     localwakelist-v1r2
Hmean     send-64         251.49 (   0.00%)      258.05 *   2.61%*      305.59 *  21.51%*
Hmean     send-128        497.86 (   0.00%)      519.89 *   4.43%*      600.25 *  20.57%*
Hmean     send-256        944.90 (   0.00%)      997.45 *   5.56%*     1140.19 *  20.67%*
Hmean     send-1024      3779.03 (   0.00%)     3859.18 *   2.12%*     4518.19 *  19.56%*
Hmean     send-2048      7030.81 (   0.00%)     7315.99 *   4.06%*     8683.01 *  23.50%*
Hmean     send-3312     10847.44 (   0.00%)    11149.43 *   2.78%*    12896.71 *  18.89%*
Hmean     send-4096     13436.19 (   0.00%)    13614.09 (   1.32%)    15041.09 *  11.94%*
Hmean     send-8192     22624.49 (   0.00%)    23265.32 *   2.83%*    24534.96 *   8.44%*
Hmean     send-16384    34441.87 (   0.00%)    36457.15 *   5.85%*    35986.21 *   4.48%*

Note that this benefit is not universal to all wakeups, it only applies
to the case where the waker often spins on p->on_rq.

The impact can be seen from a "perf sched latency" report generated from
a single iteration of one packet size:

   -----------------------------------------------------------------------------------------------------------------
    Task                  |   Runtime ms  | Switches | Average delay ms | Maximum delay ms | Maximum delay at       |
   -----------------------------------------------------------------------------------------------------------------

  vanilla
    netperf:4337          |  21709.193 ms |     2932 | avg:    0.002 ms | max:    0.041 ms | max at:    112.154512 s
    netserver:4338        |  14629.459 ms |  5146990 | avg:    0.001 ms | max: 1615.864 ms | max at:    140.134496 s

  localwakelist-v1r2
    netperf:4339          |  29789.717 ms |     2460 | avg:    0.002 ms | max:    0.059 ms | max at:    138.205389 s
    netserver:4340        |  18858.767 ms |  7279005 | avg:    0.001 ms | max:    0.362 ms | max at:    135.709683 s
   -----------------------------------------------------------------------------------------------------------------

Note that the average wakeup delay is quite small on both the vanilla
kernel and with the two patches applied. However, there are significant
outliers with the vanilla kernel with the maximum one measured as 1615
milliseconds with a vanilla kernel but never worse than 0.362 ms with
both patches applied and a much higher rate of context switching.

Similarly a separate profile of cycles showed that 2.83% of all cycles
were spent in try_to_wake_up() with almost half of the cycles spent
on spinning on p->on_rq. With the two patches, the percentage of cycles
spent in try_to_wake_up() drops to 1.13%

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jirka Hladky <jhladky@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: valentin.schneider@arm.com
Cc: Hillf Danton <hdanton@sina.com>
Cc: Rik van Riel <riel@surriel.com>
Link: https://lore.kernel.org/r/20200524202956.27665-3-mgorman@techsingularity.net
2020-05-25 07:04:10 +02:00
Peter Zijlstra
c6e7bd7afa sched/core: Optimize ttwu() spinning on p->on_cpu
Both Rik and Mel reported seeing ttwu() spend significant time on:

  smp_cond_load_acquire(&p->on_cpu, !VAL);

Attempt to avoid this by queueing the wakeup on the CPU that owns the
p->on_cpu value. This will then allow the ttwu() to complete without
further waiting.

Since we run schedule() with interrupts disabled, the IPI is
guaranteed to happen after p->on_cpu is cleared, this is what makes it
safe to queue early.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Jirka Hladky <jhladky@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: valentin.schneider@arm.com
Cc: Hillf Danton <hdanton@sina.com>
Cc: Rik van Riel <riel@surriel.com>
Link: https://lore.kernel.org/r/20200524202956.27665-2-mgorman@techsingularity.net
2020-05-25 07:01:44 +02:00
David S. Miller
13209a8f73 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
The MSCC bug fix in 'net' had to be slightly adjusted because the
register accesses are done slightly differently in net-next.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-24 13:47:27 -07:00
Linus Torvalds
9e61d12bac A set of fixes for the scheduler:
- Fix handling of throttled parents in enqueue_task_fair() completely. The
    recent fix overlooked a corner case where the first iteration terminates
    do a entiry being on rq which makes the list management incomplete and
    later triggers the assertion which checks for completeness.
 
  - Fix a similar problem in unthrottle_cfs_rq().
 
  - Show the correct uclamp values in procfs which prints the effective
    value twice instead of requested and effective.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl7Ki3QTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoVA5D/9d3ajxbUD7Qzyr9LW4dQ/GLdVrLC+h
 rav3KskFzF+v9qyEz5j6ZUEe6ZtGyqk3n+xlVCoBjEZonygaZ3A9rjn5p6p+junR
 ueUHQM9BYOQ2qX/rnQubpgzphfT2BdNXtrT0aWCQdXjbDwTJcDhS8AMjDAOnPntc
 Pj7brhP/lPtFF2JubBshlJGWCHdriALOJGyFw+FftBq6yotsFel/EH9i/5++JSFX
 3DmcFZXJMIf7cRk9xWzmP2QoOkyV6KU9FD9zlRxpvLOskT7qJ7HXaCgFLHl/HZPD
 Ukqmy8Ua8hGbKseqfuyz9vV/xJazdiEyqPOSnd8wJzk5umw2rplknOk2qpJ+Infv
 MLjnfgrjNtBvgCq4lHnmGeTvdjktLPuPusQIVjZzqis6RryZiNsNspuE5QPZP4Fh
 87/rfYQ7VoAGTRmeC9+t4X0BSnpkT4KkQvWCHtocmzl4nuym0cgfZxOCrnRW+NEN
 LeDzgujT8uRaDyeZcTwg6pysfxR2Kod6mWomnT9t17ZD93EaY/FEPUL+g8+VbDyD
 mAu1BiENX3DL5erZKBHcvHbniYVkcZ/l/cgX6o2wcFLuREaEEbHgsXYHdLqZd9QY
 eNwbldtPlagy+f2jla84O1/fFcM1za/R6V9Mbb+/86xJeNu2R/+Vv0FiOqpmHW24
 G9UW+hQcQ+ZxgA==
 =akes
 -----END PGP SIGNATURE-----

Merge tag 'sched-urgent-2020-05-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Thomas Gleixner:
 "A set of fixes for the scheduler:

   - Fix handling of throttled parents in enqueue_task_fair() completely.

     The recent fix overlooked a corner case where the first iteration
     terminates due to an entity already being on the runqueue which
     makes the list management incomplete and later triggers the
     assertion which checks for completeness.

   - Fix a similar problem in unthrottle_cfs_rq().

   - Show the correct uclamp values in procfs which prints the effective
     value twice instead of requested and effective"

* tag 'sched-urgent-2020-05-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Fix unthrottle_cfs_rq() for leaf_cfs_rq list
  sched/debug: Fix requested task uclamp values shown in procfs
  sched/fair: Fix enqueue_task_fair() warning some more
2020-05-24 10:14:58 -07:00
Shreyas Joshi
48021f9813 printk: handle blank console arguments passed in.
If uboot passes a blank string to console_setup then it results in
a trashed memory. Ultimately, the kernel crashes during freeing up
the memory.

This fix checks if there is a blank parameter being
passed to console_setup from uboot. In case it detects that
the console parameter is blank then it doesn't setup the serial
device and it gracefully exits.

Link: https://lore.kernel.org/r/20200522065306.83-1-shreyas.joshi@biamp.com
Signed-off-by: Shreyas Joshi <shreyas.joshi@biamp.com>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
[pmladek@suse.com: Better format the commit message and code, remove unnecessary brackets.]
Signed-off-by: Petr Mladek <pmladek@suse.com>
2020-05-22 10:34:34 +02:00
John Fastabend
cac616db39 bpf: Verifier track null pointer branch_taken with JNE and JEQ
Currently, when considering the branches that may be taken for a jump
instruction if the register being compared is a pointer the verifier
assumes both branches may be taken. But, if the jump instruction
is comparing if a pointer is NULL we have this information in the
verifier encoded in the reg->type so we can do better in these cases.
Specifically, these two common cases can be handled.

 * If the instruction is BPF_JEQ and we are comparing against a
   zero value. This test is 'if ptr == 0 goto +X' then using the
   type information in reg->type we can decide if the ptr is not
   null. This allows us to avoid pushing both branches onto the
   stack and instead only use the != 0 case. For example
   PTR_TO_SOCK and PTR_TO_SOCK_OR_NULL encode the null pointer.
   Note if the type is PTR_TO_SOCK_OR_NULL we can not learn anything.
   And also if the value is non-zero we learn nothing because it
   could be any arbitrary value a different pointer for example

 * If the instruction is BPF_JNE and ware comparing against a zero
   value then a similar analysis as above can be done. The test in
   asm looks like 'if ptr != 0 goto +X'. Again using the type
   information if the non null type is set (from above PTR_TO_SOCK)
   we know the jump is taken.

In this patch we extend is_branch_taken() to consider this extra
information and to return only the branch that will be taken. This
resolves a verifier issue reported with C code like the following.
See progs/test_sk_lookup_kern.c in selftests.

 sk = bpf_sk_lookup_tcp(skb, tuple, tuple_len, BPF_F_CURRENT_NETNS, 0);
 bpf_printk("sk=%d\n", sk ? 1 : 0);
 if (sk)
   bpf_sk_release(sk);
 return sk ? TC_ACT_OK : TC_ACT_UNSPEC;

In the above the bpf_printk() will resolve the pointer from
PTR_TO_SOCK_OR_NULL to PTR_TO_SOCK. Then the second test guarding
the release will cause the verifier to walk both paths resulting
in the an unreleased sock reference. See verifier/ref_tracking.c
in selftests for an assembly version of the above.

After the above additional logic is added the C code above passes
as expected.

Reported-by: Andrey Ignatov <rdna@fb.com>
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/159009164651.6313.380418298578070501.stgit@john-Precision-5820-Tower
2020-05-21 17:44:25 -07:00
Björn Töpel
d20a1676df xsk: Move xskmap.c to net/xdp/
The XSKMAP is partly implemented by net/xdp/xsk.c. Move xskmap.c from
kernel/bpf/ to net/xdp/, which is the logical place for AF_XDP related
code. Also, move AF_XDP struct definitions, and function declarations
only used by AF_XDP internals into net/xdp/xsk.h.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200520192103.355233-3-bjorn.topel@gmail.com
2020-05-21 17:31:26 -07:00
J. Bruce Fields
6670ee2ef2 Merge branch 'nfsd-5.8' of git://linux-nfs.org/~cel/cel-2.6 into for-5.8-incoming
Highlights of this series:
* Remove serialization of sending RPC/RDMA Replies
* Convert the TCP socket send path to use xdr_buf::bvecs (pre-requisite for
RPC-on-TLS)
* Fix svcrdma backchannel sendto return code
* Convert a number of dprintk call sites to use tracepoints
* Fix the "suggest braces around empty body in an 'else' statement" warning
2020-05-21 10:58:15 -04:00
Bruno Meneguele
8ece3b3eb5 kernel/printk: add kmsg SEEK_CUR handling
Userspace libraries, e.g. glibc's dprintf(), perform a SEEK_CUR operation
over any file descriptor requested to make sure the current position isn't
pointing to junk due to previous manipulation of that same fd. And whenever
that fd doesn't have support for such operation, the userspace code expects
-ESPIPE to be returned.

However, when the fd in question references the /dev/kmsg interface, the
current kernel code state returns -EINVAL instead, causing an unexpected
behavior in userspace: in the case of glibc, when -ESPIPE is returned it
gets ignored and the call completes successfully, while returning -EINVAL
forces dprintf to fail without performing any action over that fd:

  if (_IO_SEEKOFF (fp, (off64_t)0, _IO_seek_cur, _IOS_INPUT|_IOS_OUTPUT) ==
  _IO_pos_BAD && errno != ESPIPE)
    return NULL;

With this patch we make sure to return the correct value when SEEK_CUR is
requested over kmsg and also add some kernel doc information to formalize
this behavior.

Link: https://lore.kernel.org/r/20200317103344.574277-1-bmeneg@redhat.com
Cc: linux-kernel@vger.kernel.org
Cc: rostedt@goodmis.org,
Cc: David.Laight@ACULAB.COM
Signed-off-by: Bruno Meneguele <bmeneg@redhat.com>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2020-05-21 13:32:25 +02:00
Ethon Paul
325606af57 printk: Fix a typo in comment "interator"->"iterator"
There is a typo in comment, fix it.

Signed-off-by: Ethon Paul <ethp@qq.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2020-05-21 13:31:33 +02:00
Andy Shevchenko
9ed78b05f9 irqdomain: Allow software nodes for IRQ domain creation
In some cases we need to have an IRQ domain created out of software node.

One of such cases is DesignWare GPIO driver when it's instantiated from
half-baked ACPI table (alas, we can't fix it for devices which are few years
on market) and thus using software nodes to quirk this. But the driver
is using IRQ domains based on per GPIO port firmware nodes, which are in
the above case software ones. This brings a warning message to be printed

  [   73.957183] irq: Invalid fwnode type for irqdomain

and creates an anonymous IRQ domain without a debugfs entry.

Allowing software nodes to be valid for IRQ domains rids us of the warning
and debugs gets correctly populated.

  % ls -1 /sys/kernel/debug/irq/domains/
  ...
  intel-quark-dw-apb-gpio:portA

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
[maz: refactored commit message]
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20200520164927.39090-3-andriy.shevchenko@linux.intel.com
2020-05-21 10:53:17 +01:00
Andy Shevchenko
87526603c8 irqdomain: Get rid of special treatment for ACPI in __irq_domain_add()
Now that __irq_domain_add() is able to better deals with generic
fwnodes, there is no need to special-case ACPI anymore.

Get rid of the special treatment for ACPI.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20200520164927.39090-2-andriy.shevchenko@linux.intel.com
2020-05-21 10:51:50 +01:00
Andy Shevchenko
181e9d4efa irqdomain: Make __irq_domain_add() less OF-dependent
__irq_domain_add() relies in some places on the fact that the fwnode
can be only of type OF. This prevents refactoring of the code to support
other types of fwnode. Make it less OF-dependent by switching it
to use the fwnode directly where it makes sense.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20200520164927.39090-1-andriy.shevchenko@linux.intel.com
2020-05-21 10:50:30 +01:00
Andrii Nakryiko
dfeb376dd4 bpf: Prevent mmap()'ing read-only maps as writable
As discussed in [0], it's dangerous to allow mapping BPF map, that's meant to
be frozen and is read-only on BPF program side, because that allows user-space
to actually store a writable view to the page even after it is frozen. This is
exacerbated by BPF verifier making a strong assumption that contents of such
frozen map will remain unchanged. To prevent this, disallow mapping
BPF_F_RDONLY_PROG mmap()'able BPF maps as writable, ever.

  [0] https://lore.kernel.org/bpf/CAEf4BzYGWYhXdp6BJ7_=9OQPJxQpgug080MMjdSB72i9R+5c6g@mail.gmail.com/

Fixes: fc9702273e ("bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY")
Suggested-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/bpf/20200519053824.1089415-1-andriin@fb.com
2020-05-20 20:21:53 -07:00
Richard Guy Briggs
9d44a121c5 audit: add subj creds to NETFILTER_CFG record to
Some table unregister actions seem to be initiated by the kernel to
garbage collect unused tables that are not initiated by any userspace
actions.  It was found to be necessary to add the subject credentials to
cover this case to reveal the source of these actions.  A sample record:

The uid, auid, tty, ses and exe fields have not been included since they
are in the SYSCALL record and contain nothing useful in the non-user
context.

Here are two sample orphaned records:

  type=NETFILTER_CFG msg=audit(2020-05-20 12:14:36.505:5) : table=filter family=ipv4 entries=0 op=register pid=1 subj=kernel comm=swapper/0

  type=NETFILTER_CFG msg=audit(2020-05-20 12:15:27.701:301) : table=nat family=bridge entries=0 op=unregister pid=30 subj=system_u:system_r:kernel_t:s0 comm=kworker/u4:1

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2020-05-20 18:09:19 -04:00
Eric W. Biederman
87b047d2be exec: Teach prepare_exec_creds how exec treats uids & gids
It is almost possible to use the result of prepare_exec_creds with no
modifications during exec.  Update prepare_exec_creds to initialize
the suid and the fsuid to the euid, and the sgid and the fsgid to the
egid.  This is all that is needed to handle the common case of exec
when nothing special like a setuid exec is happening.

That this preserves the existing behavior of exec can be verified
by examing bprm_fill_uid and cap_bprm_set_creds.

This change makes it clear that the later parts of exec that
update bprm->cred are just need to handle special cases such
as setuid exec and change of domains.

Link: https://lkml.kernel.org/r/871rng22dm.fsf_-_@x220.int.ebiederm.org
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-05-20 14:44:21 -05:00
Christoph Hellwig
c928f642c2 fs: rename pipe_buf ->steal to ->try_steal
And replace the arcane return value convention with a simple bool
where true means success and false means failure.

[AV: braino fix folded in]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20 12:14:10 -04:00
Christoph Hellwig
b8d9e7f241 fs: make the pipe_buf_operations ->confirm operation optional
Just return 0 for success if it is not present.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20 12:11:26 -04:00
Christoph Hellwig
76887c2567 fs: make the pipe_buf_operations ->steal operation optional
Just return 1 for failure if it is not present.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20 12:11:26 -04:00
Christoph Hellwig
6797d97ab9 trace: remove tracing_pipe_buf_ops
tracing_pipe_buf_ops has identical ops to default_pipe_buf_ops, so use
that instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20 12:11:26 -04:00
Michael Ellerman
217ba7dcce Merge branch 'topic/uaccess-ppc' into next
Merge our uaccess-ppc topic branch. It is based on the uaccess topic
branch that we're sharing with Viro.

This includes the addition of user_[read|write]_access_begin(), as
well as some powerpc specific changes to our uaccess routines that
would conflict badly if merged separately.
2020-05-20 23:37:33 +10:00
Paolo Bonzini
9d5272f5e3 Merge tag 'noinstr-x86-kvm-2020-05-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into HEAD 2020-05-20 03:40:09 -04:00
Julia Lawall
fc9d276f22 tracing/probe: reverse arguments to list_add
Elsewhere in the file, the function trace_kprobe_has_same_kprobe uses
a trace_probe_event.probes object as the second argument of
list_for_each_entry, ie as a list head, while the list_for_each_entry
iterates over the list fields of the trace_probe structures, making
them the list elements.  So, exchange the arguments on the list_add
call to put the list head in the second argument.

Since both list_head structures were just initialized, this problem
did not cause any loss of information.

Link: https://lkml.kernel.org/r/1588879808-24488-1-git-send-email-Julia.Lawall@inria.fr

Fixes: 60d53e2c3b ("tracing/probe: Split trace_event related data from trace_probe")
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-05-19 21:10:50 -04:00
Cheng Jian
c143b7753b ftrace: show debugging information when panic_on_warn set
When an anomaly is detected in the function call modification
code, ftrace_bug() is called to disable function tracing as well
as give some warn and information that may help debug the problem.

But currently, we call FTRACE_WARN_ON_ONCE() first in ftrace_bug(),
so when panic_on_warn is set, we can't see the debugging information
here. Call FTRACE_WARN_ON_ONCE() at the end of ftrace_bug() to ensure
that the debugging information is displayed first.

after this patch, the dmesg looks like:

	------------[ ftrace bug ]------------
	ftrace failed to modify
	[<ffff800010081004>] bcm2835_handle_irq+0x4/0x58
	 actual:   1f:20:03:d5
	Setting ftrace call site to call ftrace function
	ftrace record flags: 80000001
	 (1)
	 expected tramp: ffff80001009d6f0
	------------[ cut here ]------------
	WARNING: CPU: 2 PID: 1635 at kernel/trace/ftrace.c:2078 ftrace_bug+0x204/0x238
	Kernel panic - not syncing: panic_on_warn set ...
	CPU: 2 PID: 1635 Comm: sh Not tainted 5.7.0-rc5-00033-gb922183867f5 #14
	Hardware name: linux,dummy-virt (DT)
	Call trace:
	 dump_backtrace+0x0/0x1b0
	 show_stack+0x20/0x30
	 dump_stack+0xc0/0x10c
	 panic+0x16c/0x368
	 __warn+0x120/0x160
	 report_bug+0xc8/0x160
	 bug_handler+0x28/0x98
	 brk_handler+0x70/0xd0
	 do_debug_exception+0xcc/0x1ac
	 el1_sync_handler+0xe4/0x120
	 el1_sync+0x7c/0x100
	 ftrace_bug+0x204/0x238

Link: https://lkml.kernel.org/r/20200515100828.7091-1-cj.chengjian@huawei.com

Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-05-19 21:08:01 -04:00
Gustavo A. R. Silva
db78538c75 locking/lockdep: Replace zero-length array with flexible-array
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

sizeof(flexible-array-member) triggers a warning because flexible array
members have incomplete type[1]. There are some instances of code in
which the sizeof operator is being incorrectly/erroneously applied to
zero-length arrays and the result is zero. Such instances may be hiding
some bugs. So, this work (flexible-array member conversions) will also
help to get completely rid of those sorts of issues.

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200507185804.GA15036@embeddedor
2020-05-19 20:34:18 +02:00
Gustavo A. R. Silva
c50c75e9b8 perf/core: Replace zero-length array with flexible-array
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

sizeof(flexible-array-member) triggers a warning because flexible array
members have incomplete type[1]. There are some instances of code in
which the sizeof operator is being incorrectly/erroneously applied to
zero-length arrays and the result is zero. Such instances may be hiding
some bugs. So, this work (flexible-array member conversions) will also
help to get completely rid of those sorts of issues.

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200511201227.GA14041@embeddedor
2020-05-19 20:34:16 +02:00
Huaixin Chang
d505b8af58 sched: Defend cfs and rt bandwidth quota against overflow
When users write some huge number into cpu.cfs_quota_us or
cpu.rt_runtime_us, overflow might happen during to_ratio() shifts of
schedulable checks.

to_ratio() could be altered to avoid unnecessary internal overflow, but
min_cfs_quota_period is less than 1 << BW_SHIFT, so a cutoff would still
be needed. Set a cap MAX_BW for cfs_quota_us and rt_runtime_us to
prevent overflow.

Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Link: https://lkml.kernel.org/r/20200425105248.60093-1-changhuaixin@linux.alibaba.com
2020-05-19 20:34:14 +02:00
Muchun Song
dbe9337109 sched/cpuacct: Fix charge cpuacct.usage_sys
The user_mode(task_pt_regs(tsk)) always return true for
user thread, and false for kernel thread. So it means that
the cpuacct.usage_sys is the time that kernel thread uses
not the time that thread uses in the kernel mode. We can
try get_irq_regs() first, if it is NULL, then we can fall
back to task_pt_regs().

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200420070453.76815-1-songmuchun@bytedance.com
2020-05-19 20:34:14 +02:00
Gustavo A. R. Silva
04f5c362ec sched/fair: Replace zero-length array with flexible-array
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

sizeof(flexible-array-member) triggers a warning because flexible array
members have incomplete type[1]. There are some instances of code in
which the sizeof operator is being incorrectly/erroneously applied to
zero-length arrays and the result is zero. Such instances may be hiding
some bugs. So, this work (flexible-array member conversions) will also
help to get completely rid of those sorts of issues.

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200507192141.GA16183@embeddedor
2020-05-19 20:34:14 +02:00
Vincent Guittot
95d685935a sched/pelt: Sync util/runnable_sum with PELT window when propagating
update_tg_cfs_*() propagate the impact of the attach/detach of an entity
down into the cfs_rq hierarchy and must keep the sync with the current pelt
window.

Even if we can't sync child cfs_rq and its group se, we can sync the group
se and its parent cfs_rq with current position in the PELT window. In fact,
we must keep them sync in order to stay also synced with others entities
and group entities that are already attached to the cfs_rq.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200506155301.14288-1-vincent.guittot@linaro.org
2020-05-19 20:34:14 +02:00
Muchun Song
12aa258738 sched/cpuacct: Use __this_cpu_add() instead of this_cpu_ptr()
The cpuacct_charge() and cpuacct_account_field() are called with
rq->lock held, and this means preemption(and IRQs) are indeed
disabled, so it is safe to use __this_cpu_*() to allow for better
code-generation.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200507031039.32615-1-songmuchun@bytedance.com
2020-05-19 20:34:13 +02:00
Vincent Guittot
7d148be69e sched/fair: Optimize enqueue_task_fair()
enqueue_task_fair jumps to enqueue_throttle label when cfs_rq_of(se) is
throttled which means that se can't be NULL in such case and we can move
the label after the if (!se) statement. Futhermore, the latter can be
removed because se is always NULL when reaching this point.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Link: https://lkml.kernel.org/r/20200513135502.4672-1-vincent.guittot@linaro.org
2020-05-19 20:34:13 +02:00
Peter Zijlstra
9013196a46 Merge branch 'sched/urgent' 2020-05-19 20:34:12 +02:00
Vincent Guittot
39f23ce07b sched/fair: Fix unthrottle_cfs_rq() for leaf_cfs_rq list
Although not exactly identical, unthrottle_cfs_rq() and enqueue_task_fair()
are quite close and follow the same sequence for enqueuing an entity in the
cfs hierarchy. Modify unthrottle_cfs_rq() to use the same pattern as
enqueue_task_fair(). This fixes a problem already faced with the latter and
add an optimization in the last for_each_sched_entity loop.

Fixes: fe61468b2c (sched/fair: Fix enqueue_task_fair warning)
Reported-by Tao Zhou <zohooouoto@zoho.com.cn>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Link: https://lkml.kernel.org/r/20200513135528.4742-1-vincent.guittot@linaro.org
2020-05-19 20:34:10 +02:00
Pavankumar Kondeti
ad32bb41fc sched/debug: Fix requested task uclamp values shown in procfs
The intention of commit 96e74ebf8d ("sched/debug: Add task uclamp
values to SCHED_DEBUG procfs") was to print requested and effective
task uclamp values. The requested values printed are read from p->uclamp,
which holds the last effective values. Fix this by printing the values
from p->uclamp_req.

Fixes: 96e74ebf8d ("sched/debug: Add task uclamp values to SCHED_DEBUG procfs")
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/1589115401-26391-1-git-send-email-pkondeti@codeaurora.org
2020-05-19 20:34:10 +02:00
Phil Auld
b34cb07dde sched/fair: Fix enqueue_task_fair() warning some more
sched/fair: Fix enqueue_task_fair warning some more

The recent patch, fe61468b2c (sched/fair: Fix enqueue_task_fair warning)
did not fully resolve the issues with the rq->tmp_alone_branch !=
&rq->leaf_cfs_rq_list warning in enqueue_task_fair. There is a case where
the first for_each_sched_entity loop exits due to on_rq, having incompletely
updated the list.  In this case the second for_each_sched_entity loop can
further modify se. The later code to fix up the list management fails to do
what is needed because se does not point to the sched_entity which broke out
of the first loop. The list is not fixed up because the throttled parent was
already added back to the list by a task enqueue in a parallel child hierarchy.

Address this by calling list_add_leaf_cfs_rq if there are throttled parents
while doing the second for_each_sched_entity loop.

Fixes: fe61468b2c ("sched/fair: Fix enqueue_task_fair warning")
Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20200512135222.GC2201@lorien.usersys.redhat.com
2020-05-19 20:34:10 +02:00
Daniel Borkmann
1b66d25361 bpf: Add get{peer, sock}name attach types for sock_addr
As stated in 983695fa67 ("bpf: fix unconnected udp hooks"), the objective
for the existing cgroup connect/sendmsg/recvmsg/bind BPF hooks is to be
transparent to applications. In Cilium we make use of these hooks [0] in
order to enable E-W load balancing for existing Kubernetes service types
for all Cilium managed nodes in the cluster. Those backends can be local
or remote. The main advantage of this approach is that it operates as close
as possible to the socket, and therefore allows to avoid packet-based NAT
given in connect/sendmsg/recvmsg hooks we only need to xlate sock addresses.

This also allows to expose NodePort services on loopback addresses in the
host namespace, for example. As another advantage, this also efficiently
blocks bind requests for applications in the host namespace for exposed
ports. However, one missing item is that we also need to perform reverse
xlation for inet{,6}_getname() hooks such that we can return the service
IP/port tuple back to the application instead of the remote peer address.

The vast majority of applications does not bother about getpeername(), but
in a few occasions we've seen breakage when validating the peer's address
since it returns unexpectedly the backend tuple instead of the service one.
Therefore, this trivial patch allows to customise and adds a getpeername()
as well as getsockname() BPF cgroup hook for both IPv4 and IPv6 in order
to address this situation.

Simple example:

  # ./cilium/cilium service list
  ID   Frontend     Service Type   Backend
  1    1.2.3.4:80   ClusterIP      1 => 10.0.0.10:80

Before; curl's verbose output example, no getpeername() reverse xlation:

  # curl --verbose 1.2.3.4
  * Rebuilt URL to: 1.2.3.4/
  *   Trying 1.2.3.4...
  * TCP_NODELAY set
  * Connected to 1.2.3.4 (10.0.0.10) port 80 (#0)
  > GET / HTTP/1.1
  > Host: 1.2.3.4
  > User-Agent: curl/7.58.0
  > Accept: */*
  [...]

After; with getpeername() reverse xlation:

  # curl --verbose 1.2.3.4
  * Rebuilt URL to: 1.2.3.4/
  *   Trying 1.2.3.4...
  * TCP_NODELAY set
  * Connected to 1.2.3.4 (1.2.3.4) port 80 (#0)
  > GET / HTTP/1.1
  >  Host: 1.2.3.4
  > User-Agent: curl/7.58.0
  > Accept: */*
  [...]

Originally, I had both under a BPF_CGROUP_INET{4,6}_GETNAME type and exposed
peer to the context similar as in inet{,6}_getname() fashion, but API-wise
this is suboptimal as it always enforces programs having to test for ctx->peer
which can easily be missed, hence BPF_CGROUP_INET{4,6}_GET{PEER,SOCK}NAME split.
Similarly, the checked return code is on tnum_range(1, 1), but if a use case
comes up in future, it can easily be changed to return an error code instead.
Helper and ctx member access is the same as with connect/sendmsg/etc hooks.

  [0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.c

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Andrey Ignatov <rdna@fb.com>
Link: https://lore.kernel.org/bpf/61a479d759b2482ae3efb45546490bacd796a220.1589841594.git.daniel@iogearbox.net
2020-05-19 11:32:04 -07:00
Domenico Andreoli
c4f39a6c74 PM: hibernate: Split off snapshot dev option
Make it possible to reduce the attack surface in case the snapshot
device is not to be used from userspace.

Signed-off-by: Domenico Andreoli <domenico.andreoli@linux.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-05-19 17:48:08 +02:00
Domenico Andreoli
ab7e9b067f PM: hibernate: Incorporate concurrency handling
Hibernation concurrency handling is currently delegated to user.c,
where it's also used for regulating the access to the snapshot device.

In the prospective of making user.c a separate configuration option,
such mutual exclusion is brought into hibernate.c and made available
through accessor helpers hereby introduced.

Signed-off-by: Domenico Andreoli <domenico.andreoli@linux.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-05-19 17:48:08 +02:00
David Howells
e7d553d69c pipe: Add notification lossage handling
Add handling for loss of notifications by having read() insert a
loss-notification message after it has read the pipe buffer that was last
in the ring when the loss occurred.

Lossage can come about either by running out of notification descriptors or
by running out of space in the pipe ring.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-05-19 15:40:28 +01:00
David Howells
8cfba76383 pipe: Allow buffers to be marked read-whole-or-error for notifications
Allow a buffer to be marked such that read() must return the entire buffer
in one go or return ENOBUFS.  Multiple buffers can be amalgamated into a
single read, but a short read will occur if the next "whole" buffer won't
fit.

This is useful for watch queue notifications to make sure we don't split a
notification across multiple reads, especially given that we need to
fabricate an overrun record under some circumstances - and that isn't in
the buffers.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-05-19 15:38:18 +01:00
David Howells
c73be61ced pipe: Add general notification queue support
Make it possible to have a general notification queue built on top of a
standard pipe.  Notifications are 'spliced' into the pipe and then read
out.  splice(), vmsplice() and sendfile() are forbidden on pipes used for
notifications as post_one_notification() cannot take pipe->mutex.  This
means that notifications could be posted in between individual pipe
buffers, making iov_iter_revert() difficult to effect.

The way the notification queue is used is:

 (1) An application opens a pipe with a special flag and indicates the
     number of messages it wishes to be able to queue at once (this can
     only be set once):

	pipe2(fds, O_NOTIFICATION_PIPE);
	ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);

 (2) The application then uses poll() and read() as normal to extract data
     from the pipe.  read() will return multiple notifications if the
     buffer is big enough, but it will not split a notification across
     buffers - rather it will return a short read or EMSGSIZE.

     Notification messages include a length in the header so that the
     caller can split them up.

Each message has a header that describes it:

	struct watch_notification {
		__u32	type:24;
		__u32	subtype:8;
		__u32	info;
	};

The type indicates the source (eg. mount tree changes, superblock events,
keyring changes, block layer events) and the subtype indicates the event
type (eg. mount, unmount; EIO, EDQUOT; link, unlink).  The info field
indicates a number of things, including the entry length, an ID assigned to
a watchpoint contributing to this buffer and type-specific flags.

Supplementary data, such as the key ID that generated an event, can be
attached in additional slots.  The maximum message size is 127 bytes.
Messages may not be padded or aligned, so there is no guarantee, for
example, that the notification type will be on a 4-byte bounary.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-05-19 15:08:24 +01:00
Thomas Gleixner
66e9b07171 kprobes: Prevent probes in .noinstr.text section
Instrumentation is forbidden in the .noinstr.text section. Make kprobes
respect this.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lkml.kernel.org/r/20200505134100.179862032@linutronix.de
2020-05-19 15:56:20 +02:00
Thomas Gleixner
b1fcf9b83c rcu: Provide __rcu_is_watching()
Same as rcu_is_watching() but without the preempt_disable/enable() pair
inside the function. It is merked noinstr so it ends up in the
non-instrumentable text section.

This is useful for non-preemptible code especially in the low level entry
section. Using rcu_is_watching() there results in a call to the
preempt_schedule_notrace() thunk which triggers noinstr section warnings in
objtool.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200512213810.518709291@linutronix.de
2020-05-19 15:51:21 +02:00
Thomas Gleixner
8ae0ae6737 rcu: Provide rcu_irq_exit_preempt()
Interrupts and exceptions invoke rcu_irq_enter() on entry and need to
invoke rcu_irq_exit() before they either return to the interrupted code or
invoke the scheduler due to preemption.

The general assumption is that RCU idle code has to have preemption
disabled so that a return from interrupt cannot schedule. So the return
from interrupt code invokes rcu_irq_exit() and preempt_schedule_irq().

If there is any imbalance in the rcu_irq/nmi* invocations or RCU idle code
had preemption enabled then this goes unnoticed until the CPU goes idle or
some other RCU check is executed.

Provide rcu_irq_exit_preempt() which can be invoked from the
interrupt/exception return code in case that preemption is enabled. It
invokes rcu_irq_exit() and contains a few sanity checks in case that
CONFIG_PROVE_RCU is enabled to catch such issues directly.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200505134904.364456424@linutronix.de
2020-05-19 15:51:21 +02:00
Paul E. McKenney
9ea366f669 rcu: Make RCU IRQ enter/exit functions rely on in_nmi()
The rcu_nmi_enter_common() and rcu_nmi_exit_common() functions take an
"irq" parameter that indicates whether these functions have been invoked from
an irq handler (irq==true) or an NMI handler (irq==false).

However, recent changes have applied notrace to a few critical functions
such that rcu_nmi_enter_common() and rcu_nmi_exit_common() many now rely on
in_nmi().  Note that in_nmi() works no differently than before, but rather
that tracing is now prohibited in code regions where in_nmi() would
incorrectly report NMI state.

Therefore remove the "irq" parameter and inline rcu_nmi_enter_common() and
rcu_nmi_exit_common() into rcu_nmi_enter() and rcu_nmi_exit(),
respectively.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Link: https://lkml.kernel.org/r/20200505134101.617130349@linutronix.de
2020-05-19 15:51:21 +02:00
Thomas Gleixner
ff5c4f5cad rcu/tree: Mark the idle relevant functions noinstr
These functions are invoked from context tracking and other places in the
low level entry code. Move them into the .noinstr.text section to exclude
them from instrumentation.

Mark the places which are safe to invoke traceable functions with
instrumentation_begin/end() so objtool won't complain.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lkml.kernel.org/r/20200505134100.575356107@linutronix.de
2020-05-19 15:51:20 +02:00
Peter Zijlstra
178ba00c35 sh/ftrace: Move arch_ftrace_nmi_{enter,exit} into nmi exception
SuperH is the last remaining user of arch_ftrace_nmi_{enter,exit}(),
remove it from the generic code and into the SuperH code.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: https://lkml.kernel.org/r/20200505134101.248881738@linutronix.de
2020-05-19 15:51:18 +02:00
Peter Zijlstra
e616cb8daa lockdep: Always inline lockdep_{off,on}()
These functions are called {early,late} in nmi_{enter,exit} and should
not be traced or probed. They are also puny, so 'inline' them.

Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Link: https://lkml.kernel.org/r/20200505134101.048523500@linutronix.de
2020-05-19 15:51:18 +02:00
Peter Zijlstra
b0f51883f5 printk: Disallow instrumenting print_nmi_enter()
It happens early in nmi_enter(), no tracing, probing or other funnies
allowed. Specifically as nmi_enter() will be used in do_debug(), which
would cause recursive exceptions when kprobed.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Link: https://lkml.kernel.org/r/20200505134101.139720912@linutronix.de
2020-05-19 15:51:16 +02:00
Petr Mladek
8c4e93c362 printk: Prepare for nested printk_nmi_enter()
There is plenty of space in the printk_context variable. Reserve one byte
there for the NMI context to be on the safe side.

It should never overflow. The BUG_ON(in_nmi() == NMI_MASK) in nmi_enter()
will trigger much earlier.

Signed-off-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Link: https://lkml.kernel.org/r/20200505134100.681374113@linutronix.de
2020-05-19 15:51:16 +02:00
Thomas Gleixner
1ed0948eea Merge tag 'noinstr-lds-2020-05-19' into core/rcu
Get the noinstr section and annotation markers to base the RCU parts on.
2020-05-19 15:50:34 +02:00
Peter Zijlstra
c86e9b987c lockdep: Prepare for noinstr sections
Force inlining and prevent instrumentation of all sorts by marking the
functions which are invoked from low level entry code with 'noinstr'.

Split the irqflags tracking into two parts. One which does the heavy
lifting while RCU is watching and the final one which can be invoked after
RCU is turned off.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Link: https://lkml.kernel.org/r/20200505134100.484532537@linutronix.de
2020-05-19 15:47:21 +02:00
Thomas Gleixner
0995a5dfbe tracing: Provide lockdep less trace_hardirqs_on/off() variants
trace_hardirqs_on/off() is only partially safe vs. RCU idle. The tracer
core itself is safe, but the resulting tracepoints can be utilized by
e.g. BPF which is unsafe.

Provide variants which do not contain the lockdep invocation so the lockdep
and tracer invocations can be split at the call site and placed
properly. This is required because lockdep needs to be aware of the state
before switching away from RCU idle and after switching to RCU idle because
these transitions can take locks.

As these code pathes are going to be non-instrumentable the tracer can be
invoked after RCU is turned on and before the switch to RCU idle. So for
these new variants there is no need to invoke the rcuidle aware tracer
functions.

Name them so they match the lockdep counterparts.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200505134100.270771162@linutronix.de
2020-05-19 15:47:21 +02:00
Alexey Gladkov
9d78edeaec proc: proc_pid_ns takes super_block as an argument
syzbot found that

  touch /proc/testfile

causes NULL pointer dereference at tomoyo_get_local_path()
because inode of the dentry is NULL.

Before c59f415a7c, Tomoyo received pid_ns from proc's s_fs_info
directly. Since proc_pid_ns() can only work with inode, using it in
the tomoyo_get_local_path() was wrong.

To avoid creating more functions for getting proc_ns, change the
argument type of the proc_pid_ns() function. Then, Tomoyo can use
the existing super_block to get pid_ns.

Link: https://lkml.kernel.org/r/0000000000002f0c7505a5b0e04c@google.com
Link: https://lkml.kernel.org/r/20200518180738.2939611-1-gladkov.alexey@gmail.com
Reported-by: syzbot+c1af344512918c61362c@syzkaller.appspotmail.com
Fixes: c59f415a7c ("Use proc_pid_ns() to get pid_namespace from the proc superblock")
Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-05-19 07:07:50 -05:00
Vincent Whitchurch
2318976619 ARM: 8976/1: module: allow arch overrides for .init section names
ARM stores unwind information for .init.text in sections named
.ARM.extab.init.text and .ARM.exidx.init.text.  Since those aren't
currently recognized as init sections, they're allocated along with the
core section, and relocation fails if the core and the init section are
allocated from different regions and can't reach other.

  final section addresses:
        ...
        0x7f800000 .init.text
        ..
        0xcbb54078 .ARM.exidx.init.text
        ..

 section 16 reloc 0 sym '': relocation 42 out of range (0xcbb54078 ->
 0x7f800000)

Allow architectures to override the section name so that ARM can fix
this.

Acked-by: Jessica Yu <jeyu@kernel.org>
Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
2020-05-19 11:42:16 +01:00
Vincent Chen
f83b04d36e
kgdb: Add kgdb_has_hit_break function
The break instruction in RISC-V does not have an immediate value field, so
the kernel cannot identify the purpose of each trap exception through the
opcode. This makes the existing identification schemes in other
architecture unsuitable for the RISC-V kernel. To solve this problem, this
patch adds kgdb_has_hit_break(), which can help RISC-V kernel identify
the KGDB trap exception.

Signed-off-by: Vincent Chen <vincent.chen@sifive.com>
Reviewed-by: Palmer Dabbelt <palmerdabbelt@google.com>
Acked-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
2020-05-18 11:38:09 -07:00
Douglas Anderson
220995622d kgdboc: Add kgdboc_earlycon to support early kgdb using boot consoles
We want to enable kgdb to debug the early parts of the kernel.
Unfortunately kgdb normally is a client of the tty API in the kernel
and serial drivers don't register to the tty layer until fairly late
in the boot process.

Serial drivers do, however, commonly register a boot console.  Let's
enable the kgdboc driver to work with boot consoles to provide early
debugging.

This change co-opts the existing read() function pointer that's part
of "struct console".  It's assumed that if a boot console (with the
flag CON_BOOT) has implemented read() that both the read() and write()
function are polling functions.  That means they work without
interrupts and read() will return immediately (with 0 bytes read) if
there's nothing to read.  This should be a safe assumption since it
appears that no current boot consoles implement read() right now and
there seems no reason to do so unless they wanted to support
"kgdboc_earlycon".

The normal/expected way to make all this work is to use
"kgdboc_earlycon" and "kgdboc" together.  You should point them both
to the same physical serial connection.  At boot time, as the system
transitions from the boot console to the normal console (and registers
a tty), kgdb will switch over.

One awkward part of all this, though, is that there can be a window
where the boot console goes away and we can't quite transtion over to
the main kgdboc that uses the tty layer.  There are two main problems:

1. The act of registering the tty doesn't cause any call into kgdboc
   so there is a window of time when the tty is there but kgdboc's
   init code hasn't been called so we can't transition to it.

2. On some serial drivers the normal console inits (and replaces the
   boot console) quite early in the system.  Presumably these drivers
   were coded up before earlycon worked as well as it does today and
   probably they don't need to do this anymore, but it causes us
   problems nontheless.

Problem #1 is not too big of a deal somewhat due to the luck of probe
ordering.  kgdboc is last in the tty/serial/Makefile so its probe gets
right after all other tty devices.  It's not fun to rely on this, but
it does work for the most part.

Problem #2 is a big deal, but only for some serial drivers.  Other
serial drivers end up registering the console (which gets rid of the
boot console) and tty at nearly the same time.

The way we'll deal with the window when the system has stopped using
the boot console and the time when we're setup using the tty is to
keep using the boot console.  This may sound surprising, but it has
been found to work well in practice.  If it doesn't work, it shouldn't
be too hard for a given serial driver to make it keep working.
Specifically, it's expected that the read()/write() function provided
in the boot console should be the same (or nearly the same) as the
normal kgdb polling functions.  That means continuing to use them
should work just fine.  To make things even more likely to work work
we'll also trap the recently added exit() function in the boot console
we're using and delay any calls to it until we're all done with the
boot console.

NOTE: there could be ways to use all this in weird / unexpected ways.
If you do something like this, it's a bit of a buyer beware situation.
Specifically:
- If you specify only "kgdboc_earlycon" but not "kgdboc" then
  (depending on your serial driver) things will probably work OK, but
  you'll get a warning printed the first time you use kgdb after the
  boot console is gone.  You'd only be able to do this, of course, if
  the serial driver you're running atop provided an early boot console.
- If your "kgdboc_earlycon" and "kgdboc" devices are not the same
  device things should work OK, but it'll be your job to switch over
  which device you're monitoring (including figuring out how to switch
  over gdb in-flight if you're using it).

When trying to enable "kgdboc_earlycon" it should be noted that the
names that are registered through the boot console layer and the tty
layer are not the same for the same port.  For example when debugging
on one board I'd need to pass "kgdboc_earlycon=qcom_geni
kgdboc=ttyMSM0" to enable things properly.  Since digging up the boot
console name is a pain and there will rarely be more than one boot
console enabled, you can provide the "kgdboc_earlycon" parameter
without specifying the name of the boot console.  In this case we'll
just pick the first boot that implements read() that we find.

This new "kgdboc_earlycon" parameter should be contrasted to the
existing "ekgdboc" parameter.  While both provide a way to debug very
early, the usage and mechanisms are quite different.  Specifically
"kgdboc_earlycon" is meant to be used in tandem with "kgdboc" and
there is a transition from one to the other.  The "ekgdboc" parameter,
on the other hand, replaces the "kgdboc" parameter.  It runs the same
logic as the "kgdboc" parameter but just relies on your TTY driver
being present super early.  The only known usage of the old "ekgdboc"
parameter is documented as "ekgdboc=kbd earlyprintk=vga".  It should
be noted that "kbd" has special treatment allowing it to init early as
a tty device.

Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Tested-by: Sumit Garg <sumit.garg@linaro.org>
Link: https://lore.kernel.org/r/20200507130644.v4.8.I8fba5961bf452ab92350654aa61957f23ecf0100@changeid
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2020-05-18 17:49:27 +01:00
Douglas Anderson
3ca676e4ca kgdb: Prevent infinite recursive entries to the debugger
If we detect that we recursively entered the debugger we should hack
our I/O ops to NULL so that the panic() in the next line won't
actually cause another recursion into the debugger.  The first line of
kgdb_panic() will check this and return.

Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Daniel Thompson <daniel.thompson@linaro.org>
Link: https://lore.kernel.org/r/20200507130644.v4.6.I89de39f68736c9de610e6f241e68d8dbc44bc266@changeid
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2020-05-18 17:49:27 +01:00
Douglas Anderson
b1a57bbfcc kgdb: Delay "kgdbwait" to dbg_late_init() by default
Using kgdb requires at least some level of architecture-level
initialization.  If nothing else, it relies on the architecture to
pass breakpoints / crashes onto kgdb.

On some architectures this all works super early, specifically it
starts working at some point in time before Linux parses
early_params's.  On other architectures it doesn't.  A survey of a few
platforms:

a) x86: Presumably it all works early since "ekgdboc" is documented to
   work here.
b) arm64: Catching crashes works; with a simple patch breakpoints can
   also be made to work.
c) arm: Nothing in kgdb works until
   paging_init() -> devicemaps_init() -> early_trap_init()

Let's be conservative and, by default, process "kgdbwait" (which tells
the kernel to drop into the debugger ASAP at boot) a bit later at
dbg_late_init() time.  If an architecture has tested it and wants to
re-enable super early debugging, they can select the
ARCH_HAS_EARLY_DEBUG KConfig option.  We'll do this for x86 to start.
It should be noted that dbg_late_init() is still called quite early in
the system.

Note that this patch doesn't affect when kgdb runs its init.  If kgdb
is set to initialize early it will still initialize when parsing
early_param's.  This patch _only_ inhibits the initial breakpoint from
"kgdbwait".  This means:

* Without any extra patches arm64 platforms will at least catch
  crashes after kgdb inits.
* arm platforms will catch crashes (and could handle a hardcoded
  kgdb_breakpoint()) any time after early_trap_init() runs, even
  before dbg_late_init().

Signed-off-by: Douglas Anderson <dianders@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20200507130644.v4.4.I3113aea1b08d8ce36dc3720209392ae8b815201b@changeid
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2020-05-18 17:49:27 +01:00
Will Deacon
aa7a65ae5b scs: Remove references to asm/scs.h from core code
asm/scs.h is no longer needed by the core code, so remove a redundant
header inclusion and update the stale Kconfig text.

Tested-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
2020-05-18 17:47:45 +01:00
Will Deacon
88485be531 scs: Move scs_overflow_check() out of architecture code
There is nothing architecture-specific about scs_overflow_check() as
it's just a trivial wrapper around scs_corrupted().

For parity with task_stack_end_corrupted(), rename scs_corrupted() to
task_scs_end_corrupted() and call it from schedule_debug() when
CONFIG_SCHED_STACK_END_CHECK_is enabled, which better reflects its
purpose as a debug feature to catch inadvertent overflow of the SCS.
Finally, remove the unused scs_overflow_check() function entirely.

This has absolutely no impact on architectures that do not support SCS
(currently arm64 only).

Tested-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
2020-05-18 17:47:40 +01:00
Will Deacon
bee348fab0 scs: Move accounting into alloc/free functions
There's no need to perform the shadow stack page accounting independently
of the lifetime of the underlying allocation, so call the accounting code
from the {alloc,free}() functions and simplify the code in the process.

Tested-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
2020-05-18 17:47:33 +01:00
Will Deacon
51189c7a7e arm64: scs: Store absolute SCS stack pointer value in thread_info
Storing the SCS information in thread_info as a {base,offset} pair
introduces an additional load instruction on the ret-to-user path,
since the SCS stack pointer in x18 has to be converted back to an offset
by subtracting the base.

Replace the offset with the absolute SCS stack pointer value instead
and avoid the redundant load.

Tested-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
2020-05-18 17:47:22 +01:00
Douglas Anderson
202164fbfa kgdb: Disable WARN_CONSOLE_UNLOCKED for all kgdb
In commit 81eaadcae8 ("kgdboc: disable the console lock when in
kgdb") we avoided the WARN_CONSOLE_UNLOCKED() yell when we were in
kgdboc.  That still works fine, but it turns out that we get a similar
yell when using other I/O drivers.  One example is the "I/O driver"
for the kgdb test suite (kgdbts).  When I enabled that I again got the
same yells.

Even though "kgdbts" doesn't actually interact with the user over the
console, using it still causes kgdb to print to the consoles.  That
trips the same warning:
  con_is_visible+0x60/0x68
  con_scroll+0x110/0x1b8
  lf+0x4c/0xc8
  vt_console_print+0x1b8/0x348
  vkdb_printf+0x320/0x89c
  kdb_printf+0x68/0x90
  kdb_main_loop+0x190/0x860
  kdb_stub+0x2cc/0x3ec
  kgdb_cpu_enter+0x268/0x744
  kgdb_handle_exception+0x1a4/0x200
  kgdb_compiled_brk_fn+0x34/0x44
  brk_handler+0x7c/0xb8
  do_debug_exception+0x1b4/0x228

Let's increment/decrement the "ignore_console_lock_warning" variable
all the time when we enter the debugger.

This will allow us to later revert commit 81eaadcae8 ("kgdboc:
disable the console lock when in kgdb").

Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Daniel Thompson <daniel.thompson@linaro.org>
Link: https://lore.kernel.org/r/20200507130644.v4.1.Ied2b058357152ebcc8bf68edd6f20a11d98d7d4e@changeid
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2020-05-18 15:28:16 +01:00
Ravi Bangoria
29da4f91c0 powerpc/watchpoint: Don't allow concurrent perf and ptrace events
With Book3s DAWR, ptrace and perf watchpoints on powerpc behaves
differently. Ptrace watchpoint works in one-shot mode and generates
signal before executing instruction. It's ptrace user's job to
single-step the instruction and re-enable the watchpoint. OTOH, in
case of perf watchpoint, kernel emulates/single-steps the instruction
and then generates event. If perf and ptrace creates two events with
same or overlapping address ranges, it's ambiguous to decide who
should single-step the instruction. Because of this issue, don't
allow perf and ptrace watchpoint at the same time if their address
range overlaps.

Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Michael Neuling <mikey@neuling.org>
Link: https://lore.kernel.org/r/20200514111741.97993-15-ravi.bangoria@linux.ibm.com
2020-05-19 00:14:45 +10:00
Jonathan Corbet
fdb1b5e089 Revert "docs: sysctl/kernel: document ngroups_max"
This reverts commit 2f4c33063a.

The changes here were fine, but there's a non-documentation change to
sysctl.c that makes messes elsewhere; those changes should have been done
independently.

Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2020-05-18 06:19:25 -06:00
Randy Dunlap
c1a371cf80 printk: fix global comment
Fix typo/spello.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2020-05-18 12:41:32 +02:00
Bartosz Golaszewski
337cbeb2c1 genirq/irq_sim: Simplify the API
The interrupt simulator API exposes a lot of custom data structures and
functions and doesn't reuse the interfaces already exposed by the irq
subsystem. This patch tries to address it.

We hide all the simulator-related data structures from users and instead
rely on the well-known irq domain. When creating the interrupt simulator
the user receives a pointer to a newly created irq_domain and can use it
to create mappings for simulated interrupts.

It is also possible to pass a handle to fwnode when creating the simulator
domain and retrieve it using irq_find_matching_fwnode().

The irq_sim_fire() function is dropped as well. Instead we implement the
irq_get/set_irqchip_state interface.

We modify the two modules that use the simulator at the same time as
adding these changes in order to reduce the intermediate bloat that would
result when trying to migrate the drivers in separate patches.

Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> #for IIO
Link: https://lore.kernel.org/r/20200514083901.23445-3-brgl@bgdev.pl
2020-05-18 10:30:21 +01:00
Bartosz Golaszewski
5c8f77a278 irqdomain: Make irq_domain_reset_irq_data() available to non-hierarchical users
irq_domain_reset_irq_data() doesn't modify the parent data, so it can be
made available even if irq domain hierarchy is not being built. We'll
subsequently use it in irq_sim code.

Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Link: https://lore.kernel.org/r/20200514083901.23445-2-brgl@bgdev.pl
2020-05-18 10:29:26 +01:00
Jan Kara
870c153cf0 blktrace: Report pid with note messages
Currently informational messages within block trace do not have PID
information of the process reporting the message included. With BFQ it
is sometimes useful to have the information and there's no good reason
to omit the information from the trace. So just fill in pid information
when generating note message.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-16 14:29:39 -06:00
Daniel Borkmann
2ec0616e87 bpf: Fix check_return_code to only allow [0,1] in trace_iter progs
As per 15d83c4d7c ("bpf: Allow loading of a bpf_iter program") we only
allow a range of [0,1] for return codes. Therefore BPF_TRACE_ITER relies
on the default tnum_range(0, 1) which is set in range var. On recent merge
of net into net-next commit e92888c72f ("bpf: Enforce returning 0 for
fentry/fexit progs") got pulled in and caused a merge conflict with the
changes from 15d83c4d7c. The resolution had a snall hiccup in that it
removed the [0,1] range restriction again so that BPF_TRACE_ITER would
have no enforcement. Fix it by adding it back.

Fixes: da07f52d3c ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
2020-05-16 00:48:02 +02:00
David S. Miller
da07f52d3c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Move the bpf verifier trace check into the new switch statement in
HEAD.

Resolve the overlapping changes in hinic, where bug fixes overlap
the addition of VF support.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-15 13:48:59 -07:00
Linus Torvalds
f85c1598dd Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Fix sk_psock reference count leak on receive, from Xiyu Yang.

 2) CONFIG_HNS should be invisible, from Geert Uytterhoeven.

 3) Don't allow locking route MTUs in ipv6, RFCs actually forbid this,
    from Maciej Żenczykowski.

 4) ipv4 route redirect backoff wasn't actually enforced, from Paolo
    Abeni.

 5) Fix netprio cgroup v2 leak, from Zefan Li.

 6) Fix infinite loop on rmmod in conntrack, from Florian Westphal.

 7) Fix tcp SO_RCVLOWAT hangs, from Eric Dumazet.

 8) Various bpf probe handling fixes, from Daniel Borkmann.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (68 commits)
  selftests: mptcp: pm: rm the right tmp file
  dpaa2-eth: properly handle buffer size restrictions
  bpf: Restrict bpf_trace_printk()'s %s usage and add %pks, %pus specifier
  bpf: Add bpf_probe_read_{user, kernel}_str() to do_refine_retval_range
  bpf: Restrict bpf_probe_read{, str}() only to archs where they work
  MAINTAINERS: Mark networking drivers as Maintained.
  ipmr: Add lockdep expression to ipmr_for_each_table macro
  ipmr: Fix RCU list debugging warning
  drivers: net: hamradio: Fix suspicious RCU usage warning in bpqether.c
  net: phy: broadcom: fix BCM54XX_SHD_SCR3_TRDDAPD value for BCM54810
  tcp: fix error recovery in tcp_zerocopy_receive()
  MAINTAINERS: Add Jakub to networking drivers.
  MAINTAINERS: another add of Karsten Graul for S390 networking
  drivers: ipa: fix typos for ipa_smp2p structure doc
  pppoe: only process PADT targeted at local interfaces
  selftests/bpf: Enforce returning 0 for fentry/fexit programs
  bpf: Enforce returning 0 for fentry/fexit progs
  net: stmmac: fix num_por initialization
  security: Fix the default value of secid_to_secctx hook
  libbpf: Fix register naming in PT_REGS s390 macros
  ...
2020-05-15 13:10:06 -07:00
Douglas Anderson
b5945214b7 kernel/cpu_pm: Fix uninitted local in cpu_pm
cpu_pm_notify() is basically a wrapper of notifier_call_chain().
notifier_call_chain() doesn't initialize *nr_calls to 0 before it
starts incrementing it--presumably it's up to the callers to do this.

Unfortunately the callers of cpu_pm_notify() don't init *nr_calls.
This potentially means you could get too many or two few calls to
CPU_PM_ENTER_FAILED or CPU_CLUSTER_PM_ENTER_FAILED depending on the
luck of the stack.

Let's fix this.

Fixes: ab10023e00 ("cpu_pm: Add cpu power management notifiers")
Cc: stable@vger.kernel.org
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Stephen Boyd <swboyd@chromium.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20200504104917.v6.3.I2d44fc0053d019f239527a4e5829416714b7e299@changeid
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
2020-05-15 11:44:34 -07:00
Stephen Kitt
2f4c33063a docs: sysctl/kernel: document ngroups_max
This is a read-only export of NGROUPS_MAX, so this patch also changes
the declarations in kernel/sysctl.c to const.

Signed-off-by: Stephen Kitt <steve@sk2.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20200515160222.7994-1-steve@sk2.org
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2020-05-15 11:24:25 -06:00
Sami Tolvanen
5bbaf9d1fc scs: Add support for stack usage debugging
Implements CONFIG_DEBUG_STACK_USAGE for shadow stacks. When enabled,
also prints out the highest shadow stack usage per process.

Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Will Deacon <will@kernel.org>
[will: rewrote most of scs_check_usage()]
Signed-off-by: Will Deacon <will@kernel.org>
2020-05-15 16:35:49 +01:00
Sami Tolvanen
628d06a48f scs: Add page accounting for shadow call stack allocations
This change adds accounting for the memory allocated for shadow stacks.

Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>
2020-05-15 16:35:49 +01:00
Sami Tolvanen
d08b9f0ca6 scs: Add support for Clang's Shadow Call Stack (SCS)
This change adds generic support for Clang's Shadow Call Stack,
which uses a shadow stack to protect return addresses from being
overwritten by an attacker. Details are available here:

  https://clang.llvm.org/docs/ShadowCallStack.html

Note that security guarantees in the kernel differ from the ones
documented for user space. The kernel must store addresses of
shadow stacks in memory, which means an attacker capable reading
and writing arbitrary memory may be able to locate them and hijack
control flow by modifying the stacks.

Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
[will: Numerous cosmetic changes]
Signed-off-by: Will Deacon <will@kernel.org>
2020-05-15 16:35:45 +01:00
Alexei Starovoitov
2c78ee898d bpf: Implement CAP_BPF
Implement permissions as stated in uapi/linux/capability.h
In order to do that the verifier allow_ptr_leaks flag is split
into four flags and they are set as:
  env->allow_ptr_leaks = bpf_allow_ptr_leaks();
  env->bypass_spec_v1 = bpf_bypass_spec_v1();
  env->bypass_spec_v4 = bpf_bypass_spec_v4();
  env->bpf_capable = bpf_capable();

The first three currently equivalent to perfmon_capable(), since leaking kernel
pointers and reading kernel memory via side channel attacks is roughly
equivalent to reading kernel memory with cap_perfmon.

'bpf_capable' enables bounded loops, precision tracking, bpf to bpf calls and
other verifier features. 'allow_ptr_leaks' enable ptr leaks, ptr conversions,
subtraction of pointers. 'bypass_spec_v1' disables speculative analysis in the
verifier, run time mitigations in bpf array, and enables indirect variable
access in bpf programs. 'bypass_spec_v4' disables emission of sanitation code
by the verifier.

That means that the networking BPF program loaded with CAP_BPF + CAP_NET_ADMIN
will have speculative checks done by the verifier and other spectre mitigation
applied. Such networking BPF program will not be able to leak kernel pointers
and will not be able to access arbitrary kernel memory.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200513230355.7858-3-alexei.starovoitov@gmail.com
2020-05-15 17:29:41 +02:00
Daniel Borkmann
b2a5212fb6 bpf: Restrict bpf_trace_printk()'s %s usage and add %pks, %pus specifier
Usage of plain %s conversion specifier in bpf_trace_printk() suffers from the
very same issue as bpf_probe_read{,str}() helpers, that is, it is broken on
archs with overlapping address ranges.

While the helpers have been addressed through work in 6ae08ae3de ("bpf: Add
probe_read_{user, kernel} and probe_read_{user, kernel}_str helpers"), we need
an option for bpf_trace_printk() as well to fix it.

Similarly as with the helpers, force users to make an explicit choice by adding
%pks and %pus specifier to bpf_trace_printk() which will then pick the corresponding
strncpy_from_unsafe*() variant to perform the access under KERNEL_DS or USER_DS.
The %pk* (kernel specifier) and %pu* (user specifier) can later also be extended
for other objects aside strings that are probed and printed under tracing, and
reused out of other facilities like bpf_seq_printf() or BTF based type printing.

Existing behavior of %s for current users is still kept working for archs where it
is not broken and therefore gated through CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE.
For archs not having this property we fall-back to pick probing under KERNEL_DS as
a sensible default.

Fixes: 8d3b7dce86 ("bpf: add support for %s specifier to bpf_trace_printk()")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Link: https://lore.kernel.org/bpf/20200515101118.6508-4-daniel@iogearbox.net
2020-05-15 08:10:36 -07:00
Daniel Borkmann
47cc0ed574 bpf: Add bpf_probe_read_{user, kernel}_str() to do_refine_retval_range
Given bpf_probe_read{,str}() BPF helpers are now only available under
CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE, we need to add the drop-in
replacements of bpf_probe_read_{kernel,user}_str() to do_refine_retval_range()
as well to avoid hitting the same issue as in 849fa50662 ("bpf/verifier:
refine retval R0 state for bpf_get_stack helper").

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200515101118.6508-3-daniel@iogearbox.net
2020-05-15 08:10:36 -07:00
Daniel Borkmann
0ebeea8ca8 bpf: Restrict bpf_probe_read{, str}() only to archs where they work
Given the legacy bpf_probe_read{,str}() BPF helpers are broken on archs
with overlapping address ranges, we should really take the next step to
disable them from BPF use there.

To generally fix the situation, we've recently added new helper variants
bpf_probe_read_{user,kernel}() and bpf_probe_read_{user,kernel}_str().
For details on them, see 6ae08ae3de ("bpf: Add probe_read_{user, kernel}
and probe_read_{user,kernel}_str helpers").

Given bpf_probe_read{,str}() have been around for ~5 years by now, there
are plenty of users at least on x86 still relying on them today, so we
cannot remove them entirely w/o breaking the BPF tracing ecosystem.

However, their use should be restricted to archs with non-overlapping
address ranges where they are working in their current form. Therefore,
move this behind a CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE and
have x86, arm64, arm select it (other archs supporting it can follow-up
on it as well).

For the remaining archs, they can workaround easily by relying on the
feature probe from bpftool which spills out defines that can be used out
of BPF C code to implement the drop-in replacement for old/new kernels
via: bpftool feature probe macro

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/bpf/20200515101118.6508-2-daniel@iogearbox.net
2020-05-15 08:10:36 -07:00
Emil Velikov
0ca650c430 rcu: constify sysrq_key_op
With earlier commits, the API no longer discards the const-ness of the
sysrq_key_op. As such we can add the notation.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jiri Slaby <jslaby@suse.com>
Cc: linux-kernel@vger.kernel.org
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: rcu@vger.kernel.org
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Emil Velikov <emil.l.velikov@gmail.com>
Link: https://lore.kernel.org/r/20200513214351.2138580-11-emil.l.velikov@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-15 14:53:20 +02:00
Emil Velikov
6400b5a0f6 kernel/power: constify sysrq_key_op
With earlier commits, the API no longer discards the const-ness of the
sysrq_key_op. As such we can add the notation.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jiri Slaby <jslaby@suse.com>
Cc: linux-kernel@vger.kernel.org
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <len.brown@intel.com>
Cc: linux-pm@vger.kernel.org
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Emil Velikov <emil.l.velikov@gmail.com>
Link: https://lore.kernel.org/r/20200513214351.2138580-10-emil.l.velikov@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-15 14:53:20 +02:00
Emil Velikov
c69b470eb8 kdb: constify sysrq_key_op
With earlier commits, the API no longer discards the const-ness of the
sysrq_key_op. As such we can add the notation.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jiri Slaby <jslaby@suse.com>
Cc: linux-kernel@vger.kernel.org
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Daniel Thompson <daniel.thompson@linaro.org>
Cc: kgdb-bugreport@lists.sourceforge.net
Acked-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Emil Velikov <emil.l.velikov@gmail.com>
Link: https://lore.kernel.org/r/20200513214351.2138580-9-emil.l.velikov@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-15 14:53:20 +02:00
Jesper Dangaard Brouer
db612f749e xdp: Cpumap redirect use frame_sz and increase skb_tailroom
Knowing the memory size backing the packet/xdp_frame data area, and
knowing it already have reserved room for skb_shared_info, simplifies
using build_skb significantly.

With this change we no-longer lie about the SKB truesize, but more
importantly a significant larger skb_tailroom is now provided, e.g. when
drivers uses a full PAGE_SIZE. This extra tailroom (in linear area) can be
used by the network stack when coalescing SKBs (e.g. in skb_try_coalesce,
see TCP cases where tcp_queue_rcv() can 'eat' skb).

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/158945337822.97035.13557959180460986059.stgit@firesoul
2020-05-14 21:21:54 -07:00
David S. Miller
d00f26b623 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2020-05-14

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Merged tag 'perf-for-bpf-2020-05-06' from tip tree that includes CAP_PERFMON.

2) support for narrow loads in bpf_sock_addr progs and additional
   helpers in cg-skb progs, from Andrey.

3) bpf benchmark runner, from Andrii.

4) arm and riscv JIT optimizations, from Luke.

5) bpf iterator infrastructure, from Yonghong.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-14 20:31:21 -07:00
Andrii Nakryiko
c70f34a8ac bpf: Fix bpf_iter's task iterator logic
task_seq_get_next might stop prematurely if get_pid_task() fails to get
task_struct. Failure to do so doesn't mean that there are no more tasks with
higher pids. Procfs's iteration algorithm (see next_tgid in fs/proc/base.c)
does a retry in such case. After this fix, instead of stopping prematurely
after about 300 tasks on my server, bpf_iter program now returns >4000, which
sounds much closer to reality.

Fixes: eaaacd2391 ("bpf: Add task and task/file iterator targets")
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200514055137.1564581-1-andriin@fb.com
2020-05-14 18:37:32 -07:00
Yonghong Song
e92888c72f bpf: Enforce returning 0 for fentry/fexit progs
Currently, tracing/fentry and tracing/fexit prog
return values are not enforced. In trampoline codes,
the fentry/fexit prog return values are ignored.
Let us enforce it to be 0 to avoid confusion and
allows potential future extension.

This patch also explicitly added return value
checking for tracing/raw_tp, tracing/fmod_ret,
and freplace programs such that these program
return values can be anything. The purpose are
two folds:
 1. to make it explicit about return value expectations
    for these programs in verifier.
 2. for tracing prog_type, if a future attach type
    is added, the default is -ENOTSUPP which will
    enforce to specify return value ranges explicitly.

Fixes: fec56f5890 ("bpf: Introduce BPF trampoline")
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200514053206.1298415-1-yhs@fb.com
2020-05-14 12:50:10 -07:00
Andrii Nakryiko
333291ce50 bpf: Fix bug in mmap() implementation for BPF array map
mmap() subsystem allows user-space application to memory-map region with
initial page offset. This wasn't taken into account in initial implementation
of BPF array memory-mapping. This would result in wrong pages, not taking into
account requested page shift, being memory-mmaped into user-space. This patch
fixes this gap and adds a test for such scenario.

Fixes: fc9702273e ("bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY")
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200512235925.3817805-1-andriin@fb.com
2020-05-14 12:40:04 -07:00
Linus Torvalds
8c1684bb81 for-linus-2020-05-13
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXrvi4AAKCRCRxhvAZXjc
 otubAPsFV2XnZykq94GRZMBqxP3CQepTykXDV4aryfrUDoV04wD/fFisS/i+R4Uq
 XvtMZzsFcm30QVT6IRfg1RY2OlOiMwc=
 =t8HD
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-2020-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull thread fix from Christian Brauner:
 "This contains a single fix for all exported legacy fork helpers to
  block accidental access to clone3() features in the upper 32 bits of
  their respective flags arguments.

  I got Cced on a glibc issue where someone reported consistent failures
  for the legacy clone() syscall on ppc64le when sign extension was
  performed (since the clone() syscall in glibc exposes the flags
  argument as an int whereas the kernel uses unsigned long).

  The legacy clone() syscall is odd in a bunch of ways and here two
  things interact:

   - First, legacy clone's flag argument is word-size dependent, i.e.
     it's an unsigned long whereas most system calls with flag arguments
     use int or unsigned int.

   - Second, legacy clone() ignores unknown and deprecated flags.

  The two of them taken together means that users on 64bit systems can
  pass garbage for the upper 32bit of the clone() syscall since forever
  and things would just work fine.

  The following program compiled on a 64bit kernel prior to v5.7-rc1
  will succeed and will fail post v5.7-rc1 with EBADF:

    int main(int argc, char *argv[])
    {
          pid_t pid;

          /* Note that legacy clone() has different argument ordering on
           * different architectures so this won't work everywhere.
           *
           * Only set the upper 32 bits.
           */
          pid = syscall(__NR_clone, 0xffffffff00000000 | SIGCHLD,
                        NULL, NULL, NULL, NULL);
          if (pid < 0)
                  exit(EXIT_FAILURE);
          if (pid == 0)
                  exit(EXIT_SUCCESS);
          if (wait(NULL) != pid)
                  exit(EXIT_FAILURE);

          exit(EXIT_SUCCESS);
    }

  Since legacy clone() couldn't be extended this was not a problem so
  far and nobody really noticed or cared since nothing in the kernel
  ever bothered to look at the upper 32 bits.

  But once we introduced clone3() and expanded the flag argument in
  struct clone_args to 64 bit we opened this can of worms. With the
  first flag-based extension to clone3() making use of the upper 32 bits
  of the flag argument we've effectively made it possible for the legacy
  clone() syscall to reach clone3() only flags on accident. The sign
  extension scenario is just the odd corner-case that we needed to
  figure this out.

  The reason we just realized this now and not already when we
  introduced CLONE_CLEAR_SIGHAND was that CLONE_INTO_CGROUP assumes that
  a valid cgroup file descriptor has been given - whereas
  CLONE_CLEAR_SIGHAND doesn't need to verify anything. It just silently
  resets the signal handlers to SIG_DFL.

  So the sign extension (or the user accidently passing garbage for the
  upper 32 bits) caused the CLONE_INTO_CGROUP bit to be raised and the
  kernel to error out when it didn't find a valid cgroup file
  descriptor.

  Note, I'm also capping kernel_thread()'s flag argument mainly because
  none of the new features make sense for kernel_thread() and we
  shouldn't risk them being accidently activated however unlikely. If we
  wanted to, we could even make kernel_thread() yell when an unknown
  flag has been set which it doesn't do right now. But it's not worth
  risking this in a bugfix imho"

* tag 'for-linus-2020-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  fork: prevent accidental access to clone3 features
2020-05-14 11:52:28 -07:00
Linus Torvalds
f44d5c4890 Various tracing fixes:
- Fix a crash when having function tracing and function stack tracing on
    the command line. The ftrace trampolines are created as executable and
    read only. But the stack tracer tries to modify them with text_poke()
    which expects all kernel text to still be writable at boot.
    Keep the trampolines writable at boot, and convert them to read-only
    with the rest of the kernel.
 
  - A selftest was triggering in the ring buffer iterator code, that
    is no longer valid with the update of keeping the ring buffer
    writable while a iterator is reading. Just bail after three failed
    attempts to get an event and remove the warning and disabling of the
    ring buffer.
 
  - While modifying the ring buffer code, decided to remove all the
    unnecessary BUG() calls.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXr1CDhQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qsXcAQCoL229SBrtHsn4DUO7eAQRppUT3hNw
 RuKzvQ56+1GccQEAh8VGCeg89uMSK6imrTujEl6VmOUdbgrD5R96yiKoGQw=
 =vi+k
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull more tracing fixes from Steven Rostedt:
 "Various tracing fixes:

   - Fix a crash when having function tracing and function stack tracing
     on the command line.

     The ftrace trampolines are created as executable and read only. But
     the stack tracer tries to modify them with text_poke() which
     expects all kernel text to still be writable at boot. Keep the
     trampolines writable at boot, and convert them to read-only with
     the rest of the kernel.

   - A selftest was triggering in the ring buffer iterator code, that is
     no longer valid with the update of keeping the ring buffer writable
     while a iterator is reading.

     Just bail after three failed attempts to get an event and remove
     the warning and disabling of the ring buffer.

   - While modifying the ring buffer code, decided to remove all the
     unnecessary BUG() calls"

* tag 'trace-v5.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  ring-buffer: Remove all BUG() calls
  ring-buffer: Don't deactivate the ring buffer on failed iterator reads
  x86/ftrace: Have ftrace trampolines turn read-only at the end of system boot up
2020-05-14 11:46:52 -07:00
Steven Rostedt (VMware)
da4d401a6b ring-buffer: Remove all BUG() calls
There's a lot of checks to make sure the ring buffer is working, and if an
anomaly is detected, it safely shuts itself down. But there's a few cases
that it will call BUG(), which defeats the point of being safe (it crashes
the kernel when an anomaly is found!). There's no reason for them. Switch
them all to either WARN_ON_ONCE() (when no ring buffer descriptor is present),
or to RB_WARN_ON() (when a ring buffer descriptor is present).

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-05-14 08:51:02 -04:00
Steven Rostedt (VMware)
3d2353de81 ring-buffer: Don't deactivate the ring buffer on failed iterator reads
If the function tracer is running and the trace file is read (which uses the
ring buffer iterator), the iterator can get in sync with the writes, and
caues it to fail to find a page with content it can read three times. This
causes a warning and deactivation of the ring buffer code.

Looking at the other cases of failure to get an event, it appears that
there's a chance that the writer could cause them too. Since the iterator is
a "best effort" to read the ring buffer if there's an active writer (the
consumer reader is made for this case "see trace_pipe"), if it fails to get
an event after three tries, simply give up and return NULL. Don't warn, nor
disable the ring buffer on this failure.

Link: https://lore.kernel.org/r/20200429090508.GG5770@shao2-debian

Reported-by: kernel test robot <lkp@intel.com>
Fixes: ff84c50cfb ("ring-buffer: Do not die if rb_iter_peek() fails more than thrice")
Tested-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-05-14 08:50:51 -04:00
Yonghong Song
3c32cc1bce bpf: Enable bpf_iter targets registering ctx argument types
Commit b121b341e5 ("bpf: Add PTR_TO_BTF_ID_OR_NULL
support") adds a field btf_id_or_null_non0_off to
bpf_prog->aux structure to indicate that the
first ctx argument is PTR_TO_BTF_ID reg_type and
all others are PTR_TO_BTF_ID_OR_NULL.
This approach does not really scale if we have
other different reg types in the future, e.g.,
a pointer to a buffer.

This patch enables bpf_iter targets registering ctx argument
reg types which may be different from the default one.
For example, for pointers to structures, the default reg_type
is PTR_TO_BTF_ID for tracing program. The target can register
a particular pointer type as PTR_TO_BTF_ID_OR_NULL which can
be used by the verifier to enforce accesses.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200513180221.2949882-1-yhs@fb.com
2020-05-13 12:30:50 -07:00
Yonghong Song
ab2ee4fcb9 bpf: Change func bpf_iter_unreg_target() signature
Change func bpf_iter_unreg_target() parameter from target
name to target reg_info, similar to bpf_iter_reg_target().

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200513180220.2949737-1-yhs@fb.com
2020-05-13 12:30:50 -07:00
Yonghong Song
15172a46fa bpf: net: Refactor bpf_iter target registration
Currently bpf_iter_reg_target takes parameters from target
and allocates memory to save them. This is really not
necessary, esp. in the future we may grow information
passed from targets to bpf_iter manager.

The patch refactors the code so target reg_info
becomes static and bpf_iter manager can just take
a reference to it.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200513180219.2949605-1-yhs@fb.com
2020-05-13 12:30:50 -07:00
Yonghong Song
2e3ed68bfc bpf: Add comments to interpret bpf_prog return values
Add a short comment in bpf_iter_run_prog() function to
explain how bpf_prog return value is converted to
seq_ops->show() return value:
  bpf_prog return           seq_ops()->show() return
     0                         0
     1                         -EAGAIN

When show() return value is -EAGAIN, the current
bpf_seq_read() will end. If the current seq_file buffer
is empty, -EAGAIN will return to user space. Otherwise,
the buffer will be copied to user space.
In both cases, the next bpf_seq_read() call will
try to show the same object which returned -EAGAIN
previously.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200513180218.2949517-1-yhs@fb.com
2020-05-13 12:30:50 -07:00
Davidlohr Bueso
9d9a6ebfea rcuwait: Let rcuwait_wake_up() return whether or not a task was awoken
Propagating the return value of wake_up_process() back to the caller
can come in handy for future users, such as for statistics or
accounting purposes.

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Message-Id: <20200424054837.5138-3-dave@stgolabs.net>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13 12:14:52 -04:00
Davidlohr Bueso
c9d64a1b2d rcuwait: Fix stale wake call name in comment
The 'trywake' name was renamed to simply 'wake', update the comment.

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Message-Id: <20200424054837.5138-2-dave@stgolabs.net>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13 12:14:51 -04:00
Christian Brauner
303cc571d1
nsproxy: attach to namespaces via pidfds
For quite a while we have been thinking about using pidfds to attach to
namespaces. This patchset has existed for about a year already but we've
wanted to wait to see how the general api would be received and adopted.
Now that more and more programs in userspace have started using pidfds
for process management it's time to send this one out.

This patch makes it possible to use pidfds to attach to the namespaces
of another process, i.e. they can be passed as the first argument to the
setns() syscall. When only a single namespace type is specified the
semantics are equivalent to passing an nsfd. That means
setns(nsfd, CLONE_NEWNET) equals setns(pidfd, CLONE_NEWNET). However,
when a pidfd is passed, multiple namespace flags can be specified in the
second setns() argument and setns() will attach the caller to all the
specified namespaces all at once or to none of them. Specifying 0 is not
valid together with a pidfd.

Here are just two obvious examples:
setns(pidfd, CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET);
setns(pidfd, CLONE_NEWUSER);
Allowing to also attach subsets of namespaces supports various use-cases
where callers setns to a subset of namespaces to retain privilege, perform
an action and then re-attach another subset of namespaces.

If the need arises, as Eric suggested, we can extend this patchset to
assume even more context than just attaching all namespaces. His suggestion
specifically was about assuming the process' root directory when
setns(pidfd, 0) or setns(pidfd, SETNS_PIDFD) is specified. For now, just
keep it flexible in terms of supporting subsets of namespaces but let's
wait until we have users asking for even more context to be assumed. At
that point we can add an extension.

The obvious example where this is useful is a standard container
manager interacting with a running container: pushing and pulling files
or directories, injecting mounts, attaching/execing any kind of process,
managing network devices all these operations require attaching to all
or at least multiple namespaces at the same time. Given that nowadays
most containers are spawned with all namespaces enabled we're currently
looking at at least 14 syscalls, 7 to open the /proc/<pid>/ns/<ns>
nsfds, another 7 to actually perform the namespace switch. With time
namespaces we're looking at about 16 syscalls.
(We could amortize the first 7 or 8 syscalls for opening the nsfds by
 stashing them in each container's monitor process but that would mean
 we need to send around those file descriptors through unix sockets
 everytime we want to interact with the container or keep on-disk
 state. Even in scenarios where a caller wants to join a particular
 namespace in a particular order callers still profit from batching
 other namespaces. That mostly applies to the user namespace but
 all container runtimes I found join the user namespace first no matter
 if it privileges or deprivileges the container similar to how unshare
 behaves.)
With pidfds this becomes a single syscall no matter how many namespaces
are supposed to be attached to.

A decently designed, large-scale container manager usually isn't the
parent of any of the containers it spawns so the containers don't die
when it crashes or needs to update or reinitialize. This means that
for the manager to interact with containers through pids is inherently
racy especially on systems where the maximum pid number is not
significicantly bumped. This is even more problematic since we often spawn
and manage thousands or ten-thousands of containers. Interacting with a
container through a pid thus can become risky quite quickly. Especially
since we allow for an administrator to enable advanced features such as
syscall interception where we're performing syscalls in lieu of the
container. In all of those cases we use pidfds if they are available and
we pass them around as stable references. Using them to setns() to the
target process' namespaces is as reliable as using nsfds. Either the
target process is already dead and we get ESRCH or we manage to attach
to its namespaces but we can't accidently attach to another process'
namespaces. So pidfds lend themselves to be used with this api.
The other main advantage is that with this change the pidfd becomes the
only relevant token for most container interactions and it's the only
token we need to create and send around.

Apart from significiantly reducing the number of syscalls from double
digit to single digit which is a decent reason post-spectre/meltdown
this also allows to switch to a set of namespaces atomically, i.e.
either attaching to all the specified namespaces succeeds or we fail. If
we fail we haven't changed a single namespace. There are currently three
namespaces that can fail (other than for ENOMEM which really is not
very interesting since we then have other problems anyway) for
non-trivial reasons, user, mount, and pid namespaces. We can fail to
attach to a pid namespace if it is not our current active pid namespace
or a descendant of it. We can fail to attach to a user namespace because
we are multi-threaded or because our current mount namespace shares
filesystem state with other tasks, or because we're trying to setns()
to the same user namespace, i.e. the target task has the same user
namespace as we do. We can fail to attach to a mount namespace because
it shares filesystem state with other tasks or because we fail to lookup
the new root for the new mount namespace. In most non-pathological
scenarios these issues can be somewhat mitigated. But there are cases where
we're half-attached to some namespace and failing to attach to another one.
I've talked about some of these problem during the hallway track (something
only the pre-COVID-19 generation will remember) of Plumbers in Los Angeles
in 2018(?). Even if all these issues could be avoided with super careful
userspace coding it would be nicer to have this done in-kernel. Pidfds seem
to lend themselves nicely for this.

The other neat thing about this is that setns() becomes an actual
counterpart to the namespace bits of unshare().

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Jann Horn <jannh@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Link: https://lore.kernel.org/r/20200505140432.181565-3-christian.brauner@ubuntu.com
2020-05-13 11:41:22 +02:00
Steven Rostedt (VMware)
59566b0b62 x86/ftrace: Have ftrace trampolines turn read-only at the end of system boot up
Booting one of my machines, it triggered the following crash:

 Kernel/User page tables isolation: enabled
 ftrace: allocating 36577 entries in 143 pages
 Starting tracer 'function'
 BUG: unable to handle page fault for address: ffffffffa000005c
 #PF: supervisor write access in kernel mode
 #PF: error_code(0x0003) - permissions violation
 PGD 2014067 P4D 2014067 PUD 2015063 PMD 7b253067 PTE 7b252061
 Oops: 0003 [#1] PREEMPT SMP PTI
 CPU: 0 PID: 0 Comm: swapper Not tainted 5.4.0-test+ #24
 Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
 RIP: 0010:text_poke_early+0x4a/0x58
 Code: 34 24 48 89 54 24 08 e8 bf 72 0b 00 48 8b 34 24 48 8b 4c 24 08 84 c0 74 0b 48 89 df f3 a4 48 83 c4 10 5b c3 9c 58 fa 48 89 df <f3> a4 50 9d 48 83 c4 10 5b e9 d6 f9 ff ff
0 41 57 49
 RSP: 0000:ffffffff82003d38 EFLAGS: 00010046
 RAX: 0000000000000046 RBX: ffffffffa000005c RCX: 0000000000000005
 RDX: 0000000000000005 RSI: ffffffff825b9a90 RDI: ffffffffa000005c
 RBP: ffffffffa000005c R08: 0000000000000000 R09: ffffffff8206e6e0
 R10: ffff88807b01f4c0 R11: ffffffff8176c106 R12: ffffffff8206e6e0
 R13: ffffffff824f2440 R14: 0000000000000000 R15: ffffffff8206eac0
 FS:  0000000000000000(0000) GS:ffff88807d400000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: ffffffffa000005c CR3: 0000000002012000 CR4: 00000000000006b0
 Call Trace:
  text_poke_bp+0x27/0x64
  ? mutex_lock+0x36/0x5d
  arch_ftrace_update_trampoline+0x287/0x2d5
  ? ftrace_replace_code+0x14b/0x160
  ? ftrace_update_ftrace_func+0x65/0x6c
  __register_ftrace_function+0x6d/0x81
  ftrace_startup+0x23/0xc1
  register_ftrace_function+0x20/0x37
  func_set_flag+0x59/0x77
  __set_tracer_option.isra.19+0x20/0x3e
  trace_set_options+0xd6/0x13e
  apply_trace_boot_options+0x44/0x6d
  register_tracer+0x19e/0x1ac
  early_trace_init+0x21b/0x2c9
  start_kernel+0x241/0x518
  ? load_ucode_intel_bsp+0x21/0x52
  secondary_startup_64+0xa4/0xb0

I was able to trigger it on other machines, when I added to the kernel
command line of both "ftrace=function" and "trace_options=func_stack_trace".

The cause is the "ftrace=function" would register the function tracer
and create a trampoline, and it will set it as executable and
read-only. Then the "trace_options=func_stack_trace" would then update
the same trampoline to include the stack tracer version of the function
tracer. But since the trampoline already exists, it updates it with
text_poke_bp(). The problem is that text_poke_bp() called while
system_state == SYSTEM_BOOTING, it will simply do a memcpy() and not
the page mapping, as it would think that the text is still read-write.
But in this case it is not, and we take a fault and crash.

Instead, lets keep the ftrace trampolines read-write during boot up,
and then when the kernel executable text is set to read-only, the
ftrace trampolines get set to read-only as well.

Link: https://lkml.kernel.org/r/20200430202147.4dc6e2de@oasis.local.home

Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: stable@vger.kernel.org
Fixes: 768ae4406a ("x86/ftrace: Use text_poke()")
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-05-12 18:24:34 -04:00
Linus Torvalds
24085f70a6 Tracing fixes to previous fixes:
Unfortunately, the last set of fixes introduced some minor bugs:
 
  - The bootconfig apply_xbc() leak fix caused the application to return
    a positive number on success, when it should have returned zero.
 
  - The preempt_irq_delay_thread fix to make the creation code
    wait for the kthread to finish to prevent it from executing after
    module unload, can now cause the kthread to exit before it even
    executes (preventing it to run its tests).
 
  - The fix to the bootconfig that fixed the initrd to remove the
    bootconfig from causing the kernel to panic, now prints a warning
    that the bootconfig is not found, even when bootconfig is not
    on the command line.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXrq2ehQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qrdjAQDGNaJa7Ft13KTDTNTioKmOorOi38vF
 ava4E3uBHl3StQD/anJmVq7Kk4WJFKGYemV6usbjDqy510PCFu/VQ1AbGQc=
 =hJvk
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
 "Fixes to previous fixes.

  Unfortunately, the last set of fixes introduced some minor bugs:

   - The bootconfig apply_xbc() leak fix caused the application to
     return a positive number on success, when it should have returned
     zero.

   - The preempt_irq_delay_thread fix to make the creation code wait for
     the kthread to finish to prevent it from executing after module
     unload, can now cause the kthread to exit before it even executes
     (preventing it to run its tests).

   - The fix to the bootconfig that fixed the initrd to remove the
     bootconfig from causing the kernel to panic, now prints a warning
     that the bootconfig is not found, even when bootconfig is not on
     the command line"

* tag 'trace-v5.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  bootconfig: Fix to prevent warning message if no bootconfig option
  tracing: Wait for preempt irq delay thread to execute
  tools/bootconfig: Fix apply_xbc() to return zero on success
2020-05-12 11:06:26 -07:00
Masami Hiramatsu
16db6264c9 kprobes: Support NOKPROBE_SYMBOL() in modules
Support NOKPROBE_SYMBOL() in modules. NOKPROBE_SYMBOL() records only symbol
address in "_kprobe_blacklist" section in the module.

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200505134059.771170126@linutronix.de
2020-05-12 17:15:32 +02:00
Masami Hiramatsu
1e6769b0ae kprobes: Support __kprobes blacklist in modules
Support __kprobes attribute for blacklist functions in modules.  The
__kprobes attribute functions are stored in .kprobes.text section.

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200505134059.678201813@linutronix.de
2020-05-12 17:15:32 +02:00
Masami Hiramatsu
4fdd88877e kprobes: Lock kprobe_mutex while showing kprobe_blacklist
Lock kprobe_mutex while showing kprobe_blacklist to prevent updating the
kprobe_blacklist.

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200505134059.571125195@linutronix.de
2020-05-12 17:15:31 +02:00
Thomas Gleixner
2a0a24ebb4 sched: Make scheduler_ipi inline
Now that the scheduler IPI is trivial and simple again there is no point to
have the little function out of line. This simplifies the effort of
constraining the instrumentation nicely.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200505134058.453581595@linutronix.de
2020-05-12 17:10:49 +02:00
Peter Zijlstra (Intel)
90b5363acd sched: Clean up scheduler_ipi()
The scheduler IPI has grown weird and wonderful over the years, time
for spring cleaning.

Move all the non-trivial stuff out of it and into a regular smp function
call IPI. This then reduces the schedule_ipi() to most of it's former NOP
glory and ensures to keep the interrupt vector lean and mean.

Aside of that avoiding the full irq_enter() in the x86 IPI implementation
is incorrect as scheduler_ipi() can be instrumented. To work around that
scheduler_ipi() had an irq_enter/exit() hack when heavy work was
pending. This is gone now.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Link: https://lkml.kernel.org/r/20200505134058.361859938@linutronix.de
2020-05-12 17:10:48 +02:00
Steven Rostedt (VMware)
8b1fac2e73 tracing: Wait for preempt irq delay thread to execute
A bug report was posted that running the preempt irq delay module on a slow
machine, and removing it quickly could lead to the thread created by the
modlue to execute after the module is removed, and this could cause the
kernel to crash. The fix for this was to call kthread_stop() after creating
the thread to make sure it finishes before allowing the module to be
removed.

Now this caused the opposite problem on fast machines. What now happens is
the kthread_stop() can cause the kthread never to execute and the test never
to run. To fix this, add a completion and wait for the kthread to execute,
then wait for it to end.

This issue caused the ftracetest selftests to fail on the preemptirq tests.

Link: https://lore.kernel.org/r/20200510114210.15d9e4af@oasis.local.home

Cc: stable@vger.kernel.org
Fixes: d16a8c3107 ("tracing: Wait for preempt irq delay thread to finish")
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-05-11 17:00:34 -04:00
Gustavo A. R. Silva
385bbf7b11 bpf, libbpf: Replace zero-length array with flexible-array
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

sizeof(flexible-array-member) triggers a warning because flexible array
members have incomplete type[1]. There are some instances of code in
which the sizeof operator is being incorrectly/erroneously applied to
zero-length arrays and the result is zero. Such instances may be hiding
some bugs. So, this work (flexible-array member conversions) will also
help to get completely rid of those sorts of issues.

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200507185057.GA13981@embeddedor
2020-05-11 16:56:47 +02:00
Dan Carpenter
b92b36eadf workqueue: Fix an use after free in init_rescuer()
We need to preserve error code before freeing "rescuer".

Fixes: f187b6974f ("workqueue: Use IS_ERR and PTR_ERR instead of PTR_ERR_OR_ZERO.")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2020-05-11 10:25:42 -04:00
Samuel Zou
a4ae16f65c livepatch: Make klp_apply_object_relocs static
Fix the following sparse warning:

kernel/livepatch/core.c:748:5: warning: symbol 'klp_apply_object_relocs' was
not declared.

The klp_apply_object_relocs() has only one call site within core.c;
it should be static

Fixes: 7c8e2bdd5f ("livepatch: Apply vmlinux-specific KLP relocations early")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Samuel Zou <zou_wei@huawei.com>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2020-05-11 00:31:38 +02:00
Xiaoming Ni
b6522fa409 parisc: add sysctl file interface panic_on_stackoverflow
The variable sysctl_panic_on_stackoverflow is used in
arch/parisc/kernel/irq.c and arch/x86/kernel/irq_32.c, but the sysctl file
interface panic_on_stackoverflow only exists on x86.

Add sysctl file interface panic_on_stackoverflow for parisc

Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Helge Deller <deller@gmx.de>
2020-05-10 22:47:35 +02:00
Yonghong Song
9c5f8a1008 bpf: Support variable length array in tracing programs
In /proc/net/ipv6_route, we have
  struct fib6_info {
    struct fib6_table *fib6_table;
    ...
    struct fib6_nh fib6_nh[0];
  }
  struct fib6_nh {
    struct fib_nh_common nh_common;
    struct rt6_info **rt6i_pcpu;
    struct rt6_exception_bucket *rt6i_exception_bucket;
  };
  struct fib_nh_common {
    ...
    u8 nhc_gw_family;
    ...
  }

The access:
  struct fib6_nh *fib6_nh = &rt->fib6_nh;
  ... fib6_nh->nh_common.nhc_gw_family ...

This patch ensures such an access is handled properly.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200509175916.2476853-1-yhs@fb.com
2020-05-09 17:05:27 -07:00
Yonghong Song
1d68f22b3d bpf: Handle spilled PTR_TO_BTF_ID properly when checking stack_boundary
This specifically to handle the case like below:
   // ptr below is a socket ptr identified by PTR_TO_BTF_ID
   u64 param[2] = { ptr, val };
   bpf_seq_printf(seq, fmt, sizeof(fmt), param, sizeof(param));

In this case, the 16 bytes stack for "param" contains:
   8 bytes for ptr with spilled PTR_TO_BTF_ID
   8 bytes for val as STACK_MISC

The current verifier will complain the ptr should not be visible
to the helper.
   ...
   16: (7b) *(u64 *)(r10 -64) = r2
   18: (7b) *(u64 *)(r10 -56) = r1
   19: (bf) r4 = r10
   ;
   20: (07) r4 += -64
   ; BPF_SEQ_PRINTF(seq, fmt1, (long)s, s->sk_protocol);
   21: (bf) r1 = r6
   22: (18) r2 = 0xffffa8d00018605a
   24: (b4) w3 = 10
   25: (b4) w5 = 16
   26: (85) call bpf_seq_printf#125
    R0=inv(id=0) R1_w=ptr_seq_file(id=0,off=0,imm=0)
    R2_w=map_value(id=0,off=90,ks=4,vs=144,imm=0) R3_w=inv10
    R4_w=fp-64 R5_w=inv16 R6=ptr_seq_file(id=0,off=0,imm=0)
    R7=ptr_netlink_sock(id=0,off=0,imm=0) R10=fp0 fp-56_w=mmmmmmmm
    fp-64_w=ptr_
   last_idx 26 first_idx 13
   regs=8 stack=0 before 25: (b4) w5 = 16
   regs=8 stack=0 before 24: (b4) w3 = 10
   invalid indirect read from stack off -64+0 size 16

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200509175915.2476783-1-yhs@fb.com
2020-05-09 17:05:27 -07:00
Yonghong Song
492e639f0c bpf: Add bpf_seq_printf and bpf_seq_write helpers
Two helpers bpf_seq_printf and bpf_seq_write, are added for
writing data to the seq_file buffer.

bpf_seq_printf supports common format string flag/width/type
fields so at least I can get identical results for
netlink and ipv6_route targets.

For bpf_seq_printf and bpf_seq_write, return value -EOVERFLOW
specifically indicates a write failure due to overflow, which
means the object will be repeated in the next bpf invocation
if object collection stays the same. Note that if the object
collection is changed, depending how collection traversal is
done, even if the object still in the collection, it may not
be visited.

For bpf_seq_printf, format %s, %p{i,I}{4,6} needs to
read kernel memory. Reading kernel memory may fail in
the following two cases:
  - invalid kernel address, or
  - valid kernel address but requiring a major fault
If reading kernel memory failed, the %s string will be
an empty string and %p{i,I}{4,6} will be all 0.
Not returning error to bpf program is consistent with
what bpf_trace_printk() does for now.

bpf_seq_printf may return -EBUSY meaning that internal percpu
buffer for memory copy of strings or other pointees is
not available. Bpf program can return 1 to indicate it
wants the same object to be repeated. Right now, this should not
happen on no-RT kernels since migrate_disable(), which guards
bpf prog call, calls preempt_disable().

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200509175914.2476661-1-yhs@fb.com
2020-05-09 17:05:26 -07:00
Yonghong Song
b121b341e5 bpf: Add PTR_TO_BTF_ID_OR_NULL support
Add bpf_reg_type PTR_TO_BTF_ID_OR_NULL support.
For tracing/iter program, the bpf program context
definition, e.g., for previous bpf_map target, looks like
  struct bpf_iter__bpf_map {
    struct bpf_iter_meta *meta;
    struct bpf_map *map;
  };

The kernel guarantees that meta is not NULL, but
map pointer maybe NULL. The NULL map indicates that all
objects have been traversed, so bpf program can take
proper action, e.g., do final aggregation and/or send
final report to user space.

Add btf_id_or_null_non0_off to prog->aux structure, to
indicate that if the context access offset is not 0,
set to PTR_TO_BTF_ID_OR_NULL instead of PTR_TO_BTF_ID.
This bit is set for tracing/iter program.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200509175912.2476576-1-yhs@fb.com
2020-05-09 17:05:26 -07:00
Yonghong Song
eaaacd2391 bpf: Add task and task/file iterator targets
Only the tasks belonging to "current" pid namespace
are enumerated.

For task/file target, the bpf program will have access to
  struct task_struct *task
  u32 fd
  struct file *file
where fd/file is an open file for the task.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200509175911.2476407-1-yhs@fb.com
2020-05-09 17:05:26 -07:00
Yonghong Song
6086d29def bpf: Add bpf_map iterator
Implement seq_file operations to traverse all bpf_maps.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200509175909.2476096-1-yhs@fb.com
2020-05-09 17:05:26 -07:00
Yonghong Song
e5158d987b bpf: Implement common macros/helpers for target iterators
Macro DEFINE_BPF_ITER_FUNC is implemented so target
can define an init function to capture the BTF type
which represents the target.

The bpf_iter_meta is a structure holding meta data, common
to all targets in the bpf program.

Additional marker functions are called before or after
bpf_seq_read() show()/next()/stop() callback functions
to help calculate precise seq_num and whether call bpf_prog
inside stop().

Two functions, bpf_iter_get_info() and bpf_iter_run_prog(),
are implemented so target can get needed information from
bpf_iter infrastructure and can run the program.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200509175907.2475956-1-yhs@fb.com
2020-05-09 17:05:26 -07:00
Yonghong Song
367ec3e483 bpf: Create file bpf iterator
To produce a file bpf iterator, the fd must be
corresponding to a link_fd assocciated with a
trace/iter program. When the pinned file is
opened, a seq_file will be generated.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200509175906.2475893-1-yhs@fb.com
2020-05-09 17:05:26 -07:00
Yonghong Song
ac51d99bf8 bpf: Create anonymous bpf iterator
A new bpf command BPF_ITER_CREATE is added.

The anonymous bpf iterator is seq_file based.
The seq_file private data are referenced by targets.
The bpf_iter infrastructure allocated additional space
at seq_file->private before the space used by targets
to store some meta data, e.g.,
  prog:       prog to run
  session_id: an unique id for each opened seq_file
  seq_num:    how many times bpf programs are queried in this session
  done_stop:  an internal state to decide whether bpf program
              should be called in seq_ops->stop() or not

The seq_num will start from 0 for valid objects.
The bpf program may see the same seq_num more than once if
 - seq_file buffer overflow happens and the same object
   is retried by bpf_seq_read(), or
 - the bpf program explicitly requests a retry of the
   same object

Since module is not supported for bpf_iter, all target
registeration happens at __init time, so there is no
need to change bpf_iter_unreg_target() as it is used
mostly in error path of the init function at which time
no bpf iterators have been created yet.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200509175905.2475770-1-yhs@fb.com
2020-05-09 17:05:26 -07:00
Yonghong Song
fd4f12bc38 bpf: Implement bpf_seq_read() for bpf iterator
bpf iterator uses seq_file to provide a lossless
way to transfer data to user space. But we want to call
bpf program after all objects have been traversed, and
bpf program may write additional data to the
seq_file buffer. The current seq_read() does not work
for this use case.

Besides allowing stop() function to write to the buffer,
the bpf_seq_read() also fixed the buffer size to one page.
If any single call of show() or stop() will emit data
more than one page to cause overflow, -E2BIG error code
will be returned to user space.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200509175904.2475468-1-yhs@fb.com
2020-05-09 17:05:26 -07:00
Yonghong Song
2057c92bc9 bpf: Support bpf tracing/iter programs for BPF_LINK_UPDATE
Added BPF_LINK_UPDATE support for tracing/iter programs.
This way, a file based bpf iterator, which holds a reference
to the link, can have its bpf program updated without
creating new files.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200509175902.2475262-1-yhs@fb.com
2020-05-09 17:05:26 -07:00
Yonghong Song
de4e05cac4 bpf: Support bpf tracing/iter programs for BPF_LINK_CREATE
Given a bpf program, the step to create an anonymous bpf iterator is:
  - create a bpf_iter_link, which combines bpf program and the target.
    In the future, there could be more information recorded in the link.
    A link_fd will be returned to the user space.
  - create an anonymous bpf iterator with the given link_fd.

The bpf_iter_link can be pinned to bpffs mount file system to
create a file based bpf iterator as well.

The benefit to use of bpf_iter_link:
  - using bpf link simplifies design and implementation as bpf link
    is used for other tracing bpf programs.
  - for file based bpf iterator, bpf_iter_link provides a standard
    way to replace underlying bpf programs.
  - for both anonymous and free based iterators, bpf link query
    capability can be leveraged.

The patch added support of tracing/iter programs for BPF_LINK_CREATE.
A new link type BPF_LINK_TYPE_ITER is added to facilitate link
querying. Currently, only prog_id is needed, so there is no
additional in-kernel show_fdinfo() and fill_link_info() hook
is needed for BPF_LINK_TYPE_ITER link.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200509175901.2475084-1-yhs@fb.com
2020-05-09 17:05:26 -07:00
Yonghong Song
15d83c4d7c bpf: Allow loading of a bpf_iter program
A bpf_iter program is a tracing program with attach type
BPF_TRACE_ITER. The load attribute
  attach_btf_id
is used by the verifier against a particular kernel function,
which represents a target, e.g., __bpf_iter__bpf_map
for target bpf_map which is implemented later.

The program return value must be 0 or 1 for now.
  0 : successful, except potential seq_file buffer overflow
      which is handled by seq_file reader.
  1 : request to restart the same object

In the future, other return values may be used for filtering or
teminating the iterator.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200509175900.2474947-1-yhs@fb.com
2020-05-09 17:05:26 -07:00
Yonghong Song
ae24345da5 bpf: Implement an interface to register bpf_iter targets
The target can call bpf_iter_reg_target() to register itself.
The needed information:
  target:           target name
  seq_ops:          the seq_file operations for the target
  init_seq_private  target callback to initialize seq_priv during file open
  fini_seq_private  target callback to clean up seq_priv during file release
  seq_priv_size:    the private_data size needed by the seq_file
                    operations

The target name represents a target which provides a seq_ops
for iterating objects.

The target can provide two callback functions, init_seq_private
and fini_seq_private, called during file open/release time.
For example, /proc/net/{tcp6, ipv6_route, netlink, ...}, net
name space needs to be setup properly during file open and
released properly during file release.

Function bpf_iter_unreg_target() is also implemented to unregister
a particular target.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200509175859.2474669-1-yhs@fb.com
2020-05-09 17:05:25 -07:00
Linus Torvalds
78a5255ffb Stop the ad-hoc games with -Wno-maybe-initialized
We have some rather random rules about when we accept the
"maybe-initialized" warnings, and when we don't.

For example, we consider it unreliable for gcc versions < 4.9, but also
if -O3 is enabled, or if optimizing for size.  And then various kernel
config options disabled it, because they know that they trigger that
warning by confusing gcc sufficiently (ie PROFILE_ALL_BRANCHES).

And now gcc-10 seems to be introducing a lot of those warnings too, so
it falls under the same heading as 4.9 did.

At the same time, we have a very straightforward way to _enable_ that
warning when wanted: use "W=2" to enable more warnings.

So stop playing these ad-hoc games, and just disable that warning by
default, with the known and straight-forward "if you want to work on the
extra compiler warnings, use W=123".

Would it be great to have code that is always so obvious that it never
confuses the compiler whether a variable is used initialized or not?
Yes, it would.  In a perfect world, the compilers would be smarter, and
our source code would be simpler.

That's currently not the world we live in, though.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-09 13:57:10 -07:00
Christian Brauner
f2a8d52e0a
nsproxy: add struct nsset
Add a simple struct nsset. It holds all necessary pieces to switch to a new
set of namespaces without leaving a task in a half-switched state which we
will make use of in the next patch. This patch switches the existing setns
logic over without causing a change in setns() behavior. This brings
setns() closer to how unshare() works(). The prepare_ns() function is
responsible to prepare all necessary information. This has two reasons.
First it minimizes dependencies between individual namespaces, i.e. all
install handler can expect that all fields are properly initialized
independent in what order they are called in. Second, this makes the code
easier to maintain and easier to follow if it needs to be changed.

The prepare_ns() helper will only be switched over to use a flags argument
in the next patch. Here it will still use nstype as a simple integer
argument which was argued would be clearer. I'm not particularly
opinionated about this if it really helps or not. The struct nsset itself
already contains the flags field since its name already indicates that it
can contain information required by different namespaces. None of this
should have functional consequences.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Jann Horn <jannh@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Link: https://lore.kernel.org/r/20200505140432.181565-2-christian.brauner@ubuntu.com
2020-05-09 13:57:12 +02:00
Vincent Minet
db803036ad umh: fix memory leak on execve failure
If a UMH process created by fork_usermode_blob() fails to execute,
a pair of struct file allocated by umh_pipe_setup() will leak.

Under normal conditions, the caller (like bpfilter) needs to manage the
lifetime of the UMH and its two pipes. But when fork_usermode_blob()
fails, the caller doesn't really have a way to know what needs to be
done. It seems better to do the cleanup ourselves in this case.

Fixes: 449325b52b ("umh: introduce fork_usermode_blob() helper")
Signed-off-by: Vincent Minet <v.minet@criteo.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-05-08 19:06:55 -07:00
Jakub Kicinski
14d8f7486a Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2020-05-09

The following pull-request contains BPF updates for your *net* tree.

We've added 4 non-merge commits during the last 9 day(s) which contain
a total of 4 files changed, 11 insertions(+), 6 deletions(-).

The main changes are:

1) Fix msg_pop_data() helper incorrectly setting an sge length in some
   cases as well as fixing bpf_tcp_ingress() wrongly accounting bytes
   in sg.size, from John Fastabend.

2) Fix to return an -EFAULT error when copy_to_user() of the value
   fails in map_lookup_and_delete_elem(), from Wei Yongjun.

3) Fix sk_psock refcnt leak in tcp_bpf_recvmsg(), from Xiyu Yang.
====================

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-05-08 18:58:39 -07:00
J. Bruce Fields
52782c92ac kthread: save thread function
It's handy to keep the kthread_fn just as a unique cookie to identify
classes of kthreads.  E.g. if you can verify that a given task is
running your thread_fn, then you may know what sort of type kthread_data
points to.

We'll use this in nfsd to pass some information into the vfs.  Note it
will need kthread_data() exported too.

Original-patch-by: Tejun Heo <tj@kernel.org>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-05-08 21:23:10 -04:00
Linus Torvalds
c61529f6f5 Driver core fixes for 5.7-rc5
Here are a number of small driver core fixes for 5.7-rc5 to resolve a
 bunch of reported issues with the current tree.
 
 Biggest here are the reverts and patches from John Stultz to resolve a
 bunch of deferred probe regressions we have been seeing in 5.7-rc right
 now.
 
 Along with those are some other smaller fixes:
 	- coredump crash fix
 	- devlink fix for when permissive mode was enabled
 	- amba and platform device dma_parms fixes
 	- component error silenced for when deferred probe happens
 
 All of these have been in linux-next for a while with no reported
 issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXrVnyg8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ylWBgCfbwjUbsDsHsrsVgWfOakIaoPUQ8IAmwetMKvS
 ny1Kq7Cia+2y2e+7fDyo
 =UKEM
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-5.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core fixes from Greg KH:
 "Here are a number of small driver core fixes for 5.7-rc5 to resolve a
  bunch of reported issues with the current tree.

  Biggest here are the reverts and patches from John Stultz to resolve a
  bunch of deferred probe regressions we have been seeing in 5.7-rc
  right now.

  Along with those are some other smaller fixes:

   - coredump crash fix

   - devlink fix for when permissive mode was enabled

   - amba and platform device dma_parms fixes

   - component error silenced for when deferred probe happens

  All of these have been in linux-next for a while with no reported
  issues"

* tag 'driver-core-5.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  regulator: Revert "Use driver_deferred_probe_timeout for regulator_init_complete_work"
  driver core: Ensure wait_for_device_probe() waits until the deferred_probe_timeout fires
  driver core: Use dev_warn() instead of dev_WARN() for deferred_probe_timeout warnings
  driver core: Revert default driver_deferred_probe_timeout value to 0
  component: Silence bind error on -EPROBE_DEFER
  driver core: Fix handling of fw_devlink=permissive
  coredump: fix crash when umh is disabled
  amba: Initialize dma_parms for amba devices
  driver core: platform: Initialize dma_parms for platform devices
2020-05-08 09:06:34 -07:00
Linus Torvalds
af38553c66 Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "14 fixes and one selftest to verify the ipc fixes herein"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  mm: limit boost_watermark on small zones
  ubsan: disable UBSAN_ALIGNMENT under COMPILE_TEST
  mm/vmscan: remove unnecessary argument description of isolate_lru_pages()
  epoll: atomically remove wait entry on wake up
  kselftests: introduce new epoll60 testcase for catching lost wakeups
  percpu: make pcpu_alloc() aware of current gfp context
  mm/slub: fix incorrect interpretation of s->offset
  scripts/gdb: repair rb_first() and rb_last()
  eventpoll: fix missing wakeup for ovflist in ep_poll_callback
  arch/x86/kvm/svm/sev.c: change flag passed to GUP fast in sev_pin_memory()
  scripts/decodecode: fix trapping instruction formatting
  kernel/kcov.c: fix typos in kcov_remote_start documentation
  mm/page_alloc: fix watchdog soft lockups during set_zone_contiguous()
  mm, memcg: fix error return value of mem_cgroup_css_alloc()
  ipc/mqueue.c: change __do_notify() to bypass check_kill_permission()
2020-05-08 08:41:09 -07:00
Christian Brauner
3f2c788a13
fork: prevent accidental access to clone3 features
Jan reported an issue where an interaction between sign-extending clone's
flag argument on ppc64le and the new CLONE_INTO_CGROUP feature causes
clone() to consistently fail with EBADF.

The whole story is a little longer. The legacy clone() syscall is odd in a
bunch of ways and here two things interact. First, legacy clone's flag
argument is word-size dependent, i.e. it's an unsigned long whereas most
system calls with flag arguments use int or unsigned int. Second, legacy
clone() ignores unknown and deprecated flags. The two of them taken
together means that users on 64bit systems can pass garbage for the upper
32bit of the clone() syscall since forever and things would just work fine.
Just try this on a 64bit kernel prior to v5.7-rc1 where this will succeed
and on v5.7-rc1 where this will fail with EBADF:

int main(int argc, char *argv[])
{
        pid_t pid;

        /* Note that legacy clone() has different argument ordering on
         * different architectures so this won't work everywhere.
         *
         * Only set the upper 32 bits.
         */
        pid = syscall(__NR_clone, 0xffffffff00000000 | SIGCHLD,
                      NULL, NULL, NULL, NULL);
        if (pid < 0)
                exit(EXIT_FAILURE);
        if (pid == 0)
                exit(EXIT_SUCCESS);
        if (wait(NULL) != pid)
                exit(EXIT_FAILURE);

        exit(EXIT_SUCCESS);
}

Since legacy clone() couldn't be extended this was not a problem so far and
nobody really noticed or cared since nothing in the kernel ever bothered to
look at the upper 32 bits.

But once we introduced clone3() and expanded the flag argument in struct
clone_args to 64 bit we opened this can of worms. With the first flag-based
extension to clone3() making use of the upper 32 bits of the flag argument
we've effectively made it possible for the legacy clone() syscall to reach
clone3() only flags. The sign extension scenario is just the odd
corner-case that we needed to figure this out.

The reason we just realized this now and not already when we introduced
CLONE_CLEAR_SIGHAND was that CLONE_INTO_CGROUP assumes that a valid cgroup
file descriptor has been given. So the sign extension (or the user
accidently passing garbage for the upper 32 bits) caused the
CLONE_INTO_CGROUP bit to be raised and the kernel to error out when it
didn't find a valid cgroup file descriptor.

Let's fix this by always capping the upper 32 bits for all codepaths that
are not aware of clone3() features. This ensures that we can't reach
clone3() only features by accident via legacy clone as with the sign
extension case and also that legacy clone() works exactly like before, i.e.
ignoring any unknown flags.  This solution risks no regressions and is also
pretty clean.

Fixes: 7f192e3cd3 ("fork: add clone3")
Fixes: ef2c41cf38 ("clone3: allow spawning processes into cgroups")
Reported-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Dmitry V. Levin <ldv@altlinux.org>
Cc: Andreas Schwab <schwab@linux-m68k.org>
Cc: Florian Weimer <fw@deneb.enyo.de>
Cc: libc-alpha@sourceware.org
Cc: stable@vger.kernel.org # 5.3+
Link: https://sourceware.org/pipermail/libc-alpha/2020-May/113596.html
Link: https://lore.kernel.org/r/20200507103214.77218-1-christian.brauner@ubuntu.com
2020-05-08 17:31:50 +02:00
Thomas Gleixner
97a9474aeb Merge branch 'kcsan-for-tip' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into locking/kcsan
Pull KCSAN updates from Paul McKenney.
2020-05-08 14:58:28 +02:00
Eric Biggers
6b0b0fa2bc crypto: lib/sha1 - rename "sha" to "sha1"
The library implementation of the SHA-1 compression function is
confusingly called just "sha_transform()".  Alongside it are some "SHA_"
constants and "sha_init()".  Presumably these are left over from a time
when SHA just meant SHA-1.  But now there are also SHA-2 and SHA-3, and
moreover SHA-1 is now considered insecure and thus shouldn't be used.

Therefore, rename these functions and constants to make it very clear
that they are for SHA-1.  Also add a comment to make it clear that these
shouldn't be used.

For the extra-misleadingly named "SHA_MESSAGE_BYTES", rename it to
SHA1_BLOCK_SIZE and define it to just '64' rather than '(512/8)' so that
it matches the same definition in <crypto/sha.h>.  This prepares for
merging <linux/cryptohash.h> into <crypto/sha.h>.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-05-08 15:32:17 +10:00
Maciej Grochowski
324cfb1956 kernel/kcov.c: fix typos in kcov_remote_start documentation
Signed-off-by: Maciej Grochowski <maciej.grochowski@pm.me>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrey Konovalov <andreyknvl@google.com>
Link: http://lkml.kernel.org/r/20200420030259.31674-1-maciek.grochowski@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-07 19:27:20 -07:00
Linus Torvalds
192ffb7515 This includes the following tracing fixes:
- Fix bootconfig causing kernels to fail with CONFIG_BLK_DEV_RAM enabled
  - Fix allocation leaks in bootconfig tool
  - Fix a double initialization of a variable
  - Fix API bootconfig usage from kprobe boot time events
  - Reject NULL location for kprobes
  - Fix crash caused by preempt delay module not cleaning up kthread
    correctly
  - Add vmalloc_sync_mappings() to prevent x86_64 page faults from
    recursively faulting from tracing page faults
  - Fix comment in gpu/trace kerneldoc header
  - Fix documentation of how to create a trace event class
  - Make the local tracing_snapshot_instance_cond() function static
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXrRUBhQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qveTAP4iNCnMeS/Isb+MXQx2Pnu7OP+0BeRP
 2ahlKG2sBgEdnwEAoUzxQoYWtfC6xoM38YwLuZPRlcScRea/5CRHyW8BFQc=
 =o3pV
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fix bootconfig causing kernels to fail with CONFIG_BLK_DEV_RAM
   enabled

 - Fix allocation leaks in bootconfig tool

 - Fix a double initialization of a variable

 - Fix API bootconfig usage from kprobe boot time events

 - Reject NULL location for kprobes

 - Fix crash caused by preempt delay module not cleaning up kthread
   correctly

 - Add vmalloc_sync_mappings() to prevent x86_64 page faults from
   recursively faulting from tracing page faults

 - Fix comment in gpu/trace kerneldoc header

 - Fix documentation of how to create a trace event class

 - Make the local tracing_snapshot_instance_cond() function static

* tag 'trace-v5.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tools/bootconfig: Fix resource leak in apply_xbc()
  tracing: Make tracing_snapshot_instance_cond() static
  tracing: Fix doc mistakes in trace sample
  gpu/trace: Minor comment updates for gpu_mem_total tracepoint
  tracing: Add a vmalloc_sync_mappings() for safe measure
  tracing: Wait for preempt irq delay thread to finish
  tracing/kprobes: Reject new event if loc is NULL
  tracing/boottime: Fix kprobe event API usage
  tracing/kprobes: Fix a double initialization typo
  bootconfig: Fix to remove bootconfig data from initrd while boot
2020-05-07 15:27:11 -07:00
Josh Poimboeuf
e6eff4376e module: Make module_enable_ro() static again
Now that module_enable_ro() has no more external users, make it static
again.

Suggested-by: Jessica Yu <jeyu@kernel.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Jessica Yu <jeyu@kernel.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2020-05-08 00:12:43 +02:00
Josh Poimboeuf
5b384f9335 x86/module: Use text_mutex in apply_relocate_add()
Now that the livepatch code no longer needs the text_mutex for changing
module permissions, move its usage down to apply_relocate_add().

Note the s390 version of apply_relocate_add() doesn't need to use the
text_mutex because it already uses s390_kernel_write_lock, which
accomplishes the same task.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Joe Lawrence <joe.lawrence@redhat.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2020-05-08 00:12:43 +02:00
Josh Poimboeuf
0d9fbf78fe module: Remove module_disable_ro()
module_disable_ro() has no more users.  Remove it.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Joe Lawrence <joe.lawrence@redhat.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Jessica Yu <jeyu@kernel.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2020-05-08 00:12:43 +02:00
Josh Poimboeuf
d556e1be33 livepatch: Remove module_disable_ro() usage
With arch_klp_init_object_loaded() gone, and apply_relocate_add() now
using text_poke(), livepatch no longer needs to use module_disable_ro().

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Joe Lawrence <joe.lawrence@redhat.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2020-05-08 00:12:43 +02:00
Josh Poimboeuf
ca376a9374 livepatch: Prevent module-specific KLP rela sections from referencing vmlinux symbols
Prevent module-specific KLP rela sections from referencing vmlinux
symbols.  This helps prevent ordering issues with module special section
initializations.  Presumably such symbols are exported and normal relas
can be used instead.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Joe Lawrence <joe.lawrence@redhat.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2020-05-08 00:12:42 +02:00
Peter Zijlstra
1d05334d28 livepatch: Remove .klp.arch
After the previous patch, vmlinux-specific KLP relocations are now
applied early during KLP module load.  This means that .klp.arch
sections are no longer needed for *vmlinux-specific* KLP relocations.

One might think they're still needed for *module-specific* KLP
relocations.  If a to-be-patched module is loaded *after* its
corresponding KLP module is loaded, any corresponding KLP relocations
will be delayed until the to-be-patched module is loaded.  If any
special sections (.parainstructions, for example) rely on those
relocations, their initializations (apply_paravirt) need to be done
afterwards.  Thus the apparent need for arch_klp_init_object_loaded()
and its corresponding .klp.arch sections -- it allows some of the
special section initializations to be done at a later time.

But... if you look closer, that dependency between the special sections
and the module-specific KLP relocations doesn't actually exist in
reality.  Looking at the contents of the .altinstructions and
.parainstructions sections, there's not a realistic scenario in which a
KLP module's .altinstructions or .parainstructions section needs to
access a symbol in a to-be-patched module.  It might need to access a
local symbol or even a vmlinux symbol; but not another module's symbol.
When a special section needs to reference a local or vmlinux symbol, a
normal rela can be used instead of a KLP rela.

Since the special section initializations don't actually have any real
dependency on module-specific KLP relocations, .klp.arch and
arch_klp_init_object_loaded() no longer have a reason to exist.  So
remove them.

As Peter said much more succinctly:

  So the reason for .klp.arch was that .klp.rela.* stuff would overwrite
  paravirt instructions. If that happens you're doing it wrong. Those
  RELAs are core kernel, not module, and thus should've happened in
  .rela.* sections at patch-module loading time.

  Reverting this removes the two apply_{paravirt,alternatives}() calls
  from the late patching path, and means we don't have to worry about
  them when removing module_disable_ro().

[ jpoimboe: Rewrote patch description.  Tweaked klp_init_object_loaded()
	    error path. ]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Joe Lawrence <joe.lawrence@redhat.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2020-05-08 00:12:42 +02:00
Josh Poimboeuf
7c8e2bdd5f livepatch: Apply vmlinux-specific KLP relocations early
KLP relocations are livepatch-specific relocations which are applied to
a KLP module's text or data.  They exist for two reasons:

  1) Unexported symbols: replacement functions often need to access
     unexported symbols (e.g. static functions), which "normal"
     relocations don't allow.

  2) Late module patching: this is the ability for a KLP module to
     bypass normal module dependencies, such that the KLP module can be
     loaded *before* a to-be-patched module.  This means that
     relocations which need to access symbols in the to-be-patched
     module might need to be applied to the KLP module well after it has
     been loaded.

Non-late-patched KLP relocations are applied from the KLP module's init
function.  That usually works fine, unless the patched code wants to use
alternatives, paravirt patching, jump tables, or some other special
section which needs relocations.  Then we run into ordering issues and
crashes.

In order for those special sections to work properly, the KLP
relocations should be applied *before* the special section init code
runs, such as apply_paravirt(), apply_alternatives(), or
jump_label_apply_nops().

You might think the obvious solution would be to move the KLP relocation
initialization earlier, but it's not necessarily that simple.  The
problem is the above-mentioned late module patching, for which KLP
relocations can get applied well after the KLP module is loaded.

To "fix" this issue in the past, we created .klp.arch sections:

  .klp.arch.{module}..altinstructions
  .klp.arch.{module}..parainstructions

Those sections allow KLP late module patching code to call
apply_paravirt() and apply_alternatives() after the module-specific KLP
relocations (.klp.rela.{module}.{section}) have been applied.

But that has a lot of drawbacks, including code complexity, the need for
arch-specific code, and the (per-arch) danger that we missed some
special section -- for example the __jump_table section which is used
for jump labels.

It turns out there's a simpler and more functional approach.  There are
two kinds of KLP relocation sections:

  1) vmlinux-specific KLP relocation sections

     .klp.rela.vmlinux.{sec}

     These are relocations (applied to the KLP module) which reference
     unexported vmlinux symbols.

  2) module-specific KLP relocation sections

     .klp.rela.{module}.{sec}:

     These are relocations (applied to the KLP module) which reference
     unexported or exported module symbols.

Up until now, these have been treated the same.  However, they're
inherently different.

Because of late module patching, module-specific KLP relocations can be
applied very late, thus they can create the ordering headaches described
above.

But vmlinux-specific KLP relocations don't have that problem.  There's
nothing to prevent them from being applied earlier.  So apply them at
the same time as normal relocations, when the KLP module is being
loaded.

This means that for vmlinux-specific KLP relocations, we no longer have
any ordering issues.  vmlinux-referencing jump labels, alternatives, and
paravirt patching will work automatically, without the need for the
.klp.arch hacks.

All that said, for module-specific KLP relocations, the ordering
problems still exist and we *do* still need .klp.arch.  Or do we?  Stay
tuned.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Joe Lawrence <joe.lawrence@redhat.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Jessica Yu <jeyu@kernel.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2020-05-08 00:12:42 +02:00
Josh Poimboeuf
dcf550e52f livepatch: Disallow vmlinux.ko
This is purely a theoretical issue, but if there were a module named
vmlinux.ko, the livepatch relocation code wouldn't be able to
distinguish between vmlinux-specific and vmlinux.o-specific KLP
relocations.

If CONFIG_LIVEPATCH is enabled, don't allow a module named vmlinux.ko.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2020-05-08 00:12:42 +02:00
Eric W. Biederman
96ecee29b0 exec: Merge install_exec_creds into setup_new_exec
The two functions are now always called one right after the
other so merge them together to make future maintenance easier.

Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Greg Ungerer <gerg@linux-m68k.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-05-07 16:55:47 -05:00
Zou Wei
192b7993b3 tracing: Make tracing_snapshot_instance_cond() static
Fix the following sparse warning:

kernel/trace/trace.c:950:6: warning: symbol 'tracing_snapshot_instance_cond'
was not declared. Should it be static?

Link: http://lkml.kernel.org/r/1587614905-48692-1-git-send-email-zou_wei@huawei.com

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zou Wei <zou_wei@huawei.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-05-07 13:32:58 -04:00
Steven Rostedt (VMware)
11f5efc3ab tracing: Add a vmalloc_sync_mappings() for safe measure
x86_64 lazily maps in the vmalloc pages, and the way this works with per_cpu
areas can be complex, to say the least. Mappings may happen at boot up, and
if nothing synchronizes the page tables, those page mappings may not be
synced till they are used. This causes issues for anything that might touch
one of those mappings in the path of the page fault handler. When one of
those unmapped mappings is touched in the page fault handler, it will cause
another page fault, which in turn will cause a page fault, and leave us in
a loop of page faults.

Commit 763802b53a ("x86/mm: split vmalloc_sync_all()") split
vmalloc_sync_all() into vmalloc_sync_unmappings() and
vmalloc_sync_mappings(), as on system exit, it did not need to do a full
sync on x86_64 (although it still needed to be done on x86_32). By chance,
the vmalloc_sync_all() would synchronize the page mappings done at boot up
and prevent the per cpu area from being a problem for tracing in the page
fault handler. But when that synchronization in the exit of a task became a
nop, it caused the problem to appear.

Link: https://lore.kernel.org/r/20200429054857.66e8e333@oasis.local.home

Cc: stable@vger.kernel.org
Fixes: 737223fbca ("tracing: Consolidate buffer allocation code")
Reported-by: "Tzvetomir Stoyanov (VMware)" <tz.stoyanov@gmail.com>
Suggested-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-05-07 13:32:57 -04:00
Steven Rostedt (VMware)
d16a8c3107 tracing: Wait for preempt irq delay thread to finish
Running on a slower machine, it is possible that the preempt delay kernel
thread may still be executing if the module was immediately removed after
added, and this can cause the kernel to crash as the kernel thread might be
executing after its code has been removed.

There's no reason that the caller of the code shouldn't just wait for the
delay thread to finish, as the thread can also be created by a trigger in
the sysfs code, which also has the same issues.

Link: http://lore.kernel.org/r/5EA2B0C8.2080706@cn.fujitsu.com

Cc: stable@vger.kernel.org
Fixes: 793937236d ("lib: Add module for testing preemptoff/irqsoff latency tracers")
Reported-by: Xiao Yang <yangx.jy@cn.fujitsu.com>
Reviewed-by: Xiao Yang <yangx.jy@cn.fujitsu.com>
Reviewed-by: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-05-07 13:32:40 -04:00
Paul E. McKenney
f736e0f1a5 Merge branches 'fixes.2020.04.27a', 'kfree_rcu.2020.04.27a', 'rcu-tasks.2020.04.27a', 'stall.2020.04.27a' and 'torture.2020.05.07a' into HEAD
fixes.2020.04.27a:  Miscellaneous fixes.
kfree_rcu.2020.04.27a:  Changes related to kfree_rcu().
rcu-tasks.2020.04.27a:  Addition of new RCU-tasks flavors.
stall.2020.04.27a:  RCU CPU stall-warning updates.
torture.2020.05.07a:  Torture-test updates.
2020-05-07 10:18:32 -07:00
Paul E. McKenney
3c80b40245 rcutorture: Convert ULONG_CMP_LT() to time_before()
This commit converts three ULONG_CMP_LT() invocations in rcutorture to
time_before() to reflect the fact that they are comparing timestamps to
the jiffies counter.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-05-07 10:15:29 -07:00
Jason Yan
afbc1574f1 rcutorture: Make rcu_fwds and rcu_fwd_emergency_stop static
This commit fixes the following sparse warning:

kernel/rcu/rcutorture.c:1695:16: warning: symbol 'rcu_fwds' was not declared. Should it be static?
kernel/rcu/rcutorture.c:1696:6: warning: symbol 'rcu_fwd_emergency_stop' was not declared. Should it be static?

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-05-07 10:15:29 -07:00
Paul E. McKenney
55b2dcf587 rcu: Allow rcutorture to starve grace-period kthread
This commit provides an rcutorture.stall_gp_kthread module parameter
to allow rcutorture to starve the grace-period kthread.  This allows
testing the code that detects such starvation.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-05-07 10:15:28 -07:00
Paul E. McKenney
19a8ff956c rcutorture: Add flag to produce non-busy-wait task stalls
This commit aids testing of RCU task stall warning messages by adding
an rcutorture.stall_cpu_block module parameter that results in the
induced stall sleeping within the RCU read-side critical section.
Spinning with interrupts disabled is still available via the
rcutorture.stall_cpu_irqsoff module parameter, and specifying neither
of these two module parameters will spin with preemption disabled.

Note that sleeping (as opposed to preemption) results in additional
complaints from RCU at context-switch time, so yet more testing.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-05-07 10:15:28 -07:00
Andy Shevchenko
a135020736 kgdb: Drop malformed kernel doc comment
Kernel doc does not understand POD variables to be referred to.

.../debug_core.c:73: warning: cannot understand function prototype:
'int                             kgdb_connected; '

Convert kernel doc to pure comment.

Fixes: dc7d552705 ("kgdb: core")
Cc: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2020-05-07 15:16:19 +01:00
Qais Yousef
fb7fb84a0c cpu/hotplug: Remove __freeze_secondary_cpus()
The refactored function is no longer required as the codepaths that call
freeze_secondary_cpus() are all suspend/resume related now.

Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Link: https://lkml.kernel.org/r/20200430114004.17477-2-qais.yousef@arm.com
2020-05-07 15:18:41 +02:00
Qais Yousef
5655585589 cpu/hotplug: Remove disable_nonboot_cpus()
The single user could have called freeze_secondary_cpus() directly.

Since this function was a source of confusion, remove it as it's
just a pointless wrapper.

While at it, rename enable_nonboot_cpus() to thaw_secondary_cpus() to
preserve the naming symmetry.

Done automatically via:

	git grep -l enable_nonboot_cpus | xargs sed -i 's/enable_nonboot_cpus/thaw_secondary_cpus/g'

Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Link: https://lkml.kernel.org/r/20200430114004.17477-1-qais.yousef@arm.com
2020-05-07 15:18:40 +02:00
David S. Miller
3793faad7b Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Conflicts were all overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-06 22:10:13 -07:00
Marco Elver
19acd03d95 kcsan: Add __kcsan_{enable,disable}_current() variants
The __kcsan_{enable,disable}_current() variants only call into KCSAN if
KCSAN is enabled for the current compilation unit. Note: This is
typically not what we want, as we usually want to ensure that even calls
into other functions still have KCSAN disabled.

These variants may safely be used in header files that are shared
between regular kernel code and code that does not link the KCSAN
runtime.

Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-05-06 10:58:46 -07:00
Masami Hiramatsu
5b4dcd2d20 tracing/kprobes: Reject new event if loc is NULL
Reject the new event which has NULL location for kprobes.
For kprobes, user must specify at least the location.

Link: http://lkml.kernel.org/r/158779376597.6082.1411212055469099461.stgit@devnote2

Cc: Tom Zanussi <zanussi@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Fixes: 2a588dd1d5 ("tracing: Add kprobe event command generation functions")
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-05-06 09:04:11 -04:00
Masami Hiramatsu
da0f1f4167 tracing/boottime: Fix kprobe event API usage
Fix boottime kprobe events to use API correctly for
multiple events.

For example, when we set a multiprobe kprobe events in
bootconfig like below,

  ftrace.event.kprobes.myevent {
  	probes = "vfs_read $arg1 $arg2", "vfs_write $arg1 $arg2"
  }

This cause an error;

  trace_boot: Failed to add probe: p:kprobes/myevent (null)  vfs_read $arg1 $arg2  vfs_write $arg1 $arg2

This shows the 1st argument becomes NULL and multiprobes
are merged to 1 probe.

Link: http://lkml.kernel.org/r/158779375766.6082.201939936008972838.stgit@devnote2

Cc: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Fixes: 29a1548105 ("tracing: Change trace_boot to use kprobe_event interface")
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-05-06 09:04:11 -04:00
Masami Hiramatsu
dcbd21c9fc tracing/kprobes: Fix a double initialization typo
Fix a typo that resulted in an unnecessary double
initialization to addr.

Link: http://lkml.kernel.org/r/158779374968.6082.2337484008464939919.stgit@devnote2

Cc: Tom Zanussi <zanussi@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Fixes: c7411a1a12 ("tracing/kprobe: Check whether the non-suffixed symbol is notrace")
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-05-06 09:04:11 -04:00
Christoph Hellwig
c3b3f52476 signal: refactor copy_siginfo_to_user32
Factor out a copy_siginfo_to_external32 helper from
copy_siginfo_to_user32 that fills out the compat_siginfo, but does so
on a kernel space data structure.  With that we can let architectures
override copy_siginfo_to_user32 with their own implementations using
copy_siginfo_to_external32.  That allows moving the x32 SIGCHLD purely
to x86 architecture code.

As a nice side effect copy_siginfo_to_external32 also comes in handy
for avoiding a set_fs() call in the coredump code later on.

Contains improvements from Eric W. Biederman <ebiederm@xmission.com>
and Arnd Bergmann <arnd@arndb.de>.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-05 16:46:09 -04:00
Arnd Bergmann
5447e8e01e sysctl: Fix unused function warning
The newly added bpf_stats_handler function has the wrong #ifdef
check around it, leading to an unused-function warning when
CONFIG_SYSCTL is disabled:

kernel/sysctl.c:205:12: error: unused function 'bpf_stats_handler' [-Werror,-Wunused-function]
static int bpf_stats_handler(struct ctl_table *table, int write,

Fix the check to match the reference.

Fixes: d46edd671a ("bpf: Sharing bpf runtime stats with BPF_ENABLE_STATS")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200505140734.503701-1-arnd@arndb.de
2020-05-05 11:59:32 -07:00
Sean Fu
f187b6974f workqueue: Use IS_ERR and PTR_ERR instead of PTR_ERR_OR_ZERO.
Replace inline function PTR_ERR_OR_ZERO with IS_ERR and PTR_ERR to
remove redundant parameter definitions and checks.
Reduce code size.
Before:
   text	   data	    bss	    dec	    hex	filename
  47510	   5979	    840	  54329	   d439	kernel/workqueue.o
After:
   text	   data	    bss	    dec	    hex	filename
  47474	   5979	    840	  54293	   d415	kernel/workqueue.o

Signed-off-by: Sean Fu <fxinrong@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2020-05-05 11:56:07 -04:00
Andrii Nakryiko
138c67677f bpf: Fix use-after-free of bpf_link when priming half-fails
If bpf_link_prime() succeeds to allocate new anon file, but then fails to
allocate ID for it, link priming is considered to be failed and user is
supposed ot be able to directly kfree() bpf_link, because it was never exposed
to user-space.

But at that point file already keeps a pointer to bpf_link and will eventually
call bpf_link_release(), so if bpf_link was kfree()'d by caller, that would
lead to use-after-free.

Fix this by first allocating ID and only then allocating file. Adding ID to
link_idr is ok, because link at that point still doesn't have its ID set, so
no user-space process can create a new FD for it.

Fixes: a3b80e1078 ("bpf: Allocate ID for bpf_link")
Reported-by: syzbot+39b64425f91b5aab714d@syzkaller.appspotmail.com
Suggested-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200501185622.3088964-1-andriin@fb.com
2020-05-01 15:13:05 -07:00
Song Liu
d46edd671a bpf: Sharing bpf runtime stats with BPF_ENABLE_STATS
Currently, sysctl kernel.bpf_stats_enabled controls BPF runtime stats.
Typical userspace tools use kernel.bpf_stats_enabled as follows:

  1. Enable kernel.bpf_stats_enabled;
  2. Check program run_time_ns;
  3. Sleep for the monitoring period;
  4. Check program run_time_ns again, calculate the difference;
  5. Disable kernel.bpf_stats_enabled.

The problem with this approach is that only one userspace tool can toggle
this sysctl. If multiple tools toggle the sysctl at the same time, the
measurement may be inaccurate.

To fix this problem while keep backward compatibility, introduce a new
bpf command BPF_ENABLE_STATS. On success, this command enables stats and
returns a valid fd. BPF_ENABLE_STATS takes argument "type". Currently,
only one type, BPF_STATS_RUN_TIME, is supported. We can extend the
command to support other types of stats in the future.

With BPF_ENABLE_STATS, user space tool would have the following flow:

  1. Get a fd with BPF_ENABLE_STATS, and make sure it is valid;
  2. Check program run_time_ns;
  3. Sleep for the monitoring period;
  4. Check program run_time_ns again, calculate the difference;
  5. Close the fd.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200430071506.1408910-2-songliubraving@fb.com
2020-05-01 10:36:32 -07:00
Zheng Bin
db9ff6ecf6 audit: make symbol 'audit_nfcfgs' static
Fix sparse warnings:

kernel/auditsc.c:138:32: warning: symbol 'audit_nfcfgs' was not declared. Should it be static?

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zheng Bin <zhengbin13@huawei.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2020-05-01 12:24:07 -04:00
Christophe Leroy
41cd780524 uaccess: Selectively open read or write user access
When opening user access to only perform reads, only open read access.
When opening user access to only perform writes, only open write
access.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/2e73bc57125c2c6ab12a587586a4eed3a47105fc.1585898438.git.christophe.leroy@c-s.fr
2020-05-01 12:35:21 +10:00
Wei Yang
b1d1779e5e sched/core: Simplify sched_init()
Currently root_task_group.shares and cfs_bandwidth are initialized for
each online cpu, which not necessary.

Let's take it out to do it only once.

Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200423214443.29994-1-richard.weiyang@gmail.com
2020-04-30 20:14:42 +02:00
Muchun Song
17c891ab34 sched/fair: Use __this_cpu_read() in wake_wide()
The code is executed with preemption(and interrupts) disabled,
so it's safe to use __this_cpu_write().

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200421144123.33580-1-songmuchun@bytedance.com
2020-04-30 20:14:41 +02:00
Peter Zijlstra
bf2c59fce4 sched/core: Fix illegal RCU from offline CPUs
In the CPU-offline process, it calls mmdrop() after idle entry and the
subsequent call to cpuhp_report_idle_dead(). Once execution passes the
call to rcu_report_dead(), RCU is ignoring the CPU, which results in
lockdep complaining when mmdrop() uses RCU from either memcg or
debugobjects below.

Fix it by cleaning up the active_mm state from BP instead. Every arch
which has CONFIG_HOTPLUG_CPU should have already called idle_task_exit()
from AP. The only exception is parisc because it switches them to
&init_mm unconditionally (see smp_boot_one_cpu() and smp_cpu_init()),
but the patch will still work there because it calls mmgrab(&init_mm) in
smp_cpu_init() and then should call mmdrop(&init_mm) in finish_cpu().

  WARNING: suspicious RCU usage
  -----------------------------
  kernel/workqueue.c:710 RCU or wq_pool_mutex should be held!

  other info that might help us debug this:

  RCU used illegally from offline CPU!
  Call Trace:
   dump_stack+0xf4/0x164 (unreliable)
   lockdep_rcu_suspicious+0x140/0x164
   get_work_pool+0x110/0x150
   __queue_work+0x1bc/0xca0
   queue_work_on+0x114/0x120
   css_release+0x9c/0xc0
   percpu_ref_put_many+0x204/0x230
   free_pcp_prepare+0x264/0x570
   free_unref_page+0x38/0xf0
   __mmdrop+0x21c/0x2c0
   idle_task_exit+0x170/0x1b0
   pnv_smp_cpu_kill_self+0x38/0x2e0
   cpu_die+0x48/0x64
   arch_cpu_idle_dead+0x30/0x50
   do_idle+0x2f4/0x470
   cpu_startup_entry+0x38/0x40
   start_secondary+0x7a8/0xa80
   start_secondary_resume+0x10/0x14

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Link: https://lkml.kernel.org/r/20200401214033.8448-1-cai@lca.pw
2020-04-30 20:14:41 +02:00
Muchun Song
f38f12d1e0 sched/fair: Mark sched_init_granularity __init
Function sched_init_granularity() is only called from __init
functions, so mark it __init as well.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Link: https://lkml.kernel.org/r/20200406074750.56533-1-songmuchun@bytedance.com
2020-04-30 20:14:41 +02:00
Huaixin Chang
5a6d6a6ccb sched/fair: Refill bandwidth before scaling
In order to prevent possible hardlockup of sched_cfs_period_timer()
loop, loop count is introduced to denote whether to scale quota and
period or not. However, scale is done between forwarding period timer
and refilling cfs bandwidth runtime, which means that period timer is
forwarded with old "period" while runtime is refilled with scaled
"quota".

Move do_sched_cfs_period_timer() before scaling to solve this.

Fixes: 2e8e192263 ("sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup")
Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Reviewed-by: Phil Auld <pauld@redhat.com>
Link: https://lkml.kernel.org/r/20200420024421.22442-3-changhuaixin@linux.alibaba.com
2020-04-30 20:14:40 +02:00
Chen Yu
457d1f4657 sched: Extract the task putting code from pick_next_task()
Introduce a new function put_prev_task_balance() to do the balance
when necessary, and then put previous task back to the run queue.
This function is extracted from pick_next_task() to prepare for
future usage by other type of task picking logic.

No functional change.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Link: https://lkml.kernel.org/r/5a99860cf66293db58a397d6248bcb2eee326776.1587464698.git.yu.c.chen@intel.com
2020-04-30 20:14:40 +02:00
Chen Yu
d91cecc156 sched: Make newidle_balance() static again
After Commit 6e2df0581f ("sched: Fix pick_next_task() vs 'change'
pattern race"), there is no need to expose newidle_balance() as it
is only used within fair.c file. Change this function back to static again.

No functional change.

Reported-by: kbuild test robot <lkp@intel.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/83cd3030b031ca5d646cd5e225be10e7a0fdd8f5.1587464698.git.yu.c.chen@intel.com
2020-04-30 20:14:40 +02:00
Valentin Schneider
36c5bdc438 sched/topology: Kill SD_LOAD_BALANCE
That flag is set unconditionally in sd_init(), and no one checks for it
anymore. Remove it.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200415210512.805-5-valentin.schneider@arm.com
2020-04-30 20:14:39 +02:00
Valentin Schneider
e669ac8ab9 sched: Remove checks against SD_LOAD_BALANCE
The SD_LOAD_BALANCE flag is set unconditionally for all domains in
sd_init(). By making the sched_domain->flags syctl interface read-only, we
have removed the last piece of code that could clear that flag - as such,
it will now be always present. Rather than to keep carrying it along, we
can work towards getting rid of it entirely.

cpusets don't need it because they can make CPUs be attached to the NULL
domain (e.g. cpuset with sched_load_balance=0), or to a partitioned
root_domain, i.e. a sched_domain hierarchy that doesn't span the entire
system (e.g. root cpuset with sched_load_balance=0 and sibling cpusets with
sched_load_balance=1).

isolcpus apply the same "trick": isolated CPUs are explicitly taken out of
the sched_domain rebuild (using housekeeping_cpumask()), so they get the
NULL domain treatment as well.

Remove the checks against SD_LOAD_BALANCE.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200415210512.805-4-valentin.schneider@arm.com
2020-04-30 20:14:39 +02:00
Valentin Schneider
9818427c62 sched/debug: Make sd->flags sysctl read-only
Writing to the sysctl of a sched_domain->flags directly updates the value of
the field, and goes nowhere near update_top_cache_domain(). This means that
the cached domain pointers can end up containing stale data (e.g. the
domain pointed to doesn't have the relevant flag set anymore).

Explicit domain walks that check for flags will be affected by
the write, but this won't be in sync with the cached pointers which will
still point to the domains that were cached at the last sched_domain
build.

In other words, writing to this interface is playing a dangerous game. It
could be made to trigger an update of the cached sched_domain pointers when
written to, but this does not seem to be worth the trouble. Make it
read-only.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200415210512.805-3-valentin.schneider@arm.com
2020-04-30 20:14:39 +02:00
Valentin Schneider
45da27732b sched/fair: find_idlest_group(): Remove unused sd_flag parameter
The last use of that parameter was removed by commit

  57abff067a ("sched/fair: Rework find_idlest_group()")

Get rid of the parameter.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20200415210512.805-2-valentin.schneider@arm.com
2020-04-30 20:14:39 +02:00
Jann Horn
586b58cac8 exit: Move preemption fixup up, move blocking operations down
With CONFIG_DEBUG_ATOMIC_SLEEP=y and CONFIG_CGROUPS=y, kernel oopses in
non-preemptible context look untidy; after the main oops, the kernel prints
a "sleeping function called from invalid context" report because
exit_signals() -> cgroup_threadgroup_change_begin() -> percpu_down_read()
can sleep, and that happens before the preempt_count_set(PREEMPT_ENABLED)
fixup.

It looks like the same thing applies to profile_task_exit() and
kcov_task_exit().

Fix it by moving the preemption fixup up and the calls to
profile_task_exit() and kcov_task_exit() down.

Fixes: 1dc0fffc48 ("sched/core: Robustify preemption leak checks")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200305220657.46800-1-jannh@google.com
2020-04-30 20:14:38 +02:00
Peng Wang
64297f2b03 sched/fair: Simplify the code of should_we_balance()
We only consider group_balance_cpu() after there is no idle
cpu. So, just do comparison before return at these two cases.

Signed-off-by: Peng Wang <rocking@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/245c792f0e580b3ca342ad61257f4c066ee0f84f.1586594833.git.rocking@linux.alibaba.com
2020-04-30 20:14:38 +02:00
Josh Don
ab93a4bc95 sched/fair: Remove distribute_running from CFS bandwidth
This is mostly a revert of commit:

  baa9be4ffb ("sched/fair: Fix throttle_list starvation with low CFS quota")

The primary use of distribute_running was to determine whether to add
throttled entities to the head or the tail of the throttled list. Now
that we always add to the tail, we can remove this field.

The other use of distribute_running is in the slack_timer, so that we
don't start a distribution while one is already running. However, even
in the event that this race occurs, it is fine to have two distributions
running (especially now that distribute grabs the cfs_b->lock to
determine remaining quota before assigning).

Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Tested-by: Phil Auld <pauld@redhat.com>
Link: https://lkml.kernel.org/r/20200410225208.109717-3-joshdon@google.com
2020-04-30 20:14:38 +02:00
Paul Turner
e98fa02c4f sched/fair: Eliminate bandwidth race between throttling and distribution
There is a race window in which an entity begins throttling before quota
is added to the pool, but does not finish throttling until after we have
finished with distribute_cfs_runtime(). This entity is not observed by
distribute_cfs_runtime() because it was not on the throttled list at the
time that distribution was running. This race manifests as rare
period-length statlls for such entities.

Rather than heavy-weight the synchronization with the progress of
distribution, we can fix this by aborting throttling if bandwidth has
become available. Otherwise, we immediately add the entity to the
throttled list so that it can be observed by a subsequent distribution.

Additionally, we can remove the case of adding the throttled entity to
the head of the throttled list, and simply always add to the tail.
Thanks to 26a8b12747, distribute_cfs_runtime() no longer holds onto
its own pool of runtime. This means that if we do hit the !assign and
distribute_running case, we know that distribution is about to end.

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Ben Segall <bsegall@google.com>
Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Link: https://lkml.kernel.org/r/20200410225208.109717-2-joshdon@google.com
2020-04-30 20:14:38 +02:00
Xie XiuQi
f080d93e1d sched/debug: Fix trival print_task() format
Ensure leave one space between state and task name.

w/o patch:
runnable tasks:
 S           task   PID         tree-key  switches  prio     wait
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200414125721.195801-1-xiexiuqi@huawei.com
2020-04-30 20:14:37 +02:00
Barret Rhoden
2ed6edd33a perf: Add cond_resched() to task_function_call()
Under rare circumstances, task_function_call() can repeatedly fail and
cause a soft lockup.

There is a slight race where the process is no longer running on the cpu
we targeted by the time remote_function() runs.  The code will simply
try again.  If we are very unlucky, this will continue to fail, until a
watchdog fires.  This can happen in a heavily loaded, multi-core virtual
machine.

Reported-by: syzbot+bb4935a5c09b5ff79940@syzkaller.appspotmail.com
Signed-off-by: Barret Rhoden <brho@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200414222920.121401-1-brho@google.com
2020-04-30 20:14:36 +02:00
Wei Yongjun
7f645462ca bpf: Fix error return code in map_lookup_and_delete_elem()
Fix to return negative error code -EFAULT from the copy_to_user() error
handling case instead of 0, as done elsewhere in this function.

Fixes: bd513cd08f ("bpf: add MAP_LOOKUP_AND_DELETE_ELEM syscall")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200430081851.166996-1-weiyongjun1@huawei.com
2020-04-30 16:19:08 +02:00
Eric W. Biederman
2dd8083f6d posix-cpu-timers: Use pids not tasks in lookup
The current posix-cpu-timer code uses pids when holding persistent
references in timers.  However the lookups from clock_id_t still return
tasks that need to be converted into pids for use.

This results in usage being pid->task->pid and that can race with
release_task and de_thread.  This can lead to some not wrong but
surprising results.  Surprising enough that Oleg and I both thought
there were some bugs in the code for a while.

This set of changes modifies the code to just lookup, verify, and return
pids from the clockid_t lookups to remove those potentialy troublesome
races.

Eric W. Biederman (3):
      posix-cpu-timers: Extend rcu_read_lock removing task_struct references
      posix-cpu-timers: Replace cpu_timer_pid_type with clock_pid_type
      posix-cpu-timers: Replace __get_task_for_clock with pid_for_clock

 kernel/time/posix-cpu-timers.c | 102 ++++++++++++++++++-----------------------
 1 file changed, 45 insertions(+), 57 deletions(-)

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-04-30 06:43:44 -05:00
Oleg Nesterov
1dd694a1b7 remove the no longer needed pid_alive() check in __task_pid_nr_ns()
Starting from 2c4704756c ("pids: Move the pgrp and session pid pointers
from task_struct to signal_struct") __task_pid_nr_ns() doesn't dereference
task->group_leader, we can remove the pid_alive() check.

pid_nr_ns() has to check pid != NULL anyway, pid_alive() just adds the
unnecessary confusion.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-04-30 06:40:14 -05:00
Daniel Jordan
3c2214b602 padata: add separate cpuhp node for CPUHP_PADATA_DEAD
Removing the pcrypt module triggers this:

  general protection fault, probably for non-canonical
    address 0xdead000000000122
  CPU: 5 PID: 264 Comm: modprobe Not tainted 5.6.0+ #2
  Hardware name: QEMU Standard PC
  RIP: 0010:__cpuhp_state_remove_instance+0xcc/0x120
  Call Trace:
   padata_sysfs_release+0x74/0xce
   kobject_put+0x81/0xd0
   padata_free+0x12/0x20
   pcrypt_exit+0x43/0x8ee [pcrypt]

padata instances wrongly use the same hlist node for the online and dead
states, so __padata_free()'s second cpuhp remove call chokes on the node
that the first poisoned.

cpuhp multi-instance callbacks only walk forward in cpuhp_step->list and
the same node is linked in both the online and dead lists, so the list
corruption that results from padata_alloc() adding the node to a second
list without removing it from the first doesn't cause problems as long
as no instances are freed.

Avoid the issue by giving each state its own node.

Fixes: 894c9ef978 ("padata: validate cpumask without removed CPU during offline")
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-04-30 15:19:33 +10:00
Arnd Bergmann
449e14bfdb bpf: Fix unused variable warning
Hiding the only using of bpf_link_type_strs[] in an #ifdef causes
an unused-variable warning:

kernel/bpf/syscall.c:2280:20: error: 'bpf_link_type_strs' defined but not used [-Werror=unused-variable]
 2280 | static const char *bpf_link_type_strs[] = {

Move the definition into the same #ifdef.

Fixes: f2e10bff16 ("bpf: Add support for BPF_OBJ_GET_INFO_BY_FD for bpf_link")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200429132217.1294289-1-arnd@arndb.de
2020-04-30 00:21:22 +02:00
Jakub Sitnicki
64d85290d7 bpf: Allow bpf_map_lookup_elem for SOCKMAP and SOCKHASH
White-list map lookup for SOCKMAP/SOCKHASH from BPF. Lookup returns a
pointer to a full socket and acquires a reference if necessary.

To support it we need to extend the verifier to know that:

 (1) register storing the lookup result holds a pointer to socket, if
     lookup was done on SOCKMAP/SOCKHASH, and that

 (2) map lookup on SOCKMAP/SOCKHASH is a reference acquiring operation,
     which needs a corresponding reference release with bpf_sk_release.

On sock_map side, lookup handlers exposed via bpf_map_ops now bump
sk_refcnt if socket is reference counted. In turn, bpf_sk_select_reuseport,
the only in-kernel user of SOCKMAP/SOCKHASH ops->map_lookup_elem, was
updated to release the reference.

Sockets fetched from a map can be used in the same way as ones returned by
BPF socket lookup helpers, such as bpf_sk_lookup_tcp. In particular, they
can be used with bpf_sk_assign to direct packets toward a socket on TC
ingress path.

Suggested-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200429181154.479310-2-jakub@cloudflare.com
2020-04-29 23:30:59 +02:00
Eric W. Biederman
964987738b posix-cpu-timers: Replace __get_task_for_clock with pid_for_clock
Now that the codes store references to pids instead of referendes to
tasks.  Looking up a task for a clock instead of looking up a struct
pid makes the code more difficult to verify it is correct than
necessary.

In posix_cpu_timers_create get_task_pid can race with release_task for
threads and return a NULL pid.  As put_pid and cpu_timer_task_rcu
handle NULL pids just fine the code works without problems but it is
an extra case to consider and keep in mind while verifying and
modifying the code.

There are races with de_thread to consider that only don't apply
because thread clocks are only allowed for threads in the same
thread_group.

So instead of leaving a burden for people making modification to the
code in the future return a rcu protected struct pid for the clock
instead.

The logic for __get_task_for_pid and lookup_task has been folded into
the new function pid_for_clock with the only change being the logic
has been modified from working on a task to working on a pid that
will be returned.

In posix_cpu_clock_get instead of calling pid_for_clock checking the
result and then calling pid_task to get the task.  The result of
pid_for_clock is fed directly into pid_task.  This is safe because
pid_task handles NULL pids.  As such an extra error check was
unnecessary.

Instead of hiding the flag that enables the special clock_gettime
handling, I have made the 3 callers just pass the flag in themselves.
That is less code and seems just as simple to work with as the
wrapper functions.

Historically the clock_gettime special case of allowing a process
clock to be found by the thread id did not even exist [33ab0fec33]
but Thomas Gleixner reports that he has found code that uses that
functionality [55e8c8eb2c].

Link: https://lkml.kernel.org/r/87zhaxqkwa.fsf@nanos.tec.linutronix.de/
Ref: 33ab0fec33 ("posix-timers: Consolidate posix_cpu_clock_get()")
Ref: 55e8c8eb2c ("posix-cpu-timers: Store a reference to a pid not a task")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-04-29 08:14:41 -05:00
Eric W. Biederman
fece98260f posix-cpu-timers: Replace cpu_timer_pid_type with clock_pid_type
Taking a clock and returning a pid_type is a more general and
a superset of taking a timer and returning a pid_type.

Perform this generalization so that future changes may use
this code on clocks as well as timers.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-04-29 07:14:51 -05:00
Eric W. Biederman
9bf7c32409 posix-cpu-timers: Extend rcu_read_lock removing task_struct references
Now that the code stores of pid references it is no longer necessary
or desirable to take a reference on task_struct in __get_task_for_clock.

Instead extend the scope of rcu_read_lock and remove the reference
counting on struct task_struct entirely.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-04-29 07:14:12 -05:00
Andrii Nakryiko
f2e10bff16 bpf: Add support for BPF_OBJ_GET_INFO_BY_FD for bpf_link
Add ability to fetch bpf_link details through BPF_OBJ_GET_INFO_BY_FD command.
Also enhance show_fdinfo to potentially include bpf_link type-specific
information (similarly to obj_info).

Also introduce enum bpf_link_type stored in bpf_link itself and expose it in
UAPI. bpf_link_tracing also now will store and return bpf_attach_type.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200429001614.1544-5-andriin@fb.com
2020-04-28 17:27:08 -07:00
Andrii Nakryiko
2d602c8cf4 bpf: Support GET_FD_BY_ID and GET_NEXT_ID for bpf_link
Add support to look up bpf_link by ID and iterate over all existing bpf_links
in the system. GET_FD_BY_ID code handles not-yet-ready bpf_link by checking
that its ID hasn't been set to non-zero value yet. Setting bpf_link's ID is
done as the very last step in finalizing bpf_link, together with installing
FD. This approach allows users of bpf_link in kernel code to not worry about
races between user-space and kernel code that hasn't finished attaching and
initializing bpf_link.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200429001614.1544-4-andriin@fb.com
2020-04-28 17:27:08 -07:00
Andrii Nakryiko
a3b80e1078 bpf: Allocate ID for bpf_link
Generate ID for each bpf_link using IDR, similarly to bpf_map and bpf_prog.
bpf_link creation, initialization, attachment, and exposing to user-space
through FD and ID is a complicated multi-step process, abstract it away
through bpf_link_primer and bpf_link_prime(), bpf_link_settle(), and
bpf_link_cleanup() internal API. They guarantee that until bpf_link is
properly attached, user-space won't be able to access partially-initialized
bpf_link either from FD or ID. All this allows to simplify bpf_link attachment
and error handling code.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200429001614.1544-3-andriin@fb.com
2020-04-28 17:27:08 -07:00
Andrii Nakryiko
f9d041271c bpf: Refactor bpf_link update handling
Make bpf_link update support more generic by making it into another
bpf_link_ops methods. This allows generic syscall handling code to be agnostic
to various conditionally compiled features (e.g., the case of
CONFIG_CGROUP_BPF). This also allows to keep link type-specific code to remain
static within respective code base. Refactor existing bpf_cgroup_link code and
take advantage of this.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200429001614.1544-2-andriin@fb.com
2020-04-28 17:27:07 -07:00
Richard Guy Briggs
a45d88530b netfilter: add audit table unregister actions
Audit the action of unregistering ebtables and x_tables.

See: https://github.com/linux-audit/audit-kernel/issues/44

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2020-04-28 18:11:36 -04:00
Richard Guy Briggs
c4dad0aab3 audit: tidy and extend netfilter_cfg x_tables
NETFILTER_CFG record generation was inconsistent for x_tables and
ebtables configuration changes.  The call was needlessly messy and there
were supporting records missing at times while they were produced when
not requested.  Simplify the logging call into a new audit_log_nfcfg
call.  Honour the audit_enabled setting while more consistently
recording information including supporting records by tidying up dummy
checks.

Add an op= field that indicates the operation being performed (register
or replace).

Here is the enhanced sample record:
  type=NETFILTER_CFG msg=audit(1580905834.919:82970): table=filter family=2 entries=83 op=replace

Generate audit NETFILTER_CFG records on ebtables table registration.
Previously this was being done for x_tables registration and replacement
operations and ebtables table replacement only.

See: https://github.com/linux-audit/audit-kernel/issues/25
See: https://github.com/linux-audit/audit-kernel/issues/35
See: https://github.com/linux-audit/audit-kernel/issues/43

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2020-04-28 17:52:42 -04:00
Eric W. Biederman
c7f5194054 posix-cpu-timer: Unify the now redundant code in lookup_task
Now that both !thread paths through lookup_task call
thread_group_leader, unify them into the single test at the end of
lookup_task.

This unification just makes it clear what is happening in the gettime
special case of lookup_task.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-04-28 16:46:04 -05:00
Eric W. Biederman
8feebc6713 posix-cpu-timer: Tidy up group_leader logic in lookup_task
Replace has_group_leader_pid with thread_group_leader.  Years ago Oleg
suggested changing thread_group_leader to has_group_leader_pid to handle
races.  Looking at the code then and now I don't see how it ever helped.
Especially as then the code really did need to be the
thread_group_leader.

Today it doesn't make a difference if thread_group_leader races with
de_thread as the task returned from lookup_task in the non-thread case is
just used to find values in task->signal.

Since the races with de_thread have never been handled revert
has_group_header_pid to thread_group_leader for clarity.

Update the comment in lookup_task to remove implementation details that
are no longer true and to mention task->signal instead of task->sighand,
as the relevant cpu timer details are all in task->signal.

Ref: 55e8c8eb2c ("posix-cpu-timers: Store a reference to a pid not a task")
Ref: c0deae8c95 ("posix-cpu-timers: Rcu_read_lock/unlock protect find_task_by_vpid call")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-04-28 16:42:03 -05:00
Eric W. Biederman
6b03d1304a proc: Ensure we see the exit of each process tid exactly once
When the thread group leader changes during exec and the old leaders
thread is reaped proc_flush_pid will flush the dentries for the entire
process because the leader still has it's original pid.

Fix this by exchanging the pids in an rcu safe manner,
and wrapping the code to do that up in a helper exchange_tids.

When I removed switch_exec_pids and introduced this behavior
in d73d65293e ("[PATCH] pidhash: kill switch_exec_pids") there
really was nothing that cared as flushing happened with
the cached dentry and de_thread flushed both of them on exec.

This lack of fully exchanging pids became a problem a few months later
when I introduced 48e6484d49 ("[PATCH] proc: Rewrite the proc dentry
flush on exit optimization").  Which overlooked the de_thread case
was no longer swapping pids, and I was looking up proc dentries
by task->pid.

The current behavior isn't properly a bug as everything in proc will
continue to work correctly just a little bit less efficiently.  Fix
this just so there are no little surprise corner cases waiting to bite
people.

-- Oleg points out this could be an issue in next_tgid in proc where
   has_group_leader_pid is called, and reording some of the assignments
   should fix that.

-- Oleg points out this will break the 10 year old hack in __exit_signal.c
>	/*
>	 * This can only happen if the caller is de_thread().
>	 * FIXME: this is the temporary hack, we should teach
>	 * posix-cpu-timers to handle this case correctly.
>	 */
>	if (unlikely(has_group_leader_pid(tsk)))
>		posix_cpu_timers_exit_group(tsk);

The code in next_tgid has been changed to use PIDTYPE_TGID,
and the posix cpu timers code has been fixed so it does not
need the 10 year old hack, so this should be safe to merge
now.

Link: https://lore.kernel.org/lkml/87h7x3ajll.fsf_-_@x220.int.ebiederm.org/
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Fixes: 48e6484d49 ("[PATCH] proc: Rewrite the proc dentry flush on exit optimization").
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-04-28 16:13:58 -05:00
Daniel Borkmann
0b54142e4b Merge branch 'work.sysctl' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull in Christoph Hellwig's series that changes the sysctl's ->proc_handler
methods to take kernel pointers instead. It gets rid of the set_fs address
space overrides used by BPF. As per discussion, pull in the feature branch
into bpf-next as it relates to BPF sysctl progs.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200427071508.GV23230@ZenIV.linux.org.uk/T/
2020-04-28 21:23:38 +02:00
Luis Chamberlain
3740d93e37 coredump: fix crash when umh is disabled
Commit 64e90a8acb ("Introduce STATIC_USERMODEHELPER to mediate
call_usermodehelper()") added the optiont to disable all
call_usermodehelper() calls by setting STATIC_USERMODEHELPER_PATH to
an empty string. When this is done, and crashdump is triggered, it
will crash on null pointer dereference, since we make assumptions
over what call_usermodehelper_exec() did.

This has been reported by Sergey when one triggers a a coredump
with the following configuration:

```
CONFIG_STATIC_USERMODEHELPER=y
CONFIG_STATIC_USERMODEHELPER_PATH=""
kernel.core_pattern = |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h %e
```

The way disabling the umh was designed was that call_usermodehelper_exec()
would just return early, without an error. But coredump assumes
certain variables are set up for us when this happens, and calls
ile_start_write(cprm.file) with a NULL file.

[    2.819676] BUG: kernel NULL pointer dereference, address: 0000000000000020
[    2.819859] #PF: supervisor read access in kernel mode
[    2.820035] #PF: error_code(0x0000) - not-present page
[    2.820188] PGD 0 P4D 0
[    2.820305] Oops: 0000 [#1] SMP PTI
[    2.820436] CPU: 2 PID: 89 Comm: a Not tainted 5.7.0-rc1+ #7
[    2.820680] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190711_202441-buildvm-armv7-10.arm.fedoraproject.org-2.fc31 04/01/2014
[    2.821150] RIP: 0010:do_coredump+0xd80/0x1060
[    2.821385] Code: e8 95 11 ed ff 48 c7 c6 cc a7 b4 81 48 8d bd 28 ff
ff ff 89 c2 e8 70 f1 ff ff 41 89 c2 85 c0 0f 84 72 f7 ff ff e9 b4 fe ff
ff <48> 8b 57 20 0f b7 02 66 25 00 f0 66 3d 00 8
0 0f 84 9c 01 00 00 44
[    2.822014] RSP: 0000:ffffc9000029bcb8 EFLAGS: 00010246
[    2.822339] RAX: 0000000000000000 RBX: ffff88803f860000 RCX: 000000000000000a
[    2.822746] RDX: 0000000000000009 RSI: 0000000000000282 RDI: 0000000000000000
[    2.823141] RBP: ffffc9000029bde8 R08: 0000000000000000 R09: ffffc9000029bc00
[    2.823508] R10: 0000000000000001 R11: ffff88803dec90be R12: ffffffff81c39da0
[    2.823902] R13: ffff88803de84400 R14: 0000000000000000 R15: 0000000000000000
[    2.824285] FS:  00007fee08183540(0000) GS:ffff88803e480000(0000) knlGS:0000000000000000
[    2.824767] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    2.825111] CR2: 0000000000000020 CR3: 000000003f856005 CR4: 0000000000060ea0
[    2.825479] Call Trace:
[    2.825790]  get_signal+0x11e/0x720
[    2.826087]  do_signal+0x1d/0x670
[    2.826361]  ? force_sig_info_to_task+0xc1/0xf0
[    2.826691]  ? force_sig_fault+0x3c/0x40
[    2.826996]  ? do_trap+0xc9/0x100
[    2.827179]  exit_to_usermode_loop+0x49/0x90
[    2.827359]  prepare_exit_to_usermode+0x77/0xb0
[    2.827559]  ? invalid_op+0xa/0x30
[    2.827747]  ret_from_intr+0x20/0x20
[    2.827921] RIP: 0033:0x55e2c76d2129
[    2.828107] Code: 2d ff ff ff e8 68 ff ff ff 5d c6 05 18 2f 00 00 01
c3 0f 1f 80 00 00 00 00 c3 0f 1f 80 00 00 00 00 e9 7b ff ff ff 55 48 89
e5 <0f> 0b b8 00 00 00 00 5d c3 66 2e 0f 1f 84 0
0 00 00 00 00 0f 1f 40
[    2.828603] RSP: 002b:00007fffeba5e080 EFLAGS: 00010246
[    2.828801] RAX: 000055e2c76d2125 RBX: 0000000000000000 RCX: 00007fee0817c718
[    2.829034] RDX: 00007fffeba5e188 RSI: 00007fffeba5e178 RDI: 0000000000000001
[    2.829257] RBP: 00007fffeba5e080 R08: 0000000000000000 R09: 00007fee08193c00
[    2.829482] R10: 0000000000000009 R11: 0000000000000000 R12: 000055e2c76d2040
[    2.829727] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[    2.829964] CR2: 0000000000000020
[    2.830149] ---[ end trace ceed83d8c68a1bf1 ]---
```

Cc: <stable@vger.kernel.org> # v4.11+
Fixes: 64e90a8acb ("Introduce STATIC_USERMODEHELPER to mediate call_usermodehelper()")
BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=199795
Reported-by: Tony Vroon <chainsaw@gentoo.org>
Reported-by: Sergey Kvachonok <ravenexp@gmail.com>
Tested-by: Sergei Trofimovich <slyfox@gentoo.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20200416162859.26518-1-mcgrof@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-28 17:54:13 +02:00
Richard Guy Briggs
9d2161bed4 audit: log audit netlink multicast bind and unbind
Log information about programs connecting to and disconnecting from the
audit netlink multicast socket. This is needed so that during
investigations a security officer can tell who or what had access to the
audit trail.  This helps to meet the FAU_SAR.2 requirement for Common
Criteria.

Here is the systemd startup event:
type=PROCTITLE msg=audit(2020-04-22 10:10:21.787:10) : proctitle=/init
type=SYSCALL msg=audit(2020-04-22 10:10:21.787:10) : arch=x86_64 syscall=bind success=yes exit=0 a0=0x19 a1=0x555f4aac7e90 a2=0xc a3=0x7ffcb792ff44 items=0 ppid=0 pid=1 auid=unset uid=root gid=root euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=(none) ses=unset comm=systemd exe=/usr/lib/systemd/systemd subj=kernel key=(null)
type=UNKNOWN[1335] msg=audit(2020-04-22 10:10:21.787:10) : pid=1 uid=root auid=unset tty=(none) ses=unset subj=kernel comm=systemd exe=/usr/lib/systemd/systemd nl-mcgrp=1 op=connect res=yes

And events from the test suite that just uses close():
type=PROCTITLE msg=audit(2020-04-22 11:47:08.501:442) : proctitle=/usr/bin/perl -w amcast_joinpart/test
type=SYSCALL msg=audit(2020-04-22 11:47:08.501:442) : arch=x86_64 syscall=bind success=yes exit=0 a0=0x7 a1=0x563004378760 a2=0xc a3=0x0 items=0 ppid=815 pid=818 auid=root uid=root gid=root euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=ttyS0 ses=1 comm=perl exe=/usr/bin/perl subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key=(null)
type=UNKNOWN[1335] msg=audit(2020-04-22 11:47:08.501:442) : pid=818 uid=root auid=root tty=ttyS0 ses=1 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 comm=perl exe=/usr/bin/perl nl-mcgrp=1 op=connect res=yes

type=UNKNOWN[1335] msg=audit(2020-04-22 11:47:08.501:443) : pid=818 uid=root auid=root tty=ttyS0 ses=1 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 comm=perl exe=/usr/bin/perl nl-mcgrp=1 op=disconnect res=yes

And the events from the test suite using setsockopt with NETLINK_DROP_MEMBERSHIP:
type=PROCTITLE msg=audit(2020-04-22 11:39:53.291:439) : proctitle=/usr/bin/perl -w amcast_joinpart/test
type=SYSCALL msg=audit(2020-04-22 11:39:53.291:439) : arch=x86_64 syscall=bind success=yes exit=0 a0=0x7 a1=0x5560877c2d20 a2=0xc a3=0x0 items=0 ppid=772 pid=775 auid=root uid=root gid=root euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=ttyS0 ses=1 comm=perl exe=/usr/bin/perl subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key=(null)
type=UNKNOWN[1335] msg=audit(2020-04-22 11:39:53.291:439) : pid=775 uid=root auid=root tty=ttyS0 ses=1 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 comm=perl exe=/usr/bin/perl nl-mcgrp=1 op=connect res=yes

type=PROCTITLE msg=audit(2020-04-22 11:39:53.292:440) : proctitle=/usr/bin/perl -w amcast_joinpart/test
type=SYSCALL msg=audit(2020-04-22 11:39:53.292:440) : arch=x86_64 syscall=setsockopt success=yes exit=0 a0=0x7 a1=SOL_NETLINK a2=0x2 a3=0x7ffc8366f000 items=0 ppid=772 pid=775 auid=root uid=root gid=root euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=ttyS0 ses=1 comm=perl exe=/usr/bin/perl subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key=(null)
type=UNKNOWN[1335] msg=audit(2020-04-22 11:39:53.292:440) : pid=775 uid=root auid=root tty=ttyS0 ses=1 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 comm=perl exe=/usr/bin/perl nl-mcgrp=1 op=disconnect res=yes

Please see the upstream issue tracker at
  https://github.com/linux-audit/audit-kernel/issues/28
With the feature description at
  https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Multicast-Socket-Join-Part
The testsuite support is at
  https://github.com/rgbriggs/audit-testsuite/compare/ghak28-mcast-part-join
  https://github.com/linux-audit/audit-testsuite/pull/93
And the userspace support patch is at
  https://github.com/linux-audit/audit-userspace/pull/114

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2020-04-27 18:49:31 -04:00
Ethon Paul
182e073f68 cpu/hotplug: Fix a typo in comment "broadacasted"->"broadcasted"
Signed-off-by: Ethon Paul <ethp@qq.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200417164008.6541-1-ethp@qq.com
2020-04-27 22:42:04 +02:00
Christoph Hellwig
8c1b2bf16d bpf, cgroup: Remove unused exports
Except for a few of the networking hooks called from modular ipv4
or ipv6 code, all of hooks are just called from guaranteed to be
built-in code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrey Ignatov <rdna@fb.com>
Link: https://lore.kernel.org/bpf/20200424064338.538313-2-hch@lst.de
2020-04-27 22:20:22 +02:00
Wei Yongjun
52785b6ae8 kcsan: Use GFP_ATOMIC under spin lock
A spin lock is held in insert_report_filterlist(), so the krealloc()
should use GFP_ATOMIC.  This commit therefore makes this change.

Reviewed-by: Marco Elver <elver@google.com>
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:10:10 -07:00
Paul E. McKenney
c9527bebb0 rcutorture: Mark data-race potential for rcu_barrier() test statistics
The n_barrier_successes, n_barrier_attempts, and
n_rcu_torture_barrier_error variables are updated (without access
markings) by the main rcu_barrier() test kthread, and accessed (also
without access markings) by the rcu_torture_stats() kthread.  This of
course can result in KCSAN complaints.

Because the accesses are in diagnostic prints, this commit uses
data_race() to excuse the diagnostic prints from the data race.  If this
were to ever cause bogus statistics prints (for example, due to store
tearing), any misleading information would be disambiguated by the
presence or absence of an rcutorture splat.

This data race was reported by KCSAN.  Not appropriate for backporting
due to failure being unlikely and due to the mild consequences of the
failure, namely a confusing rcutorture console message.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2020-04-27 11:05:13 -07:00
Paul E. McKenney
3b2a473985 rcutorture: Add KCSAN stubs
This commit adds stubs for KCSAN's data_race(), ASSERT_EXCLUSIVE_WRITER(),
and ASSERT_EXCLUSIVE_ACCESS() macros to allow code using these macros to
move ahead.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:05:13 -07:00
Paul E. McKenney
33b2b93bd8 rcu: Remove self-stack-trace when all quiescent states seen
When all quiescent states have been seen, it is normally the grace-period
kthread that is in trouble.  Although the existing stack trace from
the current CPU might possibly provide useful information, experience
indicates that there is too much noise for this to be worthwhile.

This commit therefore removes this stack trace from the output.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:04:37 -07:00
Paul E. McKenney
8837582517 rcu: When GP kthread is starved, tag idle threads as false positives
If the grace-period kthread is starved, idle threads' extended quiescent
states are not reported.  These idle threads thus wrongly appear to
be blocking the current grace period.  This commit therefore tags such
idle threads as probable false positives when the grace-period kthread
is being starved.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:04:37 -07:00
Paul E. McKenney
654db05cee rcu: Use data_race() for RCU expedited CPU stall-warning prints
Although the accesses used to determine whether or not an expedited
stall should be printed are an integral part of the concurrency algorithm
governing use of the corresponding variables, the values that are simply
printed are ancillary.  As such, it is best to use data_race() for these
accesses in order to provide the greatest latitude in the use of KCSAN
for the other accesses that are an integral part of the algorithm.  This
commit therefore changes the relevant uses of READ_ONCE() to data_race().

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:04:37 -07:00
Paul E. McKenney
e5a971d76d ftrace: Use synchronize_rcu_tasks_rude() instead of ftrace_sync()
This commit replaces the schedule_on_each_cpu(ftrace_sync) instances
with synchronize_rcu_tasks_rude().

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
[ paulmck: Make Kconfig adjustments noted by kbuild test robot. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:03:53 -07:00
Paul E. McKenney
25246fc831 rcu-tasks: Allow standalone use of TASKS_{TRACE_,}RCU
This commit allows TASKS_TRACE_RCU to be used independently of TASKS_RCU
and vice versa.

[ paulmck: Fix conditional compilation per kbuild test robot feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:03:53 -07:00
Paul E. McKenney
7e0669c3e9 rcu-tasks: Add IPI failure count to statistics
This commit adds a failure-return count for smp_call_function_single(),
and adds this to the console messages for rcutorture writer stalls and at
the end of rcutorture testing.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:03:53 -07:00
Paul E. McKenney
edf3775f0a rcu-tasks: Add count for idle tasks on offline CPUs
This commit adds a counter for the number of times the quiescent state
was an idle task associated with an offline CPU, and prints this count
at the end of rcutorture runs and at stall time.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:03:53 -07:00
Paul E. McKenney
40471509be rcu-tasks: Add rcu_dynticks_zero_in_eqs() effectiveness statistics
This commit adds counts of the number of calls and number of successful
calls to rcu_dynticks_zero_in_eqs(), which are printed at the end
of rcutorture runs and at stall time.  This allows evaluation of the
effectiveness of rcu_dynticks_zero_in_eqs().

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:03:52 -07:00
Paul E. McKenney
9796e1ae73 rcu-tasks: Make RCU tasks trace also wait for idle tasks
This commit scans the CPUs, adding each CPU's idle task to the list of
tasks that need quiescent states.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:03:52 -07:00
Paul E. McKenney
7e3b70e070 rcu-tasks: Handle the running-offline idle-task special case
The idle task corresponding to an offline CPU can appear to be running
while that CPU is offline.  This commit therefore adds checks for this
situation, treating it as a quiescent state.  Because the tasklist scan
and the holdout-list scan now exclude CPU-hotplug operations, readers
on the CPU-hotplug paths are still waited for.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:03:52 -07:00
Paul E. McKenney
81b4a7bc3b rcu-tasks: Disable CPU hotplug across RCU tasks trace scans
This commit disables CPU hotplug across RCU tasks trace scans, which
is a first step towards correctly recognizing idle tasks "running" on
offline CPUs.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:03:52 -07:00
Paul E. McKenney
b38f57c1fe rcu-tasks: Allow rcu_read_unlock_trace() under scheduler locks
The rcu_read_unlock_trace() can invoke rcu_read_unlock_trace_special(),
which in turn can call wake_up().  Therefore, if any scheduler lock is
held across a call to rcu_read_unlock_trace(), self-deadlock can occur.
This commit therefore uses the irq_work facility to defer the wake_up()
to a clean environment where no scheduler locks will be held.

Reported-by: Steven Rostedt <rostedt@goodmis.org>
[ paulmck: Update #includes for m68k per kbuild test robot. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:03:52 -07:00
Paul E. McKenney
7d0c9c50c5 rcu-tasks: Avoid IPIing userspace/idle tasks if kernel is so built
Systems running CPU-bound real-time task do not want IPIs sent to CPUs
executing nohz_full userspace tasks.  Battery-powered systems don't
want IPIs sent to idle CPUs in low-power mode.  Unfortunately, RCU tasks
trace can and will send such IPIs in some cases.

Both of these situations occur only when the target CPU is in RCU
dyntick-idle mode, in other words, when RCU is not watching the
target CPU.  This suggests that CPUs in dyntick-idle mode should use
memory barriers in outermost invocations of rcu_read_lock_trace()
and rcu_read_unlock_trace(), which would allow the RCU tasks trace
grace period to directly read out the target CPU's read-side state.
One challenge is that RCU tasks trace is not targeting a specific
CPU, but rather a task.  And that task could switch from one CPU to
another at any time.

This commit therefore uses try_invoke_on_locked_down_task()
and checks for task_curr() in trc_inspect_reader_notrunning().
When this condition holds, the target task is running and cannot move.
If CONFIG_TASKS_TRACE_RCU_READ_MB=y, the new rcu_dynticks_zero_in_eqs()
function can be used to check if the specified integer (in this case,
t->trc_reader_nesting) is zero while the target CPU remains in that same
dyntick-idle sojourn.  If so, the target task is in a quiescent state.
If not, trc_read_check_handler() must indicate failure so that the
grace-period kthread can take appropriate action or retry after an
appropriate delay, as the case may be.

With this change, given CONFIG_TASKS_TRACE_RCU_READ_MB=y, if a given
CPU remains idle or a given task continues executing in nohz_full mode,
the RCU tasks trace grace-period kthread will detect this without the
need to send an IPI.

Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:03:52 -07:00
Paul E. McKenney
9ae58d7bd1 rcu-tasks: Add Kconfig option to mediate smp_mb() vs. IPI
This commit provides a new TASKS_TRACE_RCU_READ_MB Kconfig option that
enables use of read-side memory barriers by both rcu_read_lock_trace()
and rcu_read_unlock_trace() when the are executed with the
current->trc_reader_special.b.need_mb flag set.  This flag is currently
never set.  Doing that is the subject of a later commit.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:03:52 -07:00
Paul E. McKenney
238dbce39e rcu-tasks: Add grace-period and IPI counts to statistics
This commit adds a grace-period count and a count of IPIs sent since
boot, which is printed in response to rcutorture writer stalls and at
the end of rcutorture testing.  These counts will be used to evaluate
various schemes to reduce the number of IPIs sent.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:03:52 -07:00
Paul E. McKenney
276c410448 rcu-tasks: Split ->trc_reader_need_end
This commit splits ->trc_reader_need_end by using the rcu_special union.
This change permits readers to check to see if a memory barrier is
required without any added overhead in the common case where no such
barrier is required.  This commit also adds the read-side checking.
Later commits will add the machinery to properly set the new
->trc_reader_special.b.need_mb field.

This commit also makes rcu_read_unlock_trace_special() tolerate nested
read-side critical sections within interrupt and NMI handlers.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:03:52 -07:00
Paul E. McKenney
b0afa0f056 rcu-tasks: Provide boot parameter to delay IPIs until late in grace period
This commit provides a rcupdate.rcu_task_ipi_delay kernel boot parameter
that specifies how old the RCU tasks trace grace period must be before
the grace-period kthread starts sending IPIs.  This delay allows more
tasks to pass through rcu_tasks_qs() quiescent states, thus reducing
(or even eliminating) the number of IPIs that must be sent.

On a short rcutorture test setting this kernel boot parameter to HZ/2
resulted in zero IPIs for all 877 RCU-tasks trace grace periods that
elapsed during that test.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:03:52 -07:00
Paul E. McKenney
88092d0c99 rcu-tasks: Add a grace-period start time for throttling and debug
This commit adds a place to record the grace-period start in jiffies.
This will be used by later commits for debugging purposes and to throttle
IPIs early in the grace period.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:03:52 -07:00
Paul E. McKenney
43766c3ead rcu-tasks: Make RCU Tasks Trace make use of RCU scheduler hooks
This commit makes the calls to rcu_tasks_qs() detect and report
quiescent states for RCU tasks trace.  If the task is in a quiescent
state and if ->trc_reader_checked is not yet set, the task sets its own
->trc_reader_checked.  This will cause the grace-period kthread to
remove it from the holdout list if it still remains there.

[ paulmck: Fix conditional compilation per kbuild test robot feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:03:52 -07:00
Paul E. McKenney
af051ca4e4 rcu-tasks: Make rcutorture writer stall output include GP state
This commit adds grace-period state and time to the rcutorture writer
stall output.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:03:52 -07:00
Paul E. McKenney
e21408ceec rcu-tasks: Add RCU tasks to rcutorture writer stall output
This commit adds state for each RCU-tasks flavor to the rcutorture
writer stall output.  The initial state is minimal, but you have to
start somewhere.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
[ paulmck: Fixes based on feedback from kbuild test robot. ]
2020-04-27 11:03:51 -07:00
Paul E. McKenney
8fd8ca388c rcu-tasks: Move #ifdef into tasks.h
This commit pushes the #ifdef CONFIG_TASKS_RCU_GENERIC from
kernel/rcu/update.c to kernel/rcu/tasks.h in order to improve
readability as more APIs are added.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:03:51 -07:00
Paul E. McKenney
4593e772b5 rcu-tasks: Add stall warnings for RCU Tasks Trace
This commit adds RCU CPU stall warnings for RCU Tasks Trace.  These
dump out any tasks blocking the current grace period, as well as any
CPUs that have not responded to an IPI request.  This happens in two
phases, when initially extracting state from the tasks and later when
waiting for any holdout tasks to check in.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:03:51 -07:00
Paul E. McKenney
c1a76c0b6a rcutorture: Add torture tests for RCU Tasks Trace
This commit adds the definitions required to torture the tracing flavor
of RCU tasks.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:03:51 -07:00
Paul E. McKenney
d5f177d35c rcu-tasks: Add an RCU Tasks Trace to simplify protection of tracing hooks
Because RCU does not watch exception early-entry/late-exit, idle-loop,
or CPU-hotplug execution, protection of tracing and BPF operations is
needlessly complicated.  This commit therefore adds a variant of
Tasks RCU that:

o	Has explicit read-side markers to allow finite grace periods in
	the face of in-kernel loops for PREEMPT=n builds.  These markers
	are rcu_read_lock_trace() and rcu_read_unlock_trace().

o	Protects code in the idle loop, exception entry/exit, and
	CPU-hotplug code paths.  In this respect, RCU-tasks trace is
	similar to SRCU, but with lighter-weight readers.

o	Avoids expensive read-side instruction, having overhead similar
	to that of Preemptible RCU.

There are of course downsides:

o	The grace-period code can send IPIs to CPUs, even when those
	CPUs are in the idle loop or in nohz_full userspace.  This is
	mitigated by later commits.

o	It is necessary to scan the full tasklist, much as for Tasks RCU.

o	There is a single callback queue guarded by a single lock,
	again, much as for Tasks RCU.  However, those early use cases
	that request multiple grace periods in quick succession are
	expected to do so from a single task, which makes the single
	lock almost irrelevant.  If needed, multiple callback queues
	can be provided using any number of schemes.

Perhaps most important, this variant of RCU does not affect the vanilla
flavors, rcu_preempt and rcu_sched.  The fact that RCU Tasks Trace
readers can operate from idle, offline, and exception entry/exit in no
way enables rcu_preempt and rcu_sched readers to do so.

The memory ordering was outlined here:
https://lore.kernel.org/lkml/20200319034030.GX3199@paulmck-ThinkPad-P72/

This effort benefited greatly from off-list discussions of BPF
requirements with Alexei Starovoitov and Andrii Nakryiko.  At least
some of the on-list discussions are captured in the Link: tags below.
In addition, KCSAN was quite helpful in finding some early bugs.

Link: https://lore.kernel.org/lkml/20200219150744.428764577@infradead.org/
Link: https://lore.kernel.org/lkml/87mu8p797b.fsf@nanos.tec.linutronix.de/
Link: https://lore.kernel.org/lkml/20200225221305.605144982@linutronix.de/
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Andrii Nakryiko <andriin@fb.com>
[ paulmck: Apply feedback from Steve Rostedt and Joel Fernandes. ]
[ paulmck: Decrement trc_n_readers_need_end upon IPI failure. ]
[ paulmck: Fix locking issue reported by rcutorture. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:03:51 -07:00
Paul E. McKenney
d01aa2633b rcu-tasks: Code movement to allow more Tasks RCU variants
This commit does nothing but move rcu_tasks_wait_gp() up to a new section
for common code.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:03:51 -07:00
Paul E. McKenney
e4fe5dd6f2 rcu-tasks: Further refactor RCU-tasks to allow adding more variants
This commit refactors RCU tasks to allow variants to be added.  These
variants will share the current Tasks-RCU tasklist scan and the holdout
list processing.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:03:51 -07:00
Paul E. McKenney
c97d12a63c rcu-tasks: Use unique names for RCU-Tasks kthreads and messages
This commit causes the flavors of RCU Tasks to use different names
for their kthreads and in their console messages.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:03:51 -07:00
Paul E. McKenney
3d6e43c75d rcutorture: Add torture tests for RCU Tasks Rude
This commit adds the definitions required to torture the rude flavor of
RCU tasks.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:03:51 -07:00
Paul E. McKenney
c84aad7654 rcu-tasks: Add an RCU-tasks rude variant
This commit adds a "rude" variant of RCU-tasks that has as quiescent
states schedule(), cond_resched_tasks_rcu_qs(), userspace execution,
and (in theory, anyway) cond_resched().  In other words, RCU-tasks rude
readers are regions of code with preemption disabled, but excluding code
early in the CPU-online sequence and late in the CPU-offline sequence.
Updates make use of IPIs and force an IPI and a context switch on each
online CPU.  This variant is useful in some situations in tracing.

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
[ paulmck: Apply EXPORT_SYMBOL_GPL() feedback from Qiujun Huang. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
[ paulmck: Apply review feedback from Steve Rostedt. ]
2020-04-27 11:03:51 -07:00
Paul E. McKenney
5873b8a94e rcu-tasks: Refactor RCU-tasks to allow variants to be added
This commit splits out generic processing from RCU-tasks-specific
processing in order to allow additional flavors to be added.  It also
adds a def_bool TASKS_RCU_GENERIC to enable the common RCU-tasks
infrastructure code.

This is primarily, but not entirely, a code-movement commit.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:03:51 -07:00
Paul E. McKenney
9cf8fc6fab rcutorture: Add a test for synchronize_rcu_mult()
This commit adds a crude test for synchronize_rcu_mult().  This is
currently a smoke test rather than a high-quality stress test.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:03:51 -07:00
Paul E. McKenney
07e105158d rcu-tasks: Create struct to hold state information
This commit creates an rcu_tasks struct to hold state information for
RCU Tasks.  This is a preparation commit for adding additional flavors
of Tasks RCU, each of which would have its own rcu_tasks struct.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:03:50 -07:00
Paul E. McKenney
eacd6f04a1 rcu-tasks: Move Tasks RCU to its own file
This code-movement-only commit is in preparation for adding an additional
flavor of Tasks RCU, which relies on workqueues to detect grace periods.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:03:50 -07:00
Paul E. McKenney
5bef8da66a rcu: Add per-task state to RCU CPU stall warnings
Currently, an RCU-preempt CPU stall warning simply lists the PIDs of
those tasks holding up the current grace period.  This can be helpful,
but more can be even more helpful.

To this end, this commit adds the nesting level, whether the task
thinks it was preempted in its current RCU read-side critical section,
whether RCU core has asked this task for a quiescent state, whether the
expedited-grace-period hint is set, and whether the task believes that
it is on the blocked-tasks list (it must be, or it would not be printed,
but if things are broken, best not to take too much for granted).

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:03:50 -07:00
Paul E. McKenney
2beaf3280e sched/core: Add function to sample state of locked-down task
A running task's state can be sampled in a consistent manner (for example,
for diagnostic purposes) simply by invoking smp_call_function_single()
on its CPU, which may be obtained using task_cpu(), then having the
IPI handler verify that the desired task is in fact still running.
However, if the task is not running, this sampling can in theory be done
immediately and directly.  In practice, the task might start running at
any time, including during the sampling period.  Gaining a consistent
sample of a not-running task therefore requires that something be done
to lock down the target task's state.

This commit therefore adds a try_invoke_on_locked_down_task() function
that invokes a specified function if the specified task can be locked
down, returning true if successful and if the specified function returns
true.  Otherwise this function simply returns false.  Given that the
function passed to try_invoke_on_nonrunning_task() might be invoked with
a runqueue lock held, that function had better be quite lightweight.

The function is passed the target task's task_struct pointer and the
argument passed to try_invoke_on_locked_down_task(), allowing easy access
to task state and to a location for further variables to be passed in
and out.

Note that the specified function will be called even if the specified
task is currently running.  The function can use ->on_rq and task_curr()
to quickly and easily determine the task's state, and can return false
if this state is not to the function's liking.  The caller of the
try_invoke_on_locked_down_task() would then see the false return value,
and could take appropriate action, for example, trying again later or
sending an IPI if matters are more urgent.

It is expected that use cases such as the RCU CPU stall warning code will
simply return false if the task is currently running.  However, there are
use cases involving nohz_full CPUs where the specified function might
instead fall back to an alternative sampling scheme that relies on heavier
synchronization (such as memory barriers) in the target task.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
[ paulmck: Apply feedback from Peter Zijlstra and Steven Rostedt. ]
[ paulmck: Invoke if running to handle feedback from Mathieu Desnoyers. ]
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:03:50 -07:00
Paul E. McKenney
66777e5821 rcu-tasks: Use context-switch hook for PREEMPT=y kernels
Currently, the PREEMPT=y version of rcu_note_context_switch() does not
invoke rcu_tasks_qs(), and we need it to in order to keep RCU Tasks
Trace's IPIs down to a dull roar.  This commit therefore enables this
hook.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:03:50 -07:00
Paul E. McKenney
ac3caf8274 rcu: Add comments marking transitions between RCU watching and not
It is not as clear as it might be just where in RCU's idle entry/exit
code RCU stops and starts watching the current CPU.  This commit therefore
adds comments calling out the transitions.

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:03:50 -07:00
Paul E. McKenney
52b1fc3f79 rcutorture: Add test of holding scheduler locks across rcu_read_unlock()
Now that it should be safe to hold scheduler locks across
rcu_read_unlock(), even in cases where the corresponding RCU read-side
critical section might have been preempted and boosted, the commit adds
a test of this capability to rcutorture.  This has been tested on current
mainline (which can deadlock in this situation), and lockdep duly reported
the expected deadlock.  On -rcu, lockdep is silent, thus far, anyway.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:03:50 -07:00
Lai Jiangshan
5f5fa7ea89 rcu: Don't use negative nesting depth in __rcu_read_unlock()
Now that RCU flavors have been consolidated, an RCU-preempt
rcu_read_unlock() in an interrupt or softirq handler cannot possibly
end the RCU read-side critical section.  Consider the old vulnerability
involving rcu_read_unlock() being invoked within such a handler that
interrupted an __rcu_read_unlock_special(), in which a wakeup might be
invoked with a scheduler lock held.  Because rcu_read_unlock_special()
no longer does wakeups in such situations, it is no longer necessary
for __rcu_read_unlock() to set the nesting level negative.

This commit therefore removes this recursion-protection code from
__rcu_read_unlock().

[ paulmck: Let rcu_exp_handler() continue to call rcu_report_exp_rdp(). ]
[ paulmck: Adjust other checks given no more negative nesting. ]
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:03:50 -07:00
Lai Jiangshan
f0bdf6d473 rcu: Remove unused ->rcu_read_unlock_special.b.deferred_qs field
The ->rcu_read_unlock_special.b.deferred_qs field is set to true in
rcu_read_unlock_special() but never set to false.  This is not
particularly useful, so this commit removes this field.

The only possible justification for this field is to ease debugging
of RCU deferred quiscent states, but the combination of the other
->rcu_read_unlock_special fields plus ->rcu_blocked_node and of course
->rcu_read_lock_nesting should cover debugging needs.  And if this last
proves incorrect, this patch can always be reverted, along with the
required setting of ->rcu_read_unlock_special.b.deferred_qs to false
in rcu_preempt_deferred_qs_irqrestore().

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:03:50 -07:00
Lai Jiangshan
07b4a930fc rcu: Don't set nesting depth negative in rcu_preempt_deferred_qs()
Now that RCU flavors have been consolidated, an RCU-preempt
rcu_read_unlock() in an interrupt or softirq handler cannot possibly
end the RCU read-side critical section.  Consider the old vulnerability
involving rcu_preempt_deferred_qs() being invoked within such a handler
that interrupted an extended RCU read-side critical section, in which
a wakeup might be invoked with a scheduler lock held.  Because
rcu_read_unlock_special() no longer does wakeups in such situations,
it is no longer necessary for rcu_preempt_deferred_qs() to set the
nesting level negative.

This commit therefore removes this recursion-protection code from
rcu_preempt_deferred_qs().

[ paulmck: Fix typo in commit log per Steve Rostedt. ]
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:03:50 -07:00
Paul E. McKenney
e4453d8a1c rcu: Make rcu_read_unlock_special() safe for rq/pi locks
The scheduler is currently required to hold rq/pi locks across the entire
RCU read-side critical section or not at all.  This is inconvenient and
leaves traps for the unwary, including the author of this commit.

But now that excessively long grace periods enable scheduling-clock
interrupts for holdout nohz_full CPUs, the nohz_full rescue logic in
rcu_read_unlock_special() can be dispensed with.  In other words, the
rcu_read_unlock_special() function can refrain from doing wakeups unless
such wakeups are guaranteed safe.

This commit therefore avoids unsafe wakeups, freeing the scheduler to
hold rq/pi locks across rcu_read_unlock() even if the corresponding RCU
read-side critical section might have been preempted.  This commit also
updates RCU's requirements documentation.

This commit is inspired by a patch from Lai Jiangshan:
https://lore.kernel.org/lkml/20191102124559.1135-2-laijs@linux.alibaba.com
This commit is further intended to be a step towards his goal of permitting
the inlining of RCU-preempt's rcu_read_lock() and rcu_read_unlock().

Cc: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:03:50 -07:00
Paul E. McKenney
c76e7e0bce rcu: Add KCSAN stubs to update.c
This commit adds stubs for KCSAN's data_race(), ASSERT_EXCLUSIVE_WRITER(),
and ASSERT_EXCLUSIVE_ACCESS() macros to allow code using these macros
to move ahead.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:03:50 -07:00
Paul E. McKenney
6be7436d22 rcu: Add rcu_gp_might_be_stalled()
This commit adds rcu_gp_might_be_stalled(), which returns true if there
is some reason to believe that the RCU grace period is stalled.  The use
case is where an RCU free-memory path needs to allocate memory in order
to free it, a situation that should be avoided where possible.

But where it is necessary, there is always the alternative of using
synchronize_rcu() to wait for a grace period in order to avoid the
allocation.  And if the grace period is stalled, allocating memory to
asynchronously wait for it is a bad idea of epic proportions: Far better
to let others use the memory, because these others might actually be
able to free that memory before the grace period ends.

Thus, rcu_gp_might_be_stalled() can be used to help decide whether
allocating memory on an RCU free path is a semi-reasonable course
of action.

Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:02:50 -07:00
Joel Fernandes (Google)
a6a82ce18b rcu/tree: Count number of batched kfree_rcu() locklessly
We can relax the correctness of counting of number of queued objects in
favor of not hurting performance, by locklessly sampling per-cpu
counters. This should be Ok since under high memory pressure, it should not
matter if we are off by a few objects while counting. The shrinker will
still do the reclaim.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
[ paulmck: Remove unused "flags" variable. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:02:50 -07:00
Joel Fernandes (Google)
9154244c1a rcu/tree: Add a shrinker to prevent OOM due to kfree_rcu() batching
To reduce grace periods and improve kfree() performance, we have done
batching recently dramatically bringing down the number of grace periods
while giving us the ability to use kfree_bulk() for efficient kfree'ing.

However, this has increased the likelihood of OOM condition under heavy
kfree_rcu() flood on small memory systems. This patch introduces a
shrinker which starts grace periods right away if the system is under
memory pressure due to existence of objects that have still not started
a grace period.

With this patch, I do not observe an OOM anymore on a system with 512MB
RAM and 8 CPUs, with the following rcuperf options:

rcuperf.kfree_loops=20000 rcuperf.kfree_alloc_num=8000
rcuperf.kfree_rcu_test=1 rcuperf.kfree_mult=2

Otherwise it easily OOMs with the above parameters.

NOTE:
1. On systems with no memory pressure, the patch has no effect as intended.
2. In the future, we can use this same mechanism to prevent grace periods
   from happening even more, by relying on shrinkers carefully.

Cc: urezki@gmail.com
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:02:50 -07:00
Joel Fernandes (Google)
f87dc80800 rcuperf: Add ability to increase object allocation size
This allows us to increase memory pressure dynamically using a new
rcuperf boot command line parameter called 'rcumult'.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:02:50 -07:00
Paul E. McKenney
e2f3ccfa62 rcu: Convert rcu_nohz_full_cpu() ULONG_CMP_LT() to time_before()
This commit converts the ULONG_CMP_LT() in rcu_nohz_full_cpu() to
time_before() to reflect the fact that it is comparing a timestamp to
the jiffies counter.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:01:17 -07:00
Paul E. McKenney
7b2413111a rcu: Convert rcu_initiate_boost() ULONG_CMP_GE() to time_after()
This commit converts the ULONG_CMP_GE() in rcu_initiate_boost() to
time_after() to reflect the fact that it is comparing a timestamp to
the jiffies counter.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:01:16 -07:00
Paul E. McKenney
29ffebc5fc rcu: Convert ULONG_CMP_GE() to time_after() for jiffy comparison
This commit converts the ULONG_CMP_GE() in rcu_gp_fqs_loop() to
time_after() to reflect the fact that it is comparing a timestamp to
the jiffies counter.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:01:16 -07:00
Jules Irenge
da44cd6c8e rcu: Replace 1 by true
Coccinelle reports a warning at use_softirq declaration

WARNING: Assignment of 0/1 to bool variable

The root cause is
use_softirq a variable of bool type is initialised with the integer 1
Replacing 1 with value true solve the issue.

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:01:16 -07:00
Jules Irenge
a66dbda789 rcu: Replace assigned pointer ret value by corresponding boolean value
Coccinelle reports warnings at rcu_read_lock_held_common()

WARNING: Assignment of 0/1 to bool variable

To fix this,
the assigned  pointer ret values are replaced by corresponding boolean value.
Given that ret is a pointer of bool type

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:01:16 -07:00
Paul E. McKenney
62ae19511f rcu: Mark rcu_state.gp_seq to detect more concurrent writes
The rcu_state structure's gp_seq field is only to be modified by the RCU
grace-period kthread, which is single-threaded.  This commit therefore
enlists KCSAN's help in enforcing this restriction.  This commit applies
KCSAN-specific primitives, so cannot go upstream until KCSAN does.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:01:16 -07:00
Mauro Carvalho Chehab
c28d5c09d0 rcu: Get rid of some doc warnings in update.c
This commit escapes *ret, because otherwise the documentation system
thinks that this is an incomplete emphasis block:

	./kernel/rcu/update.c:65: WARNING: Inline emphasis start-string without end-string.
	./kernel/rcu/update.c:65: WARNING: Inline emphasis start-string without end-string.
	./kernel/rcu/update.c:70: WARNING: Inline emphasis start-string without end-string.
	./kernel/rcu/update.c:82: WARNING: Inline emphasis start-string without end-string.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:01:16 -07:00
Zhaolong Zhang
fcbcc0e700 rcu: Fix the (t=0 jiffies) false positive
It is possible that an over-long grace period will end while the RCU
CPU stall warning message is printing.  In this case, the estimate of
the offending grace period's duration can be erroneous due to refetching
of rcu_state.gp_start, which will now be the time of the newly started
grace period.  Computation of this duration clearly needs to use the
start time for the old over-long grace period, not the fresh new one.
This commit avoids such errors by causing both print_other_cpu_stall() and
print_cpu_stall() to reuse the value previously fetched by their caller.

Signed-off-by: Zhaolong Zhang <zhangzl2013@126.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:01:16 -07:00
Paul E. McKenney
1fca4d12f4 rcu: Expedite first two FQS scans under callback-overload conditions
Even if some CPUs have excessive numbers of callbacks, RCU's grace-period
kthread will still wait normally between successive force-quiescent-state
scans.  The first two are the most important, as they are the ones that
enlist aid from the scheduler when overloaded.  This commit therefore
omits the wait before the first and the second force-quiescent-state
scan under callback-overload conditions.

This approach was inspired by a discussion with Jeff Roberson.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:01:16 -07:00
Paul E. McKenney
47fbb07453 rcu: Use data_race() for RCU CPU stall-warning prints
Although the accesses used to determine whether or not a stall should
be printed are an integral part of the concurrency algorithm governing
use of the corresponding variables, the values that are simply printed
are ancillary.  As such, it is best to use data_race() for these accesses
in order to provide the greatest latitude in the use of KCSAN for the
other accesses that are an integral part of the algorithm.  This commit
therefore changes the relevant uses of READ_ONCE() to data_race().

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:01:16 -07:00
Paul E. McKenney
5822b8126f rcu: Add WRITE_ONCE() to rcu_node ->boost_tasks
The rcu_node structure's ->boost_tasks field is read locklessly, so this
commit adds the WRITE_ONCE() to an update in order to provide proper
documentation and READ_ONCE()/WRITE_ONCE() pairing.

This data race was reported by KCSAN.  Not appropriate for backporting
due to failure being unlikely.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:01:16 -07:00
Paul E. McKenney
b68c614651 srcu: Add data_race() to ->srcu_lock_count and ->srcu_unlock_count arrays
The srcu_data structure's ->srcu_lock_count and ->srcu_unlock_count arrays
are read and written locklessly, so this commit adds the data_race()
to the diagnostic-print loads from these arrays in order mark them as
known and approved data-racy accesses.

This data race was reported by KCSAN. Not appropriate for backporting due
to failure being unlikely and due to this being used only by rcutorture.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:01:16 -07:00
Paul E. McKenney
065a6db12a rcu: Add READ_ONCE and data_race() to rcu_node ->boost_tasks
The rcu_node structure's ->boost_tasks field is read locklessly, so this
commit adds the READ_ONCE() to one load in order to avoid destructive
compiler optimizations.  The other load is from a diagnostic print,
so data_race() suffices.

This data race was reported by KCSAN.  Not appropriate for backporting
due to failure being unlikely.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:01:16 -07:00
Paul E. McKenney
314eeb43e5 rcu: Add *_ONCE() and data_race() to rcu_node ->exp_tasks plus locking
There are lockless loads from the rcu_node structure's ->exp_tasks field,
so this commit causes all stores to use WRITE_ONCE() and all lockless
loads to use READ_ONCE() or data_race(), with the latter for debug
prints.  This code also did a unprotected traversal of the linked list
pointed into by ->exp_tasks, so this commit also acquires the rcu_node
structure's ->lock to properly protect this traversal.  This list was
traversed unprotected only when printing an RCU CPU stall warning for
an expedited grace period, so the odds of seeing this in production are
not all that high.

This data race was reported by KCSAN.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:01:15 -07:00
Paul E. McKenney
2f08469563 rcu: Mark rcu_state.ncpus to detect concurrent writes
The rcu_state structure's ncpus field is only to be modified by the
CPU-hotplug CPU-online code path, which is single-threaded.  This
commit therefore enlists KCSAN's help in enforcing this restriction.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:01:15 -07:00
Paul E. McKenney
4f58820fd7 srcu: Add KCSAN stubs
This commit adds stubs for KCSAN's data_race(), ASSERT_EXCLUSIVE_WRITER(),
and ASSERT_EXCLUSIVE_ACCESS() macros to allow code using these macros to
move ahead.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:01:15 -07:00
Paul E. McKenney
353159365e rcu: Add KCSAN stubs
This commit adds stubs for KCSAN's data_race(), ASSERT_EXCLUSIVE_WRITER(),
and ASSERT_EXCLUSIVE_ACCESS() macros to allow code using these macros to
move ahead.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-27 11:00:06 -07:00
Alex Shi
23b5ae2e8e locking/rtmutex: Remove unused rt_mutex_cmpxchg_relaxed()
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/1587135032-188866-1-git-send-email-alex.shi@linux.alibaba.com
2020-04-27 12:26:40 +02:00
Dexuan Cui
2351f8d295 PM: hibernate: Freeze kernel threads in software_resume()
Currently the kernel threads are not frozen in software_resume(), so
between dpm_suspend_start(PMSG_QUIESCE) and resume_target_kernel(),
system_freezable_power_efficient_wq can still try to submit SCSI
commands and this can cause a panic since the low level SCSI driver
(e.g. hv_storvsc) has quiesced the SCSI adapter and can not accept
any SCSI commands: https://lkml.org/lkml/2020/4/10/47

At first I posted a fix (https://lkml.org/lkml/2020/4/21/1318) trying
to resolve the issue from hv_storvsc, but with the help of
Bart Van Assche, I realized it's better to fix software_resume(),
since this looks like a generic issue, not only pertaining to SCSI.

Cc: All applicable <stable@vger.kernel.org>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-04-27 10:30:30 +02:00
Christoph Hellwig
32927393dc sysctl: pass kernel pointers to ->proc_handler
Instead of having all the sysctl handlers deal with user pointers, which
is rather hairy in terms of the BPF interaction, copy the input to and
from  userspace in common code.  This also means that the strings are
always NUL-terminated by the common code, making the API a little bit
safer.

As most handler just pass through the data to one of the common handlers
a lot of the changes are mechnical.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-04-27 02:07:40 -04:00
Christoph Hellwig
f461d2dcd5 sysctl: avoid forward declarations
Move the sysctl tables to the end of the file to avoid lots of pointless
forward declarations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-04-27 02:07:26 -04:00
Christoph Hellwig
2374c09b1c sysctl: remove all extern declaration from sysctl.c
Extern declarations in .c files are a bad style and can lead to
mismatches.  Use existing definitions in headers where they exist,
and otherwise move the external declarations to suitable header
files.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-04-27 02:06:53 -04:00
Christoph Hellwig
26363af564 mm: remove watermark_boost_factor_sysctl_handler
watermark_boost_factor_sysctl_handler is just a pointless wrapper for
proc_dointvec_minmax, so remove it and use proc_dointvec_minmax
directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-04-27 02:06:52 -04:00
Andrii Nakryiko
6f8a57ccf8 bpf: Make verifier log more relevant by default
To make BPF verifier verbose log more releavant and easier to use to debug
verification failures, "pop" parts of log that were successfully verified.
This has effect of leaving only verifier logs that correspond to code branches
that lead to verification failure, which in practice should result in much
shorter and more relevant verifier log dumps. This behavior is made the
default behavior and can be overriden to do exhaustive logging by specifying
BPF_LOG_LEVEL2 log level.

Using BPF_LOG_LEVEL2 to disable this behavior is not ideal, because in some
cases it's good to have BPF_LOG_LEVEL2 per-instruction register dump
verbosity, but still have only relevant verifier branches logged. But for this
patch, I didn't want to add any new flags. It might be worth-while to just
rethink how BPF verifier logging is performed and requested and streamline it
a bit. But this trimming of successfully verified branches seems to be useful
and a good default behavior.

To test this, I modified runqslower slightly to introduce read of
uninitialized stack variable. Log (**truncated in the middle** to save many
lines out of this commit message) BEFORE this change:

; int handle__sched_switch(u64 *ctx)
0: (bf) r6 = r1
; struct task_struct *prev = (struct task_struct *)ctx[1];
1: (79) r1 = *(u64 *)(r6 +8)
func 'sched_switch' arg1 has btf_id 151 type STRUCT 'task_struct'
2: (b7) r2 = 0
; struct event event = {};
3: (7b) *(u64 *)(r10 -24) = r2
last_idx 3 first_idx 0
regs=4 stack=0 before 2: (b7) r2 = 0
4: (7b) *(u64 *)(r10 -32) = r2
5: (7b) *(u64 *)(r10 -40) = r2
6: (7b) *(u64 *)(r10 -48) = r2
; if (prev->state == TASK_RUNNING)

[ ... instruction dump from insn #7 through #50 are cut out ... ]

51: (b7) r2 = 16
52: (85) call bpf_get_current_comm#16
last_idx 52 first_idx 42
regs=4 stack=0 before 51: (b7) r2 = 16
; bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU,
53: (bf) r1 = r6
54: (18) r2 = 0xffff8881f3868800
56: (18) r3 = 0xffffffff
58: (bf) r4 = r7
59: (b7) r5 = 32
60: (85) call bpf_perf_event_output#25
last_idx 60 first_idx 53
regs=20 stack=0 before 59: (b7) r5 = 32
61: (bf) r2 = r10
; event.pid = pid;
62: (07) r2 += -16
; bpf_map_delete_elem(&start, &pid);
63: (18) r1 = 0xffff8881f3868000
65: (85) call bpf_map_delete_elem#3
; }
66: (b7) r0 = 0
67: (95) exit

from 44 to 66: safe

from 34 to 66: safe

from 11 to 28: R1_w=inv0 R2_w=inv0 R6_w=ctx(id=0,off=0,imm=0) R10=fp0 fp-8=mmmm???? fp-24_w=00000000 fp-32_w=00000000 fp-40_w=00000000 fp-48_w=00000000
; bpf_map_update_elem(&start, &pid, &ts, 0);
28: (bf) r2 = r10
;
29: (07) r2 += -16
; tsp = bpf_map_lookup_elem(&start, &pid);
30: (18) r1 = 0xffff8881f3868000
32: (85) call bpf_map_lookup_elem#1
invalid indirect read from stack off -16+0 size 4
processed 65 insns (limit 1000000) max_states_per_insn 1 total_states 5 peak_states 5 mark_read 4

Notice how there is a successful code path from instruction 0 through 67, few
successfully verified jumps (44->66, 34->66), and only after that 11->28 jump
plus error on instruction #32.

AFTER this change (full verifier log, **no truncation**):

; int handle__sched_switch(u64 *ctx)
0: (bf) r6 = r1
; struct task_struct *prev = (struct task_struct *)ctx[1];
1: (79) r1 = *(u64 *)(r6 +8)
func 'sched_switch' arg1 has btf_id 151 type STRUCT 'task_struct'
2: (b7) r2 = 0
; struct event event = {};
3: (7b) *(u64 *)(r10 -24) = r2
last_idx 3 first_idx 0
regs=4 stack=0 before 2: (b7) r2 = 0
4: (7b) *(u64 *)(r10 -32) = r2
5: (7b) *(u64 *)(r10 -40) = r2
6: (7b) *(u64 *)(r10 -48) = r2
; if (prev->state == TASK_RUNNING)
7: (79) r2 = *(u64 *)(r1 +16)
; if (prev->state == TASK_RUNNING)
8: (55) if r2 != 0x0 goto pc+19
 R1_w=ptr_task_struct(id=0,off=0,imm=0) R2_w=inv0 R6_w=ctx(id=0,off=0,imm=0) R10=fp0 fp-24_w=00000000 fp-32_w=00000000 fp-40_w=00000000 fp-48_w=00000000
; trace_enqueue(prev->tgid, prev->pid);
9: (61) r1 = *(u32 *)(r1 +1184)
10: (63) *(u32 *)(r10 -4) = r1
; if (!pid || (targ_pid && targ_pid != pid))
11: (15) if r1 == 0x0 goto pc+16

from 11 to 28: R1_w=inv0 R2_w=inv0 R6_w=ctx(id=0,off=0,imm=0) R10=fp0 fp-8=mmmm???? fp-24_w=00000000 fp-32_w=00000000 fp-40_w=00000000 fp-48_w=00000000
; bpf_map_update_elem(&start, &pid, &ts, 0);
28: (bf) r2 = r10
;
29: (07) r2 += -16
; tsp = bpf_map_lookup_elem(&start, &pid);
30: (18) r1 = 0xffff8881db3ce800
32: (85) call bpf_map_lookup_elem#1
invalid indirect read from stack off -16+0 size 4
processed 65 insns (limit 1000000) max_states_per_insn 1 total_states 5 peak_states 5 mark_read 4

Notice how in this case, there are 0-11 instructions + jump from 11 to
28 is recorded + 28-32 instructions with error on insn #32.

test_verifier test runner was updated to specify BPF_LOG_LEVEL2 for
VERBOSE_ACCEPT expected result due to potentially "incomplete" success verbose
log at BPF_LOG_LEVEL1.

On success, verbose log will only have a summary of number of processed
instructions, etc, but no branch tracing log. Having just a last succesful
branch tracing seemed weird and confusing. Having small and clean summary log
in success case seems quite logical and nice, though.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200423195850.1259827-1-andriin@fb.com
2020-04-26 09:47:37 -07:00
Maciej Żenczykowski
71d1921477 bpf: add bpf_ktime_get_boot_ns()
On a device like a cellphone which is constantly suspending
and resuming CLOCK_MONOTONIC is not particularly useful for
keeping track of or reacting to external network events.
Instead you want to use CLOCK_BOOTTIME.

Hence add bpf_ktime_get_boot_ns() as a mirror of bpf_ktime_get_ns()
based around CLOCK_BOOTTIME instead of CLOCK_MONOTONIC.

Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-04-26 09:43:05 -07:00
Maciej Żenczykowski
082b57e3eb net: bpf: Make bpf_ktime_get_ns() available to non GPL programs
The entire implementation is in kernel/bpf/helpers.c:

BPF_CALL_0(bpf_ktime_get_ns) {
       /* NMI safe access to clock monotonic */
       return ktime_get_mono_fast_ns();
}

const struct bpf_func_proto bpf_ktime_get_ns_proto = {
       .func           = bpf_ktime_get_ns,
       .gpl_only       = false,
       .ret_type       = RET_INTEGER,
};

and this was presumably marked GPL due to kernel/time/timekeeping.c:
  EXPORT_SYMBOL_GPL(ktime_get_mono_fast_ns);

and while that may make sense for kernel modules (although even that
is doubtful), there is currently AFAICT no other source of time
available to ebpf.

Furthermore this is really just equivalent to clock_gettime(CLOCK_MONOTONIC)
which is exposed to userspace (via vdso even to make it performant)...

As such, I see no reason to keep the GPL restriction.
(In the future I'd like to have access to time from Apache licensed ebpf code)

Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-04-26 09:04:14 -07:00
Stanislav Fomichev
6890896bd7 bpf: Fix missing bpf_base_func_proto in cgroup_base_func_proto for CGROUP_NET=n
linux-next build bot reported compile issue [1] with one of its
configs. It looks like when we have CONFIG_NET=n and
CONFIG_BPF{,_SYSCALL}=y, we are missing the bpf_base_func_proto
definition (from net/core/filter.c) in cgroup_base_func_proto.

I'm reshuffling the code a bit to make it work. The common helpers
are moved into kernel/bpf/helpers.c and the bpf_base_func_proto is
exported from there.
Also, bpf_get_raw_cpu_id goes into kernel/bpf/core.c akin to existing
bpf_user_rnd_u32.

[1] https://lore.kernel.org/linux-next/CAKH8qBsBvKHswiX1nx40LgO+BGeTmb1NX8tiTttt_0uu6T3dCA@mail.gmail.com/T/#mff8b0c083314c68c2e2ef0211cb11bc20dc13c72

Fixes: 0456ea170c ("bpf: Enable more helpers for BPF_PROG_TYPE_CGROUP_{DEVICE,SYSCTL,SOCKOPT}")
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200424235941.58382-1-sdf@google.com
2020-04-26 08:53:13 -07:00
Stanislav Fomichev
0456ea170c bpf: Enable more helpers for BPF_PROG_TYPE_CGROUP_{DEVICE,SYSCTL,SOCKOPT}
Currently the following prog types don't fall back to bpf_base_func_proto()
(instead they have cgroup_base_func_proto which has a limited set of
helpers from bpf_base_func_proto):
* BPF_PROG_TYPE_CGROUP_DEVICE
* BPF_PROG_TYPE_CGROUP_SYSCTL
* BPF_PROG_TYPE_CGROUP_SOCKOPT

I don't see any specific reason why we shouldn't use bpf_base_func_proto(),
every other type of program (except bpf-lirc and, understandably, tracing)
use it, so let's fall back to bpf_base_func_proto for those prog types
as well.

This basically boils down to adding access to the following helpers:
* BPF_FUNC_get_prandom_u32
* BPF_FUNC_get_smp_processor_id
* BPF_FUNC_get_numa_node_id
* BPF_FUNC_tail_call
* BPF_FUNC_ktime_get_ns
* BPF_FUNC_spin_lock (CAP_SYS_ADMIN)
* BPF_FUNC_spin_unlock (CAP_SYS_ADMIN)
* BPF_FUNC_jiffies64 (CAP_SYS_ADMIN)

I've also added bpf_perf_event_output() because it's really handy for
logging and debugging.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200420174610.77494-1-sdf@google.com
2020-04-26 08:40:01 -07:00
Mao Wenan
b0b3fb6759 bpf: Remove set but not used variable 'dst_known'
Fixes gcc '-Wunused-but-set-variable' warning:

kernel/bpf/verifier.c:5603:18: warning: variable ‘dst_known’
set but not used [-Wunused-but-set-variable], delete this
variable.

Signed-off-by: Mao Wenan <maowenan@huawei.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200418013735.67882-1-maowenan@huawei.com
2020-04-26 08:40:01 -07:00
David S. Miller
d483389678 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Simple overlapping changes to linux/vermagic.h

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-25 20:18:53 -07:00
Al Viro
ce5155c4f8 compat sysinfo(2): don't bother with field-by-field copyout
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-04-25 18:06:05 -04:00
Linus Torvalds
b2768df24e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull pid leak fix from Eric Biederman:
 "Oleg noticed that put_pid(thread_pid) was not getting called when proc
  was not compiled in.

  Let's get that fixed before 5.7 is released and causes problems for
  anyone"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  proc: Put thread_pid in release_task not proc_flush_pid
2020-04-25 12:25:32 -07:00
Linus Torvalds
05db498ad9 Misc fixes:
- an uclamp accounting fix
  - three frequency invariance fixes and a readability improvement
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl6kAS8RHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1g0jg//Su8CnWMzOvs4mzqUvzb3OHV6PeQ8BZva
 yI0h8z8V3S33LchwXjb6FI4VulGaGPwLD7tPZtdAdz+wTCnrhw5Rlgfk2thjROW2
 XnWxUACZrAbcH5H88PF0rVp3Z8oaygwlFZCUhvJxLUOgVi4oipNr+0ZNZVATGwO+
 wpCheVKIty6hlMRmNamUDNOB15xFRvXGSQ+kn0N2h/XIvke5f89AJ7uOgTuZ42Ne
 m1vkgX7J29yieLt4yY6odgxlcqlFNAKegpzaadWkEPNOyqkG29pOKwBX7LpuDRVS
 8jd5vo65snKDEQuBkG/CfActJR3GpNT5CVx8wzft7nDb81sEPEPb5sCCURMv5Ig3
 UpEpvzCqYokC2z+ourjtugvilmHq6odwW9XbD/a8i24X+fo13oPg72EVMF7+PLuL
 xZZfhxuQ3hcUGB+H83COEiA8XsNcFxCk7hcHhPyPZeagbGWUTozrwRn/JItAp5Bv
 xkKmqefOu0bZYDGKwP5fMJZ5BmNiWKNw6PyH7lfzL8Ve6dKXlMWMre5cwEUJXPUe
 scpPjvokvZo7C4FTy8U/7cAZlmVy27Y9Ljyf9nROWLp4KewznP3FBdGEv2+IE8uu
 m90vpYXv3Y/4JsLyxg+MMkhnOb26e8vFh6roWQxtPluhWyFlTilmBYovLmFz2RBO
 eqpS9MVBuRs=
 =5Wor
 -----END PGP SIGNATURE-----

Merge tag 'sched-urgent-2020-04-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Ingo Molnar:
 "Misc fixes:

   - an uclamp accounting fix

   - three frequency invariance fixes and a readability improvement"

* tag 'sched-urgent-2020-04-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/core: Fix reset-on-fork from RT with uclamp
  x86, sched: Move check for CPU type to caller function
  x86, sched: Don't enable static key when starting secondary CPUs
  x86, sched: Account for CPUs with less than 4 cores in freq. invariance
  x86, sched: Bail out of frequency invariance if base frequency is unknown
2020-04-25 12:11:47 -07:00
Linus Torvalds
e18588005d Two changes:
- fix exit event records
  - extend x86 PMU driver enumeration to add Intel Jasper Lake CPU support.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl6j/jURHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hpjg/+L0FTvxUj42qaknksSGhPrIa6aKNSfsHe
 zSG4WsN59E8XUOuB9SV3D/VDJ7rX7w8jEub/TpWqLJll+UiPE3B0oLOwlVhX+8nc
 fDfIqONA6ZoKvfMG9HUXPUo1Lr2+RtcF06hZFWx56g2ijqe5da1CdniJUv1YPb5s
 K4HFU6Xuitih5QrYoLOiATQdp0/W1cjG36irF4svmjrxXotmeWsp7SDlAsAP2hgw
 D7VchYMVs2bWtVL4GyBkq/+EOhvHwp5PTjF3yz7Scy9+CFitL9Bp5DGkaBolpszK
 mUKwstXjbePw28r0jlfLJaptti2ZTFd5r4Ywqyh374ct8JbsdLR77v9Uo6M9VHu8
 9la9cmz+KUnTtz2Bl2dkj2DClh+p4k9VbRjTyrAIobo6WbX8hTYKJmLn/ehvOWqU
 nPJL1bRtkx4s4tIS+oXnVOOSYdcWEvcbduxmuh1eRP3wlDb06AUZCR6JaUErd6bd
 oYFwrZg9vncscKEQM7boQI8f6PhwloZFnGPbPdXgCNarFdCEugCc57s2NvG4BUFo
 WykXYcOGxoANO3F478e/h52cjOoZNk0trdlAk9z957GcI+g+xwoMXWZf8nMUrlWx
 qrtJmEJlde5sN+5j1X75wjJovOIXBmZH9yXWYgZ+kkmASpqeKQvRkbH+eZBBDBuU
 EHNMzUenU0o=
 =pEo5
 -----END PGP SIGNATURE-----

Merge tag 'perf-urgent-2020-04-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fixes from Ingo Molnar:
 "Two changes:

   - fix exit event records

   - extend x86 PMU driver enumeration to add Intel Jasper Lake CPU
     support"

* tag 'perf-urgent-2020-04-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/core: fix parent pid/tid in task exit events
  perf/x86/cstate: Add Jasper Lake CPU support
2020-04-25 12:08:24 -07:00
Peter Collingbourne
298f3db6ee dma-contiguous: fix comment for dma_release_from_contiguous
Commit 90ae409f9e ("dma-direct: fix zone selection
after an unaddressable CMA allocation") changed the logic in
dma_release_from_contiguous to remove the normal pages fallback path,
but did not update the comment. Fix that.

Signed-off-by: Peter Collingbourne <pcc@google.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-04-25 13:17:06 +02:00
David Rientjes
1d659236fb dma-pool: scale the default DMA coherent pool size with memory capacity
When AMD memory encryption is enabled, some devices may use more than
256KB/sec from the atomic pools.  It would be more appropriate to scale
the default size based on memory capacity unless the coherent_pool
option is used on the kernel command line.

This provides a slight optimization on initial expansion and is deemed
appropriate due to the increased reliance on the atomic pools.  Note that
the default size of 128KB per pool will normally be larger than the
single coherent pool implementation since there are now up to three
coherent pools (DMA, DMA32, and kernel).

Note that even prior to this patch, coherent_pool= for sizes larger than
1 << (PAGE_SHIFT + MAX_ORDER-1) can fail.  With new dynamic expansion
support, this would be trivially extensible to allow even larger initial
sizes.

Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-04-25 13:17:06 +02:00
David Rientjes
2edc5bb3c5 dma-pool: add pool sizes to debugfs
The atomic DMA pools can dynamically expand based on non-blocking
allocations that need to use it.

Export the sizes of each of these pools, in bytes, through debugfs for
measurement.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Rientjes <rientjes@google.com>
[hch: remove the !CONFIG_DEBUG_FS stubs]
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-04-25 13:17:05 +02:00
David Rientjes
76a19940bd dma-direct: atomic allocations must come from atomic coherent pools
When a device requires unencrypted memory and the context does not allow
blocking, memory must be returned from the atomic coherent pools.

This avoids the remap when CONFIG_DMA_DIRECT_REMAP is not enabled and the
config only requires CONFIG_DMA_COHERENT_POOL.  This will be used for
CONFIG_AMD_MEM_ENCRYPT in a subsequent patch.

Keep all memory in these pools unencrypted.  When set_memory_decrypted()
fails, this prohibits the memory from being added.  If adding memory to
the genpool fails, and set_memory_encrypted() subsequently fails, there
is no alternative other than leaking the memory.

Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-04-25 13:17:05 +02:00
David Rientjes
54adadf9b0 dma-pool: dynamically expanding atomic pools
When an atomic pool becomes fully depleted because it is now relied upon
for all non-blocking allocations through the DMA API, allow background
expansion of each pool by a kworker.

When an atomic pool has less than the default size of memory left, kick
off a kworker to dynamically expand the pool in the background.  The pool
is doubled in size, up to MAX_ORDER-1.  If memory cannot be allocated at
the requested order, smaller allocation(s) are attempted.

This allows the default size to be kept quite low when one or more of the
atomic pools is not used.

Allocations for lowmem should also use GFP_KERNEL for the benefits of
reclaim, so use GFP_KERNEL | GFP_DMA and GFP_KERNEL | GFP_DMA32 for
lowmem allocations.

This also allows __dma_atomic_pool_init() to return a pointer to the pool
to make initialization cleaner.

Also switch over some node ids to the more appropriate NUMA_NO_NODE.

Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-04-25 13:17:02 +02:00
Linus Torvalds
ab51cac00e Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Fix memory leak in netfilter flowtable, from Roi Dayan.

 2) Ref-count leaks in netrom and tipc, from Xiyu Yang.

 3) Fix warning when mptcp socket is never accepted before close, from
    Florian Westphal.

 4) Missed locking in ovs_ct_exit(), from Tonghao Zhang.

 5) Fix large delays during PTP synchornization in cxgb4, from Rahul
    Lakkireddy.

 6) team_mode_get() can hang, from Taehee Yoo.

 7) Need to use kvzalloc() when allocating fw tracer in mlx5 driver,
    from Niklas Schnelle.

 8) Fix handling of bpf XADD on BTF memory, from Jann Horn.

 9) Fix BPF_STX/BPF_B encoding in x86 bpf jit, from Luke Nelson.

10) Missing queue memory release in iwlwifi pcie code, from Johannes
    Berg.

11) Fix NULL deref in macvlan device event, from Taehee Yoo.

12) Initialize lan87xx phy correctly, from Yuiko Oshino.

13) Fix looping between VRF and XFRM lookups, from David Ahern.

14) etf packet scheduler assumes all sockets are full sockets, which is
    not necessarily true. From Eric Dumazet.

15) Fix mptcp data_fin handling in RX path, from Paolo Abeni.

16) fib_select_default() needs to handle nexthop objects, from David
    Ahern.

17) Use GFP_ATOMIC under spinlock in mac80211_hwsim, from Wei Yongjun.

18) vxlan and geneve use wrong nlattr array, from Sabrina Dubroca.

19) Correct rx/tx stats in bcmgenet driver, from Doug Berger.

20) BPF_LDX zero-extension is encoded improperly in x86_32 bpf jit, fix
    from Luke Nelson.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (100 commits)
  selftests/bpf: Fix a couple of broken test_btf cases
  tools/runqslower: Ensure own vmlinux.h is picked up first
  bpf: Make bpf_link_fops static
  bpftool: Respect the -d option in struct_ops cmd
  selftests/bpf: Add test for freplace program with expected_attach_type
  bpf: Propagate expected_attach_type when verifying freplace programs
  bpf: Fix leak in LINK_UPDATE and enforce empty old_prog_fd
  bpf, x86_32: Fix logic error in BPF_LDX zero-extension
  bpf, x86_32: Fix clobbering of dst for BPF_JSET
  bpf, x86_32: Fix incorrect encoding in BPF_LDX zero-extension
  bpf: Fix reStructuredText markup
  net: systemport: suppress warnings on failed Rx SKB allocations
  net: bcmgenet: suppress warnings on failed Rx SKB allocations
  macsec: avoid to set wrong mtu
  mac80211: sta_info: Add lockdep condition for RCU list usage
  mac80211: populate debugfs only after cfg80211 init
  net: bcmgenet: correct per TX/RX ring statistics
  net: meth: remove spurious copyright text
  net: phy: bcm84881: clear settings on link down
  chcr: Fix CPU hard lockup
  ...
2020-04-24 19:17:30 -07:00
Zou Wei
6f302bfb22 bpf: Make bpf_link_fops static
Fix the following sparse warning:

kernel/bpf/syscall.c:2289:30: warning: symbol 'bpf_link_fops' was not declared. Should it be static?

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zou Wei <zou_wei@huawei.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/1587609160-117806-1-git-send-email-zou_wei@huawei.com
2020-04-24 17:42:44 -07:00
Toke Høiland-Jørgensen
03f87c0b45 bpf: Propagate expected_attach_type when verifying freplace programs
For some program types, the verifier relies on the expected_attach_type of
the program being verified in the verification process. However, for
freplace programs, the attach type was not propagated along with the
verifier ops, so the expected_attach_type would always be zero for freplace
programs.

This in turn caused the verifier to sometimes make the wrong call for
freplace programs. For all existing uses of expected_attach_type for this
purpose, the result of this was only false negatives (i.e., freplace
functions would be rejected by the verifier even though they were valid
programs for the target they were replacing). However, should a false
positive be introduced, this can lead to out-of-bounds accesses and/or
crashes.

The fix introduced in this patch is to propagate the expected_attach_type
to the freplace program during verification, and reset it after that is
done.

Fixes: be8704ff07 ("bpf: Introduce dynamic program extensions")
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/158773526726.293902.13257293296560360508.stgit@toke.dk
2020-04-24 17:34:30 -07:00
Andrii Nakryiko
4adb7a4a15 bpf: Fix leak in LINK_UPDATE and enforce empty old_prog_fd
Fix bug of not putting bpf_link in LINK_UPDATE command.
Also enforce zeroed old_prog_fd if no BPF_F_REPLACE flag is specified.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200424052045.4002963-1-andriin@fb.com
2020-04-24 17:27:02 -07:00
Eric W. Biederman
6ade99ec61 proc: Put thread_pid in release_task not proc_flush_pid
Oleg pointed out that in the unlikely event the kernel is compiled
with CONFIG_PROC_FS unset that release_task will now leak the pid.

Move the put_pid out of proc_flush_pid into release_task to fix this
and to guarantee I don't make that mistake again.

When possible it makes sense to keep get and put in the same function
so it can easily been seen how they pair up.

Fixes: 7bc3e6e55a ("proc: Use a list of inodes to flush from proc")
Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-04-24 15:49:00 -05:00
Linus Torvalds
da5de55d17 A few tracing fixes:
- Two fixes that fix memory leaks detected by kmemleak
  - Removal of some dead code
  - A few local functions turned to static
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXqIivBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qswYAQDEH+T80JHD1XBPpqWw6JBKvPph7moz
 AsjasFiX3d5T2AD+JvNMpZntTtZPWz8+V+RqbU7EcBFD9qCNIxaZXaECOAw=
 =Cm2g
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.7-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
 "A few tracing fixes:

   - Two fixes for memory leaks detected by kmemleak

   - Removal of some dead code

   - A few local functions turned static"

* tag 'trace-v5.7-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Convert local functions in tracing_map.c to static
  tracing: Remove DECLARE_TRACE_NOARGS
  ftrace: Fix memory leak caused by not freeing entry in unregister_ftrace_direct()
  tracing: Fix memory leaks in trace_events_hist.c
2020-04-24 12:39:21 -07:00
Linus Torvalds
b4f633221f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull SIGCHLD fix from Eric Biederman:
 "Christof Meerwald reported that do_notify_parent has not been
  successfully populating si_pid and si_uid for multi-threaded
  processes.

  This is the one-liner fix. Strictly speaking a one-liner plus
  comment"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  signal: Avoid corrupting si_pid and si_uid in do_notify_parent
2020-04-23 13:30:18 -07:00
Ioana Ciornei
788f87ac60 xdp: export the DEV_MAP_BULK_SIZE macro
Export the DEV_MAP_BULK_SIZE macro to the header file so that drivers
can directly use it as the maximum number of xdp_frames received in the
.ndo_xdp_xmit() callback.

Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-22 20:11:29 -07:00
Jason Yan
d013496f99 tracing: Convert local functions in tracing_map.c to static
Fix the following sparse warning:

kernel/trace/tracing_map.c:286:6: warning: symbol
'tracing_map_array_clear' was not declared. Should it be static?
kernel/trace/tracing_map.c:297:6: warning: symbol
'tracing_map_array_free' was not declared. Should it be static?
kernel/trace/tracing_map.c:319:26: warning: symbol
'tracing_map_array_alloc' was not declared. Should it be static?

Link: http://lkml.kernel.org/r/20200410073312.38855-1-yanaijie@huawei.com

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-04-22 22:07:26 -04:00
Steven Rostedt (VMware)
353da87921 ftrace: Fix memory leak caused by not freeing entry in unregister_ftrace_direct()
kmemleak reported the following:

unreferenced object 0xffff90d47127a920 (size 32):
  comm "modprobe", pid 1766, jiffies 4294792031 (age 162.568s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 22 01 00 00 00 00 ad de  ........".......
    00 78 12 a7 ff ff ff ff 00 00 b6 c0 ff ff ff ff  .x..............
  backtrace:
    [<00000000bb79e72e>] register_ftrace_direct+0xcb/0x3a0
    [<00000000295e4f79>] do_one_initcall+0x72/0x340
    [<00000000873ead18>] do_init_module+0x5a/0x220
    [<00000000974d9de5>] load_module+0x2235/0x2550
    [<0000000059c3d6ce>] __do_sys_finit_module+0xc0/0x120
    [<000000005a8611b4>] do_syscall_64+0x60/0x230
    [<00000000a0cdc49e>] entry_SYSCALL_64_after_hwframe+0x49/0xb3

The entry used to save the direct descriptor needs to be freed
when unregistering.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-04-22 22:04:27 -04:00
Vamshi K Sthambamkadi
9da73974eb tracing: Fix memory leaks in trace_events_hist.c
kmemleak report 1:
    [<9092c50b>] kmem_cache_alloc_trace+0x138/0x270
    [<05a2c9ed>] create_field_var+0xcf/0x180
    [<528a2d68>] action_create+0xe2/0xc80
    [<63f50b61>] event_hist_trigger_func+0x15b5/0x1920
    [<28ea5d3d>] trigger_process_regex+0x7b/0xc0
    [<3138e86f>] event_trigger_write+0x4d/0xb0
    [<ffd66c19>] __vfs_write+0x30/0x200
    [<4f424a0d>] vfs_write+0x96/0x1b0
    [<da59a290>] ksys_write+0x53/0xc0
    [<3717101a>] __ia32_sys_write+0x15/0x20
    [<c5f23497>] do_fast_syscall_32+0x70/0x250
    [<46e2629c>] entry_SYSENTER_32+0xaf/0x102

This is because save_vars[] of struct hist_trigger_data are
not destroyed

kmemleak report 2:
    [<9092c50b>] kmem_cache_alloc_trace+0x138/0x270
    [<6e5e97c5>] create_var+0x3c/0x110
    [<de82f1b9>] create_field_var+0xaf/0x180
    [<528a2d68>] action_create+0xe2/0xc80
    [<63f50b61>] event_hist_trigger_func+0x15b5/0x1920
    [<28ea5d3d>] trigger_process_regex+0x7b/0xc0
    [<3138e86f>] event_trigger_write+0x4d/0xb0
    [<ffd66c19>] __vfs_write+0x30/0x200
    [<4f424a0d>] vfs_write+0x96/0x1b0
    [<da59a290>] ksys_write+0x53/0xc0
    [<3717101a>] __ia32_sys_write+0x15/0x20
    [<c5f23497>] do_fast_syscall_32+0x70/0x250
    [<46e2629c>] entry_SYSENTER_32+0xaf/0x102

struct hist_field allocated through create_var() do not initialize
"ref" field to 1. The code in __destroy_hist_field() does not destroy
object if "ref" is initialized to zero, the condition
if (--hist_field->ref > 1) always passes since unsigned int wraps.

kmemleak report 3:
    [<f8666fcc>] __kmalloc_track_caller+0x139/0x2b0
    [<bb7f80a5>] kstrdup+0x27/0x50
    [<39d70006>] init_var_ref+0x58/0xd0
    [<8ca76370>] create_var_ref+0x89/0xe0
    [<f045fc39>] action_create+0x38f/0xc80
    [<7c146821>] event_hist_trigger_func+0x15b5/0x1920
    [<07de3f61>] trigger_process_regex+0x7b/0xc0
    [<e87daf8f>] event_trigger_write+0x4d/0xb0
    [<19bf1512>] __vfs_write+0x30/0x200
    [<64ce4d27>] vfs_write+0x96/0x1b0
    [<a6f34170>] ksys_write+0x53/0xc0
    [<7d4230cd>] __ia32_sys_write+0x15/0x20
    [<8eadca00>] do_fast_syscall_32+0x70/0x250
    [<235cf985>] entry_SYSENTER_32+0xaf/0x102

hist_fields (system & event_name) are not freed

Link: http://lkml.kernel.org/r/20200422061503.GA5151@cosmos

Signed-off-by: Vamshi K Sthambamkadi <vamshi.k.sthambamkadi@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-04-22 21:24:00 -04:00
Ian Rogers
f3bed55e85 perf/core: fix parent pid/tid in task exit events
Current logic yields the child task as the parent.

Before:
$ perf record bash -c "perf list > /dev/null"
$ perf script -D |grep 'FORK\|EXIT'
4387036190981094 0x5a70 [0x30]: PERF_RECORD_FORK(10472:10472):(10470:10470)
4387036606207580 0xf050 [0x30]: PERF_RECORD_EXIT(10472:10472):(10472:10472)
4387036607103839 0x17150 [0x30]: PERF_RECORD_EXIT(10470:10470):(10470:10470)
                                                   ^
  Note the repeated values here -------------------/

After:
383281514043 0x9d8 [0x30]: PERF_RECORD_FORK(2268:2268):(2266:2266)
383442003996 0x2180 [0x30]: PERF_RECORD_EXIT(2268:2268):(2266:2266)
383451297778 0xb70 [0x30]: PERF_RECORD_EXIT(2266:2266):(2265:2265)

Fixes: 94d5d1b2d8 ("perf_counter: Report the cloning task as parent on perf_counter_fork()")
Reported-by: KP Singh <kpsingh@google.com>
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200417182842.12522-1-irogers@google.com
2020-04-22 23:10:14 +02:00
Quentin Perret
eaf5a92ebd sched/core: Fix reset-on-fork from RT with uclamp
uclamp_fork() resets the uclamp values to their default when the
reset-on-fork flag is set. It also checks whether the task has a RT
policy, and sets its uclamp.min to 1024 accordingly. However, during
reset-on-fork, the task's policy is lowered to SCHED_NORMAL right after,
hence leading to an erroneous uclamp.min setting for the new task if it
was forked from RT.

Fix this by removing the unnecessary check on rt_task() in
uclamp_fork() as this doesn't make sense if the reset-on-fork flag is
set.

Fixes: 1a00d99997 ("sched/uclamp: Set default clamps for RT tasks")
Reported-by: Chitti Babu Theegala <ctheegal@codeaurora.org>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Patrick Bellasi <patrick.bellasi@matbug.net>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lkml.kernel.org/r/20200416085956.217587-1-qperret@google.com
2020-04-22 23:10:13 +02:00
Paul Moore
3054d06719 audit: fix a net reference leak in audit_list_rules_send()
If audit_list_rules_send() fails when trying to create a new thread
to send the rules it also fails to cleanup properly, leaking a
reference to a net structure.  This patch fixes the error patch and
renames audit_send_list() to audit_send_list_thread() to better
match its cousin, audit_send_reply_thread().

Reported-by: teroincn@gmail.com
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2020-04-22 15:23:10 -04:00
Ingo Molnar
87cfeb1920 perf/core fixes and improvements:
kernel + tools/perf:
 
   Alexey Budankov:
 
   - Introduce CAP_PERFMON to kernel and user space.
 
 callchains:
 
   Adrian Hunter:
 
   - Allow using Intel PT to synthesize callchains for regular events.
 
   Kan Liang:
 
   - Stitch LBR records from multiple samples to get deeper backtraces,
     there are caveats, see the csets for details.
 
 perf script:
 
   Andreas Gerstmayr:
 
   - Add flamegraph.py script
 
 BPF:
 
   Jiri Olsa:
 
   - Synthesize bpf_trampoline/dispatcher ksymbol events.
 
 perf stat:
 
   Arnaldo Carvalho de Melo:
 
   - Honour --timeout for forked workloads.
 
   Stephane Eranian:
 
   - Force error in fallback on :k events, to avoid counting nothing when
     the user asks for kernel events but is not allowed to.
 
 perf bench:
 
   Ian Rogers:
 
   - Add event synthesis benchmark.
 
 tools api fs:
 
   Stephane Eranian:
 
  - Make xxx__mountpoint() more scalable
 
 libtraceevent:
 
   He Zhe:
 
   - Handle return value of asprintf.
 
 Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQR2GiIUctdOfX2qHhGyPKLppCJ+JwUCXp2LlQAKCRCyPKLppCJ+
 J95oAP0ZihVUhESv/gdeX0IDE5g6Rd2V6LNcRj+jb7gX9NlQkwD/UfS454WV1ftQ
 qTwrkKPzY/5Tm2cLuVE7r7fJ6naDHgU=
 =FHm4
 -----END PGP SIGNATURE-----

Merge tag 'perf-core-for-mingo-5.8-20200420' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core

Pull perf/core fixes and improvements from Arnaldo Carvalho de Melo:

kernel + tools/perf:

  Alexey Budankov:

  - Introduce CAP_PERFMON to kernel and user space.

callchains:

  Adrian Hunter:

  - Allow using Intel PT to synthesize callchains for regular events.

  Kan Liang:

  - Stitch LBR records from multiple samples to get deeper backtraces,
    there are caveats, see the csets for details.

perf script:

  Andreas Gerstmayr:

  - Add flamegraph.py script

BPF:

  Jiri Olsa:

  - Synthesize bpf_trampoline/dispatcher ksymbol events.

perf stat:

  Arnaldo Carvalho de Melo:

  - Honour --timeout for forked workloads.

  Stephane Eranian:

  - Force error in fallback on :k events, to avoid counting nothing when
    the user asks for kernel events but is not allowed to.

perf bench:

  Ian Rogers:

  - Add event synthesis benchmark.

tools api fs:

  Stephane Eranian:

 - Make xxx__mountpoint() more scalable

libtraceevent:

  He Zhe:

  - Handle return value of asprintf.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-04-22 14:08:28 +02:00
Leon Romanovsky
51161bfc66 kernel/module: Hide vermagic header file from general use
VERMAGIC* definitions are not supposed to be used by the drivers,
see this [1] bug report, so introduce special define to guard inclusion
of this header file and define it in kernel/modules.h and in internal
script that generates *.mod.c files.

In-tree module build:
➜  kernel git:(vermagic) ✗ make clean
➜  kernel git:(vermagic) ✗ make M=drivers/infiniband/hw/mlx5
➜  kernel git:(vermagic) ✗ modinfo drivers/infiniband/hw/mlx5/mlx5_ib.ko
filename:	/images/leonro/src/kernel/drivers/infiniband/hw/mlx5/mlx5_ib.ko
<...>
vermagic:       5.6.0+ SMP mod_unload modversions

Out-of-tree module build:
➜  mlx5 make -C /images/leonro/src/kernel clean M=/tmp/mlx5
➜  mlx5 make -C /images/leonro/src/kernel M=/tmp/mlx5
➜  mlx5 modinfo /tmp/mlx5/mlx5_ib.ko
filename:       /tmp/mlx5/mlx5_ib.ko
<...>
vermagic:       5.6.0+ SMP mod_unload modversions

[1] https://lore.kernel.org/lkml/20200411155623.GA22175@zn.tnic
Reported-by: Borislav Petkov <bp@suse.de>
Acked-by: Borislav Petkov <bp@suse.de>
Acked-by: Jessica Yu <jeyu@kernel.org>
Co-developed-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-21 13:27:37 -07:00
Peter Zijlstra
5c3a7db0c7 module: Harden STRICT_MODULE_RWX
We're very close to enforcing W^X memory, refuse to load modules that
violate this principle per construction.

[jeyu: move module_enforce_rwx_sections under STRICT_MODULE_RWX as per discussion]
Link: http://lore.kernel.org/r/20200403171303.GK20760@hirez.programming.kicks-ass.net
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2020-04-21 17:20:13 +02:00
Eric W. Biederman
61e713bdca signal: Avoid corrupting si_pid and si_uid in do_notify_parent
Christof Meerwald <cmeerw@cmeerw.org> writes:
> Hi,
>
> this is probably related to commit
> 7a0cf09494 (signal: Correct namespace
> fixups of si_pid and si_uid).
>
> With a 5.6.5 kernel I am seeing SIGCHLD signals that don't include a
> properly set si_pid field - this seems to happen for multi-threaded
> child processes.
>
> A simple test program (based on the sample from the signalfd man page):
>
> #include <sys/signalfd.h>
> #include <signal.h>
> #include <unistd.h>
> #include <spawn.h>
> #include <stdlib.h>
> #include <stdio.h>
>
> #define handle_error(msg) \
>     do { perror(msg); exit(EXIT_FAILURE); } while (0)
>
> int main(int argc, char *argv[])
> {
>   sigset_t mask;
>   int sfd;
>   struct signalfd_siginfo fdsi;
>   ssize_t s;
>
>   sigemptyset(&mask);
>   sigaddset(&mask, SIGCHLD);
>
>   if (sigprocmask(SIG_BLOCK, &mask, NULL) == -1)
>     handle_error("sigprocmask");
>
>   pid_t chldpid;
>   char *chldargv[] = { "./sfdclient", NULL };
>   posix_spawn(&chldpid, "./sfdclient", NULL, NULL, chldargv, NULL);
>
>   sfd = signalfd(-1, &mask, 0);
>   if (sfd == -1)
>     handle_error("signalfd");
>
>   for (;;) {
>     s = read(sfd, &fdsi, sizeof(struct signalfd_siginfo));
>     if (s != sizeof(struct signalfd_siginfo))
>       handle_error("read");
>
>     if (fdsi.ssi_signo == SIGCHLD) {
>       printf("Got SIGCHLD %d %d %d %d\n",
>           fdsi.ssi_status, fdsi.ssi_code,
>           fdsi.ssi_uid, fdsi.ssi_pid);
>       return 0;
>     } else {
>       printf("Read unexpected signal\n");
>     }
>   }
> }
>
>
> and a multi-threaded client to test with:
>
> #include <unistd.h>
> #include <pthread.h>
>
> void *f(void *arg)
> {
>   sleep(100);
> }
>
> int main()
> {
>   pthread_t t[8];
>
>   for (int i = 0; i != 8; ++i)
>   {
>     pthread_create(&t[i], NULL, f, NULL);
>   }
> }
>
> I tried to do a bit of debugging and what seems to be happening is
> that
>
>   /* From an ancestor pid namespace? */
>   if (!task_pid_nr_ns(current, task_active_pid_ns(t))) {
>
> fails inside task_pid_nr_ns because the check for "pid_alive" fails.
>
> This code seems to be called from do_notify_parent and there we
> actually have "tsk != current" (I am assuming both are threads of the
> current process?)

I instrumented the code with a warning and received the following backtrace:
> WARNING: CPU: 0 PID: 777 at kernel/pid.c:501 __task_pid_nr_ns.cold.6+0xc/0x15
> Modules linked in:
> CPU: 0 PID: 777 Comm: sfdclient Not tainted 5.7.0-rc1userns+ #2924
> Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> RIP: 0010:__task_pid_nr_ns.cold.6+0xc/0x15
> Code: ff 66 90 48 83 ec 08 89 7c 24 04 48 8d 7e 08 48 8d 74 24 04 e8 9a b6 44 00 48 83 c4 08 c3 48 c7 c7 59 9f ac 82 e8 c2 c4 04 00 <0f> 0b e9 3fd
> RSP: 0018:ffffc9000042fbf8 EFLAGS: 00010046
> RAX: 000000000000000c RBX: 0000000000000000 RCX: ffffc9000042faf4
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff81193d29
> RBP: ffffc9000042fc18 R08: 0000000000000000 R09: 0000000000000001
> R10: 000000100f938416 R11: 0000000000000309 R12: ffff8880b941c140
> R13: 0000000000000000 R14: 0000000000000000 R15: ffff8880b941c140
> FS:  0000000000000000(0000) GS:ffff8880bca00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f2e8c0a32e0 CR3: 0000000002e10000 CR4: 00000000000006f0
> Call Trace:
>  send_signal+0x1c8/0x310
>  do_notify_parent+0x50f/0x550
>  release_task.part.21+0x4fd/0x620
>  do_exit+0x6f6/0xaf0
>  do_group_exit+0x42/0xb0
>  get_signal+0x13b/0xbb0
>  do_signal+0x2b/0x670
>  ? __audit_syscall_exit+0x24d/0x2b0
>  ? rcu_read_lock_sched_held+0x4d/0x60
>  ? kfree+0x24c/0x2b0
>  do_syscall_64+0x176/0x640
>  ? trace_hardirqs_off_thunk+0x1a/0x1c
>  entry_SYSCALL_64_after_hwframe+0x49/0xb3

The immediate problem is as Christof noticed that "pid_alive(current) == false".
This happens because do_notify_parent is called from the last thread to exit
in a process after that thread has been reaped.

The bigger issue is that do_notify_parent can be called from any
process that manages to wait on a thread of a multi-threaded process
from wait_task_zombie.  So any logic based upon current for
do_notify_parent is just nonsense, as current can be pretty much
anything.

So change do_notify_parent to call __send_signal directly.

Inspecting the code it appears this problem has existed since the pid
namespace support started handling this case in 2.6.30.  This fix only
backports to 7a0cf09494 ("signal: Correct namespace fixups of si_pid and si_uid")
where the problem logic was moved out of __send_signal and into send_signal.

Cc: stable@vger.kernel.org
Fixes: 6588c1e3ff ("signals: SI_USER: Masquerade si_pid when crossing pid ns boundary")
Ref: 921cf9f630 ("signals: protect cinit from unblocked SIG_DFL signals")
Link: https://lore.kernel.org/lkml/20200419201336.GI22017@edge.cmeerw.net/
Reported-by: Christof Meerwald <cmeerw@cmeerw.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-04-21 09:55:30 -05:00
Jann Horn
8ff3571f7e bpf: Fix handling of XADD on BTF memory
check_xadd() can cause check_ptr_to_btf_access() to be executed with
atype==BPF_READ and value_regno==-1 (meaning "just check whether the access
is okay, don't tell me what type it will result in").
Handle that case properly and skip writing type information, instead of
indexing into the registers at index -1 and writing into out-of-bounds
memory.

Note that at least at the moment, you can't actually write through a BTF
pointer, so check_xadd() will reject the program after calling
check_ptr_to_btf_access with atype==BPF_WRITE; but that's after the
verifier has already corrupted memory.

This patch assumes that BTF pointers are not available in unprivileged
programs.

Fixes: 9e15db6613 ("bpf: Implement accurate raw_tp context access via BTF")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200417000007.10734-2-jannh@google.com
2020-04-20 18:41:34 -07:00
Jann Horn
6e7e63cbb0 bpf: Forbid XADD on spilled pointers for unprivileged users
When check_xadd() verifies an XADD operation on a pointer to a stack slot
containing a spilled pointer, check_stack_read() verifies that the read,
which is part of XADD, is valid. However, since the placeholder value -1 is
passed as `value_regno`, check_stack_read() can only return a binary
decision and can't return the type of the value that was read. The intent
here is to verify whether the value read from the stack slot may be used as
a SCALAR_VALUE; but since check_stack_read() doesn't check the type, and
the type information is lost when check_stack_read() returns, this is not
enforced, and a malicious user can abuse XADD to leak spilled kernel
pointers.

Fix it by letting check_stack_read() verify that the value is usable as a
SCALAR_VALUE if no type information is passed to the caller.

To be able to use __is_pointer_value() in check_stack_read(), move it up.

Fix up the expected unprivileged error message for a BPF selftest that,
until now, assumed that unprivileged users can use XADD on stack-spilled
pointers. This also gives us a test for the behavior introduced in this
patch for free.

In theory, this could also be fixed by forbidding XADD on stack spills
entirely, since XADD is a locked operation (for operations on memory with
concurrency) and there can't be any concurrency on the BPF stack; but
Alexei has said that he wants to keep XADD on stack slots working to avoid
changes to the test suite [1].

The following BPF program demonstrates how to leak a BPF map pointer as an
unprivileged user using this bug:

    // r7 = map_pointer
    BPF_LD_MAP_FD(BPF_REG_7, small_map),
    // r8 = launder(map_pointer)
    BPF_STX_MEM(BPF_DW, BPF_REG_FP, BPF_REG_7, -8),
    BPF_MOV64_IMM(BPF_REG_1, 0),
    ((struct bpf_insn) {
      .code  = BPF_STX | BPF_DW | BPF_XADD,
      .dst_reg = BPF_REG_FP,
      .src_reg = BPF_REG_1,
      .off = -8
    }),
    BPF_LDX_MEM(BPF_DW, BPF_REG_8, BPF_REG_FP, -8),

    // store r8 into map
    BPF_MOV64_REG(BPF_REG_ARG1, BPF_REG_7),
    BPF_MOV64_REG(BPF_REG_ARG2, BPF_REG_FP),
    BPF_ALU64_IMM(BPF_ADD, BPF_REG_ARG2, -4),
    BPF_ST_MEM(BPF_W, BPF_REG_ARG2, 0, 0),
    BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
    BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
    BPF_EXIT_INSN(),
    BPF_STX_MEM(BPF_DW, BPF_REG_0, BPF_REG_8, 0),

    BPF_MOV64_IMM(BPF_REG_0, 0),
    BPF_EXIT_INSN()

[1] https://lore.kernel.org/bpf/20200416211116.qxqcza5vo2ddnkdq@ast-mbp.dhcp.thefacebook.com/

Fixes: 17a5267067 ("bpf: verifier (add verifier core)")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200417000007.10734-1-jannh@google.com
2020-04-20 18:41:34 -07:00
Toke Høiland-Jørgensen
bc23d0e3f7 cpumap: Avoid warning when CONFIG_DEBUG_PER_CPU_MAPS is enabled
When the kernel is built with CONFIG_DEBUG_PER_CPU_MAPS, the cpumap code
can trigger a spurious warning if CONFIG_CPUMASK_OFFSTACK is also set. This
happens because in this configuration, NR_CPUS can be larger than
nr_cpumask_bits, so the initial check in cpu_map_alloc() is not sufficient
to guard against hitting the warning in cpumask_check().

Fix this by explicitly checking the supplied key against the
nr_cpumask_bits variable before calling cpu_possible().

Fixes: 6710e11269 ("bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP")
Reported-by: Xiumei Mu <xmu@redhat.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Xiumei Mu <xmu@redhat.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200416083120.453718-1-toke@redhat.com
2020-04-20 18:38:04 -07:00
Paul Moore
a48b284b40 audit: fix a net reference leak in audit_send_reply()
If audit_send_reply() fails when trying to create a new thread to
send the reply it also fails to cleanup properly, leaking a reference
to a net structure.  This patch fixes the error path and makes a
handful of other cleanups that came up while fixing the code.

Reported-by: teroincn@gmail.com
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2020-04-20 19:33:56 -04:00
Mauro Carvalho Chehab
03c109d668 futex: get rid of a kernel-docs build warning
Adjust whitespaces and blank lines in order to get rid of this:

	./kernel/futex.c:491: WARNING: Definition list ends without a blank line; unexpected unindent.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Link: https://lore.kernel.org/r/57788af7889161483e0c97f91c079cfb3986c4b3.1586881715.git.mchehab+huawei@kernel.org
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2020-04-20 15:45:41 -06:00
Mauro Carvalho Chehab
0c1bc6b845 docs: filesystems: fix renamed references
Some filesystem references got broken by a previous patch
series I submitted. Address those.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Acked-by: David Sterba <dsterba@suse.com> # fs/affs/Kconfig
Link: https://lore.kernel.org/r/57318c53008dbda7f6f4a5a9e5787f4d37e8565a.1586881715.git.mchehab+huawei@kernel.org
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2020-04-20 15:45:22 -06:00
Paul Moore
763dafc520 audit: check the length of userspace generated audit records
Commit 7561252892 ("audit: always check the netlink payload length
in audit_receive_msg()") fixed a number of missing message length
checks, but forgot to check the length of userspace generated audit
records.  The good news is that you need CAP_AUDIT_WRITE to submit
userspace audit records, which is generally only given to trusted
processes, so the impact should be limited.

Cc: stable@vger.kernel.org
Fixes: 7561252892 ("audit: always check the netlink payload length in audit_receive_msg()")
Reported-by: syzbot+49e69b4d71a420ceda3e@syzkaller.appspotmail.com
Signed-off-by: Paul Moore <paul@paul-moore.com>
2020-04-20 17:10:58 -04:00
David Rientjes
c84dc6e68a dma-pool: add additional coherent pools to map to gfp mask
The single atomic pool is allocated from the lowest zone possible since
it is guaranteed to be applicable for any DMA allocation.

Devices may allocate through the DMA API but not have a strict reliance
on GFP_DMA memory.  Since the atomic pool will be used for all
non-blockable allocations, returning all memory from ZONE_DMA may
unnecessarily deplete the zone.

Provision for multiple atomic pools that will map to the optimal gfp
mask of the device.

When allocating non-blockable memory, determine the optimal gfp mask of
the device and use the appropriate atomic pool.

The coherent DMA mask will remain the same between allocation and free
and, thus, memory will be freed to the same atomic pool it was allocated
from.

__dma_atomic_pool_init() will be changed to return struct gen_pool *
later once dynamic expansion is added.

Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-04-20 12:09:40 +02:00
David Rientjes
e860c299ac dma-remap: separate DMA atomic pools from direct remap code
DMA atomic pools will be needed beyond only CONFIG_DMA_DIRECT_REMAP so
separate them out into their own file.

This also adds a new Kconfig option that can be subsequently used for
options, such as CONFIG_AMD_MEM_ENCRYPT, that will utilize the coherent
pools but do not have a dependency on direct remapping.

For this patch alone, there is no functional change introduced.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Rientjes <rientjes@google.com>
[hch: fixup copyrights and remove unused includes]
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-04-20 12:08:45 +02:00
Jason Yan
05f099a7d0 dma-debug: make __dma_entry_alloc_check_leak() static
Fix the following sparse warning:

kernel/dma/debug.c:659:6: warning: symbol '__dma_entry_alloc_check_leak'
was not declared. Should it be static?

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-04-20 12:05:18 +02:00
Linus Torvalds
3e0dea5768 An update for the proc interface of time namespaces: Use symbolic names
instead of clockid numbers. The usability nuisance of numbers was noticed
 by Michael when polishing the man page.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl6cVQsTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoWBjEAC0dCUHKDLoG0FeyG4tb4FEBW2iTqM8
 UFirH26K18s8QSePdvfJlaxtN2SdfNZG7UgYN7wz1fDFQy05zTz7Rek8UrDuu3rh
 mVph/UZtUJl+6ypW2Lw9x5RWpT5yzay2iowUyBPnNxU9F/0uRKvXQFju3L83Lo/z
 Z4ni7gVEw87dQi5E74tEv6iaydgPuCBpGxoMahotnHyclqMjA0QuAK6nhN5ZTcAn
 senoorS/VqkSF5qEvIUwe7+F+kkMbwQryT7merJyNwh/F49xTTXRyBmiys1MF8Og
 MTEvldXKy2pCh2UfRa/x84WWwOUVNivTXdIXjhalsblczL0j1z9MsQ8b3AOXOiLf
 S+/Ntbb2dGo4qE22jekMwZ54Pm4x5NzChCU8+3pvd6IrPWZKi6vue74Kd0RNHQg/
 0kWOlZnIP2ArVW0bFqV6jhMYkjmVdK6gm7cUpFV66L2H8zbfFuc4OlxJYEFYivye
 9Yck+rFQmMwA15ZXYIpggkd7Rf/5CGF1CiMBAvP/ILubpgbJqnn6/tGByq8tDKdy
 mqXX+NHF0M/7rJd5vr7wP6p3E5nQ9l/41rh9ii9EDLXf4jsWVO3EyobJ7fFHwprs
 5tTWGxVJymUQLq/LQPXOVVENGK+ZsXXNGn/4n8IOVroeypxADTGyhtSh122kFFhv
 jPcVHqpBUd0g4Q==
 =slEk
 -----END PGP SIGNATURE-----

Merge tag 'timers-urgent-2020-04-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull time namespace fix from Thomas Gleixner:
 "An update for the proc interface of time namespaces: Use symbolic
  names instead of clockid numbers. The usability nuisance of numbers
  was noticed by Michael when polishing the man page"

* tag 'timers-urgent-2020-04-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  proc, time/namespace: Show clock symbolic names in /proc/pid/timens_offsets
2020-04-19 11:46:21 -07:00
Linus Torvalds
80ade29e1e A set of fixes/updates for the interrupt subsystem:
- Remove setup_irq() and remove_irq(). All users have been converted so
    remove them before new users surface.
 
  - A set of bugfixes for various interrupt chip drivers
 
  - Add a few missing static attributes to address sparse warnings.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl6cUuMTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoYi7EACOFPrwdOlKqDdgU1FGReEzhJeNSSyH
 yUp1m2nNckz8Y2B+ihnLsfvcktZSXYRuDTZ/u/rmaKqq2wH5Q/h4DNQxEfoMNUep
 IVBlvAFcGsvpdSbrlc+nx6sEo0K2b22AQVHdyPECiQYFZJikstAtEfzEv+ZaUr2S
 Lcds295BIQylbugQpcVZL73j6iUKQ+P5YU0Wlkd/Vhlnxe9UdMd/N1P3GoRaRlOa
 QxYDJCnZJjWkN+cEVRCAZVTat6pd3zaMHvEabI39Lzx4U+nu4vh62TILwk+wdpuA
 DzgA+ENFXzv2zLlnF8gB0wKWw3J99No9gfRpuK/vWBQ68UeZsPlM5PKEr93oD4cC
 To9D70r71UM+LS+Km8ciFlqeT4N+hIMb/x8rpIf5Tcfn5spXjNEhR4U6/d/D2ZYy
 cQiu82th9kSOMGBhlrfkJ0gAT20UfAktDHU1M4JhwI5Y/DLusb6mfg0CRMj8ucOV
 0xrKkgHxhX162oRTKzy5OTMWQRGTvIQZg1QE3xxtrT2qCq4ypu0EHQbh3GdfcIVQ
 8n+s/Dde6etmbSwDDdEuRi///zM+hvaiXi5KOV28LYgRDbU78cAX8uRgX9sq2pg+
 WxK9ulprkW6Ci1yTts9Q6FY+ZBekg7NBKXXDCJdPwXxTLRrdci68pPZip12AaWxP
 2HYxWhE8LvmKAw==
 =jaX5
 -----END PGP SIGNATURE-----

Merge tag 'irq-urgent-2020-04-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq fixes from Thomas Gleixner:
 "A set of fixes/updates for the interrupt subsystem:

   - Remove setup_irq() and remove_irq(). All users have been converted
     so remove them before new users surface.

   - A set of bugfixes for various interrupt chip drivers

   - Add a few missing static attributes to address sparse warnings"

* tag 'irq-urgent-2020-04-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/irq-bcm7038-l1: Make bcm7038_l1_of_init() static
  irqchip/irq-mvebu-icu: Make legacy_bindings static
  irqchip/meson-gpio: Fix HARDIRQ-safe -> HARDIRQ-unsafe lock order
  irqchip/sifive-plic: Fix maximum priority threshold value
  irqchip/ti-sci-inta: Fix processing of masked irqs
  irqchip/mbigen: Free msi_desc on device teardown
  irqchip/gic-v4.1: Update effective affinity of virtual SGIs
  irqchip/gic-v4.1: Add support for VPENDBASER's Dirty+Valid signaling
  genirq: Remove setup_irq() and remove_irq()
2020-04-19 11:23:33 -07:00
Linus Torvalds
08dd387277 Two fixes for the scheduler:
- Work around an uninitializaed variable warning where GCC can't figure it
    out.
 
  - Allow 'isolcpus=' to skip unknown subparameters so that older kernels
    work with the commandline of a newer kernel. Improve the error output
    while at it.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl6cVFwTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoZAaD/9i9QgLuj1Ka59kNPAs68i5Kjar72VS
 us1dM2n0Tx6lIUEYsdJsu4GTRi5NEBqLbmwSgsXROnhI6Jd17hHp5JViezk1GZWc
 Zg2uARAj9Jsqh2q5IjriNOwzq47PDC4dmSUzaecJff8PqGkk9Lpry6qvx3A02uSn
 tVVQAXqwCbPTaQzuhEf/q6mbfRaO90tVqGdseD+1wE0FBFfPLwddegLEGhL1vYsA
 55UhpKCGsS9lrfmgkxk1Xb3e0pJBObiV0SXdn2qHqJTpVUaDTZzsWgJHXg+0Fe1V
 0ZsuGfmaaisYPBZmqRo4HALbkgnvVECSbp7FAnhvqiQMyNaciiwkkFv9Ap5+aayf
 c8wXz/emAmuEMNzipovyFUITCIOs6IL1CkESsbO8Bgx9sTHO+pcgNEYrsX1953UC
 45GjhXR3ymnclqsVqfMWIcNRukk0g9W38yp1DgA5IIhVz1rHogEquD9F10qsCGb1
 FgSOnyGlU0I0JR5tEfqR0TeCuqLGKB2NvnEgLU4OVpsdEC5ac87uvzWEZuOmR5Z4
 vQCkps1z1ABW5fB/kFO83OiA5BZfDGnq5Vvh6XDOv6EeWjhIXrolu6VeTYpBSInq
 w0oNpsaA9wsy7WIy1RJ8jtSNsgS8fULCE5lUBtFeSUY/T7IcWd0lwnTlL97A4qzg
 GdYVT/UAHLCzCA==
 =AKgh
 -----END PGP SIGNATURE-----

Merge tag 'sched-urgent-2020-04-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Thomas Gleixner:
 "Two fixes for the scheduler:

   - Work around an uninitialized variable warning where GCC can't
     figure it out.

   - Allow 'isolcpus=' to skip unknown subparameters so that older
     kernels work with the commandline of a newer kernel. Improve the
     error output while at it"

* tag 'sched-urgent-2020-04-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/vtime: Work around an unitialized variable warning
  sched/isolation: Allow "isolcpus=" to skip unknown sub-parameters
2020-04-19 11:18:20 -07:00
Linus Torvalds
5e7de58127 A single bugfix for RCU to prevent taking a lock in NMI context.
-----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl6cUf8THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYofFtD/92Ufh1t+1uSe/txUJ/lTdY98cpcD5i
 UZJYdF102VLkhwA3Ors0Udtrnvuyxb5FFSfJZ3/N7tsNaXgcz1QOQsqoZz4SSgLJ
 pPj+2v+LNI6rIWDXuwszbLM/nT1mJGAK9NQ6+AhvIyBMPbht4/Lajsv7vkFMTzXw
 8o+a5NqLKWDca7/eLcgbcMiSsulRZxRld0D7MSP7RBeEWeylt31q3JIBp7ldzJ77
 0KCdQEd3TAkP+hOZW98CNgcLgGtCxiOJ5EgjUOFJyD/+5mj219czKF53HXnn4amk
 5lmdzSB0RKV2JTNFKNZQOobgMPp8VIIf6R6QlDp5MdrGYWTIbV0p5Hak2u41Cyma
 BfxxkVZJipjC7mgAvZLgy0/Md7n2Eu5uAW0e72AYEmv7IwOGyHh9YL7IYiZQld67
 5q8xEgrIIpaCwscVjZqP3+GHc8KGWYuv8puMCeOk1v7UeWsRlc8j/eGWIXnY4E8v
 wvqWAB91dHlBh+CHaFtdy97buYinVzW/Tv2NTxLFKgxyGJTg82H2hdTpjRVYi5Z4
 DM2NQRLcD1ozvh8tsFzXWP+/uemlE+EUPBZofCjJ0WtzH1GWarf3YNqviFqldRLr
 GyEFoyIc3Ra/hTEzD9yCC0UWwJubhAVLWOuu9pJKuaaei8s1aiusABQGbz5sNG9l
 AcpB9KFMtsLAXA==
 =QjbA
 -----END PGP SIGNATURE-----

Merge tag 'core-urgent-2020-04-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull RCU fix from Thomas Gleixner:
 "A single bugfix for RCU to prevent taking a lock in NMI context"

* tag 'core-urgent-2020-04-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  rcu: Don't acquire lock in NMI handler in rcu_nmi_enter_common()
2020-04-19 11:16:00 -07:00
Kaitao Cheng
58eb7b77ad smp: Use smp_call_func_t in on_each_cpu()
Use smp_call_func_t instead of the open coded function pointer argument.

Signed-off-by: Kaitao Cheng <pilgrimtao@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lkml.kernel.org/r/20200417162451.91969-1-pilgrimtao@gmail.com
2020-04-19 17:51:48 +02:00
Linus Torvalds
774acb2a09 for-linus-2020-04-18
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXprWIAAKCRCRxhvAZXjc
 omUyAQCQcvJQhilLv0b7FtBAbN7+TkzV8vAQTzEITuHPa6m/HwEA2Gp9ZDTJfQbV
 T6utOrTm/LT0mfBkiDLSnLPtVzh7mgE=
 =Jz3d
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-2020-04-18' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull thread fixes from Christian Brauner:
 "A few fixes and minor improvements:

   - Correctly validate the cgroup file descriptor when clone3() is used
     with CLONE_INTO_CGROUP.

   - Check that a new enough version of struct clone_args is passed
     which supports the cgroup file descriptor argument when
     CLONE_INTO_CGROUP is set in the flags argument.

   - Catch nonsensical struct clone_args layouts at build time.

   - Catch extensions of struct clone_args without updating the uapi
     visible size definitions at build time.

   - Check whether the signal is valid early in kill_pid_usb_asyncio()
     before doing further work.

   - Replace open-coded rcu_read_lock()+kill_pid_info()+rcu_read_unlock()
     sequence in kill_something_info() with kill_proc_info() which is a
     dedicated helper to do just that"

* tag 'for-linus-2020-04-18' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  clone3: add build-time CLONE_ARGS_SIZE_VER* validity checks
  clone3: add a check for the user struct size if CLONE_INTO_CGROUP is set
  clone3: fix cgroup argument sanity check
  signal: use kill_proc_info instead of kill_pid_info in kill_something_info
  signal: check sig before setting info in kill_pid_usb_asyncio
2020-04-18 11:38:51 -07:00
Jessica Yu
db991af02f module: break nested ARCH_HAS_STRICT_MODULE_RWX and STRICT_MODULE_RWX #ifdefs
Various frob_* and module_{enable,disable}_* functions are defined in a
CONFIG_ARCH_HAS_STRICT_MODULE_RWX ifdef block which also has a nested
CONFIG_STRICT_MODULE_RWX ifdef block within it. This is unecessary and
makes things hard to read. Not only that, this construction requires
redundant empty stubs for module_enable_nx(). I suspect this was
originally done for cosmetic reasons - to keep all the frob_* functions
in the same place, and all the module_{enable,disable}_* functions right
after, but as a result it made the code harder to read.

Make this more readable by unnesting the ifdef blocks and getting rid of
the redundant empty stubs.

Signed-off-by: Jessica Yu <jeyu@kernel.org>
2020-04-17 14:56:35 +02:00
Linus Torvalds
c8372665b4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Disable RISCV BPF JIT builds when !MMU, from Björn Töpel.

 2) nf_tables leaves dangling pointer after free, fix from Eric Dumazet.

 3) Out of boundary write in __xsk_rcv_memcpy(), fix from Li RongQing.

 4) Adjust icmp6 message source address selection when routes have a
    preferred source address set, from Tim Stallard.

 5) Be sure to validate HSR protocol version when creating new links,
    from Taehee Yoo.

 6) CAP_NET_ADMIN should be sufficient to manage l2tp tunnels even in
    non-initial namespaces, from Michael Weiß.

 7) Missing release firmware call in mlx5, from Eran Ben Elisha.

 8) Fix variable type in macsec_changelink(), caught by KASAN. Fix from
    Taehee Yoo.

 9) Fix pause frame negotiation in marvell phy driver, from Clemens
    Gruber.

10) Record RX queue early enough in tun packet paths such that XDP
    programs will see the correct RX queue index, from Gilberto Bertin.

11) Fix double unlock in mptcp, from Florian Westphal.

12) Fix offset overflow in ARM bpf JIT, from Luke Nelson.

13) marvell10g needs to soft reset PHY when coming out of low power
    mode, from Russell King.

14) Fix MTU setting regression in stmmac for some chip types, from
    Florian Fainelli.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (101 commits)
  amd-xgbe: Use __napi_schedule() in BH context
  mISDN: make dmril and dmrim static
  net: stmmac: dwmac-sunxi: Provide TX and RX fifo sizes
  net: dsa: mt7530: fix tagged frames pass-through in VLAN-unaware mode
  tipc: fix incorrect increasing of link window
  Documentation: Fix tcp_challenge_ack_limit default value
  net: tulip: make early_486_chipsets static
  dt-bindings: net: ethernet-phy: add desciption for ethernet-phy-id1234.d400
  ipv6: remove redundant assignment to variable err
  net/rds: Use ERR_PTR for rds_message_alloc_sgs()
  net: mscc: ocelot: fix untagged packet drops when enslaving to vlan aware bridge
  selftests/bpf: Check for correct program attach/detach in xdp_attach test
  libbpf: Fix type of old_fd in bpf_xdp_set_link_opts
  libbpf: Always specify expected_attach_type on program load if supported
  xsk: Add missing check on user supplied headroom size
  mac80211: fix channel switch trigger from unknown mesh peer
  mac80211: fix race in ieee80211_register_hw()
  net: marvell10g: soft-reset the PHY when coming out of low power
  net: marvell10g: report firmware version
  net/cxgb4: Check the return from t4_query_params properly
  ...
2020-04-16 14:52:29 -07:00
Alexey Budankov
031258da05 trace/bpf_trace: Open access for CAP_PERFMON privileged process
Open access to bpf_trace monitoring for CAP_PERFMON privileged process.
Providing the access under CAP_PERFMON capability singly, without the
rest of CAP_SYS_ADMIN credentials, excludes chances to misuse the
credentials and makes operation more secure.

CAP_PERFMON implements the principle of least privilege for performance
monitoring and observability operations (POSIX IEEE 1003.1e 2.2.2.39
principle of least privilege: A security design principle that states
that a process or program be granted only those privileges (e.g.,
capabilities) necessary to accomplish its legitimate function, and only
for the time that such privileges are actually required)

For backward compatibility reasons access to bpf_trace monitoring
remains open for CAP_SYS_ADMIN privileged processes but CAP_SYS_ADMIN
usage for secure bpf_trace monitoring is discouraged with respect to
CAP_PERFMON capability.

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Igor Lubashev <ilubashe@akamai.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: intel-gfx@lists.freedesktop.org
Cc: linux-doc@vger.kernel.org
Cc: linux-man@vger.kernel.org
Cc: linux-security-module@vger.kernel.org
Cc: selinux@vger.kernel.org
Link: http://lore.kernel.org/lkml/c0a0ae47-8b6e-ff3e-416b-3cd1faaf71c0@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2020-04-16 12:19:08 -03:00
Alexey Budankov
c9e0924e5c perf/core: open access to probes for CAP_PERFMON privileged process
Open access to monitoring via kprobes and uprobes and eBPF tracing for
CAP_PERFMON privileged process. Providing the access under CAP_PERFMON
capability singly, without the rest of CAP_SYS_ADMIN credentials,
excludes chances to misuse the credentials and makes operation more
secure.

perf kprobes and uprobes are used by ftrace and eBPF. perf probe uses
ftrace to define new kprobe events, and those events are treated as
tracepoint events. eBPF defines new probes via perf_event_open interface
and then the probes are used in eBPF tracing.

CAP_PERFMON implements the principle of least privilege for performance
monitoring and observability operations (POSIX IEEE 1003.1e 2.2.2.39
principle of least privilege: A security design principle that states
that a process or program be granted only those privileges (e.g.,
capabilities) necessary to accomplish its legitimate function, and only
for the time that such privileges are actually required)

For backward compatibility reasons access to perf_events subsystem
remains open for CAP_SYS_ADMIN privileged processes but CAP_SYS_ADMIN
usage for secure perf_events monitoring is discouraged with respect to
CAP_PERFMON capability.

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Igor Lubashev <ilubashe@akamai.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: intel-gfx@lists.freedesktop.org
Cc: linux-doc@vger.kernel.org
Cc: linux-security-module@vger.kernel.org
Cc: selinux@vger.kernel.org
Cc: linux-man@vger.kernel.org
Link: http://lore.kernel.org/lkml/3c129d9a-ba8a-3483-ecc5-ad6c8e7c203f@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2020-04-16 12:19:08 -03:00
Alexey Budankov
18aa185662 perf/core: Open access to the core for CAP_PERFMON privileged process
Open access to monitoring of kernel code, CPUs, tracepoints and
namespaces data for a CAP_PERFMON privileged process. Providing the
access under CAP_PERFMON capability singly, without the rest of
CAP_SYS_ADMIN credentials, excludes chances to misuse the credentials
and makes operation more secure.

CAP_PERFMON implements the principle of least privilege for performance
monitoring and observability operations (POSIX IEEE 1003.1e 2.2.2.39
principle of least privilege: A security design principle that states
that a process or program be granted only those privileges (e.g.,
capabilities) necessary to accomplish its legitimate function, and only
for the time that such privileges are actually required)

For backward compatibility reasons the access to perf_events subsystem
remains open for CAP_SYS_ADMIN privileged processes but CAP_SYS_ADMIN
usage for secure perf_events monitoring is discouraged with respect to
CAP_PERFMON capability.

Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Igor Lubashev <ilubashe@akamai.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: linux-man@vger.kernel.org
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: intel-gfx@lists.freedesktop.org
Cc: linux-doc@vger.kernel.org
Cc: linux-security-module@vger.kernel.org
Cc: selinux@vger.kernel.org
Link: http://lore.kernel.org/lkml/471acaef-bb8a-5ce2-923f-90606b78eef9@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2020-04-16 12:19:08 -03:00
Will Deacon
10415533a9 gcov: Remove old GCC 3.4 support
The kernel requires at least GCC 4.8 in order to build, and so there is
no need to cater for the pre-4.7 gcov format.

Remove the obsolete code.

Acked-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
2020-04-16 12:28:35 +01:00
Andrei Vagin
94d440d618 proc, time/namespace: Show clock symbolic names in /proc/pid/timens_offsets
Michael Kerrisk suggested to replace numeric clock IDs with symbolic names.

Now the content of these files looks like this:
$ cat /proc/774/timens_offsets
monotonic      864000         0
boottime      1728000         0

For setting offsets, both representations of clocks (numeric and symbolic)
can be used.

As for compatibility, it is acceptable to change things as long as
userspace doesn't care. The format of timens_offsets files is very new and
there are no userspace tools yet which rely on this format.

But three projects crun, util-linux and criu rely on the interface of
setting time offsets and this is why it's required to continue supporting
the numeric clock IDs on write.

Fixes: 04a8682a71 ("fs/proc: Introduce /proc/pid/timens_offsets")
Suggested-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20200411154031.642557-1-avagin@gmail.com
2020-04-16 12:10:54 +02:00
Kairui Song
4c5b566c21 crash_dump: Remove no longer used saved_max_pfn
saved_max_pfn was originally introduced in commit

  92aa63a5a1 ("[PATCH] kdump: Retrieve saved max pfn")

It used to make sure that the user does not try to read the physical memory
beyond saved_max_pfn. But since commit

  921d58c0e6 ("vmcore: remove saved_max_pfn check")

it's no longer used for the check. This variable doesn't have any users
anymore so just remove it.

 [ bp: Drop the Calgary IOMMU reference from the commit message. ]

Signed-off-by: Kairui Song <kasong@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Link: https://lkml.kernel.org/r/20200330181544.1595733-1-kasong@redhat.com
2020-04-15 11:21:54 +02:00
Borislav Petkov
e0d648f9d8 sched/vtime: Work around an unitialized variable warning
Work around this warning:

  kernel/sched/cputime.c: In function ‘kcpustat_field’:
  kernel/sched/cputime.c:1007:6: warning: ‘val’ may be used uninitialized in this function [-Wmaybe-uninitialized]

because GCC can't see that val is used only when err is 0.

Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200327214334.GF8015@zn.tnic
2020-04-15 11:06:50 +02:00
Peter Xu
3662daf023 sched/isolation: Allow "isolcpus=" to skip unknown sub-parameters
The "isolcpus=" parameter allows sub-parameters before the cpulist is
specified, and if the parser detects an unknown sub-parameters the whole
parameter will be ignored.

This design is incompatible with itself when new sub-parameters are added.
An older kernel will not recognize the new sub-parameter and will
invalidate the whole parameter so the CPU isolation will not take
effect. It emits a warning:

    isolcpus: Error, unknown flag

The better and compatible way is to allow "isolcpus=" to skip unknown
sub-parameters, so that even if new sub-parameters are added an older
kernel will still be able to behave as usual even if with the new
sub-parameter specified on the command line.

Ideally this should have been there when the first sub-parameter for
"isolcpus=" was introduced.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200403223517.406353-1-peterx@redhat.com
2020-04-15 10:38:26 +02:00
Eugene Syromiatnikov
a966dcfe15 clone3: add build-time CLONE_ARGS_SIZE_VER* validity checks
CLONE_ARGS_SIZE_VER* macros are defined explicitly and not via
the offsets of the relevant struct clone_args fields, which makes
it rather error-prone, so it probably makes sense to add some
compile-time checks for them (including the one that breaks
on struct clone_args extension as a reminder to add a relevant
size macro and a similar check).  Function copy_clone_args_from_user
seems to be a good place for such checks.

Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20200412202658.GA31499@asgard.redhat.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-04-15 09:56:32 +02:00
Eugene Syromiatnikov
62173872ca clone3: add a check for the user struct size if CLONE_INTO_CGROUP is set
Passing CLONE_INTO_CGROUP with an under-sized structure (that doesn't
properly contain cgroup field) seems like garbage input, especially
considering the fact that fd 0 is a valid descriptor.

Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20200412203123.GA5869@asgard.redhat.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-04-15 09:56:25 +02:00
Eugene Syromiatnikov
e82a118f57 clone3: fix cgroup argument sanity check
Checking that cgroup field value of struct clone_args is less than 0
is useless, as it is defined as unsigned 64-bit integer.  Moreover,
it doesn't catch the situations where its higher bits are lost during
the assignment to the cgroup field of the cgroup field of the internal
struct kernel_clone_args (where it is declared as signed 32-bit
integer), so it is still possible to pass garbage there.  A check
against INT_MAX solves both these issues.

Fixes: ef2c41cf38 ("clone3: allow spawning processes into cgroups")
Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20200412202533.GA29554@asgard.redhat.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-04-15 09:56:12 +02:00
Xiao Yang
0bbe7f7199 tracing: Fix the race between registering 'snapshot' event trigger and triggering 'snapshot' operation
Traced event can trigger 'snapshot' operation(i.e. calls snapshot_trigger()
or snapshot_count_trigger()) when register_snapshot_trigger() has completed
registration but doesn't allocate buffer for 'snapshot' event trigger.  In
the rare case, 'snapshot' operation always detects the lack of allocated
buffer so make register_snapshot_trigger() allocate buffer first.

trigger-snapshot.tc in kselftest reproduces the issue on slow vm:
-----------------------------------------------------------
cat trace
...
ftracetest-3028  [002] ....   236.784290: sched_process_fork: comm=ftracetest pid=3028 child_comm=ftracetest child_pid=3036
     <...>-2875  [003] ....   240.460335: tracing_snapshot_instance_cond: *** SNAPSHOT NOT ALLOCATED ***
     <...>-2875  [003] ....   240.460338: tracing_snapshot_instance_cond: *** stopping trace here!   ***
-----------------------------------------------------------

Link: http://lkml.kernel.org/r/20200414015145.66236-1-yangx.jy@cn.fujitsu.com

Cc: stable@vger.kernel.org
Fixes: 93e31ffbf4 ("tracing: Add 'snapshot' event trigger command")
Signed-off-by: Xiao Yang <yangx.jy@cn.fujitsu.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-04-14 22:02:10 -04:00
Zou Wei
89f33dcadb bpf: remove unneeded conversion to bool in __mark_reg_unknown
This issue was detected by using the Coccinelle software:

  kernel/bpf/verifier.c:1259:16-21: WARNING: conversion to bool not needed here

The conversion to bool is unneeded, remove it.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zou Wei <zou_wei@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/1586779076-101346-1-git-send-email-zou_wei@huawei.com
2020-04-14 21:40:06 +02:00
Andrii Nakryiko
1f6cb19be2 bpf: Prevent re-mmap()'ing BPF map as writable for initially r/o mapping
VM_MAYWRITE flag during initial memory mapping determines if already mmap()'ed
pages can be later remapped as writable ones through mprotect() call. To
prevent user application to rewrite contents of memory-mapped as read-only and
subsequently frozen BPF map, remove VM_MAYWRITE flag completely on initially
read-only mapping.

Alternatively, we could treat any memory-mapping on unfrozen map as writable
and bump writecnt instead. But there is little legitimate reason to map
BPF map as read-only and then re-mmap() it as writable through mprotect(),
instead of just mmap()'ing it as read/write from the very beginning.

Also, at the suggestion of Jann Horn, drop unnecessary refcounting in mmap
operations. We can just rely on VMA holding reference to BPF map's file
properly.

Fixes: fc9702273e ("bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/bpf/20200410202613.3679837-1-andriin@fb.com
2020-04-14 21:28:57 +02:00
afzal mohammed
07d8350ede genirq: Remove setup_irq() and remove_irq()
Now that all the users of setup_irq() & remove_irq() have been replaced by
request_irq() & free_irq() respectively, delete them.

Signed-off-by: afzal mohammed <afzal.mohd.ma@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Link: https://lkml.kernel.org/r/0aa8771ada1ac8e1312f6882980c9c08bd023148.1585320721.git.afzal.mohd.ma@gmail.com
2020-04-14 10:08:50 +02:00
Ingo Molnar
40e7d7bdc1 Merge branch 'urgent-for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/urgent
Pull RCU fix from Paul E. McKenney.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-04-14 08:36:41 +02:00
Marco Elver
cdb9b07d8c kcsan: Make reporting aware of KCSAN tests
Reporting hides KCSAN runtime functions in the stack trace, with
filtering done based on function names. Currently this included all
functions (or modules) that would match "kcsan_". Make the filter aware
of KCSAN tests, which contain "kcsan_test", and are no longer skipped in
the report.

This is in preparation for adding a KCSAN test module.

Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-13 17:18:16 -07:00
Marco Elver
f770ed10a9 kcsan: Fix function matching in report
Pass string length as returned by scnprintf() to strnstr(), since
strnstr() searches exactly len bytes in haystack, even if it contains a
NUL-terminator before haystack+len.

Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-13 17:18:15 -07:00
Marco Elver
d8949ef1d9 kcsan: Introduce scoped ASSERT_EXCLUSIVE macros
Introduce ASSERT_EXCLUSIVE_*_SCOPED(), which provide an intuitive
interface to use the scoped-access feature, without having to explicitly
mark the start and end of the desired scope. Basing duration of the
checks on scope avoids accidental misuse and resulting false positives,
which may be hard to debug. See added comments for usage.

The macros are implemented using __attribute__((__cleanup__(func))),
which is supported by all compilers that currently support KCSAN.

Suggested-by: Boqun Feng <boqun.feng@gmail.com>
Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-13 17:18:13 -07:00
Marco Elver
757a4cefde kcsan: Add support for scoped accesses
This adds support for scoped accesses, where the memory range is checked
for the duration of the scope. The feature is implemented by inserting
the relevant access information into a list of scoped accesses for
the current execution context, which are then checked (until removed)
on every call (through instrumentation) into the KCSAN runtime.

An alternative, more complex, implementation could set up a watchpoint for
the scoped access, and keep the watchpoint set up. This, however, would
require first exposing a handle to the watchpoint, as well as dealing
with cases such as accesses by the same thread while the watchpoint is
still set up (and several more cases). It is also doubtful if this would
provide any benefit, since the majority of delay where the watchpoint
is set up is likely due to the injected delays by KCSAN.  Therefore,
the implementation in this patch is simpler and avoids hurting KCSAN's
main use-case (normal data race detection); it also implicitly increases
scoped-access race-detection-ability due to increased probability of
setting up watchpoints by repeatedly calling __kcsan_check_access()
throughout the scope of the access.

The implementation required adding an additional conditional branch to
the fast-path. However, the microbenchmark showed a *speedup* of ~5%
on the fast-path. This appears to be due to subtly improved codegen by
GCC from moving get_ctx() and associated load of preempt_count earlier.

Suggested-by: Boqun Feng <boqun.feng@gmail.com>
Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-13 17:18:11 -07:00
Marco Elver
6119418f94 kcsan: Avoid blocking producers in prepare_report()
To avoid deadlock in case watchers can be interrupted, we need to ensure
that producers of the struct other_info can never be blocked by an
unrelated consumer. (Likely to occur with KCSAN_INTERRUPT_WATCHER.)

There are several cases that can lead to this scenario, for example:

	1. A watchpoint A was set up by task T1, but interrupted by
	   interrupt I1. Some other thread (task or interrupt) finds
	   watchpoint A consumes it, and sets other_info. Then I1 also
	   finds some unrelated watchpoint B, consumes it, but is blocked
	   because other_info is in use. T1 cannot consume other_info
	   because I1 never returns -> deadlock.

	2. A watchpoint A was set up by task T1, but interrupted by
	   interrupt I1, which also sets up a watchpoint B. Some other
	   thread finds watchpoint A, and consumes it and sets up
	   other_info with its information. Similarly some other thread
	   finds watchpoint B and consumes it, but is then blocked because
	   other_info is in use. When I1 continues it sees its watchpoint
	   was consumed, and that it must wait for other_info, which
	   currently contains information to be consumed by T1. However, T1
	   cannot unblock other_info because I1 never returns -> deadlock.

To avoid this, we need to ensure that producers of struct other_info
always have a usable other_info entry. This is obviously not the case
with only a single instance of struct other_info, as concurrent
producers must wait for the entry to be released by some consumer (which
may be locked up as illustrated above).

While it would be nice if producers could simply call kmalloc() and
append their instance of struct other_info to a list, we are very
limited in this code path: since KCSAN can instrument the allocators
themselves, calling kmalloc() could lead to deadlock or corrupted
allocator state.

Since producers of the struct other_info will always succeed at
try_consume_watchpoint(), preceding the call into kcsan_report(), we
know that the particular watchpoint slot cannot simply be reused or
consumed by another potential other_info producer. If we move removal of
a watchpoint after reporting (by the consumer of struct other_info), we
can see a consumed watchpoint as a held lock on elements of other_info,
if we create a one-to-one mapping of a watchpoint to an other_info
element.

Therefore, the simplest solution is to create an array of struct
other_info that is as large as the watchpoints array in core.c, and pass
the watchpoint index to kcsan_report() for producers and consumers, and
change watchpoints to be removed after reporting is done.

With a default config on a 64-bit system, the array other_infos consumes
~37KiB. For most systems today this is not a problem. On smaller memory
constrained systems, the config value CONFIG_KCSAN_NUM_WATCHPOINTS can
be reduced appropriately.

Overall, this change is a simplification of the prepare_report() code,
and makes some of the checks (such as checking if at least one access is
a write) redundant.

Tested:
$ tools/testing/selftests/rcutorture/bin/kvm.sh \
	--cpus 12 --duration 10 --kconfig "CONFIG_DEBUG_INFO=y \
	CONFIG_KCSAN=y CONFIG_KCSAN_ASSUME_PLAIN_WRITES_ATOMIC=n \
	CONFIG_KCSAN_REPORT_VALUE_CHANGE_ONLY=n \
	CONFIG_KCSAN_REPORT_ONCE_IN_MS=100000 CONFIG_KCSAN_VERBOSE=y \
	CONFIG_KCSAN_INTERRUPT_WATCHER=y CONFIG_PROVE_LOCKING=y" \
	--configs TREE03
=> No longer hangs and runs to completion as expected.

Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-13 17:18:11 -07:00
Marco Elver
135c0872d8 kcsan: Introduce report access_info and other_info
Improve readability by introducing access_info and other_info structs,
and in preparation of the following commit in this series replaces the
single instance of other_info with an array of size 1.

No functional change intended.

Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-04-13 17:18:10 -07:00
Ingo Molnar
3b02a051d2 Linux 5.7-rc1
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl6TbaUeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGhgkH/iWpiKvosA20HJjC
 rBqYeJPxQsgZTuBieWJ+MeVxbpcF7RlM4c+glyvg3QJhHwIEG58dl6LBrQbAyBAR
 aFHNojr1iAYOruVCGnU3pA008YZiwUIDv/ZQ4DF8fmIU2vI2mJ6qHBv3XDl4G2uR
 Nwz8Eu9AgIwZM5coomVOSmoWyFy7Vxmb7W+3t5VmKsvOWx4ib9kyQtOIkvQDEl7j
 XCbWfI0xDQr6LFOm4jnCi5R/LhJ2LIqqIvHHrunbpszM8IwK797jCXz4im+dmd5Y
 +km46N7a8pDqri36xXz1gdBAU3eG7Pt1NyvfjwRVTdX4GquQ2MT0GoojxbLxUP3y
 3pEsQuE=
 =whbL
 -----END PGP SIGNATURE-----

Merge tag 'v5.7-rc1' into locking/kcsan, to resolve conflicts and refresh

Resolve these conflicts:

	arch/x86/Kconfig
	arch/x86/kernel/Makefile

Do a minor "evil merge" to move the KCSAN entry up a bit by a few lines
in the Kconfig to reduce the probability of future conflicts.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-04-13 09:44:39 +02:00
Zhiqiang Liu
3075afdf15 signal: use kill_proc_info instead of kill_pid_info in kill_something_info
signal.c provides kill_proc_info, we can use it instead of kill_pid_info
in kill_something_info func gracefully.

Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/80236965-f0b5-c888-95ff-855bdec75bb3@huawei.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-04-12 22:46:34 +02:00
Zhiqiang Liu
eaec2b0bd3 signal: check sig before setting info in kill_pid_usb_asyncio
In kill_pid_usb_asyncio, if signal is not valid, we do not need to
set info struct.

Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/f525fd08-1cf7-fb09-d20c-4359145eb940@huawei.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-04-12 22:46:34 +02:00
Linus Torvalds
0785249f8b Time(keeping) updates:
- Fix the time_for_children symlink in /proc/$PID/ so it properly reflects
    that it part of the 'time' namespace
 
  - Add the missing userns limit for the allowed number of time namespaces,
    which was half defined but the actual array member was not added.  This
    went unnoticed as the array has an exessive empty member at the end but
    introduced a user visible regression as the output was corrupted.
 
  - Prevent further silent ucount corruption by adding a BUILD_BUG_ON() to
    catch half updated data.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl6TFe4THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYob4PD/47Qwz2z2mEeO037VbbI2gY4yl/raFo
 5KPWmnwonKrtVaYAldLutA3iaG7bBbUX5fRvbSRNTS6CJIHwwfLSx7/CeWMmIXEX
 0zsBsn5QXjG89lJZXM+ot74yzjvkeoad2g0jEHv92v0WDSXFiAWhkBUwknfNFbpa
 csEjkdpyn2zTVBGBzKVHWHXddkY0o0Q0JOy0EiH09rHGpQktPoLJdYp73VCygoJd
 NRAXhTmBQq85RMcSB3eVTbSPpIuBUzZke9zoio7YZwEjl6bkvSqetPmTdIr57u4s
 ex3PX++64EXD7r8ZW36fPGDqu6v0CH2ILK7QVhwyHAYJo2LQKVd+v25muaFrzfpn
 dSG1SqabWqdIHUoW/76ORyecAFLTzGwDu07UH+6VJbXeLfmuhe/LI3hdDQFph9NQ
 BOBKhaHm8aXmAmvrkxbbAikSkJYVHrAIp5abI4PSYoPaqK1DWnSPaT1cqtaIUgYL
 Mk15z19V9np4lMCH2cucAlap8U9EvQEIfCRRdl+crDu17ZzGID1pwhY2DA8adqcT
 SUfwzzUaykd5TZtDeIe+6G9fsgf/wbSTSSbrNGKlLXDbxx+iNVXErkmx0JXLEHV4
 47cmBwQZ255DzjMfuS4HzCck2MaaP8mDWgcbszgkP+GFnkf9EAP5XNp9st937mbG
 rzP+NkjNCldN9w==
 =wOiC
 -----END PGP SIGNATURE-----

Merge tag 'timers-urgent-2020-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull time(keeping) updates from Thomas Gleixner:

 - Fix the time_for_children symlink in /proc/$PID/ so it properly
   reflects that it part of the 'time' namespace

 - Add the missing userns limit for the allowed number of time
   namespaces, which was half defined but the actual array member was
   not added. This went unnoticed as the array has an exessive empty
   member at the end but introduced a user visible regression as the
   output was corrupted.

 - Prevent further silent ucount corruption by adding a BUILD_BUG_ON()
   to catch half updated data.

* tag 'timers-urgent-2020-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  ucount: Make sure ucounts in /proc/sys/user don't regress again
  time/namespace: Add max_time_namespaces ucount
  time/namespace: Fix time_for_children symlink
2020-04-12 10:13:14 -07:00
Linus Torvalds
590680d139 Scheduler fixes/updates:
- Deduplicate the average computations in the scheduler core and the fair
    class code.
 
  - Fix a raise between runtime distribution and assignement which can cause
    exceeding the quota by up to 70%.
 
  - Prevent negative results in the imbalanace calculation
 
  - Remove a stale warning in the workqueue code which can be triggered
    since the call site was moved out of preempt disabled code. It's a false
    positive.
 
  - Deduplicate the print macros for procfs
 
  - Add the ucmap values to the SCHED_DEBUG procfs output for completness
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl6TFEoTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoRoPD/9cTERNZb/aT/SVD3TntBs5bMlnRR2U
 KsyoN7w2bmYGY6XbKdjWX+KorPyJ9YRcUpOfmH+6JmMioM5RrgCtAQ4cXP4q89k3
 6+hdHu+WAhp2AyKr5LAYZK+NYdq0fy/9xx7fdXNri8/QakQCy1vu/u6iiKRMmrEf
 R7zB7v4ddTOTKamUuGBNnmM4GwfrD7td9NksfjDqV4yJ2ZkaQ+apWAPzJK8ixfEa
 TGjVnnUZA82tZRZ+O4RDN2AVgG0CuXYRYPWRfqi3XgQ3d0+ju5mM7q+6TQeEUuAq
 19IoscLhmYGb9yYCY5p0j1AUkyr6FcSFr1SJt/8Jxpyw4JsKw+ANFBA3kkzEBZ+h
 0VcWjUw4S6A6+0HdrqUQYjr5tl7RCY+hOE+QQ/Xvrm4PpWi0L+1tB+kgR7VKNPO8
 Xqfp/CK7Rcgh2/nUHT4uxdwh4ZNsLo39QGTuahyPXzeRVJTEGUjbzktnEUVc9wmd
 OfjpC0Q2DBdXx+vnzH82flv/rtUUYor/Owhpq92E6d50Z7CjTa6xRNbmQUiVKls8
 jqehpRBUg5cB3TI0BKeb5+kzgeGDGpm7MfZbvSwvoL0stdJdzNwwc4nhOLjPkJFK
 R5m7cg82Kx+r/00Vb7k6WgYZ10WTS2Il1OKbgwHl5uTlaKuY7TPEIRVuJQV/XUSG
 88n4jIwpRbLMNw==
 =BVqY
 -----END PGP SIGNATURE-----

Merge tag 'sched-urgent-2020-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes/updates from Thomas Gleixner:

 - Deduplicate the average computations in the scheduler core and the
   fair class code.

 - Fix a raise between runtime distribution and assignement which can
   cause exceeding the quota by up to 70%.

 - Prevent negative results in the imbalanace calculation

 - Remove a stale warning in the workqueue code which can be triggered
   since the call site was moved out of preempt disabled code. It's a
   false positive.

 - Deduplicate the print macros for procfs

 - Add the ucmap values to the SCHED_DEBUG procfs output for completness

* tag 'sched-urgent-2020-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/debug: Add task uclamp values to SCHED_DEBUG procfs
  sched/debug: Factor out printing formats into common macros
  sched/debug: Remove redundant macro define
  sched/core: Remove unused rq::last_load_update_tick
  workqueue: Remove the warning in wq_worker_sleeping()
  sched/fair: Fix negative imbalance in imbalance calculation
  sched/fair: Fix race between runtime distribution and assignment
  sched/fair: Align rq->avg_idle and rq->avg_scan_cost
2020-04-12 10:09:19 -07:00
Linus Torvalds
20e2aa8126 Thre fixes/updates for perf:
- Fix the perf event cgroup tracking which tries to track the cgroup even
    for disabled events.
 
  - Add Ice Lake server support for uncore events
 
  - Disable pagefaults when retrieving the physical address in the sampling
    code.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl6TEbMTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoYvgD/9Ufht3qrbE29oTS+q2x6PZRc01hvMW
 c3wh/7Anh5yBlTioUEl1tbNfSD1k9hKqMQJF6iPZQ4tzft6JAunstICngqJrJqSY
 si22mgY6QGKrA2+4UuxT7FZD5nrEhy1TrGNIOp/86Y9I59PAlrypWMxq71VUPjgB
 Yy9r6eSJqNX05r9tZy1WMloJaBBieaVvEefK9ZrO4s1XM/RU5pAl+/B1XauK8XN9
 e4bzu7d6r+w23pFuEqk3r4KWozafxczJAqV+w6/Me1lFgNj6m2GKq429bL/Bgj8d
 Re8N4donynPe9yvuDWS4eHD1z0nhOMQhOv85seBwBcEet0PnEPtyw/eUu2zZTE2J
 h1R7l8I3RIs9EmXtdoOuzLFbc8a9sw/lGQJYdP70TMEtLKPUcGkfNhwP196CRbZK
 TBVub3jP0STmo8u7PrkcMb53ZBZ17P317ous52f0xgkFB6YVqv29cB+NEyabjKMI
 fP9ZAaoJ9Rn7T9nji/8KsUcjpCUBhP//YWnf/Vax5PW+hkJhfrLOpesKpX1uAwzd
 fc4QzEXYr/a4YI30IvqSJBsJJZfvFl9ikMaXrv5TbR8nj/eszJInho5+olQj54A7
 SLzCOpVHc3jJ13d/C5Xjo0LhwSqjLv3JtvlFG34XHl+YlZUw0ua9u4Ld5vDIDYgw
 mwHHFp3P/yFAfQ==
 =5kqD
 -----END PGP SIGNATURE-----

Merge tag 'perf-urgent-2020-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fixes from Thomas Gleixner:
 "Three fixes/updates for perf:

   - Fix the perf event cgroup tracking which tries to track the cgroup
     even for disabled events.

   - Add Ice Lake server support for uncore events

   - Disable pagefaults when retrieving the physical address in the
     sampling code"

* tag 'perf-urgent-2020-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/core: Disable page faults when getting phys address
  perf/x86/intel/uncore: Add Ice Lake server uncore support
  perf/cgroup: Correct indirection in perf_less_group_idx()
  perf/core: Fix event cgroup tracking
2020-04-12 10:05:24 -07:00
Linus Torvalds
652fa53caa Three small fixes/updates for the locking core code:
- Plug a task struct reference leak in the percpu rswem implementation.
 
  - Document the refcount interaction with PID_MAX_LIMIT
 
  - Improve the 'invalid wait context' data dump in lockdep so it contains
    all information which is required to decode the problem
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl6TEJoTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoeY6D/42AKqIvpAPiPA+Aod9o8o7qDCoPJQN
 kwCx939+Mwrwm8YS3WJGJvR/jgoGeq33gerY93D0+jkMyhmvYeWbUIEdCenyy0+C
 tkA7pN1j61D7jMqmL8PgrObhQMM8tTdNNsiNxg5qZA8wI2oFBMbAStx5mCficCyJ
 8t3hh/2aRVlNN1MBjo25+jEWEXUIchAzmxnmR+5t/pRYWythOhLuWW05D8b1JoN3
 9zdHj9ZScONlrK+VmpXZl41SokT385arMXcmUuS+zPx6zl4P6yTgeHMLsm4Kt3+j
 1q0hm4fUEEW2IrF1Uc9pxtJh/39UVwvX8Wfj/wJLm3b1akmO+3tQ2cpiee4NHk2H
 WogvBNGG4QeGMrRkg0seiqUE6SOaNPXdGRszPc6EkzwpOywDjHVYV5qbII1VOpbK
 z/TJrtv8g8ejgAvyG0dEtEDydm2SqPAIAZFXj9JQ0I/gUZKTW5rvGQWdMXW6HPif
 wdJJ6xb4DC3f5DrBrXgpkMSdwfea2boIoLDMf9tCcdlRs6vmvY3iRLHy2xBLcRFt
 9yVDUxO40eJUxztbFKuZsQvgJyBxZubyK+Yo9CCvWQ95GfIr2JurIJEcD2fxVyRU
 57/Avt3zt7iViZ9LHJnL4JnWRxftWOdWusF2dsufmnuveCVhtA5B5bl7nGntpIT2
 mfJEDrJ7zzhgyA==
 =Agrh
 -----END PGP SIGNATURE-----

Merge tag 'locking-urgent-2020-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking fixes from Thomas Gleixner:
 "Three small fixes/updates for the locking core code:

   - Plug a task struct reference leak in the percpu rswem
     implementation.

   - Document the refcount interaction with PID_MAX_LIMIT

   - Improve the 'invalid wait context' data dump in lockdep so it
     contains all information which is required to decode the problem"

* tag 'locking-urgent-2020-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/lockdep: Improve 'invalid wait context' splat
  locking/refcount: Document interaction with PID_MAX_LIMIT
  locking/percpu-rwsem: Fix a task_struct refcount
2020-04-12 09:47:10 -07:00
Linus Torvalds
75e7188397 dma-mapping fixes for 5.7
- fix an integer truncation in dma_direct_get_required_mask
    (Kishon Vijay Abraham)
  - fix the display of dma mapping types (Grygorii Strashko)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAl6Rfy8LHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYPT3w/+MKNVGmyi6nSsruskr1CkLCah5GkHrDA4v+vwwkFZ
 zIeB7df/EOijzfYHlhEkiQWBcPvl9wmVBhFsuaY9XMQ/fJv06CpR5J2xtb+uZN9L
 HmfwCJPZwrHDqsonIiOtIWjyIOARLGC0CfBP+DCPJ/GDG9wys7eyfESh85PQEDvO
 QA4sPjJHuKeTTH2Q6QLM/aiqgr80Q4xrQLCVcgDAqAvqnCYi/E4HXty4pLNGpvJu
 5O2xzmbQT2l3gRXjxt5Ct4pol+nFmiWJ0sBDNVNUqge7vebBiu3e0VjEaeXUPAMq
 hQSrof2x+EwHITV10L82/poIKNQVVXW3aYgItMNXfOSBNqaNx4wVLRsrcMZWeMeJ
 0OHmCy6Sac2dK4QtY/NtZFLL1zW4ajsj8X975RjqBMcL6QAdCMxNSWGEY/j/fMQy
 FV1spy8FBfvXH+0O2POkYNBx/eICnMbGQ2ed5NZ8ONAl8wFLPM+2qIwmgb51a3TG
 8jR613Zqyw0aMnky0MUwvj5TlwwUyyXbGKpizl0vkFw7/NYXroNMaYWJSHF/ddFp
 vUChNqYhgCFY8FOhanFadyPxqU1InEoOkUS8/ij7NAR6cCrxsLuAohMYSNQJes3R
 IvmU36FtjO53rBdHaN4Eizk4EHVaZDxoll05UPU/B5IvzXqR6zy0PphoOxfzvmmq
 Qfk=
 =Af93
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-5.7-1' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping fixes from Christoph Hellwig:

 - fix an integer truncation in dma_direct_get_required_mask
   (Kishon Vijay Abraham)

 - fix the display of dma mapping types (Grygorii Strashko)

* tag 'dma-mapping-5.7-1' of git://git.infradead.org/users/hch/dma-mapping:
  dma-debug: fix displaying of dma allocation type
  dma-direct: fix data truncation in dma_direct_get_required_mask()
2020-04-11 11:34:36 -07:00
Linus Torvalds
5b8b9d0c6d Merge branch 'akpm' (patches from Andrew)
Merge yet more updates from Andrew Morton:

 - Almost all of the rest of MM (memcg, slab-generic, slab, pagealloc,
   gup, hugetlb, pagemap, memremap)

 - Various other things (hfs, ocfs2, kmod, misc, seqfile)

* akpm: (34 commits)
  ipc/util.c: sysvipc_find_ipc() should increase position index
  kernel/gcov/fs.c: gcov_seq_next() should increase position index
  fs/seq_file.c: seq_read(): add info message about buggy .next functions
  drivers/dma/tegra20-apb-dma.c: fix platform_get_irq.cocci warnings
  change email address for Pali Rohár
  selftests: kmod: test disabling module autoloading
  selftests: kmod: fix handling test numbers above 9
  docs: admin-guide: document the kernel.modprobe sysctl
  fs/filesystems.c: downgrade user-reachable WARN_ONCE() to pr_warn_once()
  kmod: make request_module() return an error when autoloading is disabled
  mm/memremap: set caching mode for PCI P2PDMA memory to WC
  mm/memory_hotplug: add pgprot_t to mhp_params
  powerpc/mm: thread pgprot_t through create_section_mapping()
  x86/mm: introduce __set_memory_prot()
  x86/mm: thread pgprot_t through init_memory_mapping()
  mm/memory_hotplug: rename mhp_restrictions to mhp_params
  mm/memory_hotplug: drop the flags field from struct mhp_restrictions
  mm/special: create generic fallbacks for pte_special() and pte_mkspecial()
  mm/vma: introduce VM_ACCESS_FLAGS
  mm/vma: define a default value for VM_DATA_DEFAULT_FLAGS
  ...
2020-04-10 17:57:48 -07:00
Vasily Averin
f4d74ef622 kernel/gcov/fs.c: gcov_seq_next() should increase position index
If seq_file .next function does not change position index, read after
some lseek can generate unexpected output.

https://bugzilla.kernel.org/show_bug.cgi?id=206283
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: NeilBrown <neilb@suse.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Waiman Long <longman@redhat.com>
Link: http://lkml.kernel.org/r/f65c6ee7-bd00-f910-2f8a-37cc67e4ff88@virtuozzo.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-10 15:36:22 -07:00
Eric Biggers
d7d27cfc5c kmod: make request_module() return an error when autoloading is disabled
Patch series "module autoloading fixes and cleanups", v5.

This series fixes a bug where request_module() was reporting success to
kernel code when module autoloading had been completely disabled via
'echo > /proc/sys/kernel/modprobe'.

It also addresses the issues raised on the original thread
(https://lkml.kernel.org/lkml/20200310223731.126894-1-ebiggers@kernel.org/T/#u)
bydocumenting the modprobe sysctl, adding a self-test for the empty path
case, and downgrading a user-reachable WARN_ONCE().

This patch (of 4):

It's long been possible to disable kernel module autoloading completely
(while still allowing manual module insertion) by setting
/proc/sys/kernel/modprobe to the empty string.

This can be preferable to setting it to a nonexistent file since it
avoids the overhead of an attempted execve(), avoids potential
deadlocks, and avoids the call to security_kernel_module_request() and
thus on SELinux-based systems eliminates the need to write SELinux rules
to dontaudit module_request.

However, when module autoloading is disabled in this way,
request_module() returns 0.  This is broken because callers expect 0 to
mean that the module was successfully loaded.

Apparently this was never noticed because this method of disabling
module autoloading isn't used much, and also most callers don't use the
return value of request_module() since it's always necessary to check
whether the module registered its functionality or not anyway.

But improperly returning 0 can indeed confuse a few callers, for example
get_fs_type() in fs/filesystems.c where it causes a WARNING to be hit:

	if (!fs && (request_module("fs-%.*s", len, name) == 0)) {
		fs = __get_fs_type(name, len);
		WARN_ONCE(!fs, "request_module fs-%.*s succeeded, but still no fs?\n", len, name);
	}

This is easily reproduced with:

	echo > /proc/sys/kernel/modprobe
	mount -t NONEXISTENT none /

It causes:

	request_module fs-NONEXISTENT succeeded, but still no fs?
	WARNING: CPU: 1 PID: 1106 at fs/filesystems.c:275 get_fs_type+0xd6/0xf0
	[...]

This should actually use pr_warn_once() rather than WARN_ONCE(), since
it's also user-reachable if userspace immediately unloads the module.
Regardless, request_module() should correctly return an error when it
fails.  So let's make it return -ENOENT, which matches the error when
the modprobe binary doesn't exist.

I've also sent patches to document and test this case.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Jessica Yu <jeyu@kernel.org>
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jeff Vander Stoep <jeffv@google.com>
Cc: Ben Hutchings <benh@debian.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200310223731.126894-1-ebiggers@kernel.org
Link: http://lkml.kernel.org/r/20200312202552.241885-1-ebiggers@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-10 15:36:22 -07:00
Sergey Senozhatsky
ab6f762f0f printk: queue wake_up_klogd irq_work only if per-CPU areas are ready
printk_deferred(), similarly to printk_safe/printk_nmi, does not
immediately attempt to print a new message on the consoles, avoiding
calls into non-reentrant kernel paths, e.g. scheduler or timekeeping,
which potentially can deadlock the system.

Those printk() flavors, instead, rely on per-CPU flush irq_work to print
messages from safer contexts.  For same reasons (recursive scheduler or
timekeeping calls) printk() uses per-CPU irq_work in order to wake up
user space syslog/kmsg readers.

However, only printk_safe/printk_nmi do make sure that per-CPU areas
have been initialised and that it's safe to modify per-CPU irq_work.
This means that, for instance, should printk_deferred() be invoked "too
early", that is before per-CPU areas are initialised, printk_deferred()
will perform illegal per-CPU access.

Lech Perczak [0] reports that after commit 1b710b1b10 ("char/random:
silence a lockdep splat with printk()") user-space syslog/kmsg readers
are not able to read new kernel messages.

The reason is printk_deferred() being called too early (as was pointed
out by Petr and John).

Fix printk_deferred() and do not queue per-CPU irq_work before per-CPU
areas are initialized.

Link: https://lore.kernel.org/lkml/aa0732c6-5c4e-8a8b-a1c1-75ebe3dca05b@camlintechnologies.com/
Reported-by: Lech Perczak <l.perczak@camlintechnologies.com>
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Tested-by: Jann Horn <jannh@google.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-10 13:18:57 -07:00
Linus Torvalds
87ad46e601 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull proc fix from Eric Biederman:
 "A brown paper bag slipped through my proc changes, and syzcaller
  caught it when the code ended up in your tree.

  I have opted to fix it the simplest cleanest way I know how, so there
  is no reasonable chance for the bug to repeat"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  proc: Use a dedicated lock in struct pid
2020-04-10 12:59:56 -07:00
Linus Torvalds
bbec2a2dc3 More power management updates for 5.7-rc1
Rework compat ioctl handling in the user space hibernation
 interface (Christoph Hellwig) and fix a typo in a function
 name in the cpuidle haltpoll driver (Yihao Wu).
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl6QPXkSHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRx918P+NU9lM++pMi2Ng4wHmtRtHW9mH4M6L54
 /LTjDNZ4IE/v6JlK6hzaFUAhrFViI23gmUat8a7TT/o1NUsPS0QHxatFK+nGbjEk
 blB/rBHWpC5vGo8SZbqmvI0hbRn0Q6Ah5I+iV+KAk9Z76mDEMNHZKpP2CfiqSVQE
 QYsUIJOSUo/CMT1SZRE/xDrvoU418Y1Ed6a6Kn9Ki5uXvqDTPHoAyETZ9M6tLphP
 kGBpbnbkHx3FyYZ3EyjVEd8O6cDsxS2gIWu6YUCB31N4G7v3bJ6rfgperTCN999J
 8eQY9rNVlaAWIP9t0ObC3xOpoaNUcZC+V+yaHev3LU/6LP4Sued8cNBxQn26/RWN
 vyM5M6K0OphjPll9QfVfZFRUcuoBMyAtgVYw8mwT64GVJt3ukbd2QYDQHORXBQEC
 ziTtfLQEuAiUw/DRoTGo2XYDTlnd2nu9mFoTRU6juOawOwgwPbQldOtSJiMtCDoR
 cyaQH/t528w10jGCa+mIJZToTIet+gN6ui83M3kQdxMTa5ulgPClU8Kujn1wPgKX
 6jFrf+NJ6SNm6A/EhPk9t/soe7hhzyGwqx351aMQpZGX6UH4hQQPXgC0tyZ71Uj5
 UJ7Ys5RHcXg4kub/FJsfhv/3wdSPiekwe+U89UCQY5lxXC2x3iVjp50B4auCjW23
 tQVHpAvegWg=
 =h4ud
 -----END PGP SIGNATURE-----

Merge tag 'pm-5.7-rc1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull more power management updates from Rafael Wysocki:
 "Rework compat ioctl handling in the user space hibernation interface
  (Christoph Hellwig) and fix a typo in a function name in the cpuidle
  haltpoll driver (Yihao Wu)"

* tag 'pm-5.7-rc1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpuidle-haltpoll: Fix small typo
  PM / sleep: handle the compat case in snapshot_set_swap_area()
  PM / sleep: move SNAPSHOT_SET_SWAP_AREA handling into a helper
2020-04-10 09:50:00 -07:00
David S. Miller
40fc7ad2c8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2020-04-10

The following pull-request contains BPF updates for your *net* tree.

We've added 13 non-merge commits during the last 7 day(s) which contain
a total of 13 files changed, 137 insertions(+), 43 deletions(-).

The main changes are:

1) JIT code emission fixes for riscv and arm32, from Luke Nelson and Xi Wang.

2) Disable vmlinux BTF info if GCC_PLUGIN_RANDSTRUCT is used, from Slava Bacherikov.

3) Fix oob write in AF_XDP when meta data is used, from Li RongQing.

4) Fix bpf_get_link_xdp_id() handling on single prog when flags are specified,
   from Andrey Ignatov.

5) Fix sk_assign() BPF helper for request sockets that can have sk_reuseport
   field uninitialized, from Joe Stringer.

6) Fix mprotect() test case for the BPF LSM, from KP Singh.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-09 17:39:22 -07:00
Linus Torvalds
c0cc271173 Modules updates for v5.7
Summary of modules changes for the 5.7 merge window:
 
 - Trivial zero-length array to flexible-array cleanup
 
 Signed-off-by: Jessica Yu <jeyu@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEVrp26glSWYuDNrCUwEV+OM47wXIFAl6O4MsQHGpleXVAa2Vy
 bmVsLm9yZwAKCRDARX44zjvBci4aEACsoeVA9rJGsYR7Z08xq9VF5/mPI9oFsXHQ
 HBnSr8QzOqpKBpAoE0Pp5WqhgbC0+5HrhcbMqE8yGHs+WJ7EDvKB8l8AF6OerRVq
 kmMlxYJwCJ7rJnnn6K8By4Q/13ZeCSgJfz8v8KJsqLADfm81L5/fGKvt4BThdyap
 fCuwv4fo0zq7JeEekCyvjaYfcqLHT6OpbOrkR9XQW6U2XTPC5nPFvmKwJvhTED/n
 18HVGRGxrfsPLcTkd/njPEEZetngKICTK6iRJk0ePH2cdxiNAPEInI/BZFOXge8H
 wo9l6UEGSlX1DicdokCc27SQLzd4R78xykT+Z1CrFZo+6INpY5F/+J2LNnknqP25
 BGqRpOu6HJZcpOUNTGaBk2c0BmXzKQzqCE3bq2ClI8ifVjwAqQrODbtie9eIqr/w
 5Ocg1np0z9F2NFjIiJjD288zASmBpnfgpwgkOsiQAGW8j6Xd40mzqH/Vu6/iT856
 +XC3FJvossm2XlB5D+koyWDYPtTQf1LbGt3DhCZ/5Xd7dFIRt51g8jxBJEWAyTJy
 EVxAohtEM9UTyqn9VymwUUrLPGV6JCG1dbfz+02KuqRklM59ZJQdKypv0vxhqdcv
 HS9N77xALC6AUdSupOIZiw+jSHNE2WgkDkdtJ6fS2JO2veqJcyLBEflMikT3+1l7
 PXji6Pt+9Q==
 =HWeX
 -----END PGP SIGNATURE-----

Merge tag 'modules-for-v5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux

Pull module updates from Jessica Yu:
 "Only a small cleanup this time around: a trivial conversion of
  zero-length arrays to flexible arrays"

* tag 'modules-for-v5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
  kernel: module: Replace zero-length array with flexible-array member
2020-04-09 12:52:34 -07:00
Tejun Heo
d8ef4b38cb Revert "cgroup: Add memory barriers to plug cgroup_rstat_updated() race window"
This reverts commit 9a9e97b2f1 ("cgroup: Add memory barriers to plug
cgroup_rstat_updated() race window").

The commit was added in anticipation of memcg rstat conversion which needed
synchronous accounting for the event counters (e.g. oom kill count). However,
the conversion didn't get merged due to percpu memory overhead concern which
couldn't be addressed at the time.

Unfortunately, the patch's addition of smp_mb() to cgroup_rstat_updated()
meant that every scheduling event now had to go through an additional full
barrier and Mel Gorman noticed it as 1% regression in netperf UDP_STREAM test.

There's no need to have this barrier in tree now and even if we need
synchronous accounting in the future, the right thing to do is separating that
out to a separate function so that hot paths which don't care about
synchronous behavior don't have to pay the overhead of the full barrier. Let's
revert.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Mel Gorman <mgorman@techsingularity.net>
Link: http://lkml.kernel.org/r/20200409154413.GK3818@techsingularity.net
Cc: v4.18+
2020-04-09 14:55:46 -04:00
Eric W. Biederman
63f818f46a proc: Use a dedicated lock in struct pid
syzbot wrote:
> ========================================================
> WARNING: possible irq lock inversion dependency detected
> 5.6.0-syzkaller #0 Not tainted
> --------------------------------------------------------
> swapper/1/0 just changed the state of lock:
> ffffffff898090d8 (tasklist_lock){.+.?}-{2:2}, at: send_sigurg+0x9f/0x320 fs/fcntl.c:840
> but this lock took another, SOFTIRQ-unsafe lock in the past:
>  (&pid->wait_pidfd){+.+.}-{2:2}
>
>
> and interrupts could create inverse lock ordering between them.
>
>
> other info that might help us debug this:
>  Possible interrupt unsafe locking scenario:
>
>        CPU0                    CPU1
>        ----                    ----
>   lock(&pid->wait_pidfd);
>                                local_irq_disable();
>                                lock(tasklist_lock);
>                                lock(&pid->wait_pidfd);
>   <Interrupt>
>     lock(tasklist_lock);
>
>  *** DEADLOCK ***
>
> 4 locks held by swapper/1/0:

The problem is that because wait_pidfd.lock is taken under the tasklist
lock.  It must always be taken with irqs disabled as tasklist_lock can be
taken from interrupt context and if wait_pidfd.lock was already taken this
would create a lock order inversion.

Oleg suggested just disabling irqs where I have added extra calls to
wait_pidfd.lock.  That should be safe and I think the code will eventually
do that.  It was rightly pointed out by Christian that sharing the
wait_pidfd.lock was a premature optimization.

It is also true that my pre-merge window testing was insufficient.  So
remove the premature optimization and give struct pid a dedicated lock of
it's own for struct pid things.  I have verified that lockdep sees all 3
paths where we take the new pid->lock and lockdep does not complain.

It is my current day dream that one day pid->lock can be used to guard the
task lists as well and then the tasklist_lock won't need to be held to
deliver signals.  That will require taking pid->lock with irqs disabled.

Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/lkml/00000000000011d66805a25cd73f@google.com/
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Reported-by: syzbot+343f75cdeea091340956@syzkaller.appspotmail.com
Reported-by: syzbot+832aabf700bc3ec920b9@syzkaller.appspotmail.com
Reported-by: syzbot+f675f964019f884dbd0f@syzkaller.appspotmail.com
Reported-by: syzbot+a9fb1457d720a55d6dc5@syzkaller.appspotmail.com
Fixes: 7bc3e6e55a ("proc: Use a list of inodes to flush from proc")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-04-09 12:15:35 -05:00
Grygorii Strashko
9bb50ed747 dma-debug: fix displaying of dma allocation type
The commit 2e05ea5cdc ("dma-mapping: implement dma_map_single_attrs using
dma_map_page_attrs") removed "dma_debug_page" enum, but missed to update
type2name string table. This causes incorrect displaying of dma allocation
type.
Fix it by removing "page" string from type2name string table and switch to
use named initializers.

Before (dma_alloc_coherent()):
k3-ringacc 4b800000.ringacc: scather-gather idx 2208 P=d1140000 N=d114 D=d1140000 L=40 DMA_BIDIRECTIONAL dma map error check not applicable
k3-ringacc 4b800000.ringacc: scather-gather idx 2216 P=d1150000 N=d115 D=d1150000 L=40 DMA_BIDIRECTIONAL dma map error check not applicable

After:
k3-ringacc 4b800000.ringacc: coherent idx 2208 P=d1140000 N=d114 D=d1140000 L=40 DMA_BIDIRECTIONAL dma map error check not applicable
k3-ringacc 4b800000.ringacc: coherent idx 2216 P=d1150000 N=d115 D=d1150000 L=40 DMA_BIDIRECTIONAL dma map error check not applicable

Fixes: 2e05ea5cdc ("dma-mapping: implement dma_map_single_attrs using dma_map_page_attrs")
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-04-08 21:46:57 +02:00
Kishon Vijay Abraham I
cdcda0d1f8 dma-direct: fix data truncation in dma_direct_get_required_mask()
The upper 32-bit physical address gets truncated inadvertently
when dma_direct_get_required_mask() invokes phys_to_dma_direct().
This results in dma_addressing_limited() return incorrect value
when used in platforms with LPAE enabled.
Fix it here by explicitly type casting 'max_pfn' to phys_addr_t
in order to prevent overflow of intermediate value while evaluating
'(max_pfn - 1) << PAGE_SHIFT'.

Signed-off-by: Kishon Vijay Abraham I <kishon@ti.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-04-08 20:52:24 +02:00
Peter Zijlstra
9a019db0b6 locking/lockdep: Improve 'invalid wait context' splat
The 'invalid wait context' splat doesn't print all the information
required to reconstruct / validate the error, specifically the
irq-context state is missing.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-04-08 12:05:07 +02:00
Qian Cai
d22cc7f67d locking/percpu-rwsem: Fix a task_struct refcount
The following commit:

  7f26482a87 ("locking/percpu-rwsem: Remove the embedded rwsem")

introduced task_struct memory leaks due to messing up the task_struct
refcount.

At the beginning of percpu_rwsem_wake_function(), it calls get_task_struct(),
but if the trylock failed, it will remain in the waitqueue. However, it
will run percpu_rwsem_wake_function() again with get_task_struct() to
increase the refcount but then only call put_task_struct() once the trylock
succeeded.

Fix it by adjusting percpu_rwsem_wake_function() a bit to guard against
when percpu_rwsem_wait() observing !private, terminating the wait and
doing a quick exit() while percpu_rwsem_wake_function() then doing
wake_up_process(p) as a use-after-free.

Fixes: 7f26482a87 ("locking/percpu-rwsem: Remove the embedded rwsem")
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200330213002.2374-1-cai@lca.pw
2020-04-08 12:05:06 +02:00
Valentin Schneider
96e74ebf8d sched/debug: Add task uclamp values to SCHED_DEBUG procfs
Requested and effective uclamp values can be a bit tricky to decipher when
playing with cgroup hierarchies. Add them to a task's procfs when
SCHED_DEBUG is enabled.

Reviewed-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200226124543.31986-4-valentin.schneider@arm.com
2020-04-08 11:35:27 +02:00
Valentin Schneider
9e3bf9469c sched/debug: Factor out printing formats into common macros
The printing macros in debug.c keep redefining the same output
format. Collect each output format in a single definition, and reuse that
definition in the other macros. While at it, add a layer of parentheses and
replace printf's  with the newly introduced macros.

Reviewed-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200226124543.31986-3-valentin.schneider@arm.com
2020-04-08 11:35:26 +02:00
Valentin Schneider
c745a6212c sched/debug: Remove redundant macro define
Most printing macros for procfs are defined globally in debug.c, and they
are re-defined (to the exact same thing) within proc_sched_show_task().

Get rid of the duplicate defines.

Reviewed-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200226124543.31986-2-valentin.schneider@arm.com
2020-04-08 11:35:24 +02:00
Vincent Donnefort
275b2f6723 sched/core: Remove unused rq::last_load_update_tick
The following commit:

  5e83eafbfd ("sched/fair: Remove the rq->cpu_load[] update code")

eliminated the last use case for rq->last_load_update_tick, so remove
the field as well.

Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/1584710495-308969-1-git-send-email-vincent.donnefort@arm.com
2020-04-08 11:35:23 +02:00
Sebastian Andrzej Siewior
62849a9612 workqueue: Remove the warning in wq_worker_sleeping()
The kernel test robot triggered a warning with the following race:
   task-ctx A                            interrupt-ctx B
 worker
  -> process_one_work()
    -> work_item()
      -> schedule();
         -> sched_submit_work()
           -> wq_worker_sleeping()
             -> ->sleeping = 1
               atomic_dec_and_test(nr_running)
         __schedule();                *interrupt*
                                       async_page_fault()
                                       -> local_irq_enable();
                                       -> schedule();
                                          -> sched_submit_work()
                                            -> wq_worker_sleeping()
                                               -> if (WARN_ON(->sleeping)) return
                                          -> __schedule()
                                            ->  sched_update_worker()
                                              -> wq_worker_running()
                                                 -> atomic_inc(nr_running);
                                                 -> ->sleeping = 0;

      ->  sched_update_worker()
        -> wq_worker_running()
          if (!->sleeping) return

In this context the warning is pointless everything is fine.
An interrupt before wq_worker_sleeping() will perform the ->sleeping
assignment (0 -> 1 > 0) twice.
An interrupt after wq_worker_sleeping() will trigger the warning and
nr_running will be decremented (by A) and incremented once (only by B, A
will skip it). This is the case until the ->sleeping is zeroed again in
wq_worker_running().

Remove the WARN statement because this condition may happen. Document
that preemption around wq_worker_sleeping() needs to be disabled to
protect ->sleeping and not just as an optimisation.

Fixes: 6d25be5782 ("sched/core, workqueues: Distangle worker accounting from rq lock")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Link: https://lkml.kernel.org/r/20200327074308.GY11705@shao2-debian
2020-04-08 11:35:20 +02:00
Aubrey Li
111688ca1c sched/fair: Fix negative imbalance in imbalance calculation
A negative imbalance value was observed after imbalance calculation,
this happens when the local sched group type is group_fully_busy,
and the average load of local group is greater than the selected
busiest group. Fix this problem by comparing the average load of the
local and busiest group before imbalance calculation formula.

Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/1585201349-70192-1-git-send-email-aubrey.li@intel.com
2020-04-08 11:35:20 +02:00
Huaixin Chang
26a8b12747 sched/fair: Fix race between runtime distribution and assignment
Currently, there is a potential race between distribute_cfs_runtime()
and assign_cfs_rq_runtime(). Race happens when cfs_b->runtime is read,
distributes without holding lock and finds out there is not enough
runtime to charge against after distribution. Because
assign_cfs_rq_runtime() might be called during distribution, and use
cfs_b->runtime at the same time.

Fibtest is the tool to test this race. Assume all gcfs_rq is throttled
and cfs period timer runs, slow threads might run and sleep, returning
unused cfs_rq runtime and keeping min_cfs_rq_runtime in their local
pool. If all this happens sufficiently quickly, cfs_b->runtime will drop
a lot. If runtime distributed is large too, over-use of runtime happens.

A runtime over-using by about 70 percent of quota is seen when we
test fibtest on a 96-core machine. We run fibtest with 1 fast thread and
95 slow threads in test group, configure 10ms quota for this group and
see the CPU usage of fibtest is 17.0%, which is far more than the
expected 10%.

On a smaller machine with 32 cores, we also run fibtest with 96
threads. CPU usage is more than 12%, which is also more than expected
10%. This shows that on similar workloads, this race do affect CPU
bandwidth control.

Solve this by holding lock inside distribute_cfs_runtime().

Fixes: c06f04c704 ("sched: Fix potential near-infinite distribute_cfs_runtime() loop")
Reviewed-by: Ben Segall <bsegall@google.com>
Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/lkml/20200325092602.22471-1-changhuaixin@linux.alibaba.com/
2020-04-08 11:35:19 +02:00
Valentin Schneider
d76343c6b2 sched/fair: Align rq->avg_idle and rq->avg_scan_cost
sched/core.c uses update_avg() for rq->avg_idle and sched/fair.c uses an
open-coded version (with the exact same decay factor) for
rq->avg_scan_cost. On top of that, select_idle_cpu() expects to be able to
compare these two fields.

The only difference between the two is that rq->avg_scan_cost is computed
using a pure division rather than a shift. Turns out it actually matters,
first of all because the shifted value can be negative, and the standard
has this to say about it:

  """
  The result of E1 >> E2 is E1 right-shifted E2 bit positions. [...] If E1
  has a signed type and a negative value, the resulting value is
  implementation-defined.
  """

Not only this, but (arithmetic) right shifting a negative value (using 2's
complement) is *not* equivalent to dividing it by the corresponding power
of 2. Let's look at a few examples:

  -4      -> 0xF..FC
  -4 >> 3 -> 0xF..FF == -1 != -4 / 8

  -8      -> 0xF..F8
  -8 >> 3 -> 0xF..FF == -1 == -8 / 8

  -9      -> 0xF..F7
  -9 >> 3 -> 0xF..FE == -2 != -9 / 8

Make update_avg() use a division, and export it to the private scheduler
header to reuse it where relevant. Note that this still lets compilers use
a shift here, but should prevent any unwanted surprise. The disassembly of
select_idle_cpu() remains unchanged on arm64, and ttwu_do_wakeup() gains 2
instructions; the diff sort of looks like this:

  - sub x1, x1, x0
  + subs x1, x1, x0 // set condition codes
  + add x0, x1, #0x7
  + csel x0, x0, x1, mi // x0 = x1 < 0 ? x0 : x1
    add x0, x3, x0, asr #3

which does the right thing (i.e. gives us the expected result while still
using an arithmetic shift)

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200330090127.16294-1-valentin.schneider@arm.com
2020-04-08 11:35:18 +02:00
Jiri Olsa
d3296fb372 perf/core: Disable page faults when getting phys address
We hit following warning when running tests on kernel
compiled with CONFIG_DEBUG_ATOMIC_SLEEP=y:

 WARNING: CPU: 19 PID: 4472 at mm/gup.c:2381 __get_user_pages_fast+0x1a4/0x200
 CPU: 19 PID: 4472 Comm: dummy Not tainted 5.6.0-rc6+ #3
 RIP: 0010:__get_user_pages_fast+0x1a4/0x200
 ...
 Call Trace:
  perf_prepare_sample+0xff1/0x1d90
  perf_event_output_forward+0xe8/0x210
  __perf_event_overflow+0x11a/0x310
  __intel_pmu_pebs_event+0x657/0x850
  intel_pmu_drain_pebs_nhm+0x7de/0x11d0
  handle_pmi_common+0x1b2/0x650
  intel_pmu_handle_irq+0x17b/0x370
  perf_event_nmi_handler+0x40/0x60
  nmi_handle+0x192/0x590
  default_do_nmi+0x6d/0x150
  do_nmi+0x2f9/0x3c0
  nmi+0x8e/0xd7

While __get_user_pages_fast() is IRQ-safe, it calls access_ok(),
which warns on:

  WARN_ON_ONCE(!in_task() && !pagefault_disabled())

Peter suggested disabling page faults around __get_user_pages_fast(),
which gets rid of the warning in access_ok() call.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200407141427.3184722-1-jolsa@kernel.org
2020-04-08 11:33:46 +02:00
Ian Rogers
24fb6b8e7c perf/cgroup: Correct indirection in perf_less_group_idx()
The void* in perf_less_group_idx() is to a member in the array which points
at a perf_event*, as such it is a perf_event**.

Reported-By: John Sperbeck <jsperbeck@google.com>
Fixes: 6eef8a7116 ("perf/core: Use min_heap in visit_groups_merge()")
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200321164331.107337-1-irogers@google.com
2020-04-08 11:33:45 +02:00
Peter Zijlstra
33238c5045 perf/core: Fix event cgroup tracking
Song reports that installing cgroup events is broken since:

  db0503e4f6 ("perf/core: Optimize perf_install_in_event()")

The problem being that cgroup events try to track cpuctx->cgrp even
for disabled events, which is pointless and actively harmful since the
above commit. Rework the code to have explicit enable/disable hooks
for cgroup events, such that we can limit cgroup tracking to active
events.

More specifically, since the above commit disabled events are no
longer added to their context from the 'right' CPU, and we can't
access things like the current cgroup for a remote CPU.

Cc: <stable@vger.kernel.org> # v5.5+
Fixes: db0503e4f6 ("perf/core: Optimize perf_install_in_event()")
Reported-by: Song Liu <songliubraving@fb.com>
Tested-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200318193337.GB20760@hirez.programming.kicks-ass.net
2020-04-08 11:33:44 +02:00
Jan Kara
0f538e3e71 ucount: Make sure ucounts in /proc/sys/user don't regress again
Commit 769071ac9f "ns: Introduce Time Namespace" broke reporting of
inotify ucounts (max_inotify_instances, max_inotify_watches) in
/proc/sys/user because it has added UCOUNT_TIME_NAMESPACES into enum
ucount_type but didn't properly update reporting in
kernel/ucount.c:setup_userns_sysctls(). This problem got fixed in commit
eeec26d5da "time/namespace: Add max_time_namespaces ucount".

Add BUILD_BUG_ON to catch a similar problem in the future.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Andrei Vagin <avagin@gmail.com>
Link: https://lkml.kernel.org/r/20200407154643.10102-1-jack@suse.cz
2020-04-07 21:51:27 +02:00
Gustavo A. R. Silva
6524d79413 kernel/gcov/fs.c: replace zero-length array with flexible-array member
The current codebase makes use of the zero-length array language extension
to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning in
case the flexible array does not occur last in the structure, which will
help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by this
change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied.  As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Oberparleiter <oberpar@linux.ibm.com>
Link: http://lkml.kernel.org/r/20200302224851.GA26467@embeddedor
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:44 -07:00
Gustavo A. R. Silva
7ff87182d1 gcov: gcc_3_4: replace zero-length array with flexible-array member
The current codebase makes use of the zero-length array language extension
to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning in
case the flexible array does not occur last in the structure, which will
help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by this
change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied.  As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Oberparleiter <oberpar@linux.ibm.com>
Link: http://lkml.kernel.org/r/20200302224501.GA14175@embeddedor
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:44 -07:00
Gustavo A. R. Silva
fba4168ede gcov: gcc_4_7: replace zero-length array with flexible-array member
The current codebase makes use of the zero-length array language extension
to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning in
case the flexible array does not occur last in the structure, which will
help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by this
change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied.  As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Link: http://lkml.kernel.org/r/20200213152241.GA877@embeddedor
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:44 -07:00
Qiujun Huang
06d4f8152a kernel/kmod.c: fix a typo "assuems" -> "assumes"
There is a typo in comment.  Fix it.  s/assuems/assumes/

Signed-off-by: Qiujun Huang <hqjagain@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Link: http://lkml.kernel.org/r/1585891029-6450-1-git-send-email-hqjagain@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:44 -07:00
Will Deacon
0bd476e6c6 kallsyms: unexport kallsyms_lookup_name() and kallsyms_on_each_symbol()
kallsyms_lookup_name() and kallsyms_on_each_symbol() are exported to
modules despite having no in-tree users and being wide open to abuse by
out-of-tree modules that can use them as a method to invoke arbitrary
non-exported kernel functions.

Unexport kallsyms_lookup_name() and kallsyms_on_each_symbol().

Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Quentin Perret <qperret@google.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: K.Prasad <prasad@linux.vnet.ibm.com>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: http://lkml.kernel.org/r/20200221114404.14641-4-will@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:44 -07:00
Masahiro Yamada
889b3c1245 compiler: remove CONFIG_OPTIMIZE_INLINING entirely
Commit ac7c3e4ff4 ("compiler: enable CONFIG_OPTIMIZE_INLINING
forcibly") made this always-on option. We released v5.4 and v5.5
including that commit.

Remove the CONFIG option and clean up the code now.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: David Miller <davem@davemloft.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20200220110807.32534-2-masahiroy@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:42 -07:00
Nathan Chancellor
63174f61df kernel/extable.c: use address-of operator on section symbols
Clang warns:

../kernel/extable.c:37:52: warning: array comparison always evaluates to
a constant [-Wtautological-compare]
        if (main_extable_sort_needed && __stop___ex_table > __start___ex_table) {
                                                          ^
1 warning generated.

These are not true arrays, they are linker defined symbols, which are just
addresses.  Using the address of operator silences the warning and does
not change the resulting assembly with either clang/ld.lld or gcc/ld
(tested with diff + objdump -Dr).

Suggested-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Link: https://github.com/ClangBuiltLinux/linux/issues/892
Link: http://lkml.kernel.org/r/20200219202036.45702-1-natechancellor@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:42 -07:00
Alexey Dobriyan
d919b33daf proc: faster open/read/close with "permanent" files
Now that "struct proc_ops" exist we can start putting there stuff which
could not fly with VFS "struct file_operations"...

Most of fs/proc/inode.c file is dedicated to make open/read/.../close
reliable in the event of disappearing /proc entries which usually happens
if module is getting removed.  Files like /proc/cpuinfo which never
disappear simply do not need such protection.

Save 2 atomic ops, 1 allocation, 1 free per open/read/close sequence for such
"permanent" files.

Enable "permanent" flag for

	/proc/cpuinfo
	/proc/kmsg
	/proc/modules
	/proc/slabinfo
	/proc/stat
	/proc/sysvipc/*
	/proc/swaps

More will come once I figure out foolproof way to prevent out module
authors from marking their stuff "permanent" for performance reasons
when it is not.

This should help with scalability: benchmark is "read /proc/cpuinfo R times
by N threads scattered over the system".

	N	R	t, s (before)	t, s (after)
	-----------------------------------------------------
	64	4096	1.582458	1.530502	-3.2%
	256	4096	6.371926	6.125168	-3.9%
	1024	4096	25.64888	24.47528	-4.6%

Benchmark source:

#include <chrono>
#include <iostream>
#include <thread>
#include <vector>

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>

const int NR_CPUS = sysconf(_SC_NPROCESSORS_ONLN);
int N;
const char *filename;
int R;

int xxx = 0;

int glue(int n)
{
	cpu_set_t m;
	CPU_ZERO(&m);
	CPU_SET(n, &m);
	return sched_setaffinity(0, sizeof(cpu_set_t), &m);
}

void f(int n)
{
	glue(n % NR_CPUS);

	while (*(volatile int *)&xxx == 0) {
	}

	for (int i = 0; i < R; i++) {
		int fd = open(filename, O_RDONLY);
		char buf[4096];
		ssize_t rv = read(fd, buf, sizeof(buf));
		asm volatile ("" :: "g" (rv));
		close(fd);
	}
}

int main(int argc, char *argv[])
{
	if (argc < 4) {
		std::cerr << "usage: " << argv[0] << ' ' << "N /proc/filename R
";
		return 1;
	}

	N = atoi(argv[1]);
	filename = argv[2];
	R = atoi(argv[3]);

	for (int i = 0; i < NR_CPUS; i++) {
		if (glue(i) == 0)
			break;
	}

	std::vector<std::thread> T;
	T.reserve(N);
	for (int i = 0; i < N; i++) {
		T.emplace_back(f, i);
	}

	auto t0 = std::chrono::system_clock::now();
	{
		*(volatile int *)&xxx = 1;
		for (auto& t: T) {
			t.join();
		}
	}
	auto t1 = std::chrono::system_clock::now();
	std::chrono::duration<double> dt = t1 - t0;
	std::cout << dt.count() << '
';

	return 0;
}

P.S.:
Explicit randomization marker is added because adding non-function pointer
will silently disable structure layout randomization.

[akpm@linux-foundation.org: coding style fixes]
Reported-by: kbuild test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Joe Perches <joe@perches.com>
Link: http://lkml.kernel.org/r/20200222201539.GA22576@avx2
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:42 -07:00
Anshuman Khandual
03911132aa mm/vma: replace all remaining open encodings with is_vm_hugetlb_page()
This replaces all remaining open encodings with is_vm_hugetlb_page().

Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Will Deacon <will@kernel.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Paul Burton <paulburton@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/1582520593-30704-4-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:37 -07:00
Anshuman Khandual
3122e80efc mm/vma: make vma_is_accessible() available for general use
Lets move vma_is_accessible() helper to include/linux/mm.h which makes it
available for general use.  While here, this replaces all remaining open
encodings for VMA access check with vma_is_accessible().

Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Guo Ren <guoren@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Guo Ren <guoren@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Burton <paulburton@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Will Deacon <will@kernel.org>
Link: http://lkml.kernel.org/r/1582520593-30704-3-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:37 -07:00
Li Xinhai
e39a4b332d mm: set vm_next and vm_prev to NULL in vm_area_dup()
Set ->vm_next and ->vm_prev to NULL to prevent potential misuse from the
new duplicated vma.

Currently, only in fork path there are misuse for handling anon_vma.  No
other bugs been revealed with this patch applied.

Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Link: http://lkml.kernel.org/r/1581150928-3214-4-git-send-email-lixinhai.lxh@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:37 -07:00
Li Xinhai
93949bb21b mm: don't prepare anon_vma if vma has VM_WIPEONFORK
Patch series "mm: Fix misuse of parent anon_vma in dup_mmap path".

This patchset fixes the misuse of parenet anon_vma, which mainly caused by
child vma's vm_next and vm_prev are left same as its parent after
duplicate vma.  Finally, code reached parent vma's neighbor by referring
pointer of child vma and executed wrong logic.

The first two patches fix relevant issues, and the third patch sets
vm_next and vm_prev to NULL when duplicate vma to prevent potential misuse
in future.

Effects of the first bug is that causes rmap code to check both parent and
child's page table, although a page couldn't be mapped by both parent and
child, because child vma has WIPEONFORK so all pages mapped by child are
'new' and not relevant to parent.

Effects of the second bug is that the relationship of anon_vma of parent
and child are totallyconvoluted.  It would cause 'son', 'grandson', ...,
etc, to share 'parent' anon_vma, which disobey the design rule of reusing
anon_vma (the rule to be followed is that reusing should among vma of same
process, and vma should not gone through fork).

So, both issues should cause unnecessary rmap walking and have unexpected
complexity.

These two issues would not be directly visible, I used debugging code to
check the anon_vma pointers of parent and child when inspecting the
suspicious implementation of issue #2, then find the problem.

This patch (of 3):

In dup_mmap(), anon_vma_prepare() is called for vma has VM_WIPEONFORK, and
parameter 'tmp' (i.e., the new vma of child) has same ->vm_next and
->vm_prev as its parent vma.  That allows anon_vma used by parent been
mistakenly shared by child (find_mergeable_anon_vma() will do this reuse
work).

Besides this issue, call anon_vma_prepare() should be avoided because we
don't copy page for this vma.  Preparing anon_vma will be handled during
fault.

Fixes: d2cd9ede6e ("mm,fork: introduce MADV_WIPEONFORK")
Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Link: http://lkml.kernel.org/r/1581150928-3214-2-git-send-email-lixinhai.lxh@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:37 -07:00
Dmitry Safonov
eeec26d5da time/namespace: Add max_time_namespaces ucount
Michael noticed that userns limit for number of time namespaces is missing.

Furthermore, time namespace introduced UCOUNT_TIME_NAMESPACES, but didn't
introduce an array member in user_table[]. It would make array's
initialisation OOB write, but by luck the user_table array has an excessive
empty member (all accesses to the array are limited with UCOUNT_COUNTS - so
it silently reuses the last free member.

Fixes user-visible regression: max_inotify_instances by reason of the
missing UCOUNT_ENTRY() has limited max number of namespaces instead of the
number of inotify instances.

Fixes: 769071ac9f ("ns: Introduce Time Namespace")
Reported-by: Michael Kerrisk (man-pages) <mtk.manpages@gmail.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Andrei Vagin <avagin@gmail.com>
Acked-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: stable@kernel.org
Link: https://lkml.kernel.org/r/20200406171342.128733-1-dima@arista.com
2020-04-07 12:37:21 +02:00
Michael Kerrisk (man-pages)
b801f1e22c time/namespace: Fix time_for_children symlink
Looking at the contents of the /proc/PID/ns/time_for_children symlink shows
an anomaly:

$ ls -l /proc/self/ns/* |awk '{print $9, $10, $11}'
...
/proc/self/ns/pid -> pid:[4026531836]
/proc/self/ns/pid_for_children -> pid:[4026531836]
/proc/self/ns/time -> time:[4026531834]
/proc/self/ns/time_for_children -> time_for_children:[4026531834]
/proc/self/ns/user -> user:[4026531837]
...

The reference for 'time_for_children' should be a 'time' namespace, just as
the reference for 'pid_for_children' is a 'pid' namespace.  In other words,
the above time_for_children link should read:

/proc/self/ns/time_for_children -> time:[4026531834]

Fixes: 769071ac9f ("ns: Introduce Time Namespace")
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dmitry Safonov <dima@arista.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Andrei Vagin <avagin@gmail.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/a2418c48-ed80-3afe-116e-6611cb799557@gmail.com
2020-04-07 12:37:21 +02:00
Qiujun Huang
0ac16296ff bpf: Fix a typo "inacitve" -> "inactive"
There is a typo in struct bpf_lru_list's next_inactive_rotation
description, thus fix s/inacitve/inactive/.

Signed-off-by: Qiujun Huang <hqjagain@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/1585901254-30377-1-git-send-email-hqjagain@gmail.com
2020-04-06 21:54:10 +02:00
Christoph Hellwig
0f5c4c6e0e PM / sleep: handle the compat case in snapshot_set_swap_area()
Use in_compat_syscall to copy directly from the 32-bit ABI structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-04-06 21:42:36 +02:00
Christoph Hellwig
88a77559cc PM / sleep: move SNAPSHOT_SET_SWAP_AREA handling into a helper
Move the handling of the SNAPSHOT_SET_SWAP_AREA ioctl from the main
ioctl helper into a helper function.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-04-06 21:42:36 +02:00
Linus Torvalds
ef05db16bb Additional power management updates for 5.7-rc1
- Fix corner-case suspend-to-idle wakeup issue on systems where
    the ACPI SCI is shared with another wakeup source (Hans de Goede).
 
  - Add document describing system-wide suspend and resume code flows
    to the admin guide (Rafael Wysocki).
 
  - Add kernel command line option to set pm_debug_messages (Chen Yu).
 
  - Choose schedutil as the preferred scaling governor by default on
    ARM big.LITTLE systems and on x86 systems using the intel_pstate
    driver in the passive mode (Linus Walleij, Rafael Wysocki).
 
  - Drop racy and redundant checks from the PM core's device_prepare()
    routine (Rafael Wysocki).
 
  - Make resume from hibernation take the hibernation_restore() return
    value into account (Dexuan Cui).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl6LN9kSHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxqj0P/3Uh16YIKOUeXosqtptJJiWgZoWu82aR
 H2uzc9hCV6k/tnSG/Y3ZGgI4dUZGcJ94By80XoWceZ97CC1YuBHpF/dHSupTCcvt
 tBEP9H2qbSLi9d2PhlEUJ0B6PBpWEjCzOb519Tzj4sd67FGhsKeihXl+WXdgaaKZ
 hpdj+tBkQ6Dxb7CXVw/mUpWlg/nmF+yRqpK5By+V0WMOlQtg4Ecb3lMqaJPr7b4F
 W/I5m4PE9QG+VetSSXoe7CKfLo2fsDtyHH6ioTJ9xklcOflIblZ3MYf3sKgOuuny
 m/6Zm71HmQwWRkjA/V+C/eTr0VX2TuMUALfA8i+uZBo//I58TTJq9oqosu8eao2P
 5MR79Vwa+eqtYRgpHCTM4JO1iyKiE+FTfVAPOS4hHOKjqY9BUj0YtWHxk8abdzQd
 owv9DjEtGKcipAnCKtWDIS9ikZ5TOsYiGgWyJQnoonpFY+0s3DwtRdi+gBLJliSj
 5j7sRv9wJrJRFZ7tRMHHAqoQkSYqf4YdBuRBG7F/M44anveORBJNaTbPgm30qbUP
 FXD9hTJn/4Q2nmBSzqvEv0+TJrWoCyKduzr5uNsR0eyg48w9mrlUDDkW7dOMb3DU
 WV6QGNomMuKIY4DHKCo3/k21NmzW2JN7zhUrErCnl0bXjV2AM71Cb2xjUD1eNOqV
 9LxPDvAmZoGt
 =QFDr
 -----END PGP SIGNATURE-----

Merge tag 'pm-5.7-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull more power management updates from Rafael Wysocki:
 "Additional power management updates.

  These fix a corner-case suspend-to-idle wakeup issue on systems where
  the ACPI SCI is shared with another wakeup source, add a kernel
  command line option to set pm_debug_messages via the kernel command
  line, add a document desctibing system-wide suspend and resume code
  flows, modify cpufreq Kconfig to choose schedutil as the preferred
  governor by default in a couple of cases and do some assorted
  cleanups.

  Specifics:

   - Fix corner-case suspend-to-idle wakeup issue on systems where the
     ACPI SCI is shared with another wakeup source (Hans de Goede).

   - Add document describing system-wide suspend and resume code flows
     to the admin guide (Rafael Wysocki).

   - Add kernel command line option to set pm_debug_messages (Chen Yu).

   - Choose schedutil as the preferred scaling governor by default on
     ARM big.LITTLE systems and on x86 systems using the intel_pstate
     driver in the passive mode (Linus Walleij, Rafael Wysocki).

   - Drop racy and redundant checks from the PM core's device_prepare()
     routine (Rafael Wysocki).

   - Make resume from hibernation take the hibernation_restore() return
     value into account (Dexuan Cui)"

* tag 'pm-5.7-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  platform/x86: intel_int0002_vgpio: Use acpi_register_wakeup_handler()
  ACPI: PM: Add acpi_[un]register_wakeup_handler()
  Documentation: PM: sleep: Document system-wide suspend code flows
  cpufreq: Select schedutil when using big.LITTLE
  PM: sleep: Add pm_debug_messages kernel command line option
  PM: sleep: core: Drop racy and redundant checks from device_prepare()
  PM: hibernate: Propagate the return value of hibernation_restore()
  cpufreq: intel_pstate: Select schedutil as the default governor
2020-04-06 10:14:39 -07:00
Linus Torvalds
b6ff10700d \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAl6LFJ0ACgkQnJ2qBz9k
 QNkSUQgAzwaescnHeVTF7/Zg9Uj2xrfrTJZ1E+Mn9qnd/0/z/asVV+RKfY7Gnu7h
 g19inDI4ZESFz2gWz4jwJD1c2/yMZb8vnae4ye3dtCv2yjG/0JxCeue6vjwsWqmO
 4jbSgk8YNQqzwEFVMzNp43ZJr3CFooLCIsJcL8q4yYk8Kt4pDUPmQ1vBvAc6k9vK
 BKMBvp926tbomP27nq0n0CjvHy7ipDGMl4H6i4vBxHRfbDPih2x9VEklK3JatC1n
 4AKS6IYJrkZVdOjli+DrResbcWxyT4db5tPio5MU0RDnVhNZT2cHyNVXf5EpRJqP
 72pa7gfPu1Rx1+tU8bDR/daSveou2A==
 =fkCV
 -----END PGP SIGNATURE-----

Merge tag 'fsnotify_for_v5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull fsnotify updates from Jan Kara:
 "This implements the fanotify FAN_DIR_MODIFY event.

  This event reports the name in a directory under which a change
  happened and together with the directory filehandle and fstatat()
  allows reliable and efficient implementation of directory
  synchronization"

* tag 'fsnotify_for_v5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  fanotify: Fix the checks in fanotify_fsid_equal
  fanotify: report name info for FAN_DIR_MODIFY event
  fanotify: record name info for FAN_DIR_MODIFY event
  fanotify: Drop fanotify_event_has_fid()
  fanotify: prepare to report both parent and child fid's
  fanotify: send FAN_DIR_MODIFY event flavor with dir inode and name
  fanotify: divorce fanotify_path_event and fanotify_fid_event
  fanotify: Store fanotify handles differently
  fanotify: Simplify create_fd()
  fanotify: fix merging marks masks with FAN_ONDIR
  fanotify: merge duplicate events on parent and child
  fsnotify: replace inode pointer with an object id
  fsnotify: simplify arguments passing to fsnotify_parent()
  fsnotify: use helpers to access data by data_type
  fsnotify: funnel all dirent events through fsnotify_name()
  fsnotify: factor helpers fsnotify_dentry() and fsnotify_file()
  fsnotify: tidy up FS_ and FAN_ constants
2020-04-06 08:58:42 -07:00
Paul E. McKenney
bf37da98c5 rcu: Don't acquire lock in NMI handler in rcu_nmi_enter_common()
The rcu_nmi_enter_common() function can be invoked both in interrupt
and NMI handlers.  If it is invoked from process context (as opposed
to userspace or idle context) on a nohz_full CPU, it might acquire the
CPU's leaf rcu_node structure's ->lock.  Because this lock is held only
with interrupts disabled, this is safe from an interrupt handler, but
doing so from an NMI handler can result in self-deadlock.

This commit therefore adds "irq" to the "if" condition so as to only
acquire the ->lock from irq handlers or process context, never from
an NMI handler.

Fixes: 5b14557b07 ("rcu: Avoid tick_dep_set_cpu() misordering")
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: <stable@vger.kernel.org> # 5.5.x
2020-04-05 14:22:15 -07:00
Linus Torvalds
c48b07226b perf updates all over the place:
core:
 
    - Support for cgroup tracking in samples to allow cgroup based
      analysis
 
  tools:
 
    - Support for cgroup analysis
 
    - Commandline option and hotkey for perf top to change the sort order
 
    - A set of fixes all over the place
 
    - Various build system related improvements
 
    - Updates of the X86 pmu event JSON data
 
    - Documentation updates
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl6J2iATHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoXMEEACy6WaNabX7EzkLUnX0WCgxnZVSryNR
 4EnxLJSg5lChEe4q2mOS3mRMpzlXHQieWcFxlwda7kjIbFlgQvjJQUiYlAvxRRO/
 giY/GwCtTi/Flcb+7wKxTgMYmtAUOZDWeQbBZUlFLi9vyeCHVkjget9EyVsgbe/W
 EmRsrPuKOVMUTeEwm3zpIE051DObpiWLNge++My70q5W/yNsS94PbNydgKO7osqk
 pX37YVyBFpI2IQxMGzaE3yK7OxXRjYljZaz1tONFDMEYOX9gmxpDsCCflsP1ZOzL
 4/P4faRvAOElwxtYBelKmRl8eboqhRpTEK0Et0TI0LYbUZrE2nisDi0LTKPWQb0k
 Om2Qi6AfZs67PVzx9htlx6rfee72+sUluz5BDKOGH0pNJ8CFy70ns8InLsZqbBZ7
 SgFVNjx6bHxB58VuVE9WEzr/KVs6zI/SuJlH7WG7FLXbm5j0cETfCjg40JlDpSNh
 CZs+Epky1zyytrVJ9Gc7KnRlw8VB2eWEQ2cQ0sqj5w7WxhfrnsCmQf1zk4sofhOF
 iz3Dvb9Llz/pYWGZbEiQAuI+8bo0psJptidzpmpbIXs/woKDhW49w1FxqowJQMWe
 +lGWcauSfo3gjgEnTkOWAx3yiH4i9rvRChX8Jh0Z07d6Kwf19YYfxcuhlkx0Wutj
 eKaaErWDtWAxpA==
 =2UUD
 -----END PGP SIGNATURE-----

Merge tag 'perf-urgent-2020-04-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull more perf updates from Thomas Gleixner:
 "Perf updates all over the place:

  core:

   - Support for cgroup tracking in samples to allow cgroup based
     analysis

  tools:

   - Support for cgroup analysis

   - Commandline option and hotkey for perf top to change the sort order

   - A set of fixes all over the place

   - Various build system related improvements

   - Updates of the X86 pmu event JSON data

   - Documentation updates"

* tag 'perf-urgent-2020-04-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (55 commits)
  perf python: Fix clang detection to strip out options passed in $CC
  perf tools: Support Python 3.8+ in Makefile
  perf script: Fix invalid read of directory entry after closedir()
  perf script report: Fix SEGFAULT when using DWARF mode
  perf script: add -S/--symbols documentation
  perf pmu-events x86: Use CPU_CLK_UNHALTED.THREAD in Kernel_Utilization metric
  perf events parser: Add missing Intel CPU events to parser
  perf script: Allow --symbol to accept hexadecimal addresses
  perf report/top TUI: Fix title line formatting
  perf top: Support hotkey to change sort order
  perf top: Support --group-sort-idx to change the sort order
  perf symbols: Fix arm64 gap between kernel start and module end
  perf build-test: Honour JOBS to override detection of number of cores
  perf script: Add --show-cgroup-events option
  perf top: Add --all-cgroups option
  perf record: Add --all-cgroups option
  perf record: Support synthesizing cgroup events
  perf report: Add 'cgroup' sort key
  perf cgroup: Maintain cgroup hierarchy
  perf tools: Basic support for CGROUP event
  ...
2020-04-05 12:26:24 -07:00
Linus Torvalds
d5ca32738f Two timer subsystem fixes:
- Prevent a use after free in the new lockdep state tracking for hrtimers
 
     - Add missing parenthesis in the VF pit timer driver
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl6J2pgTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoWM/D/4qlP3NvsRf/4dIqcmfOtaGmMC1yqug
 4YbA6jQ2iecmgGprqN7JIKCHRgRyP2d72Ue6dmDKA8eOCcLmUsWy6dz5A+Ufi5wT
 S9S9dWctZSkmXSdEWkkBMHaefNFUNOTc16q4c4BFXomZzE4QZs+KjoVjJZBDtIqw
 A/9rmZKcBKxMpbuorE7zs6cRzsfvmiothXI+R78WMRbI+Yy3JAIuf3+uR1h7tXSi
 M8BNTTGn9U+Rnos/MFK5p136mwd5DHbCrX2G5KoYaox2CFGQ3+SvFGW9DWR38OTz
 IDP/RmH02s2AI0MNsQxrFFCQIpCentUEHWV5x5gjsw6DrHI23Xc98xbNdz3c9S+n
 WZMn63jvGr2XuH8XWb9tS72Zdp9VyKzubQ04xOEswvZg2KQuSntbUjq8RIEwSTMb
 xC82sJVXf20RX4iPsHcPKqPAWTgKRjNBuZbzxjRWjS/Ijtbdt8/GP/q9nZ3EKRvb
 6k+bRWS6fbdRflhptCp9YmczE3/WX6SH02d0+m46x88SPzbxkg7sCMKjeJddiqXW
 XO2fRedYlbQXmGdUbvFrssRxLuGin4rYMAtZbO43t7uIf8KizPRE8EdUIaRpcYsS
 QftCReyHa1lu4+yCdknBIJ6eadzeeKaJed8FJLTp1KtPOQu68WryN1qG9Isa5o4R
 xcTq0+KI56XoKA==
 =y3EI
 -----END PGP SIGNATURE-----

Merge tag 'timers-urgent-2020-04-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fixes from Thomas Gleixner:
 "Two timer subsystem fixes:

   - Prevent a use after free in the new lockdep state tracking for
     hrtimers

   - Add missing parenthesis in the VF pit timer driver"

* tag 'timers-urgent-2020-04-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  clocksource/drivers/timer-vf-pit: Add missing parenthesis
  hrtimer: Don't dereference the hrtimer pointer after the callback
2020-04-05 12:06:51 -07:00
Linus Torvalds
aa1a8ce533 New tracing features:
- The ring buffer is no longer disabled when reading the trace file.
    The trace_pipe file was made to be used for live tracing and reading
    as it acted like the normal producer/consumer. As the trace file
    would not consume the data, the easy way of handling it was to just
    disable writes to the ring buffer. This came to a surprise to the
    BPF folks who complained about lost events due to reading.
    This is no longer an issue. If someone wants to keep the old disabling
    there's a new option "pause-on-trace" that can be set.
 
  - New set_ftrace_notrace_pid file. PIDs in this file will not be traced
    by the function tracer. Similar to set_ftrace_pid, which makes the
    function tracer only trace those tasks with PIDs in the file, the
    set_ftrace_notrace_pid does the reverse.
 
  - New set_event_notrace_pid file. PIDs in this file will cause events
    not to be traced if triggered by a task with a matching PID.
    Similar to the set_event_pid file but will not be traced.
    Note, sched_waking and sched_switch events may still be trace if
    one of the tasks referenced by those events contains a PID that
    is allowed to be traced.
 
 Tracing related features:
 
  - New bootconfig option, that is attached to the initrd file.
    If bootconfig is on the command line, then the initrd file
    is searched looking for a bootconfig appended at the end.
 
  - New GPU tracepoint infrastructure to help the gfx drivers to get
    off debugfs (acked by Greg Kroah-Hartman)
 
 Other minor updates and fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXokgWRQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qgrHAP0UkKs/52JY4oWa3OIh/OqK+vnCrIwz
 zGvDFOYM0fKbwgD9FZWgzlcaYK5m2Cxlhp4VoraZveHMLJUhnEHtdX6X0wk=
 =Rebj
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing updates from Steven Rostedt:
 "New tracing features:

   - The ring buffer is no longer disabled when reading the trace file.

     The trace_pipe file was made to be used for live tracing and
     reading as it acted like the normal producer/consumer. As the trace
     file would not consume the data, the easy way of handling it was to
     just disable writes to the ring buffer.

     This came to a surprise to the BPF folks who complained about lost
     events due to reading. This is no longer an issue. If someone wants
     to keep the old disabling there's a new option "pause-on-trace"
     that can be set.

   - New set_ftrace_notrace_pid file. PIDs in this file will not be
     traced by the function tracer.

     Similar to set_ftrace_pid, which makes the function tracer only
     trace those tasks with PIDs in the file, the set_ftrace_notrace_pid
     does the reverse.

   - New set_event_notrace_pid file. PIDs in this file will cause events
     not to be traced if triggered by a task with a matching PID.

     Similar to the set_event_pid file but will not be traced. Note,
     sched_waking and sched_switch events may still be traced if one of
     the tasks referenced by those events contains a PID that is allowed
     to be traced.

  Tracing related features:

   - New bootconfig option, that is attached to the initrd file.

     If bootconfig is on the command line, then the initrd file is
     searched looking for a bootconfig appended at the end.

   - New GPU tracepoint infrastructure to help the gfx drivers to get
     off debugfs (acked by Greg Kroah-Hartman)

  And other minor updates and fixes"

* tag 'trace-v5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (27 commits)
  tracing: Do not allocate buffer in trace_find_next_entry() in atomic
  tracing: Add documentation on set_ftrace_notrace_pid and set_event_notrace_pid
  selftests/ftrace: Add test to test new set_event_notrace_pid file
  selftests/ftrace: Add test to test new set_ftrace_notrace_pid file
  tracing: Create set_event_notrace_pid to not trace tasks
  ftrace: Create set_ftrace_notrace_pid to not trace tasks
  ftrace: Make function trace pid filtering a bit more exact
  ftrace/kprobe: Show the maxactive number on kprobe_events
  tracing: Have the document reflect that the trace file keeps tracing enabled
  ring-buffer/tracing: Have iterator acknowledge dropped events
  tracing: Do not disable tracing when reading the trace file
  ring-buffer: Do not disable recording when there is an iterator
  ring-buffer: Make resize disable per cpu buffer instead of total buffer
  ring-buffer: Optimize rb_iter_head_event()
  ring-buffer: Do not die if rb_iter_peek() fails more than thrice
  ring-buffer: Have rb_iter_head_event() handle concurrent writer
  ring-buffer: Add page_stamp to iterator for synchronization
  ring-buffer: Rename ring_buffer_read() to read_buffer_iter_advance()
  ring-buffer: Have ring_buffer_empty() not depend on tracing stopped
  tracing: Save off entry when peeking at next entry
  ...
2020-04-05 10:36:18 -07:00
Linus Torvalds
6f43bae382 dma-mapping updates for 5.7
- fix an integer overflow in the coherent pool (Kevin Grandemange)
  - provide support for in-place uncached remapping and use that
    for openrisc
  - fix the arm coherent allocator to take the bus limit into account
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAl6IL38LHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYN8Pg/9FerEL9GoU7lSeXvvklAED5Ro3JVyaZdA4eQaYzG6
 PnUYs+lFAsKGoE8qWbvue+pJs1JZtZ+oSp9uVrLGbPaMS+71iy/hCS+Cv9Ym0Y0u
 lciBlFEkN9qM1srBMn4vdBo4gIyailcylznQxzw/ZjU90vU4uUJj37YrfGxQnOd8
 pHrD24wutwECksEp6nLx3/Yt2xW92j9/oH+FnEK5mfaA0ATAQMz51L9veyU6liV4
 1A8jbi0diskAIqn4uyO5SuVg7s7C90HOe3JUBtk+oZvUXFlr9WvrDjCqBp6rcSiH
 XS+Z2RBomMSrjEHOfETcFu8JSyjY1eKu/a1rvEbmUc6bE/gKrEGgPiQwSLa+1Aty
 qy3I24uSF7xcs5yngnjzIQ/BizKFk/wzja15c4sfUNKiXLI6FqwwHL34Dg+Nv7UG
 A/eCXePzOGPVANcIU0Zh68epEfCJRqJtqy2BDrWisqRfhxd3rRgl9gNeS1JwR0El
 9T5c+CKfXn1IVA3YhMABUYh1JJ9bXrlZIOd3PEPwvwYRBnIYxP6JK2R+4BYjsMHy
 Y90QyAUUsJKMWYq4p4EpSCUGSlnzl9I2QH3ItUHvo+T9NcT6Vo4J6tCTQZu5tUGM
 SPiV49Gxz3u2+5VBmolWixO6JpRBv+92gowWdxRULkFpMaOw8mPInWW5cWPWc2MY
 u/Y=
 =DrK0
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-5.7' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping updates from Christoph Hellwig:

 - fix an integer overflow in the coherent pool (Kevin Grandemange)

 - provide support for in-place uncached remapping and use that for
   openrisc

 - fix the arm coherent allocator to take the bus limit into account

* tag 'dma-mapping-5.7' of git://git.infradead.org/users/hch/dma-mapping:
  ARM/dma-mapping: merge __dma_supported into arm_dma_supported
  ARM/dma-mapping: take the bus limit into account in __dma_alloc
  ARM/dma-mapping: remove get_coherent_dma_mask
  openrisc: use the generic in-place uncached DMA allocator
  dma-direct: provide a arch_dma_clear_uncached hook
  dma-direct: make uncached_kernel_address more general
  dma-direct: consolidate the error handling in dma_direct_alloc_pages
  dma-direct: remove the cached_kernel_address hook
  dma-coherent: fix integer overflow in the reserved-memory dma allocation
2020-04-04 10:12:47 -07:00
Linus Torvalds
ad0bf4eb91 s390 updates for the 5.7 merge window
- Update maintainers. Niklas Schnelle takes over zpci and Vineeth Vijayan
   common io code.
 
 - Extend cpuinfo to include topology information.
 
 - Add new extended counters for IBM z15 and sampling buffer allocation
   rework in perf code.
 
 - Add control over zeroing out memory during system restart.
 
 - CCA protected key block version 2 support and other fixes/improvements
   in crypto code.
 
 - Convert to new fallthrough; annotations.
 
 - Replace zero-length arrays with flexible-arrays.
 
 - QDIO debugfs and other small improvements.
 
 - Drop 2-level paging support optimization for compat tasks. Varios
   mm cleanups.
 
 - Remove broken and unused hibernate / power management support.
 
 - Remove fake numa support which does not bring any benefits.
 
 - Exclude offline CPUs from CPU topology masks to be more consistent
   with other architectures.
 
 - Prevent last branching instruction address leaking to userspace.
 
 - Other small various fixes and improvements all over the code.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEE3QHqV+H2a8xAv27vjYWKoQLXFBgFAl6Ig2YACgkQjYWKoQLX
 FBj2gggAibnHOl9d0ngX1mVT4nz51R3V8z5sEQjNMr2uHBmaTqs7pi/00gaFMxoC
 NngVEXvL443jSogQivthGgXPpRCV9xdKE3sp38j7fF4LgHoeuDtGd1oaX4W9Rqk0
 7Yii35EaO2e2WHdOKaAbu+ZvDRunFjERyntc51MYaIUivFosogSo07vC73vFIArF
 VGStS09fJ4Ny76ott896T7Ulx1Iek/MkF1vponEMLGNUIcLIQbbxZxOwgz0pHuEF
 SlyyJBnhOIaAJGOYlKREQDt1cew+hsxluPU+a01bwdsmdZv9LH1BGwLayDqTH58i
 QWvtEpzJFmDvo9jGM1v81ebaGnyCKg==
 =hiGF
 -----END PGP SIGNATURE-----

Merge tag 's390-5.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

Pull s390 updates from Vasily Gorbik:

 - Update maintainers. Niklas Schnelle takes over zpci and Vineeth
   Vijayan common io code.

 - Extend cpuinfo to include topology information.

 - Add new extended counters for IBM z15 and sampling buffer allocation
   rework in perf code.

 - Add control over zeroing out memory during system restart.

 - CCA protected key block version 2 support and other
   fixes/improvements in crypto code.

 - Convert to new fallthrough; annotations.

 - Replace zero-length arrays with flexible-arrays.

 - QDIO debugfs and other small improvements.

 - Drop 2-level paging support optimization for compat tasks. Varios mm
   cleanups.

 - Remove broken and unused hibernate / power management support.

 - Remove fake numa support which does not bring any benefits.

 - Exclude offline CPUs from CPU topology masks to be more consistent
   with other architectures.

 - Prevent last branching instruction address leaking to userspace.

 - Other small various fixes and improvements all over the code.

* tag 's390-5.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (57 commits)
  s390/mm: cleanup init_new_context() callback
  s390/mm: cleanup virtual memory constants usage
  s390/mm: remove page table downgrade support
  s390/qdio: set qdio_irq->cdev at allocation time
  s390/qdio: remove unused function declarations
  s390/ccwgroup: remove pm support
  s390/ap: remove power management code from ap bus and drivers
  s390/zcrypt: use kvmalloc instead of kmalloc for 256k alloc
  s390/mm: cleanup arch_get_unmapped_area() and friends
  s390/ism: remove pm support
  s390/cio: use fallthrough;
  s390/vfio: use fallthrough;
  s390/zcrypt: use fallthrough;
  s390: use fallthrough;
  s390/cpum_sf: Fix wrong page count in error message
  s390/diag: fix display of diagnose call statistics
  s390/ap: Remove ap device suspend and resume callbacks
  s390/pci: Improve handling of unset UID
  s390/pci: Fix zpci_alloc_domain() over allocation
  s390/qdio: pass ISC as parameter to chsc_sadc()
  ...
2020-04-04 09:45:50 -07:00
Ingo Molnar
7dc41b9b99 perf/urgent fixes and improvements:
perf python:
 
   Arnaldo Carvalho de Melo:
 
   - Fix clang detection to strip out options passed in $CC.
 
 build:
 
   He Zhe:
 
   - Normalize gcc parameter when generating arch errno table, fixing
     the build by removing options from $(CC).
 
   Sam Lunt:
 
   - Support Python 3.8+ in Makefile.
 
 perf report/top:
 
   Arnaldo Carvalho de Melo:
 
   - Fix title line formatting.
 
 perf script:
 
   Andreas Gerstmayr:
 
   - Fix SEGFAULT when using DWARF mode.
 
   - Fix invalid read of directory entry after closedir(), found with valgrind.
 
   Hagen Paul Pfeifer:
 
   - Introduce --deltatime option.
 
   Stephane Eranian:
 
   - Allow --symbol to accept hexadecimal addresses.
 
   Ian Rogers:
 
   - Add -S/--symbols documentation
 
   Namhyung Kim:
 
   - Add --show-cgroup-events option.
 
 perf python:
 
   Arnaldo Carvalho de Melo:
 
   - Include rwsem.c in the python binding, needed by the cgroups improvements.
 
 build-test:
 
   Arnaldo Carvalho de Melo:
 
   - Honour JOBS to override detection of number of cores
 
 perf top:
 
   Jin Yao:
 
   - Support --group-sort-idx to change the sort order
 
   - perf top: Support hotkey to change sort order
 
 perf pmu-events x86:
 
   Jin Yao:
 
   - Use CPU_CLK_UNHALTED.THREAD in Kernel_Utilization metric
 
 perf symbols arm64:
 
   Kemeng Shi:
 
   - Fix arm64 gap between kernel start and module end
 
 kernel perf subsystem:
 
   Namhyung Kim:
 
   - Add PERF_RECORD_CGROUP event and Add PERF_SAMPLE_CGROUP feature,
     to allow cgroup tracking, saving a link between cgroup path and
     its id number.
 
 perf cgroup:
 
   Namhyung Kim:
 
   - Maintain cgroup hierarchy.
 
 perf report:
 
   Namhyung Kim:
 
   - Add 'cgroup' sort key.
 
 perf record:
 
   Namhyung Kim:
 
   - Support synthesizing cgroup events for pre-existing cgroups.
 
   - Add --all-cgroups option
 
 Documentation:
 
   Tony Jones:
 
   - Update docs regarding kernel/user space unwinding.
 
 Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQR2GiIUctdOfX2qHhGyPKLppCJ+JwUCXodMtwAKCRCyPKLppCJ+
 J+leAP0Ws0dGzMSIwcMVc7zvK1IsYOTlZ8lYXJePxD+Po/YPdAEA6Squf4gwZ2wm
 b9R7w50dlCkMJ9LaueCeZZjh/4asFwQ=
 =auOb
 -----END PGP SIGNATURE-----

Merge tag 'perf-urgent-for-mingo-5.7-20200403' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgent

Pull perf/urgent fixes and improvements from Arnaldo Carvalho de Melo:

perf python:

  Arnaldo Carvalho de Melo:

  - Fix clang detection to strip out options passed in $CC.

build:

  He Zhe:

  - Normalize gcc parameter when generating arch errno table, fixing
    the build by removing options from $(CC).

  Sam Lunt:

  - Support Python 3.8+ in Makefile.

perf report/top:

  Arnaldo Carvalho de Melo:

  - Fix title line formatting.

perf script:

  Andreas Gerstmayr:

  - Fix SEGFAULT when using DWARF mode.

  - Fix invalid read of directory entry after closedir(), found with valgrind.

  Hagen Paul Pfeifer:

  - Introduce --deltatime option.

  Stephane Eranian:

  - Allow --symbol to accept hexadecimal addresses.

  Ian Rogers:

  - Add -S/--symbols documentation

  Namhyung Kim:

  - Add --show-cgroup-events option.

perf python:

  Arnaldo Carvalho de Melo:

  - Include rwsem.c in the python binding, needed by the cgroups improvements.

build-test:

  Arnaldo Carvalho de Melo:

  - Honour JOBS to override detection of number of cores

perf top:

  Jin Yao:

  - Support --group-sort-idx to change the sort order

  - perf top: Support hotkey to change sort order

perf pmu-events x86:

  Jin Yao:

  - Use CPU_CLK_UNHALTED.THREAD in Kernel_Utilization metric

perf symbols arm64:

  Kemeng Shi:

  - Fix arm64 gap between kernel start and module end

kernel perf subsystem:

  Namhyung Kim:

  - Add PERF_RECORD_CGROUP event and Add PERF_SAMPLE_CGROUP feature,
    to allow cgroup tracking, saving a link between cgroup path and
    its id number.

perf cgroup:

  Namhyung Kim:

  - Maintain cgroup hierarchy.

perf report:

  Namhyung Kim:

  - Add 'cgroup' sort key.

perf record:

  Namhyung Kim:

  - Support synthesizing cgroup events for pre-existing cgroups.

  - Add --all-cgroups option

Documentation:

  Tony Jones:

  - Update docs regarding kernel/user space unwinding.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-04-04 10:35:15 +02:00
Linus Torvalds
0ad5b053d4 Char/Misc driver patches for 5.7-rc1
Here is the big set of char/misc/other driver patches for 5.7-rc1.
 
 Lots of things in here, and it's later than expected due to some reverts
 to resolve some reported issues.  All is now clean with no reported
 problems in linux-next.
 
 Included in here is:
 	- interconnect updates
 	- mei driver updates
 	- uio updates
 	- nvmem driver updates
 	- soundwire updates
 	- binderfs updates
 	- coresight updates
 	- habanalabs updates
 	- mhi new bus type and core
 	- extcon driver updates
 	- some Kconfig cleanups
 	- other small misc driver cleanups and updates
 
 As mentioned, all have been in linux-next for a while, and with the last
 two reverts, all is calm and good.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXodfvA8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ynzCQCfROhar3E8EhYEqSOP6xq6uhX9uegAnRgGY2rs
 rN4JJpOcTddvZcVlD+vo
 =ocWk
 -----END PGP SIGNATURE-----

Merge tag 'char-misc-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc

Pull char/misc driver updates from Greg KH:
 "Here is the big set of char/misc/other driver patches for 5.7-rc1.

  Lots of things in here, and it's later than expected due to some
  reverts to resolve some reported issues. All is now clean with no
  reported problems in linux-next.

  Included in here is:
   - interconnect updates
   - mei driver updates
   - uio updates
   - nvmem driver updates
   - soundwire updates
   - binderfs updates
   - coresight updates
   - habanalabs updates
   - mhi new bus type and core
   - extcon driver updates
   - some Kconfig cleanups
   - other small misc driver cleanups and updates

  As mentioned, all have been in linux-next for a while, and with the
  last two reverts, all is calm and good"

* tag 'char-misc-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (174 commits)
  Revert "driver core: platform: Initialize dma_parms for platform devices"
  Revert "amba: Initialize dma_parms for amba devices"
  amba: Initialize dma_parms for amba devices
  driver core: platform: Initialize dma_parms for platform devices
  bus: mhi: core: Drop the references to mhi_dev in mhi_destroy_device()
  bus: mhi: core: Initialize bhie field in mhi_cntrl for RDDM capture
  bus: mhi: core: Add support for reading MHI info from device
  misc: rtsx: set correct pcr_ops for rts522A
  speakup: misc: Use dynamic minor numbers for speakup devices
  mei: me: add cedar fork device ids
  coresight: do not use the BIT() macro in the UAPI header
  Documentation: provide IBM contacts for embargoed hardware
  nvmem: core: remove nvmem_sysfs_get_groups()
  nvmem: core: use is_bin_visible for permissions
  nvmem: core: use device_register and device_unregister
  nvmem: core: add root_only member to nvmem device struct
  extcon: axp288: Add wakeup support
  extcon: Mark extcon_get_edev_name() function as exported symbol
  extcon: palmas: Hide error messages if gpio returns -EPROBE_DEFER
  dt-bindings: extcon: usbc-cros-ec: convert extcon-usbc-cros-ec.txt to yaml format
  ...
2020-04-03 13:22:40 -07:00
Linus Torvalds
ff2ae607c6 SPDX patches for 5.7-rc1.
Here are 3 SPDX patches for 5.7-rc1.
 
 One fixes up the SPDX tag for a single driver, while the other two go
 through the tree and add SPDX tags for all of the .gitignore files as
 needed.
 
 Nothing too complex, but you will get a merge conflict with your current
 tree, that should be trivial to handle (one file modified by two things,
 one file deleted.)
 
 All 3 of these have been in linux-next for a while, with no reported
 issues other than the merge conflict.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXodg5A8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ykySQCgy9YDrkz7nWq6v3Gohl6+lW/L+rMAnRM4uTZm
 m5AuCzO3Azt9KBi7NL+L
 =2Lm5
 -----END PGP SIGNATURE-----

Merge tag 'spdx-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx

Pull SPDX updates from Greg KH:
 "Here are three SPDX patches for 5.7-rc1.

  One fixes up the SPDX tag for a single driver, while the other two go
  through the tree and add SPDX tags for all of the .gitignore files as
  needed.

  Nothing too complex, but you will get a merge conflict with your
  current tree, that should be trivial to handle (one file modified by
  two things, one file deleted.)

  All three of these have been in linux-next for a while, with no
  reported issues other than the merge conflict"

* tag 'spdx-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx:
  ASoC: MT6660: make spdxcheck.py happy
  .gitignore: add SPDX License Identifier
  .gitignore: remove too obvious comments
2020-04-03 13:12:26 -07:00
Linus Torvalds
0adb8bc039 Merge branch 'for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue updates from Tejun Heo:
 "Nothing too interesting. Just two trivial patches"

* 'for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: Mark up unlocked access to wq->first_flusher
  workqueue: Make workqueue_init*() return void
2020-04-03 12:27:36 -07:00
Linus Torvalds
d883600523 Merge branch 'for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:

 - Christian extended clone3 so that processes can be spawned into
   cgroups directly.

   This is not only neat in terms of semantics but also avoids grabbing
   the global cgroup_threadgroup_rwsem for migration.

 - Daniel added !root xattr support to cgroupfs.

   Userland already uses xattrs on cgroupfs for bookkeeping. This will
   allow delegated cgroups to support such usages.

 - Prateek tried to make cpuset hotplug handling synchronous but that
   led to possible deadlock scenarios. Reverted.

 - Other minor changes including release_agent_path handling cleanup.

* 'for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  docs: cgroup-v1: Document the cpuset_v2_mode mount option
  Revert "cpuset: Make cpuset hotplug synchronous"
  cgroupfs: Support user xattrs
  kernfs: Add option to enable user xattrs
  kernfs: Add removed_size out param for simple_xattr_set
  kernfs: kvmalloc xattr value instead of kmalloc
  cgroup: Restructure release_agent_path handling
  selftests/cgroup: add tests for cloning into cgroups
  clone3: allow spawning processes into cgroups
  cgroup: add cgroup_may_write() helper
  cgroup: refactor fork helpers
  cgroup: add cgroup_get_from_file() helper
  cgroup: unify attach permission checking
  cpuset: Make cpuset hotplug synchronous
  cgroup.c: Use built-in RCU list checking
  kselftest/cgroup: add cgroup destruction test
  cgroup: Clean up css_set task traversal
2020-04-03 11:30:20 -07:00
Linus Torvalds
f2c3bec3c9 kgdb patches for 5.7-rc1
Pretty quiet this cycle. Just a couple of small fixes from
 myself both of which were reviewed by Doug Anderson to keep
 me honest (thanks).
 
 Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEELzVBU1D3lWq6cKzwfOMlXTn3iKEFAl6G+f0ACgkQfOMlXTn3
 iKENkw/7BnSCtxo3ZvcaFGmoMIuxVfaChOO5rpioKh9GpBzNhSSa4GTdQcw8revI
 czdVlz8aihleqJHiwGcNFhglC/HkllMwrZ5Os7hJ3kJaVF2p4rrWQp9Vu9hjB8WA
 z3g7SuScBqEnrjByuxr1nc1z68qibLiQt+WB4evpBbGN00rTk9r2t6sLtzfGl6cY
 RrYeInCnnS6N5/kGTxKijvedriMCvVu0KVy9wvRg3GctLbFDpp/yrU6OesGM8cFM
 CYr5e3Qn2HNloYvxLq44gGBL24GO+BNp6qr+lqcoHE/jCbq1/Yr7yqcNMjoQLZs2
 ClqzXcusVeqdrxzROw2sAf6Xo2CEzboZyI/6xs9eGj+haTR2ebcvQcBG30q/i6lb
 7nA6BavBcX8freDLy41JqemVZpVJIGlhI9ArSk+7lxSjbnsRC1aO97s4MwfIvggz
 SeKT8rPTNaSdXwOzZbbnIWrOP5P1hzZDUsSwjW3fvcYOZGVTprna7mYcsuP4CzVM
 GRvHW70cgMHDAG/+BcVnng9ML+vPf3TNFbGQYaqX5WoPX1JkuTVwYitT/cAR8ARL
 a6qusHlms8eSVDIa4mfmLpv6YUSFUxDhKJcAafd0nJ01+a941v32rJfutnSEm16W
 H2Qm64lihxQs5tCEZQ455q87QBjLj5zJznEadyWmWbm+VKnqqdY=
 =ulBe
 -----END PGP SIGNATURE-----

Merge tag 'kgdb-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux

Pull kgdb updates from Daniel Thompson:
 "Pretty quiet this cycle. Just a couple of small fixes from myself both
  of which were reviewed by Doug Anderson to keep me honest (thanks)"

* tag 'kgdb-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux:
  kdb: Censor attempts to set PROMPT without ENABLE_MEM_READ
  kdb: Eliminate strncpy() warnings by replacing with strscpy()
2020-04-03 11:26:32 -07:00
Waiman Long
0c05b9bdbf docs: cgroup-v1: Document the cpuset_v2_mode mount option
The cpuset in cgroup v1 accepts a special "cpuset_v2_mode" mount
option that make cpuset.cpus and cpuset.mems behave more like those in
cgroup v2.  Document it to make other people more aware of this feature
that can be useful in some circumstances.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2020-04-03 11:42:56 -04:00
Tejun Heo
2b729fe7f3 Revert "cpuset: Make cpuset hotplug synchronous"
This reverts commit a49e4629b5 ("cpuset: Make cpuset hotplug synchronous") as
it may deadlock with cpu hotplug path.

Link: http://lkml.kernel.org/r/F0388D99-84D7-453B-9B6B-EEFF0E7BE4CC@lca.pw
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Qian Cai <cai@lca.pw>
Cc: Prateek Sood <prsood@codeaurora.org>
2020-04-03 11:32:13 -04:00
Steven Rostedt (VMware)
8e99cf91b9 tracing: Do not allocate buffer in trace_find_next_entry() in atomic
When dumping out the trace data in latency format, a check is made to peek
at the next event to compare its timestamp to the current one, and if the
delta is of a greater size, it will add a marker showing so. But to do this,
it needs to save the current event otherwise peeking at the next event will
remove the current event. To save the event, a temp buffer is used, and if
the event is bigger than the temp buffer, the temp buffer is freed and a
bigger buffer is allocated.

This allocation is a problem when called in atomic context. The only way
this gets called via atomic context is via ftrace_dump(). Thus, use a static
buffer of 128 bytes (which covers most events), and if the event is bigger
than that, simply return NULL. The callers of trace_find_next_entry() need
to handle a NULL case, as that's what would happen if the allocation failed.

Link: https://lore.kernel.org/r/20200326091256.GR11705@shao2-debian

Fixes: ff895103a8 ("tracing: Save off entry when peeking at next entry")
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-04-03 11:30:50 -04:00
Linus Torvalds
6cad420cc6 Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:
 "A large amount of MM, plenty more to come.

  Subsystems affected by this patch series:
   - tools
   - kthread
   - kbuild
   - scripts
   - ocfs2
   - vfs
   - mm: slub, kmemleak, pagecache, gup, swap, memcg, pagemap, mremap,
         sparsemem, kasan, pagealloc, vmscan, compaction, mempolicy,
         hugetlbfs, hugetlb"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (155 commits)
  include/linux/huge_mm.h: check PageTail in hpage_nr_pages even when !THP
  mm/hugetlb: fix build failure with HUGETLB_PAGE but not HUGEBTLBFS
  selftests/vm: fix map_hugetlb length used for testing read and write
  mm/hugetlb: remove unnecessary memory fetch in PageHeadHuge()
  mm/hugetlb.c: clean code by removing unnecessary initialization
  hugetlb_cgroup: add hugetlb_cgroup reservation docs
  hugetlb_cgroup: add hugetlb_cgroup reservation tests
  hugetlb: support file_region coalescing again
  hugetlb_cgroup: support noreserve mappings
  hugetlb_cgroup: add accounting for shared mappings
  hugetlb: disable region_add file_region coalescing
  hugetlb_cgroup: add reservation accounting for private mappings
  mm/hugetlb_cgroup: fix hugetlb_cgroup migration
  hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations
  hugetlb_cgroup: add hugetlb_cgroup reservation counter
  hugetlbfs: Use i_mmap_rwsem to address page fault/truncate race
  hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
  mm/memblock.c: remove redundant assignment to variable max_addr
  mm: mempolicy: require at least one nodeid for MPOL_PREFERRED
  mm: mempolicy: use VM_BUG_ON_VMA in queue_pages_test_walk()
  ...
2020-04-02 13:55:34 -07:00
Linus Torvalds
d987ca1c6b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull exec/proc updates from Eric Biederman:
 "This contains two significant pieces of work: the work to sort out
  proc_flush_task, and the work to solve a deadlock between strace and
  exec.

  Fixing proc_flush_task so that it no longer requires a persistent
  mount makes improvements to proc possible. The removal of the
  persistent mount solves an old regression that that caused the hidepid
  mount option to only work on remount not on mount. The regression was
  found and reported by the Android folks. This further allows Alexey
  Gladkov's work making proc mount options specific to an individual
  mount of proc to move forward.

  The work on exec starts solving a long standing issue with exec that
  it takes mutexes of blocking userspace applications, which makes exec
  extremely deadlock prone. For the moment this adds a second mutex with
  a narrower scope that handles all of the easy cases. Which makes the
  tricky cases easy to spot. With a little luck the code to solve those
  deadlocks will be ready by next merge window"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (25 commits)
  signal: Extend exec_id to 64bits
  pidfd: Use new infrastructure to fix deadlocks in execve
  perf: Use new infrastructure to fix deadlocks in execve
  proc: io_accounting: Use new infrastructure to fix deadlocks in execve
  proc: Use new infrastructure to fix deadlocks in execve
  kernel/kcmp.c: Use new infrastructure to fix deadlocks in execve
  kernel: doc: remove outdated comment cred.c
  mm: docs: Fix a comment in process_vm_rw_core
  selftests/ptrace: add test cases for dead-locks
  exec: Fix a deadlock in strace
  exec: Add exec_update_mutex to replace cred_guard_mutex
  exec: Move exec_mmap right after de_thread in flush_old_exec
  exec: Move cleanup of posix timers on exec out of de_thread
  exec: Factor unshare_sighand out of de_thread and call it separately
  exec: Only compute current once in flush_old_exec
  pid: Improve the comment about waiting in zap_pid_ns_processes
  proc: Remove the now unnecessary internal mount of proc
  uml: Create a private mount of proc for mconsole
  uml: Don't consult current to find the proc_mnt in mconsole_proc
  proc: Use a list of inodes to flush from proc
  ...
2020-04-02 11:22:17 -07:00
Sebastian Andrzej Siewior
6923aa0d8c mm/compaction: Disable compact_unevictable_allowed on RT
Since commit 5bbe3547aa ("mm: allow compaction of unevictable pages")
it is allowed to examine mlocked pages and compact them by default.  On
-RT even minor pagefaults are problematic because it may take a few 100us
to resolve them and until then the task is blocked.

Make compact_unevictable_allowed = 0 default and issue a warning on RT if
it is changed.

[bigeasy@linutronix.de: v5]
  Link: https://lore.kernel.org/linux-mm/20190710144138.qyn4tuttdq6h7kqx@linutronix.de/
  Link: http://lkml.kernel.org/r/20200319165536.ovi75tsr2seared4@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: https://lore.kernel.org/linux-mm/20190710144138.qyn4tuttdq6h7kqx@linutronix.de/
Link: http://lkml.kernel.org/r/20200303202225.nhqc3v5gwlb7x6et@linutronix.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 09:35:31 -07:00
Sebastian Andrzej Siewior
964b692daf mm/compaction: really limit compact_unevictable_allowed to 0 and 1
The proc file `compact_unevictable_allowed' should allow 0 and 1 only, the
`extra*' attribues have been set properly but without
proc_dointvec_minmax() as the `proc_handler' the limit will not be
enforced.

Use proc_dointvec_minmax() as the `proc_handler' to enfoce the valid
specified range.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Link: http://lkml.kernel.org/r/20200303202054.gsosv7fsx2ma3cic@linutronix.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 09:35:31 -07:00
Johannes Weiner
8a931f8013 mm: memcontrol: recursive memory.low protection
Right now, the effective protection of any given cgroup is capped by its
own explicit memory.low setting, regardless of what the parent says.  The
reasons for this are mostly historical and ease of implementation: to make
delegation of memory.low safe, effective protection is the min() of all
memory.low up the tree.

Unfortunately, this limitation makes it impossible to protect an entire
subtree from another without forcing the user to make explicit protection
allocations all the way to the leaf cgroups - something that is highly
undesirable in real life scenarios.

Consider memory in a data center host.  At the cgroup top level, we have a
distinction between system management software and the actual workload the
system is executing.  Both branches are further subdivided into individual
services, job components etc.

We want to protect the workload as a whole from the system management
software, but that doesn't mean we want to protect and prioritize
individual workload wrt each other.  Their memory demand can vary over
time, and we'd want the VM to simply cache the hottest data within the
workload subtree.  Yet, the current memory.low limitations force us to
allocate a fixed amount of protection to each workload component in order
to get protection from system management software in general.  This
results in very inefficient resource distribution.

Another concern with mandating downward allocation is that, as the
complexity of the cgroup tree grows, it gets harder for the lower levels
to be informed about decisions made at the host-level.  Consider a
container inside a namespace that in turn creates its own nested tree of
cgroups to run multiple workloads.  It'd be extremely difficult to
configure memory.low parameters in those leaf cgroups that on one hand
balance pressure among siblings as the container desires, while also
reflecting the host-level protection from e.g.  rpm upgrades, that lie
beyond one or more delegation and namespacing points in the tree.

It's highly unusual from a cgroup interface POV that nested levels have to
be aware of and reflect decisions made at higher levels for them to be
effective.

To enable such use cases and scale configurability for complex trees, this
patch implements a resource inheritance model for memory that is similar
to how the CPU and the IO controller implement work-conserving resource
allocations: a share of a resource allocated to a subree always applies to
the entire subtree recursively, while allowing, but not mandating,
children to further specify distribution rules.

That means that if protection is explicitly allocated among siblings,
those configured shares are being followed during page reclaim just like
they are now.  However, if the memory.low set at a higher level is not
fully claimed by the children in that subtree, the "floating" remainder is
applied to each cgroup in the tree in proportion to its size.  Since
reclaim pressure is applied in proportion to size as well, each child in
that tree gets the same boost, and the effect is neutral among siblings -
with respect to each other, they behave as if no memory control was
enabled at all, and the VM simply balances the memory demands optimally
within the subtree.  But collectively those cgroups enjoy a boost over the
cgroups in neighboring trees.

E.g.  a leaf cgroup with a memory.low setting of 0 no longer means that
it's not getting a share of the hierarchically assigned resource, just
that it doesn't claim a fixed amount of it to protect from its siblings.

This allows us to recursively protect one subtree (workload) from another
(system management), while letting subgroups compete freely among each
other - without having to assign fixed shares to each leaf, and without
nested groups having to echo higher-level settings.

The floating protection composes naturally with fixed protection.
Consider the following example tree:

		A            A: low = 2G
               / \          A1: low = 1G
              A1 A2         A2: low = 0G

As outside pressure is applied to this tree, A1 will enjoy a fixed
protection from A2 of 1G, but the remaining, unclaimed 1G from A is split
evenly among A1 and A2, coming out to 1.5G and 0.5G.

There is a slight risk of regressing theoretical setups where the
top-level cgroups don't know about the true budgeting and set bogusly high
"bypass" values that are meaningfully allocated down the tree.  Such
setups would rely on unclaimed protection to be discarded, and
distributing it would change the intended behavior.  Be safe and hide the
new behavior behind a mount option, 'memory_recursiveprot'.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Chris Down <chris@chrisdown.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michal Koutný <mkoutny@suse.com>
Link: http://lkml.kernel.org/r/20200227195606.46212-4-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 09:35:28 -07:00
Roman Gushchin
f4b00eab50 mm: kmem: rename memcg_kmem_(un)charge() into memcg_kmem_(un)charge_page()
Rename (__)memcg_kmem_(un)charge() into (__)memcg_kmem_(un)charge_page()
to better reflect what they are actually doing:

1) call __memcg_kmem_(un)charge_memcg() to actually charge or uncharge
   the current memcg

2) set or clear the PageKmemcg flag

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Link: http://lkml.kernel.org/r/20200109202659.752357-4-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 09:35:28 -07:00
Chen Yu
db96a75946 PM: sleep: Add pm_debug_messages kernel command line option
Debug messages from the system suspend/hibernation infrastructure
are disabled by default, and can only be enabled after the system
has boot up via /sys/power/pm_debug_messages.

This makes the hibernation resume hard to track as it involves system
boot up across hibernation.  There's no chance for software_resume()
to track the resume process, for example.

Add a kernel command line option to set pm_debug_messages during
boot up.

Signed-off-by: Chen Yu <yu.c.chen@intel.com>
[ rjw: Subject & changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-04-02 15:29:56 +02:00
Linus Torvalds
72f35423e8 Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
 "API:
   - Fix out-of-sync IVs in self-test for IPsec AEAD algorithms

  Algorithms:
   - Use formally verified implementation of x86/curve25519

  Drivers:
   - Enhance hwrng support in caam

   - Use crypto_engine for skcipher/aead/rsa/hash in caam

   - Add Xilinx AES driver

   - Add uacce driver

   - Register zip engine to uacce in hisilicon

   - Add support for OCTEON TX CPT engine in marvell"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (162 commits)
  crypto: af_alg - bool type cosmetics
  crypto: arm[64]/poly1305 - add artifact to .gitignore files
  crypto: caam - limit single JD RNG output to maximum of 16 bytes
  crypto: caam - enable prediction resistance in HRWNG
  bus: fsl-mc: add api to retrieve mc version
  crypto: caam - invalidate entropy register during RNG initialization
  crypto: caam - check if RNG job failed
  crypto: caam - simplify RNG implementation
  crypto: caam - drop global context pointer and init_done
  crypto: caam - use struct hwrng's .init for initialization
  crypto: caam - allocate RNG instantiation descriptor with GFP_DMA
  crypto: ccree - remove duplicated include from cc_aead.c
  crypto: chelsio - remove set but not used variable 'adap'
  crypto: marvell - enable OcteonTX cpt options for build
  crypto: marvell - add the Virtual Function driver for CPT
  crypto: marvell - add support for OCTEON TX CPT engine
  crypto: marvell - create common Kconfig and Makefile for Marvell
  crypto: arm/neon - memzero_explicit aes-cbc key
  crypto: bcm - Use scnprintf() for avoiding potential buffer overflow
  crypto: atmel-i2c - Fix wakeup fail
  ...
2020-04-01 14:47:40 -07:00
Eric W. Biederman
d1e7fd6462 signal: Extend exec_id to 64bits
Replace the 32bit exec_id with a 64bit exec_id to make it impossible
to wrap the exec_id counter.  With care an attacker can cause exec_id
wrap and send arbitrary signals to a newly exec'd parent.  This
bypasses the signal sending checks if the parent changes their
credentials during exec.

The severity of this problem can been seen that in my limited testing
of a 32bit exec_id it can take as little as 19s to exec 65536 times.
Which means that it can take as little as 14 days to wrap a 32bit
exec_id.  Adam Zabrocki has succeeded wrapping the self_exe_id in 7
days.  Even my slower timing is in the uptime of a typical server.
Which means self_exec_id is simply a speed bump today, and if exec
gets noticably faster self_exec_id won't even be a speed bump.

Extending self_exec_id to 64bits introduces a problem on 32bit
architectures where reading self_exec_id is no longer atomic and can
take two read instructions.  Which means that is is possible to hit
a window where the read value of exec_id does not match the written
value.  So with very lucky timing after this change this still
remains expoiltable.

I have updated the update of exec_id on exec to use WRITE_ONCE
and the read of exec_id in do_notify_parent to use READ_ONCE
to make it clear that there is no locking between these two
locations.

Link: https://lore.kernel.org/kernel-hardening/20200324215049.GA3710@pi3.com.pl
Fixes: 2.3.23pre2
Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-04-01 12:04:24 -05:00
Daniel Thompson
ad99b5105c kdb: Censor attempts to set PROMPT without ENABLE_MEM_READ
Currently the PROMPT variable could be abused to provoke the printf()
machinery to read outside the current stack frame. Normally this
doesn't matter becaues md is already a much better tool for reading
from memory.

However the md command can be disabled by not setting KDB_ENABLE_MEM_READ.
Let's also prevent PROMPT from being modified in these circumstances.

Whilst adding a comment to help future code reviewers we also remove
the #ifdef where PROMPT in consumed. There is no problem passing an
unused (0) to snprintf when !CONFIG_SMP.
argument

Reported-by: Wang Xiayang <xywang.sjtu@sjtu.edu.cn>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
2020-04-01 16:59:11 +01:00
Daniel Thompson
d228bee820 kdb: Eliminate strncpy() warnings by replacing with strscpy()
Currently the code to manage the kdb history buffer uses strncpy() to
copy strings to/and from the history and exhibits the classic "but
nobody ever told me that strncpy() doesn't always terminate strings"
bug. Modern gcc compilers recognise this bug and issue a warning.

In reality these calls will only abridge the copied string if kdb_read()
has *already* overflowed the command buffer. Thus the use of counted
copies here is only used to reduce the secondary effects of a bug
elsewhere in the code.

Therefore transitioning these calls into strscpy() (without checking
the return code) is appropriate.

Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
2020-04-01 16:59:02 +01:00
Sebastian Andrzej Siewior
73d20564e0 hrtimer: Don't dereference the hrtimer pointer after the callback
A hrtimer can be released in its callback, but lockdep_hrtimer_exit()
dereferences the pointer after the callback returns, i.e. a potential use
after free.

Retrieve the context in which the hrtimer expires before the callback is
invoked and use it in lockdep_hrtimer_exit().

Fixes: 40db173965 ("lockdep: Add hrtimer context tracing bits")
Reported-by: syzbot+62c155c276e580cfb606@syzkaller.appspotmail.com
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200331201849.fkp2siy3vcdqvqlz@linutronix.de
2020-04-01 13:20:14 +02:00
Dexuan Cui
3704a6a445 PM: hibernate: Propagate the return value of hibernation_restore()
If hibernation_restore() fails, the 'error' should not be zero.

Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-04-01 11:36:11 +02:00
Linus Torvalds
29d9f30d4c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from David Miller:
 "Highlights:

   1) Fix the iwlwifi regression, from Johannes Berg.

   2) Support BSS coloring and 802.11 encapsulation offloading in
      hardware, from John Crispin.

   3) Fix some potential Spectre issues in qtnfmac, from Sergey
      Matyukevich.

   4) Add TTL decrement action to openvswitch, from Matteo Croce.

   5) Allow paralleization through flow_action setup by not taking the
      RTNL mutex, from Vlad Buslov.

   6) A lot of zero-length array to flexible-array conversions, from
      Gustavo A. R. Silva.

   7) Align XDP statistics names across several drivers for consistency,
      from Lorenzo Bianconi.

   8) Add various pieces of infrastructure for offloading conntrack, and
      make use of it in mlx5 driver, from Paul Blakey.

   9) Allow using listening sockets in BPF sockmap, from Jakub Sitnicki.

  10) Lots of parallelization improvements during configuration changes
      in mlxsw driver, from Ido Schimmel.

  11) Add support to devlink for generic packet traps, which report
      packets dropped during ACL processing. And use them in mlxsw
      driver. From Jiri Pirko.

  12) Support bcmgenet on ACPI, from Jeremy Linton.

  13) Make BPF compatible with RT, from Thomas Gleixnet, Alexei
      Starovoitov, and your's truly.

  14) Support XDP meta-data in virtio_net, from Yuya Kusakabe.

  15) Fix sysfs permissions when network devices change namespaces, from
      Christian Brauner.

  16) Add a flags element to ethtool_ops so that drivers can more simply
      indicate which coalescing parameters they actually support, and
      therefore the generic layer can validate the user's ethtool
      request. Use this in all drivers, from Jakub Kicinski.

  17) Offload FIFO qdisc in mlxsw, from Petr Machata.

  18) Support UDP sockets in sockmap, from Lorenz Bauer.

  19) Fix stretch ACK bugs in several TCP congestion control modules,
      from Pengcheng Yang.

  20) Support virtual functiosn in octeontx2 driver, from Tomasz
      Duszynski.

  21) Add region operations for devlink and use it in ice driver to dump
      NVM contents, from Jacob Keller.

  22) Add support for hw offload of MACSEC, from Antoine Tenart.

  23) Add support for BPF programs that can be attached to LSM hooks,
      from KP Singh.

  24) Support for multiple paths, path managers, and counters in MPTCP.
      From Peter Krystad, Paolo Abeni, Florian Westphal, Davide Caratti,
      and others.

  25) More progress on adding the netlink interface to ethtool, from
      Michal Kubecek"

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2121 commits)
  net: ipv6: rpl_iptunnel: Fix potential memory leak in rpl_do_srh_inline
  cxgb4/chcr: nic-tls stats in ethtool
  net: dsa: fix oops while probing Marvell DSA switches
  net/bpfilter: remove superfluous testing message
  net: macb: Fix handling of fixed-link node
  net: dsa: ksz: Select KSZ protocol tag
  netdevsim: dev: Fix memory leak in nsim_dev_take_snapshot_write
  net: stmmac: add EHL 2.5Gbps PCI info and PCI ID
  net: stmmac: add EHL PSE0 & PSE1 1Gbps PCI info and PCI ID
  net: stmmac: create dwmac-intel.c to contain all Intel platform
  net: dsa: bcm_sf2: Support specifying VLAN tag egress rule
  net: dsa: bcm_sf2: Add support for matching VLAN TCI
  net: dsa: bcm_sf2: Move writing of CFP_DATA(5) into slicing functions
  net: dsa: bcm_sf2: Check earlier for FLOW_EXT and FLOW_MAC_EXT
  net: dsa: bcm_sf2: Disable learning for ASP port
  net: dsa: b53: Deny enslaving port 7 for 7278 into a bridge
  net: dsa: b53: Prevent tagged VLAN on port 7 for 7278
  net: dsa: b53: Restore VLAN entries upon (re)configuration
  net: dsa: bcm_sf2: Fix overflow checks
  hv_netvsc: Remove unnecessary round_up for recv_completion_cnt
  ...
2020-03-31 17:29:33 -07:00
Linus Torvalds
1f944f976d TTY/Serial patches for 5.7-rc1
Here is the big set of TTY / Serial patches for 5.7-rc1
 
 Lots of console fixups and reworking in here, serial core tweaks
 (doesn't that ever get old, why are we still creating new serial
 devices?), serial driver updates, line-protocol driver updates, and some
 vt cleanups and fixes included in here as well.
 
 All have been in linux-next with no reported issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXoHT8w8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+yl3CwCgj/97IKb4K49nV2rDgiV+t/ELWqUAnjBp+Zpd
 H2BEdhwCFhq/5CJHKXWH
 =JTm1
 -----END PGP SIGNATURE-----

Merge tag 'tty-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty

Pull tty/serial updates from Greg KH:
 "Here is the big set of TTY / Serial patches for 5.7-rc1

  Lots of console fixups and reworking in here, serial core tweaks
  (doesn't that ever get old, why are we still creating new serial
  devices?), serial driver updates, line-protocol driver updates, and
  some vt cleanups and fixes included in here as well.

  All have been in linux-next with no reported issues"

* tag 'tty-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (161 commits)
  serial: 8250: Optimize irq enable after console write
  serial: 8250: Fix rs485 delay after console write
  vt: vt_ioctl: fix use-after-free in vt_in_use()
  vt: vt_ioctl: fix VT_DISALLOCATE freeing in-use virtual console
  tty: serial: make SERIAL_SPRD depend on COMMON_CLK
  tty: serial: fsl_lpuart: fix return value checking
  tty: serial: fsl_lpuart: move dma_request_chan()
  ARM: dts: tango4: Make /serial compatible with ns16550a
  ARM: dts: mmp*: Make the serial ports compatible with xscale-uart
  ARM: dts: mmp*: Fix serial port names
  ARM: dts: mmp2-brownstone: Don't redeclare phandle references
  ARM: dts: pxa*: Make the serial ports compatible with xscale-uart
  ARM: dts: pxa*: Fix serial port names
  ARM: dts: pxa*: Don't redeclare phandle references
  serial: omap: drop unused dt-bindings header
  serial: 8250: 8250_omap: Add DMA support for UARTs on K3 SoCs
  serial: 8250: 8250_omap: Work around errata causing spurious IRQs with DMA
  serial: 8250: 8250_omap: Extend driver data to pass FIFO trigger info
  serial: 8250: 8250_omap: Move locking out from __dma_rx_do_complete()
  serial: 8250: 8250_omap: Account for data in flight during DMA teardown
  ...
2020-03-31 16:18:55 -07:00
Linus Torvalds
674d85eb2d audit/stable-5.7 PR 20200330
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAl6CZbEUHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXMbFA//Z1Z/lXhrt9GcWz1KrjZWhGn4Emv2
 kVPZpB+HhrGmPE6z1r7gHKTZIMxmUKapZJt+uwMO8Mdns16fD8LrhLc9NXqsFB7X
 tUq5YaePTHC1MMf2TGzuvsroMT0UXeMqNb043KBvAOjHijDL28KTCMSMz1biKHdY
 0Bj1U1f18YtnjdxsvGeBbHjjlhiGbZfXNq15dbIfR/fMKpN5p/3VHi57FN7otZbD
 jN1SniJfZLTXS2PKt/I8yL63aH3UY+81CqTbNsqXpSDzgqfVZOpsp5f5X7UfPH7+
 qF8Cs4iBDm/cykPWivoPgQ65T8kCmaxYiZ5b9cEHrw08FF7Iokn4cC6VbfMOksoM
 2tCQ59LdrZ6yAqox6jMF2a8JE4UJYGQjyhjfsAgmGgqgMdvZx+3Ac2umPG/YzuIa
 79CYVJBV+mdhogGWK6OMbGGOvl6bxMBCv1CmstCXQZQYy69SJGz21mnvkpWhwYoi
 d/TH0obqpVkQYvxXi+7ExmXY+7u6igytNGEdtQRjztUCMaXBH8X7s6B/jqmO8OcG
 UTdM2gcgNFshR2r/J7FdwYoQoxIb9CNCw2DkeX45NY52m8WZMMUHd7JAb3Zr5Ggf
 XCF62oazVBDvijBtFlX4gT4VHGZ9K99qO3IMcTOc8CQKWBSvYBcAOqzSs0Bu+jel
 xD9wSJHKUfKLUsQ=
 =xLho
 -----END PGP SIGNATURE-----

Merge tag 'audit-pr-20200330' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit

Pull audit updates from Paul Moore:
 "We've got two audit patches for the v5.7 merge window with a stellar
  14 lines changed between the two patches. The patch descriptions are
  far more lengthy than the patches themselves, which is a very good
  thing for patches this size IMHO. The patches pass our test suites and
  a quick summary is below:

   - Stop logging inode information when updating an audit file watch.

     Since we are not changing the inode, or the fact that we are
     watching the associated file, the inode information is just noise
     that we can do without.

   - Fix a problem where mandatory audit records were missing their
     accompanying audit records (e.g. SYSCALL records were missing).

     The missing records often meant that we didn't have the necessary
     context to understand what was going on when the event occurred"

* tag 'audit-pr-20200330' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
  audit: trigger accompanying records when no rules present
  audit: CONFIG_CHANGE don't log internal bookkeeping as an event
2020-03-31 15:04:17 -07:00
Linus Torvalds
d9d7677892 Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm updates from Ingo Molnar:
 "A handful of changes:

   - two memory encryption related fixes

   - don't display the kernel's virtual memory layout plaintext on
     32-bit kernels either

   - two simplifications"

* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm: Remove the now redundant N_MEMORY check
  dma-mapping: Fix dma_pgprot() for unencrypted coherent pages
  x86: Don't let pgprot_modify() change the page encryption bit
  x86/mm/kmmio: Use this_cpu_ptr() instead get_cpu_var() for kmmio_ctx
  x86/mm/init/32: Stop printing the virtual memory layout
2020-03-31 11:51:05 -07:00
David S. Miller
ed52f2c608 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30 19:52:37 -07:00
Linus Torvalds
d5f744f9a2 x86 entry code updates:
- Convert the 32bit syscalls to be pt_regs based which removes the
       requirement to push all 6 potential arguments onto the stack and
       consolidates the interface with the 64bit variant
 
     - The first small portion of the exception and syscall related entry
       code consolidation which aims to address the recently discovered
       issues vs. RCU, int3, NMI and some other exceptions which can
       interrupt any context. The bulk of the changes is still work in
       progress and aimed for 5.8.
 
     - A few lockdep namespace cleanups which have been applied into this
       branch to keep the prerequisites for the ongoing work confined.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl6B/TMTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoYA6EAC7r/bCMxBelljT3b7LkBbiJcocJ+zK
 OSzWU9miJGTAvYqn4/ciLKg4dA424b/1rBFlF1hBTCQ0HL5Cv4lajxdKEZCO5WCC
 WWTCz+MC60aWFaH3VNoywiLGb39H2IbqWbS9yNPd/wBkLHiMAD6NPQntOvcPaD4j
 1lyrMtLzfrWlrHxvxdI3kt5ZpFLYNXr2xk61xQjTz0ROFQBhf2sDsuhHhiYVLPj7
 JwYktpbBiPeaw2+I18NPymNPY+VfY8LCTgLl5M+rbKyCqebKaedZQJ7QXFhAEqKC
 Y2f+gJsKWtTDzGP2mk/5kF0uP7cd0vJK35ZCXtLZ9BbcNtFZU6w+ADqRo4pJBHRY
 QRzo/AWrdkuTJF0CrP6mcneNC7NwWLSdKrE1z77RQCHUPVvhHhRDZsgdLcZ/KKwx
 y1ji22trwNB+7LmI2fUOU5RRHZBIuNvQT+mPt24febJuHpZKul62dd3cqTGeSTC+
 MYVknYDSg/+jk+83DhuZnTyb9lWTbq/0Q1HRDu6l2LrMIH7YMPpY5Ea64ZFYzWXy
 s0+iHEM4mUzltwNauHIntjbwXi3C0l2k1WQyG0gun2eS6SXfu0lb93V4msFj/N1+
 oHavH2n2A4XrRr+Ob87fsl7nfXJibWP7R9xPblrWP2sNdqfjSyGd49rnsvpWqWMK
 Fj0d7tQ78+/SwA==
 =tWXS
 -----END PGP SIGNATURE-----

Merge tag 'x86-entry-2020-03-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 entry code updates from Thomas Gleixner:

 - Convert the 32bit syscalls to be pt_regs based which removes the
   requirement to push all 6 potential arguments onto the stack and
   consolidates the interface with the 64bit variant

 - The first small portion of the exception and syscall related entry
   code consolidation which aims to address the recently discovered
   issues vs. RCU, int3, NMI and some other exceptions which can
   interrupt any context. The bulk of the changes is still work in
   progress and aimed for 5.8.

 - A few lockdep namespace cleanups which have been applied into this
   branch to keep the prerequisites for the ongoing work confined.

* tag 'x86-entry-2020-03-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (35 commits)
  x86/entry: Fix build error x86 with !CONFIG_POSIX_TIMERS
  lockdep: Rename trace_{hard,soft}{irq_context,irqs_enabled}()
  lockdep: Rename trace_softirqs_{on,off}()
  lockdep: Rename trace_hardirq_{enter,exit}()
  x86/entry: Rename ___preempt_schedule
  x86: Remove unneeded includes
  x86/entry: Drop asmlinkage from syscalls
  x86/entry/32: Enable pt_regs based syscalls
  x86/entry/32: Use IA32-specific wrappers for syscalls taking 64-bit arguments
  x86/entry/32: Rename 32-bit specific syscalls
  x86/entry/32: Clean up syscall_32.tbl
  x86/entry: Remove ABI prefixes from functions in syscall tables
  x86/entry/64: Add __SYSCALL_COMMON()
  x86/entry: Remove syscall qualifier support
  x86/entry/64: Remove ptregs qualifier from syscall table
  x86/entry: Move max syscall number calculation to syscallhdr.sh
  x86/entry/64: Split X32 syscall table into its own file
  x86/entry/64: Move sys_ni_syscall stub to common.c
  x86/entry/64: Use syscall wrappers for x32_rt_sigreturn
  x86/entry: Refactor SYS_NI macros
  ...
2020-03-30 19:14:28 -07:00
Linus Torvalds
dbb381b619 timekeeping and timer updates:
Core:
 
   - Consolidation of the vDSO build infrastructure to address the
     difficulties of cross-builds for ARM64 compat vDSO libraries by
     restricting the exposure of header content to the vDSO build.
 
     This is achieved by splitting out header content into separate
     headers. which contain only the minimaly required information which is
     necessary to build the vDSO. These new headers are included from the
     kernel headers and the vDSO specific files.
 
   - Enhancements to the generic vDSO library allowing more fine grained
     control over the compiled in code, further reducing architecture
     specific storage and preparing for adopting the generic library by PPC.
 
   - Cleanup and consolidation of the exit related code in posix CPU timers.
 
   - Small cleanups and enhancements here and there
 
  Drivers:
 
   - The obligatory new drivers: Ingenic JZ47xx and X1000 TCU support
 
   - Correct the clock rate of PIT64b global clock
 
   - setup_irq() cleanup
 
   - Preparation for PWM and suspend support for the TI DM timer
 
   - Expand the fttmr010 driver to support ast2600 systems
 
   - The usual small fixes, enhancements and cleanups all over the place
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl6B+QETHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYofJ5D/94s5fpaqiuNcaAsLq2D3DRIrTnqxx7
 yEeAOPcbYV1bM1SgY/M83L5yGc2S8ny787e26abwRTCZhZV3eAmRTphIFFIZR0Xk
 xS+i67odscbdJTRtztKj3uQ9rFxefszRuphyaa89pwSY9nnyMWLcahGSQOGs0LJK
 hvmgwPjyM1drNfPxgPiaFg7vDr2XxNATpQr/FBt+BhelvVan8TlAfrkcNPiLr++Y
 Axz925FP7jMaRRbZ1acji34gLiIAZk0jLCUdbix7YkPrqDB4GfO+v8Vez+fGClbJ
 uDOYeR4r1+Be/BtSJtJ2tHqtsKCcAL6agtaE2+epZq5HbzaZFRvBFaxgFNF8WVcn
 3FFibdEMdsRNfZTUVp5wwgOLN0UIqE/7LifE12oLEL2oFB5H2PiNEUw3E02XHO11
 rL3zgHhB6Ke1sXKPCjSGdmIQLbxZmV5kOlQFy7XuSeo5fmRapVzKNffnKcftIliF
 1HNtZbgdA+3tdxMFCqoo1QX+kotl9kgpslmdZ0qHAbaRb3xqLoSskbqEjFRMuSCC
 8bjJrwboD9T5GPfwodSCgqs/58CaSDuqPFbIjCay+p90Fcg6wWAkZtyG04ZLdPRc
 GgNNdN4gjTD9bnrRi8cH47z1g8OO4vt4K4SEbmjo8IlDW+9jYMxuwgR88CMeDXd7
 hu7aKsr2I2q/WQ==
 =5o9G
 -----END PGP SIGNATURE-----

Merge tag 'timers-core-2020-03-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timekeeping and timer updates from Thomas Gleixner:
 "Core:

   - Consolidation of the vDSO build infrastructure to address the
     difficulties of cross-builds for ARM64 compat vDSO libraries by
     restricting the exposure of header content to the vDSO build.

     This is achieved by splitting out header content into separate
     headers. which contain only the minimaly required information which
     is necessary to build the vDSO. These new headers are included from
     the kernel headers and the vDSO specific files.

   - Enhancements to the generic vDSO library allowing more fine grained
     control over the compiled in code, further reducing architecture
     specific storage and preparing for adopting the generic library by
     PPC.

   - Cleanup and consolidation of the exit related code in posix CPU
     timers.

   - Small cleanups and enhancements here and there

  Drivers:

   - The obligatory new drivers: Ingenic JZ47xx and X1000 TCU support

   - Correct the clock rate of PIT64b global clock

   - setup_irq() cleanup

   - Preparation for PWM and suspend support for the TI DM timer

   - Expand the fttmr010 driver to support ast2600 systems

   - The usual small fixes, enhancements and cleanups all over the
     place"

* tag 'timers-core-2020-03-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (80 commits)
  Revert "clocksource/drivers/timer-probe: Avoid creating dead devices"
  vdso: Fix clocksource.h macro detection
  um: Fix header inclusion
  arm64: vdso32: Enable Clang Compilation
  lib/vdso: Enable common headers
  arm: vdso: Enable arm to use common headers
  x86/vdso: Enable x86 to use common headers
  mips: vdso: Enable mips to use common headers
  arm64: vdso32: Include common headers in the vdso library
  arm64: vdso: Include common headers in the vdso library
  arm64: Introduce asm/vdso/processor.h
  arm64: vdso32: Code clean up
  linux/elfnote.h: Replace elf.h with UAPI equivalent
  scripts: Fix the inclusion order in modpost
  common: Introduce processor.h
  linux/ktime.h: Extract common header for vDSO
  linux/jiffies.h: Extract common header for vDSO
  linux/time64.h: Extract common header for vDSO
  linux/time32.h: Extract common header for vDSO
  linux/time.h: Extract common header for vDSO
  ...
2020-03-30 18:51:47 -07:00
Linus Torvalds
336622e9fc NOHZ full updates:
- Remove TIF_NOHZ from 3 architectures
 
     These architectures use a static key to decide whether context tracking
     needs to be invoked and the TIF_NOHZ flag just causes a pointless
     slowpath execution for nothing.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl6B+bITHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoZjpD/9PkXE/zQoVmPLhOOcEBXB4i0rQQV41
 mR8F83aswch+qtT1g7A00G5j49CWkLh/hj5PX7ajS9nSTQCHOQ9jdZuxPrjW8CGZ
 gMHCyd5o9C98sKOORylR2nuCKhVdOq0/HleRjBBDsqcO0T5KlhVPUrtuJ878kX8d
 1SnoZnZMx+Ro0+4+Ehp39CmZJ0pV6o5ypT469esa2MB1xw389AQCmLt4rk99FNMo
 LDbKAB+7XBwNAu/rqD0hIv7YyvaSlcdlWBAXBLeCrwVIKQG3VfT9CpgwTtGoNFhY
 9KBkzr0z+lvHS9eKWyWzpXYrgVU1u28gUVvpaavv+Ma5V8STqNunoMBs7hKanJqV
 mPh+4ABACtFieKlwkj2PwUrGEgH+y/SAfStliOFsimVz/w2udC0S777/EjjzfKaN
 NS13mP19s5/P1q3y/6BSrOxYD0inicROO+UfetHNPOgMePY+Gp/xzluefPnhTagX
 CnJxndA3Fbjh9rXFbSZ5TMlf97kTxVVJE+qtrh5Upw1AWpo/qvkLsIFsamgyW2jR
 7t3MbHzKYnLkUJlwOLPJimvZeN4hZOx05ra/RZOkVaxri7xtVsDCkaEvhgEqLWYj
 Gbt2mGnNccawwN0bVPd2hgkKmUBqO8u5llhQcM2BBG4CJgZMaB8LjIkS+F6FsSME
 xMnY+tS3c7Q8TQ==
 =0lHP
 -----END PGP SIGNATURE-----

Merge tag 'timers-nohz-2020-03-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull NOHZ update from Thomas Gleixner:
 "Remove TIF_NOHZ from three architectures

  These architectures use a static key to decide whether context
  tracking needs to be invoked and the TIF_NOHZ flag just causes a
  pointless slowpath execution for nothing"

* tag 'timers-nohz-2020-03-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  arm64: Remove TIF_NOHZ
  arm: Remove TIF_NOHZ
  x86: Remove TIF_NOHZ
  context-tracking: Introduce CONFIG_HAVE_TIF_NOHZ
  x86/entry: Remove _TIF_NOHZ from _TIF_WORK_SYSCALL_ENTRY
2020-03-30 18:29:05 -07:00
Linus Torvalds
992a1a3b45 CPU (hotplug) updates:
- Support for locked CSD objects in smp_call_function_single_async()
     which allows to simplify callsites in the scheduler core and MIPS
 
   - Treewide consolidation of CPU hotplug functions which ensures the
     consistency between the sysfs interface and kernel state. The low level
     functions cpu_up/down() are now confined to the core code and not
     longer accessible from random code.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl6B9VQTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYodCyD/0WFYAe7LkOfNjkbLa0IeuyLjF9rnCi
 ilcSXMLpaVwwoQvm7MopwkXUDdmEIyeJ0B641j3mC3AKCRap4+O36H2IEg2byrj7
 twOvQNCfxpVVmCCD11FTH9aQa74LEB6AikTgjevhrRWj6eHsal7c2Ak26AzCgrt+
 0eEkOAOWJbLAlbIiPdHlCZ3TMldcs3gg+lRSYd5QCGQVkZFnwpXzyOvpyJEUGGbb
 R/JuvwJoLhRMiYAJDILoQQQg/J07ODuivse/R8PWaH2djkn+2NyRGrD794PhyyOg
 QoTU0ZrYD3Z48ACXv+N3jLM7wXMcFzjYtr1vW1E3O/YGA7GVIC6XHGbMQ7tEihY0
 ajtwq8DcnpKtuouviYnf7NuKgqdmJXkaZjz3Gms6n8nLXqqSVwuQELWV2CXkxNe6
 9kgnnKK+xXMOGI4TUhN8bejvkXqRCmKMeQJcWyf+7RA9UIhAJw5o7WGo8gXfQWUx
 tazCqDy/inYjqGxckW615fhi2zHfemlYTbSzIGOuMB1TEPKFcrgYAii/VMsYHQVZ
 5amkYUXGQ5brlCOzOn38lzp5OkALBnFzD7xgvOcQgWT3ynVpdqADfBytXiEEHh4J
 KSkSgSSRcS58397nIxnDcJgJouHLvAWYyPZ4UC6mfynuQIic31qMHGVqwdbEKMY3
 4M5dGgqIfOBgYw==
 =jwCg
 -----END PGP SIGNATURE-----

Merge tag 'smp-core-2020-03-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull core SMP updates from Thomas Gleixner:
 "CPU (hotplug) updates:

   - Support for locked CSD objects in smp_call_function_single_async()
     which allows to simplify callsites in the scheduler core and MIPS

   - Treewide consolidation of CPU hotplug functions which ensures the
     consistency between the sysfs interface and kernel state. The low
     level functions cpu_up/down() are now confined to the core code and
     not longer accessible from random code"

* tag 'smp-core-2020-03-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits)
  cpu/hotplug: Ignore pm_wakeup_pending() for disable_nonboot_cpus()
  cpu/hotplug: Hide cpu_up/down()
  cpu/hotplug: Move bringup of secondary CPUs out of smp_init()
  torture: Replace cpu_up/down() with add/remove_cpu()
  firmware: psci: Replace cpu_up/down() with add/remove_cpu()
  xen/cpuhotplug: Replace cpu_up/down() with device_online/offline()
  parisc: Replace cpu_up/down() with add/remove_cpu()
  sparc: Replace cpu_up/down() with add/remove_cpu()
  powerpc: Replace cpu_up/down() with add/remove_cpu()
  x86/smp: Replace cpu_up/down() with add/remove_cpu()
  arm64: hibernate: Use bringup_hibernate_cpu()
  cpu/hotplug: Provide bringup_hibernate_cpu()
  arm64: Use reboot_cpu instead of hardconding it to 0
  arm64: Don't use disable_nonboot_cpus()
  ARM: Use reboot_cpu instead of hardcoding it to 0
  ARM: Don't use disable_nonboot_cpus()
  ia64: Replace cpu_down() with smp_shutdown_nonboot_cpus()
  cpu/hotplug: Create a new function to shutdown nonboot cpus
  cpu/hotplug: Add new {add,remove}_cpu() functions
  sched/core: Remove rq.hrtick_csd_pending
  ...
2020-03-30 18:06:39 -07:00
Andrii Nakryiko
0c991ebc8c bpf: Implement bpf_prog replacement for an active bpf_cgroup_link
Add new operation (LINK_UPDATE), which allows to replace active bpf_prog from
under given bpf_link. Currently this is only supported for bpf_cgroup_link,
but will be extended to other kinds of bpf_links in follow-up patches.

For bpf_cgroup_link, implemented functionality matches existing semantics for
direct bpf_prog attachment (including BPF_F_REPLACE flag). User can either
unconditionally set new bpf_prog regardless of which bpf_prog is currently
active under given bpf_link, or, optionally, can specify expected active
bpf_prog. If active bpf_prog doesn't match expected one, no changes are
performed, old bpf_link stays intact and attached, operation returns
a failure.

cgroup_bpf_replace() operation is resolving race between auto-detachment and
bpf_prog update in the same fashion as it's done for bpf_link detachment,
except in this case update has no way of succeeding because of target cgroup
marked as dying. So in this case error is returned.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200330030001.2312810-3-andriin@fb.com
2020-03-30 17:36:33 -07:00
Andrii Nakryiko
af6eea5743 bpf: Implement bpf_link-based cgroup BPF program attachment
Implement new sub-command to attach cgroup BPF programs and return FD-based
bpf_link back on success. bpf_link, once attached to cgroup, cannot be
replaced, except by owner having its FD. Cgroup bpf_link supports only
BPF_F_ALLOW_MULTI semantics. Both link-based and prog-based BPF_F_ALLOW_MULTI
attachments can be freely intermixed.

To prevent bpf_cgroup_link from keeping cgroup alive past the point when no
BPF program can be executed, implement auto-detachment of link. When
cgroup_bpf_release() is called, all attached bpf_links are forced to release
cgroup refcounts, but they leave bpf_link otherwise active and allocated, as
well as still owning underlying bpf_prog. This is because user-space might
still have FDs open and active, so bpf_link as a user-referenced object can't
be freed yet. Once last active FD is closed, bpf_link will be freed and
underlying bpf_prog refcount will be dropped. But cgroup refcount won't be
touched, because cgroup is released already.

The inherent race between bpf_cgroup_link release (from closing last FD) and
cgroup_bpf_release() is resolved by both operations taking cgroup_mutex. So
the only additional check required is when bpf_cgroup_link attempts to detach
itself from cgroup. At that time we need to check whether there is still
cgroup associated with that link. And if not, exit with success, because
bpf_cgroup_link was already successfully detached.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Roman Gushchin <guro@fb.com>
Link: https://lore.kernel.org/bpf/20200330030001.2312810-2-andriin@fb.com
2020-03-30 17:35:59 -07:00
Linus Torvalds
2d385336af Updates for the interrupt subsystem:
Treewide:
 
     - Cleanup of setup_irq() which is not longer required because the
       memory allocator is available early. Most cleanup changes come
       through the various maintainer trees, so the final removal of
       setup_irq() is postponed towards the end of the merge window.
 
   Core:
 
     - Protection against unsafe invocation of interrupt handlers and unsafe
       interrupt injection including a fixup of the offending PCI/AER error
       injection mechanism.
 
       Invoking interrupt handlers from arbitrary contexts, i.e. outside of
       an actual interrupt, can cause inconsistent state on the fragile
       x86 interrupt affinity changing hardware trainwreck.
 
   Drivers:
 
     - Second wave of support for the new ARM GICv4.1
     - Multi-instance support for Xilinx and PLIC interrupt controllers
     - CPU-Hotplug support for PLIC
     - The obligatory new driver for X1000 TCU
     - Enhancements, cleanups and fixes all over the place
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl6B888THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoeMJD/9v8GcI/DSY87Fmo7s4odLFVU0J8zZ6
 7QlYjSPm4yWv4pqn1TEnEF2pKz5X9Euhoh8BmdMKtdXBqlS4Ix9N+pH8ModcxyQo
 aX97zuRUxvqfeeVE+yQRwbbMREj9jj9RW8FRtA39+l5H3uC1GDcc+2aAMIaykQ7+
 8lo/6wBd8ZrZ0gsNf4KjlBwMDYAlQSRWxrff38PQ2XRpGKowdp8JFYZuq5Vp0ljJ
 r2cE75ldmFSfmtuhhVroBRY0GAqW4/8v8/syAN3Q9jOEII60qhA0dqR085B9veWa
 DHSqgLmzyUFFXN7Ntzt/fDirJVsIM4BE9qGu3ftCYHMaPB8hG+xqjbZe9E3D2e/d
 +0Pb3TG8EHVOIwzv1t9+6462qYGkBhmBXtbj6GptPYk2Ai4HZlNaSsa8jUNyHvGz
 WDegdRjt7O5RjqDH/VwrQxW/AEp05f/1egweBXbq9aF6j9nqeOur75c/PdxZxAX5
 WUMtouXP2WN+sMW8k1T5cmVMGWxLGBB0wwG4LC/mXzHnkDiN1+2wEUHmhS8Voi3q
 3HXeYBJeukUYbVvMKRvWVAD330TxFjAyd6pPwCdoNY2ZngJnQWlDD9vbYYX2osoW
 kP+KhIANNBVqdK7NqlLoqcr3SdHn01pQYuVHejNzxb7E6/mmpMlaYDJc/rMPi/eM
 0/rzl8fAj/WyBQ==
 =DZ/G
 -----END PGP SIGNATURE-----

Merge tag 'irq-core-2020-03-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq updates from Thomas Gleixner:
 "Updates for the interrupt subsystem:

  Treewide:

    - Cleanup of setup_irq() which is not longer required because the
      memory allocator is available early.

      Most cleanup changes come through the various maintainer trees, so
      the final removal of setup_irq() is postponed towards the end of
      the merge window.

  Core:

    - Protection against unsafe invocation of interrupt handlers and
      unsafe interrupt injection including a fixup of the offending
      PCI/AER error injection mechanism.

      Invoking interrupt handlers from arbitrary contexts, i.e. outside
      of an actual interrupt, can cause inconsistent state on the
      fragile x86 interrupt affinity changing hardware trainwreck.

  Drivers:

    - Second wave of support for the new ARM GICv4.1

    - Multi-instance support for Xilinx and PLIC interrupt controllers

    - CPU-Hotplug support for PLIC

    - The obligatory new driver for X1000 TCU

    - Enhancements, cleanups and fixes all over the place"

* tag 'irq-core-2020-03-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (58 commits)
  unicore32: Replace setup_irq() by request_irq()
  sh: Replace setup_irq() by request_irq()
  hexagon: Replace setup_irq() by request_irq()
  c6x: Replace setup_irq() by request_irq()
  alpha: Replace setup_irq() by request_irq()
  irqchip/gic-v4.1: Eagerly vmap vPEs
  irqchip/gic-v4.1: Add VSGI property setup
  irqchip/gic-v4.1: Add VSGI allocation/teardown
  irqchip/gic-v4.1: Move doorbell management to the GICv4 abstraction layer
  irqchip/gic-v4.1: Plumb set_vcpu_affinity SGI callbacks
  irqchip/gic-v4.1: Plumb get/set_irqchip_state SGI callbacks
  irqchip/gic-v4.1: Plumb mask/unmask SGI callbacks
  irqchip/gic-v4.1: Add initial SGI configuration
  irqchip/gic-v4.1: Plumb skeletal VSGI irqchip
  irqchip/stm32: Retrigger both in eoi and unmask callbacks
  irqchip/gic-v3: Move irq_domain_update_bus_token to after checking for NULL domain
  irqchip/xilinx: Do not call irq_set_default_host()
  irqchip/xilinx: Enable generic irq multi handler
  irqchip/xilinx: Fill error code when irq domain registration fails
  irqchip/xilinx: Add support for multiple instances
  ...
2020-03-30 17:35:14 -07:00
Linus Torvalds
642e53ead6 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "The main changes in this cycle are:

   - Various NUMA scheduling updates: harmonize the load-balancer and
     NUMA placement logic to not work against each other. The intended
     result is better locality, better utilization and fewer migrations.

   - Introduce Thermal Pressure tracking and optimizations, to improve
     task placement on thermally overloaded systems.

   - Implement frequency invariant scheduler accounting on (some) x86
     CPUs. This is done by observing and sampling the 'recent' CPU
     frequency average at ~tick boundaries. The CPU provides this data
     via the APERF/MPERF MSRs. This hopefully makes our capacity
     estimates more precise and keeps tasks on the same CPU better even
     if it might seem overloaded at a lower momentary frequency. (As
     usual, turbo mode is a complication that we resolve by observing
     the maximum frequency and renormalizing to it.)

   - Add asymmetric CPU capacity wakeup scan to improve capacity
     utilization on asymmetric topologies. (big.LITTLE systems)

   - PSI fixes and optimizations.

   - RT scheduling capacity awareness fixes & improvements.

   - Optimize the CONFIG_RT_GROUP_SCHED constraints code.

   - Misc fixes, cleanups and optimizations - see the changelog for
     details"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (62 commits)
  threads: Update PID limit comment according to futex UAPI change
  sched/fair: Fix condition of avg_load calculation
  sched/rt: cpupri_find: Trigger a full search as fallback
  kthread: Do not preempt current task if it is going to call schedule()
  sched/fair: Improve spreading of utilization
  sched: Avoid scale real weight down to zero
  psi: Move PF_MEMSTALL out of task->flags
  MAINTAINERS: Add maintenance information for psi
  psi: Optimize switching tasks inside shared cgroups
  psi: Fix cpu.pressure for cpu.max and competing cgroups
  sched/core: Distribute tasks within affinity masks
  sched/fair: Fix enqueue_task_fair warning
  thermal/cpu-cooling, sched/core: Move the arch_set_thermal_pressure() API to generic scheduler code
  sched/rt: Remove unnecessary push for unfit tasks
  sched/rt: Allow pulling unfitting task
  sched/rt: Optimize cpupri_find() on non-heterogenous systems
  sched/rt: Re-instate old behavior in select_task_rq_rt()
  sched/rt: cpupri_find: Implement fallback mechanism for !fit case
  sched/fair: Fix reordering of enqueue/dequeue_task_fair()
  sched/fair: Fix runnable_avg for throttled cfs
  ...
2020-03-30 17:01:51 -07:00
Linus Torvalds
9b82f05f86 Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Ingo Molnar:
 "The main changes in this cycle were:

  Kernel side changes:

   - A couple of x86/cpu cleanups and changes were grandfathered in due
     to patch dependencies. These clean up the set of CPU model/family
     matching macros with a consistent namespace and C99 initializer
     style.

   - A bunch of updates to various low level PMU drivers:
       * AMD Family 19h L3 uncore PMU
       * Intel Tiger Lake uncore support
       * misc fixes to LBR TOS sampling

   - optprobe fixes

   - perf/cgroup: optimize cgroup event sched-in processing

   - misc cleanups and fixes

  Tooling side changes are to:

   - perf {annotate,expr,record,report,stat,test}

   - perl scripting

   - libapi, libperf and libtraceevent

   - vendor events on Intel and S390, ARM cs-etm

   - Intel PT updates

   - Documentation changes and updates to core facilities

   - misc cleanups, fixes and other enhancements"

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (89 commits)
  cpufreq/intel_pstate: Fix wrong macro conversion
  x86/cpu: Cleanup the now unused CPU match macros
  hwrng: via_rng: Convert to new X86 CPU match macros
  crypto: Convert to new CPU match macros
  ASoC: Intel: Convert to new X86 CPU match macros
  powercap/intel_rapl: Convert to new X86 CPU match macros
  PCI: intel-mid: Convert to new X86 CPU match macros
  mmc: sdhci-acpi: Convert to new X86 CPU match macros
  intel_idle: Convert to new X86 CPU match macros
  extcon: axp288: Convert to new X86 CPU match macros
  thermal: Convert to new X86 CPU match macros
  hwmon: Convert to new X86 CPU match macros
  platform/x86: Convert to new CPU match macros
  EDAC: Convert to new X86 CPU match macros
  cpufreq: Convert to new X86 CPU match macros
  ACPI: Convert to new X86 CPU match macros
  x86/platform: Convert to new CPU match macros
  x86/kernel: Convert to new CPU match macros
  x86/kvm: Convert to new CPU match macros
  x86/perf/events: Convert to new CPU match macros
  ...
2020-03-30 16:40:08 -07:00
Linus Torvalds
4b9fd8a829 Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
 "The main changes in this cycle were:

   - Continued user-access cleanups in the futex code.

   - percpu-rwsem rewrite that uses its own waitqueue and atomic_t
     instead of an embedded rwsem. This addresses a couple of
     weaknesses, but the primary motivation was complications on the -rt
     kernel.

   - Introduce raw lock nesting detection on lockdep
     (CONFIG_PROVE_RAW_LOCK_NESTING=y), document the raw_lock vs. normal
     lock differences. This too originates from -rt.

   - Reuse lockdep zapped chain_hlocks entries, to conserve RAM
     footprint on distro-ish kernels running into the "BUG:
     MAX_LOCKDEP_CHAIN_HLOCKS too low!" depletion of the lockdep
     chain-entries pool.

   - Misc cleanups, smaller fixes and enhancements - see the changelog
     for details"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (55 commits)
  fs/buffer: Make BH_Uptodate_Lock bit_spin_lock a regular spinlock_t
  thermal/x86_pkg_temp: Make pkg_temp_lock a raw_spinlock_t
  Documentation/locking/locktypes: Minor copy editor fixes
  Documentation/locking/locktypes: Further clarifications and wordsmithing
  m68knommu: Remove mm.h include from uaccess_no.h
  x86: get rid of user_atomic_cmpxchg_inatomic()
  generic arch_futex_atomic_op_inuser() doesn't need access_ok()
  x86: don't reload after cmpxchg in unsafe_atomic_op2() loop
  x86: convert arch_futex_atomic_op_inuser() to user_access_begin/user_access_end()
  objtool: whitelist __sanitizer_cov_trace_switch()
  [parisc, s390, sparc64] no need for access_ok() in futex handling
  sh: no need of access_ok() in arch_futex_atomic_op_inuser()
  futex: arch_futex_atomic_op_inuser() calling conventions change
  completion: Use lockdep_assert_RT_in_threaded_ctx() in complete_all()
  lockdep: Add posixtimer context tracing bits
  lockdep: Annotate irq_work
  lockdep: Add hrtimer context tracing bits
  lockdep: Introduce wait-type checks
  completion: Use simple wait queues
  sched/swait: Prepare usage in completions
  ...
2020-03-30 16:17:15 -07:00
Linus Torvalds
7c4fa15071 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 "The main changes in this cycle were:

   - Make kfree_rcu() use kfree_bulk() for added performance

   - RCU updates

   - Callback-overload handling updates

   - Tasks-RCU KCSAN and sparse updates

   - Locking torture test and RCU torture test updates

   - Documentation updates

   - Miscellaneous fixes"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (74 commits)
  rcu: Make rcu_barrier() account for offline no-CBs CPUs
  rcu: Mark rcu_state.gp_seq to detect concurrent writes
  Documentation/memory-barriers: Fix typos
  doc: Add rcutorture scripting to torture.txt
  doc/RCU/rcu: Use https instead of http if possible
  doc/RCU/rcu: Use absolute paths for non-rst files
  doc/RCU/rcu: Use ':ref:' for links to other docs
  doc/RCU/listRCU: Update example function name
  doc/RCU/listRCU: Fix typos in a example code snippets
  doc/RCU/Design: Remove remaining HTML tags in ReST files
  doc: Add some more RCU list patterns in the kernel
  rcutorture: Set KCSAN Kconfig options to detect more data races
  rcutorture: Manually clean up after rcu_barrier() failure
  rcutorture: Make rcu_torture_barrier_cbs() post from corresponding CPU
  rcuperf: Measure memory footprint during kfree_rcu() test
  rcutorture: Annotation lockless accesses to rcu_torture_current
  rcutorture: Add READ_ONCE() to rcu_torture_count and rcu_torture_batch
  rcutorture: Fix stray access to rcu_fwd_cb_nodelay
  rcutorture: Fix rcu_torture_one_read()/rcu_torture_writer() data race
  rcutorture: Make kvm-find-errors.sh abort on bad directory
  ...
2020-03-30 15:52:00 -07:00
Linus Torvalds
49835c15a5 Power management updates for 5.7-rc1
- Clean up and rework the PM QoS API to simplify the code and
    reduce the size of it (Rafael Wysocki).
 
  - Fix a suspend-to-idle wakeup regression on Dell XPS13 9370
    and similar platforms where the USB plug/unplug events are
    handled by the EC (Rafael Wysocki).
 
  - CLean up the intel_idle and PSCI cpuidle drivers (Rafael Wysocki,
    Ulf Hansson).
 
  - Extend the haltpoll cpuidle driver so that it can be forced to
    run on some systems where it refused to load (Maciej Szmigiero).
 
  - Convert several cpufreq documents to the .rst format and move the
    legacy driver documentation into one common file (Mauro Carvalho
    Chehab, Rafael Wysocki).
 
  - Update several cpufreq drivers:
 
    * Extend and fix the imx-cpufreq-dt driver (Anson Huang).
 
    * Improve the -EPROBE_DEFER handling and fix unwanted CPU
      overclocking on i.MX6ULL in imx6q-cpufreq (Anson Huang,
      Christoph Niedermaier).
 
    * Add support for Krait based SoCs to the qcom driver (Ansuel
      Smith).
 
    * Add support for OPP_PLUS to ti-cpufreq (Lokesh Vutla).
 
    * Add platform specific intermediate callbacks support to
      cpufreq-dt and update the imx6q driver (Peng Fan).
 
    * Simplify and consolidate some pieces of the intel_pstate driver
      and update its documentation (Rafael Wysocki, Alex Hung).
 
  - Fix several devfreq issues:
 
    * Remove unneeded extern keyword from a devfreq header file
      and use the DEVFREQ_GOV_UPDATE_INTERNAL event name instead of
      DEVFREQ_GOV_INTERNAL (Chanwoo Choi).
 
    * Fix the handling of dev_pm_qos_remove_request() result (Leonard
      Crestez).
 
    * Use constant name for userspace governor (Pierre Kuo).
 
    * Get rid of doc warnings and fix a typo (Christophe JAILLET).
 
  - Use built-in RCU list checking in some places in the PM core to
    avoid false-positive RCU usage warnings (Madhuparna Bhowmik).
 
  - Add explicit READ_ONCE()/WRITE_ONCE() annotations to low-level
    PM QoS routines (Qian Cai).
 
  - Fix removal of wakeup sources to avoid NULL pointer dereferences
    in a corner case (Neeraj Upadhyay).
 
  - Clean up the handling of hibernate compat ioctls and fix the
    related documentation (Eric Biggers).
 
  - Update the idle_inject power capping driver to use variable-length
    arrays instead of zero-length arrays (Gustavo Silva).
 
  - Fix list format in a PM QoS document (Randy Dunlap).
 
  - Make the cpufreq stats module use scnprintf() to avoid potential
    buffer overflows (Takashi Iwai).
 
  - Add pm_runtime_get_if_active() to PM-runtime API (Sakari Ailus).
 
  - Allow no domain-idle-states DT property in generic PM domains (Ulf
    Hansson).
 
  - Fix a broken y-axis scale in the intel_pstate_tracer utility (Doug
    Smythies).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl6B/YkSHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxEjIP/jXoO1pAxq7BMx7naZnZL7pzemJfAGR7
 HVnRLDo0IlsSwI7Jvuy13a0eI+EcGPA6pRo5qnBM4TZCIFsHoO5Yle47ndNGsi8r
 Jd3T89oT3I+fXI4KTfWO0n+K/F6mv8/CTZDz/E7Z6zirpFxyyZQxgIsAT76RcZom
 xhWna9vygOlBnFsQaAeph+GzoXBWIylaMZfylUeT3v4c4DLH6FzcbnINPkgJsZCw
 Ayt1bmE0L9yiqCizEto91eaDObkxTHVFGr2OVNa/Y/SVW+VUThUJrXqV28opQxPZ
 h4TiQorpTX1CwMmiXZwmoeqqsiVXrm0KyhK0lwc5tZ9FnZWiW4qjJ487Eu6TjOmh
 gecT+M2Yexy0BvUGN0wIdaCLtfmf2Hjxk0trxM2blAh3uoFjf3UJ9SLNkRjlu2/b
 QqWmIRRPljD5fEUid5lVV4EAXuITUzWMJeia+FiAsgx1SF3pZPar80f+FGrYfaJN
 wL2BTwBx1aXpPpAkEX0kM9Rkf6oJsFATR3p7DNzyZ1bMrQUxiToWRlQBID5H6G4v
 /kAkSTQjNQVwkkylUzTLOlcmL56sCvc0YPdybH62OsLXs9K4gyC8v6tEdtdA5qtw
 0Up9DrYbNKKv6GrSXf8eyk2Q2CEqfRXHv2ACNnkLRXZ6fWnFiTfMgNj7zqtrfna7
 tJBvrV9/ACXE
 =cBQd
 -----END PGP SIGNATURE-----

Merge tag 'pm-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:
 "These clean up and rework the PM QoS API, address a suspend-to-idle
  wakeup regression on some ACPI-based platforms, clean up and extend a
  few cpuidle drivers, update multiple cpufreq drivers and cpufreq
  documentation, and fix a number of issues in devfreq and several other
  things all over.

  Specifics:

   - Clean up and rework the PM QoS API to simplify the code and reduce
     the size of it (Rafael Wysocki).

   - Fix a suspend-to-idle wakeup regression on Dell XPS13 9370 and
     similar platforms where the USB plug/unplug events are handled by
     the EC (Rafael Wysocki).

   - CLean up the intel_idle and PSCI cpuidle drivers (Rafael Wysocki,
     Ulf Hansson).

   - Extend the haltpoll cpuidle driver so that it can be forced to run
     on some systems where it refused to load (Maciej Szmigiero).

   - Convert several cpufreq documents to the .rst format and move the
     legacy driver documentation into one common file (Mauro Carvalho
     Chehab, Rafael Wysocki).

   - Update several cpufreq drivers:

        * Extend and fix the imx-cpufreq-dt driver (Anson Huang).

        * Improve the -EPROBE_DEFER handling and fix unwanted CPU
          overclocking on i.MX6ULL in imx6q-cpufreq (Anson Huang,
          Christoph Niedermaier).

        * Add support for Krait based SoCs to the qcom driver (Ansuel
          Smith).

        * Add support for OPP_PLUS to ti-cpufreq (Lokesh Vutla).

        * Add platform specific intermediate callbacks support to
          cpufreq-dt and update the imx6q driver (Peng Fan).

        * Simplify and consolidate some pieces of the intel_pstate
          driver and update its documentation (Rafael Wysocki, Alex
          Hung).

   - Fix several devfreq issues:

        * Remove unneeded extern keyword from a devfreq header file and
          use the DEVFREQ_GOV_UPDATE_INTERNAL event name instead of
          DEVFREQ_GOV_INTERNAL (Chanwoo Choi).

        * Fix the handling of dev_pm_qos_remove_request() result
          (Leonard Crestez).

        * Use constant name for userspace governor (Pierre Kuo).

        * Get rid of doc warnings and fix a typo (Christophe JAILLET).

   - Use built-in RCU list checking in some places in the PM core to
     avoid false-positive RCU usage warnings (Madhuparna Bhowmik).

   - Add explicit READ_ONCE()/WRITE_ONCE() annotations to low-level PM
     QoS routines (Qian Cai).

   - Fix removal of wakeup sources to avoid NULL pointer dereferences in
     a corner case (Neeraj Upadhyay).

   - Clean up the handling of hibernate compat ioctls and fix the
     related documentation (Eric Biggers).

   - Update the idle_inject power capping driver to use variable-length
     arrays instead of zero-length arrays (Gustavo Silva).

   - Fix list format in a PM QoS document (Randy Dunlap).

   - Make the cpufreq stats module use scnprintf() to avoid potential
     buffer overflows (Takashi Iwai).

   - Add pm_runtime_get_if_active() to PM-runtime API (Sakari Ailus).

   - Allow no domain-idle-states DT property in generic PM domains (Ulf
     Hansson).

   - Fix a broken y-axis scale in the intel_pstate_tracer utility (Doug
     Smythies)"

* tag 'pm-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (78 commits)
  cpufreq: intel_pstate: Simplify intel_pstate_cpu_init()
  tools/power/x86/intel_pstate_tracer: fix a broken y-axis scale
  ACPI: PM: s2idle: Refine active GPEs check
  ACPICA: Allow acpi_any_gpe_status_set() to skip one GPE
  PM: sleep: wakeup: Skip wakeup_source_sysfs_remove() if device is not there
  PM / devfreq: Get rid of some doc warnings
  PM / devfreq: Fix handling dev_pm_qos_remove_request result
  PM / devfreq: Fix a typo in a comment
  PM / devfreq: Change to DEVFREQ_GOV_UPDATE_INTERVAL event name
  PM / devfreq: Remove unneeded extern keyword
  PM / devfreq: Use constant name of userspace governor
  ACPI: PM: s2idle: Fix comment in acpi_s2idle_prepare_late()
  cpufreq: qcom: Add support for krait based socs
  cpufreq: imx6q-cpufreq: Improve the logic of -EPROBE_DEFER handling
  cpufreq: Use scnprintf() for avoiding potential buffer overflow
  cpuidle: psci: Split psci_dt_cpu_init_idle()
  PM / Domains: Allow no domain-idle-states DT property in genpd when parsing
  PM / hibernate: Remove unnecessary compat ioctl overrides
  PM: hibernate: fix docs for ioctls that return loff_t via pointer
  Documentation: intel_pstate: update links for references
  ...
2020-03-30 15:05:01 -07:00
John Fastabend
fa123ac022 bpf: Verifier, refine 32bit bound in do_refine_retval_range
Further refine return values range in do_refine_retval_range by noting
these are int return types (We will assume here that int is a 32-bit type).

Two reasons to pull this out of original patch. First it makes the original
fix impossible to backport. And second I've not seen this as being problematic
in practice unlike the other case.

Fixes: 849fa50662 ("bpf/verifier: refine retval R0 state for bpf_get_stack helper")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/158560421952.10843.12496354931526965046.stgit@john-Precision-5820-Tower
2020-03-30 15:00:30 -07:00
John Fastabend
3f50f132d8 bpf: Verifier, do explicit ALU32 bounds tracking
It is not possible for the current verifier to track ALU32 and JMP ops
correctly. This can result in the verifier aborting with errors even though
the program should be verifiable. BPF codes that hit this can work around
it by changin int variables to 64-bit types, marking variables volatile,
etc. But this is all very ugly so it would be better to avoid these tricks.

But, the main reason to address this now is do_refine_retval_range() was
assuming return values could not be negative. Once we fixed this code that
was previously working will no longer work. See do_refine_retval_range()
patch for details. And we don't want to suddenly cause programs that used
to work to fail.

The simplest example code snippet that illustrates the problem is likely
this,

 53: w8 = w0                    // r8 <- [0, S32_MAX],
                                // w8 <- [-S32_MIN, X]
 54: w8 <s 0                    // r8 <- [0, U32_MAX]
                                // w8 <- [0, X]

The expected 64-bit and 32-bit bounds after each line are shown on the
right. The current issue is without the w* bounds we are forced to use
the worst case bound of [0, U32_MAX]. To resolve this type of case,
jmp32 creating divergent 32-bit bounds from 64-bit bounds, we add explicit
32-bit register bounds s32_{min|max}_value and u32_{min|max}_value. Then
from branch_taken logic creating new bounds we can track 32-bit bounds
explicitly.

The next case we observed is ALU ops after the jmp32,

 53: w8 = w0                    // r8 <- [0, S32_MAX],
                                // w8 <- [-S32_MIN, X]
 54: w8 <s 0                    // r8 <- [0, U32_MAX]
                                // w8 <- [0, X]
 55: w8 += 1                    // r8 <- [0, U32_MAX+1]
                                // w8 <- [0, X+1]

In order to keep the bounds accurate at this point we also need to track
ALU32 ops. To do this we add explicit ALU32 logic for each of the ALU
ops, mov, add, sub, etc.

Finally there is a question of how and when to merge bounds. The cases
enumerate here,

1. MOV ALU32   - zext 32-bit -> 64-bit
2. MOV ALU64   - copy 64-bit -> 32-bit
3. op  ALU32   - zext 32-bit -> 64-bit
4. op  ALU64   - n/a
5. jmp ALU32   - 64-bit: var32_off | upper_32_bits(var64_off)
6. jmp ALU64   - 32-bit: (>> (<< var64_off))

Details for each case,

For "MOV ALU32" BPF arch zero extends so we simply copy the bounds
from 32-bit into 64-bit ensuring we truncate var_off and 64-bit
bounds correctly. See zext_32_to_64.

For "MOV ALU64" copy all bounds including 32-bit into new register. If
the src register had 32-bit bounds the dst register will as well.

For "op ALU32" zero extend 32-bit into 64-bit the same as move,
see zext_32_to_64.

For "op ALU64" calculate both 32-bit and 64-bit bounds no merging
is done here. Except we have a special case. When RSH or ARSH is
done we can't simply ignore shifting bits from 64-bit reg into the
32-bit subreg. So currently just push bounds from 64-bit into 32-bit.
This will be correct in the sense that they will represent a valid
state of the register. However we could lose some accuracy if an
ARSH is following a jmp32 operation. We can handle this special
case in a follow up series.

For "jmp ALU32" mark 64-bit reg unknown and recalculate 64-bit bounds
from tnum by setting var_off to ((<<(>>var_off)) | var32_off). We
special case if 64-bit bounds has zero'd upper 32bits at which point
we can simply copy 32-bit bounds into 64-bit register. This catches
a common compiler trick where upper 32-bits are zeroed and then
32-bit ops are used followed by a 64-bit compare or 64-bit op on
a pointer. See __reg_combine_64_into_32().

For "jmp ALU64" cast the bounds of the 64bit to their 32-bit
counterpart. For example s32_min_value = (s32)reg->smin_value. For
tnum use only the lower 32bits via, (>>(<<var_off)). See
__reg_combine_64_into_32().

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/158560419880.10843.11448220440809118343.stgit@john-Precision-5820-Tower
2020-03-30 14:59:53 -07:00
John Fastabend
100605035e bpf: Verifier, do_refine_retval_range may clamp umin to 0 incorrectly
do_refine_retval_range() is called to refine return values from specified
helpers, probe_read_str and get_stack at the moment, the reasoning is
because both have a max value as part of their input arguments and
because the helper ensure the return value will not be larger than this
we can set smax values of the return register, r0.

However, the return value is a signed integer so setting umax is incorrect
It leads to further confusion when the do_refine_retval_range() then calls,
__reg_deduce_bounds() which will see a umax value as meaning the value is
unsigned and then assuming it is unsigned set the smin = umin which in this
case results in 'smin = 0' and an 'smax = X' where X is the input argument
from the helper call.

Here are the comments from _reg_deduce_bounds() on why this would be safe
to do.

 /* Learn sign from unsigned bounds.  Signed bounds cross the sign
  * boundary, so we must be careful.
  */
 if ((s64)reg->umax_value >= 0) {
	/* Positive.  We can't learn anything from the smin, but smax
	 * is positive, hence safe.
	 */
	reg->smin_value = reg->umin_value;
	reg->smax_value = reg->umax_value = min_t(u64, reg->smax_value,
						  reg->umax_value);

But now we incorrectly have a return value with type int with the
signed bounds (0,X). Suppose the return value is negative, which is
possible the we have the verifier and reality out of sync. Among other
things this may result in any error handling code being falsely detected
as dead-code and removed. For instance the example below shows using
bpf_probe_read_str() causes the error path to be identified as dead
code and removed.

>From the 'llvm-object -S' dump,

 r2 = 100
 call 45
 if r0 s< 0 goto +4
 r4 = *(u32 *)(r7 + 0)

But from dump xlate

  (b7) r2 = 100
  (85) call bpf_probe_read_compat_str#-96768
  (61) r4 = *(u32 *)(r7 +0)  <-- dropped if goto

Due to verifier state after call being

 R0=inv(id=0,umax_value=100,var_off=(0x0; 0x7f))

To fix omit setting the umax value because its not safe. The only
actual bounds we know is the smax. This results in the correct bounds
(SMIN, X) where X is the max length from the helper. After this the
new verifier state looks like the following after call 45.

R0=inv(id=0,smax_value=100)

Then xlated version no longer removed dead code giving the expected
result,

  (b7) r2 = 100
  (85) call bpf_probe_read_compat_str#-96768
  (c5) if r0 s< 0x0 goto pc+4
  (61) r4 = *(u32 *)(r7 +0)

Note, bpf_probe_read_* calls are root only so we wont hit this case
with non-root bpf users.

v3: comment had some documentation about meta set to null case which
is not relevant here and confusing to include in the comment.

v2 note: In original version we set msize_smax_value from check_func_arg()
and propagated this into smax of retval. The logic was smax is the bound
on the retval we set and because the type in the helper is ARG_CONST_SIZE
we know that the reg is a positive tnum_const() so umax=smax. Alexei
pointed out though this is a bit odd to read because the register in
check_func_arg() has a C type of u32 and the umax bound would be the
normally relavent bound here. Pulling in extra knowledge about future
checks makes reading the code a bit tricky. Further having a signed
meta data that can only ever be positive is also a bit odd. So dropped
the msize_smax_value metadata and made it a u64 msize_max_value to
indicate its unsigned. And additionally save bound from umax value in
check_arg_funcs which is the same as smax due to as noted above tnumx_cont
and negative check but reads better. By my analysis nothing functionally
changes in v2 but it does get easier to read so that is win.

Fixes: 849fa50662 ("bpf/verifier: refine retval R0 state for bpf_get_stack helper")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/158560417900.10843.14351995140624628941.stgit@john-Precision-5820-Tower
2020-03-30 14:44:15 -07:00
KP Singh
f50b49a0bf bpf: btf: Fix arg verification in btf_ctx_access()
The bounds checking for the arguments accessed in the BPF program breaks
when the expected_attach_type is not BPF_TRACE_FEXIT, BPF_LSM_MAC or
BPF_MODIFY_RETURN resulting in no check being done for the default case
(the programs which do not receive the return value of the attached
function in its arguments) when the index of the argument being accessed
is equal to the number of arguments (nr_args).

This was a result of a misplaced "else if" block  introduced by the
Commit 6ba43b761c ("bpf: Attachment verification for
BPF_MODIFY_RETURN")

Fixes: 6ba43b761c ("bpf: Attachment verification for BPF_MODIFY_RETURN")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200330144246.338-1-kpsingh@chromium.org
2020-03-30 13:28:02 -07:00
Linus Torvalds
78b0dedd52 updates for seccomp
- allow TSYNC and USER_NOTIF together (Tycho Andersen)
 - Add missing compat_ioctl for notify (Sven Schnelle)
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAl6BcbwWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJuhfD/9NjC6h+l9YNHW2O3bYPxDDjkJq
 1aRInf+/UayTnfwhlZiJj2FFYyPR1gfZXB9CPcniYO/t6tsCdc+0kQdc3uUCPb0y
 ClPp5Pc/u/SwEFgrj5gv/NsEwAVaTPy1ioagefZMENQXj77XcfifF5Mrave+lR3K
 TiZsFItucIRTiEb8YY4xF/t5rn/lBvAqDiYNZwYYVcopnW3kgvOljz6ZRyOstV/B
 J9QrErFfDH9SzPfK/1bZ5GbCUsTRzbGXA281UBhZdkJQaA3yoqK+yv/xKtoaX0WK
 uxLPt2BG3qb21+8JZacJ2L6KQAwm5EdT+OyLyFzUYki23LsJNHEb+UpkoRnyJg5H
 sSSZRj14WH5aK1REGTDLr5tgx5lxkXx/iuxYc4tuM56ToWS4hXNiQFU2cUcdqjSO
 bSKVg1LO9FfTTMecYXUqljoOwAKMVra2nDNCpvkBr/1JMVFZjCfWpjy3ZvHHwpqt
 BpxgfJW250HfnMWpa7k5p6bIP+WMwetwP1yGZx6xNz8j3ZSshIPUqCvTU6zD89CN
 RXHMfnZOxNtq1biI41Ppc5/kCt2t4598BaGsWIWcjhhY8p5Ttq+HGs3tOPsuUXen
 ccGAa/1Co0u5CCxudG4nZ2a/ooeijMx7D5HfvoYvHDQbugR/x4aSZuiw7JiTKBHr
 EZCFZyxvFVqnqlzQ2Q==
 =Ejyd
 -----END PGP SIGNATURE-----

Merge tag 'seccomp-v5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull seccomp updates from Kees Cook:
 "A couple of seccomp updates. They're both mostly bug fixes that I
  wanted to have sit in linux-next for a while:

   - allow TSYNC and USER_NOTIF together (Tycho Andersen)

   - add missing compat_ioctl for notify (Sven Schnelle)"

* tag 'seccomp-v5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  seccomp: Add missing compat_ioctl for notify
  seccomp: allow TSYNC and USER_NOTIF together
2020-03-30 12:53:56 -07:00
Linus Torvalds
e59cd88028 for-5.7/io_uring-2020-03-29
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl6BJEMQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpie7D/9gN4zhykYDfcgamfxMtTbpla2PdTnWoJxP
 fjy/Nx2FySakmccaiCGQSQ1rzD1L67UQkJgEH6hPTomJvA4FaOmJ+ZSaExMy55LH
 ZT+nD3zQ9SCuA0DEpfxbsCP1tbnoXSMQNt8Tyh0x8PAoxp5bI0eRczOju1QWLWTS
 tjBEMZNipN6krrV9RPWT0S5Z31/yGr/sXprCSHFV9Ypzwrx58Tj2i6F9gR7FVbLs
 nV2/O8taEn0sMQIz8TVHKol/TBalluGrC4M/bOeS3faP3BPN4TT24Gtc0LAKEibk
 F49/SX7FzwhOdl43Bdkbe2bbL86p+zOLSf0IMBwMm0DJl4aiOljRUYTSYRolgGgm
 Ebw9QhemTwbxxeD2nEriA4EAeYvTx69RDlN2eVilwwfJ48Xz9fVm3GNYG7LISeON
 k3/TyZOBQH2SZ2Hc3oF2Mq9j1UPHXZHUUsUNlNcN+aM9SFHcWkRi6xZWemTJHJZ4
 zFss5RZHo0+RLBa8rrx8xaO8iWrc73+FuRhr9eSsmyPIj+OZ4ezEFRRRHwtk2fgv
 dZvD413AyCI1c+3LlBusESMsrtXyY8p9O9buNTzHy3ZUtHe0ERmYV2m/a83A5pXo
 Kia/5aJbPIC61bAkCCkiVo+W9OASJ6o5+3CXl5sM9lGTbDXjcofzewmd+RHPestx
 xVbzeR9UIw==
 =bYLJ
 -----END PGP SIGNATURE-----

Merge tag 'for-5.7/io_uring-2020-03-29' of git://git.kernel.dk/linux-block

Pull io_uring updates from Jens Axboe:
 "Here are the io_uring changes for this merge window. Light on new
  features this time around (just splice + buffer selection), lots of
  cleanups, fixes, and improvements to existing support. In particular,
  this contains:

   - Cleanup fixed file update handling for stack fallback (Hillf)

   - Re-work of how pollable async IO is handled, we no longer require
     thread offload to handle that. Instead we rely using poll to drive
     this, with task_work execution.

   - In conjunction with the above, allow expendable buffer selection,
     so that poll+recv (for example) no longer has to be a split
     operation.

   - Make sure we honor RLIMIT_FSIZE for buffered writes

   - Add support for splice (Pavel)

   - Linked work inheritance fixes and optimizations (Pavel)

   - Async work fixes and cleanups (Pavel)

   - Improve io-wq locking (Pavel)

   - Hashed link write improvements (Pavel)

   - SETUP_IOPOLL|SETUP_SQPOLL improvements (Xiaoguang)"

* tag 'for-5.7/io_uring-2020-03-29' of git://git.kernel.dk/linux-block: (54 commits)
  io_uring: cleanup io_alloc_async_ctx()
  io_uring: fix missing 'return' in comment
  io-wq: handle hashed writes in chains
  io-uring: drop 'free_pfile' in struct io_file_put
  io-uring: drop completion when removing file
  io_uring: Fix ->data corruption on re-enqueue
  io-wq: close cancel gap for hashed linked work
  io_uring: make spdxcheck.py happy
  io_uring: honor original task RLIMIT_FSIZE
  io-wq: hash dependent work
  io-wq: split hashing and enqueueing
  io-wq: don't resched if there is no work
  io-wq: remove duplicated cancel code
  io_uring: fix truncated async read/readv and write/writev retry
  io_uring: dual license io_uring.h uapi header
  io_uring: io_uring_enter(2) don't poll while SETUP_IOPOLL|SETUP_SQPOLL enabled
  io_uring: Fix unused function warnings
  io_uring: add end-of-bits marker and build time verify it
  io_uring: provide means of removing buffers
  io_uring: add IOSQE_BUFFER_SELECT support for IORING_OP_RECVMSG
  ...
2020-03-30 12:18:49 -07:00
Jann Horn
0fc31b10cf bpf: Simplify reg_set_min_max_inv handling
reg_set_min_max_inv() contains exactly the same logic as reg_set_min_max(),
just flipped around. While this makes sense in a cBPF verifier (where ALU
operations are not symmetric), it does not make sense for eBPF.

Replace reg_set_min_max_inv() with a helper that flips the opcode around,
then lets reg_set_min_max() do the complicated work.

Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200330160324.15259-4-daniel@iogearbox.net
2020-03-30 11:53:52 -07:00
Jann Horn
604dca5e3a bpf: Fix tnum constraints for 32-bit comparisons
The BPF verifier tried to track values based on 32-bit comparisons by
(ab)using the tnum state via 581738a681 ("bpf: Provide better register
bounds after jmp32 instructions"). The idea is that after a check like
this:

    if ((u32)r0 > 3)
      exit

We can't meaningfully constrain the arithmetic-range-based tracking, but
we can update the tnum state to (value=0,mask=0xffff'ffff'0000'0003).
However, the implementation from 581738a681 didn't compute the tnum
constraint based on the fixed operand, but instead derives it from the
arithmetic-range-based tracking. This means that after the following
sequence of operations:

    if (r0 >= 0x1'0000'0001)
      exit
    if ((u32)r0 > 7)
      exit

The verifier assumed that the lower half of r0 is in the range (0, 0)
and apply the tnum constraint (value=0,mask=0xffff'ffff'0000'0000) thus
causing the overall tnum to be (value=0,mask=0x1'0000'0000), which was
incorrect. Provide a fixed implementation.

Fixes: 581738a681 ("bpf: Provide better register bounds after jmp32 instructions")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200330160324.15259-3-daniel@iogearbox.net
2020-03-30 11:53:52 -07:00
Daniel Borkmann
f2d67fec0b bpf: Undo incorrect __reg_bound_offset32 handling
Anatoly has been fuzzing with kBdysch harness and reported a hang in
one of the outcomes:

  0: (b7) r0 = 808464432
  1: (7f) r0 >>= r0
  2: (14) w0 -= 808464432
  3: (07) r0 += 808464432
  4: (b7) r1 = 808464432
  5: (de) if w1 s<= w0 goto pc+0
   R0_w=invP(id=0,umin_value=808464432,umax_value=5103431727,var_off=(0x30303020;0x10000001f)) R1_w=invP808464432 R10=fp0
  6: (07) r0 += -2144337872
  7: (14) w0 -= -1607454672
  8: (25) if r0 > 0x30303030 goto pc+0
   R0_w=invP(id=0,umin_value=271581184,umax_value=271581311,var_off=(0x10300000;0x7f)) R1_w=invP808464432 R10=fp0
  9: (76) if w0 s>= 0x303030 goto pc+2
  12: (95) exit

  from 8 to 9: safe

  from 5 to 6: R0_w=invP(id=0,umin_value=808464432,umax_value=5103431727,var_off=(0x30303020;0x10000001f)) R1_w=invP808464432 R10=fp0
  6: (07) r0 += -2144337872
  7: (14) w0 -= -1607454672
  8: (25) if r0 > 0x30303030 goto pc+0
   R0_w=invP(id=0,umin_value=271581184,umax_value=271581311,var_off=(0x10300000;0x7f)) R1_w=invP808464432 R10=fp0
  9: safe

  from 8 to 9: safe
  verification time 589 usec
  stack depth 0
  processed 17 insns (limit 1000000) [...]

The underlying program was xlated as follows:

  # bpftool p d x i 9
   0: (b7) r0 = 808464432
   1: (7f) r0 >>= r0
   2: (14) w0 -= 808464432
   3: (07) r0 += 808464432
   4: (b7) r1 = 808464432
   5: (de) if w1 s<= w0 goto pc+0
   6: (07) r0 += -2144337872
   7: (14) w0 -= -1607454672
   8: (25) if r0 > 0x30303030 goto pc+0
   9: (76) if w0 s>= 0x303030 goto pc+2
  10: (05) goto pc-1
  11: (05) goto pc-1
  12: (95) exit

The verifier rewrote original instructions it recognized as dead code with
'goto pc-1', but reality differs from verifier simulation in that we're
actually able to trigger a hang due to hitting the 'goto pc-1' instructions.

Taking different examples to make the issue more obvious: in this example
we're probing bounds on a completely unknown scalar variable in r1:

  [...]
  5: R0_w=inv1 R1_w=inv(id=0) R10=fp0
  5: (18) r2 = 0x4000000000
  7: R0_w=inv1 R1_w=inv(id=0) R2_w=inv274877906944 R10=fp0
  7: (18) r3 = 0x2000000000
  9: R0_w=inv1 R1_w=inv(id=0) R2_w=inv274877906944 R3_w=inv137438953472 R10=fp0
  9: (18) r4 = 0x400
  11: R0_w=inv1 R1_w=inv(id=0) R2_w=inv274877906944 R3_w=inv137438953472 R4_w=inv1024 R10=fp0
  11: (18) r5 = 0x200
  13: R0_w=inv1 R1_w=inv(id=0) R2_w=inv274877906944 R3_w=inv137438953472 R4_w=inv1024 R5_w=inv512 R10=fp0
  13: (2d) if r1 > r2 goto pc+4
   R0_w=inv1 R1_w=inv(id=0,umax_value=274877906944,var_off=(0x0; 0x7fffffffff)) R2_w=inv274877906944 R3_w=inv137438953472 R4_w=inv1024 R5_w=inv512 R10=fp0
  14: R0_w=inv1 R1_w=inv(id=0,umax_value=274877906944,var_off=(0x0; 0x7fffffffff)) R2_w=inv274877906944 R3_w=inv137438953472 R4_w=inv1024 R5_w=inv512 R10=fp0
  14: (ad) if r1 < r3 goto pc+3
   R0_w=inv1 R1_w=inv(id=0,umin_value=137438953472,umax_value=274877906944,var_off=(0x0; 0x7fffffffff)) R2_w=inv274877906944 R3_w=inv137438953472 R4_w=inv1024 R5_w=inv512 R10=fp0
  15: R0=inv1 R1=inv(id=0,umin_value=137438953472,umax_value=274877906944,var_off=(0x0; 0x7fffffffff)) R2=inv274877906944 R3=inv137438953472 R4=inv1024 R5=inv512 R10=fp0
  15: (2e) if w1 > w4 goto pc+2
   R0=inv1 R1=inv(id=0,umin_value=137438953472,umax_value=274877906944,var_off=(0x0; 0x7f00000000)) R2=inv274877906944 R3=inv137438953472 R4=inv1024 R5=inv512 R10=fp0
  16: R0=inv1 R1=inv(id=0,umin_value=137438953472,umax_value=274877906944,var_off=(0x0; 0x7f00000000)) R2=inv274877906944 R3=inv137438953472 R4=inv1024 R5=inv512 R10=fp0
  16: (ae) if w1 < w5 goto pc+1
   R0=inv1 R1=inv(id=0,umin_value=137438953472,umax_value=274877906944,var_off=(0x0; 0x7f00000000)) R2=inv274877906944 R3=inv137438953472 R4=inv1024 R5=inv512 R10=fp0
  [...]

We're first probing lower/upper bounds via jmp64, later we do a similar
check via jmp32 and examine the resulting var_off there. After fall-through
in insn 14, we get the following bounded r1 with 0x7fffffffff unknown marked
bits in the variable section.

Thus, after knowing r1 <= 0x4000000000 and r1 >= 0x2000000000:

  max: 0b100000000000000000000000000000000000000 / 0x4000000000
  var: 0b111111111111111111111111111111111111111 / 0x7fffffffff
  min: 0b010000000000000000000000000000000000000 / 0x2000000000

Now, in insn 15 and 16, we perform a similar probe with lower/upper bounds
in jmp32.

Thus, after knowing r1 <= 0x4000000000 and r1 >= 0x2000000000 and
                    w1 <= 0x400        and w1 >= 0x200:

  max: 0b100000000000000000000000000000000000000 / 0x4000000000
  var: 0b111111100000000000000000000000000000000 / 0x7f00000000
  min: 0b010000000000000000000000000000000000000 / 0x2000000000

The lower/upper bounds haven't changed since they have high bits set in
u64 space and the jmp32 tests can only refine bounds in the low bits.

However, for the var part the expectation would have been 0x7f000007ff
or something less precise up to 0x7fffffffff. A outcome of 0x7f00000000
is not correct since it would contradict the earlier probed bounds
where we know that the result should have been in [0x200,0x400] in u32
space. Therefore, tests with such info will lead to wrong verifier
assumptions later on like falsely predicting conditional jumps to be
always taken, etc.

The issue here is that __reg_bound_offset32()'s implementation from
commit 581738a681 ("bpf: Provide better register bounds after jmp32
instructions") makes an incorrect range assumption:

  static void __reg_bound_offset32(struct bpf_reg_state *reg)
  {
        u64 mask = 0xffffFFFF;
        struct tnum range = tnum_range(reg->umin_value & mask,
                                       reg->umax_value & mask);
        struct tnum lo32 = tnum_cast(reg->var_off, 4);
        struct tnum hi32 = tnum_lshift(tnum_rshift(reg->var_off, 32), 32);

        reg->var_off = tnum_or(hi32, tnum_intersect(lo32, range));
  }

In the above walk-through example, __reg_bound_offset32() as-is chose
a range after masking with 0xffffffff of [0x0,0x0] since umin:0x2000000000
and umax:0x4000000000 and therefore the lo32 part was clamped to 0x0 as
well. However, in the umin:0x2000000000 and umax:0x4000000000 range above
we'd end up with an actual possible interval of [0x0,0xffffffff] for u32
space instead.

In case of the original reproducer, the situation looked as follows at
insn 5 for r0:

  [...]
  5: R0_w=invP(id=0,umin_value=808464432,umax_value=5103431727,var_off=(0x0; 0x1ffffffff)) R1_w=invP808464432 R10=fp0
                               0x30303030           0x13030302f
  5: (de) if w1 s<= w0 goto pc+0
   R0_w=invP(id=0,umin_value=808464432,umax_value=5103431727,var_off=(0x30303020; 0x10000001f)) R1_w=invP808464432 R10=fp0
                             0x30303030           0x13030302f
  [...]

After the fall-through, we similarly forced the var_off result into
the wrong range [0x30303030,0x3030302f] suggesting later on that fixed
bits must only be of 0x30303020 with 0x10000001f unknowns whereas such
assumption can only be made when both bounds in hi32 range match.

Originally, I was thinking to fix this by moving reg into a temp reg and
use proper coerce_reg_to_size() helper on the temp reg where we can then
based on that define the range tnum for later intersection:

  static void __reg_bound_offset32(struct bpf_reg_state *reg)
  {
        struct bpf_reg_state tmp = *reg;
        struct tnum lo32, hi32, range;

        coerce_reg_to_size(&tmp, 4);
        range = tnum_range(tmp.umin_value, tmp.umax_value);
        lo32 = tnum_cast(reg->var_off, 4);
        hi32 = tnum_lshift(tnum_rshift(reg->var_off, 32), 32);
        reg->var_off = tnum_or(hi32, tnum_intersect(lo32, range));
  }

In the case of the concrete example, this gives us a more conservative unknown
section. Thus, after knowing r1 <= 0x4000000000 and r1 >= 0x2000000000 and
                             w1 <= 0x400        and w1 >= 0x200:

  max: 0b100000000000000000000000000000000000000 / 0x4000000000
  var: 0b111111111111111111111111111111111111111 / 0x7fffffffff
  min: 0b010000000000000000000000000000000000000 / 0x2000000000

However, above new __reg_bound_offset32() has no effect on refining the
knowledge of the register contents. Meaning, if the bounds in hi32 range
mismatch we'll get the identity function given the range reg spans
[0x0,0xffffffff] and we cast var_off into lo32 only to later on binary
or it again with the hi32.

Likewise, if the bounds in hi32 range match, then we mask both bounds
with 0xffffffff, use the resulting umin/umax for the range to later
intersect the lo32 with it. However, _prior_ called __reg_bound_offset()
did already such intersection on the full reg and we therefore would only
repeat the same operation on the lo32 part twice.

Given this has no effect and the original commit had false assumptions,
this patch reverts the code entirely which is also more straight forward
for stable trees: apparently 581738a681 got auto-selected by Sasha's
ML system and misclassified as a fix, so it got sucked into v5.4 where
it should never have landed. A revert is low-risk also from a user PoV
since it requires a recent kernel and llc to opt-into -mcpu=v3 BPF CPU
to generate jmp32 instructions. A proper bounds refinement would need a
significantly more complex approach which is currently being worked, but
no stable material [0]. Hence revert is best option for stable. After the
revert, the original reported program gets rejected as follows:

  1: (7f) r0 >>= r0
  2: (14) w0 -= 808464432
  3: (07) r0 += 808464432
  4: (b7) r1 = 808464432
  5: (de) if w1 s<= w0 goto pc+0
   R0_w=invP(id=0,umin_value=808464432,umax_value=5103431727,var_off=(0x0; 0x1ffffffff)) R1_w=invP808464432 R10=fp0
  6: (07) r0 += -2144337872
  7: (14) w0 -= -1607454672
  8: (25) if r0 > 0x30303030 goto pc+0
   R0_w=invP(id=0,umax_value=808464432,var_off=(0x0; 0x3fffffff)) R1_w=invP808464432 R10=fp0
  9: (76) if w0 s>= 0x303030 goto pc+2
   R0=invP(id=0,umax_value=3158063,var_off=(0x0; 0x3fffff)) R1=invP808464432 R10=fp0
  10: (30) r0 = *(u8 *)skb[808464432]
  BPF_LD_[ABS|IND] uses reserved fields
  processed 11 insns (limit 1000000) [...]

  [0] https://lore.kernel.org/bpf/158507130343.15666.8018068546764556975.stgit@john-Precision-5820-Tower/T/

Fixes: 581738a681 ("bpf: Provide better register bounds after jmp32 instructions")
Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200330160324.15259-2-daniel@iogearbox.net
2020-03-30 11:53:52 -07:00
Rafael J. Wysocki
ada0629bd3 Merge branches 'pm-core', 'pm-sleep', 'pm-acpi' and 'pm-domains'
* pm-core:
  PM: runtime: Add pm_runtime_get_if_active()

* pm-sleep:
  PM: sleep: wakeup: Skip wakeup_source_sysfs_remove() if device is not there
  PM / hibernate: Remove unnecessary compat ioctl overrides
  PM: hibernate: fix docs for ioctls that return loff_t via pointer
  PM: sleep: wakeup: Use built-in RCU list checking
  PM: sleep: core: Use built-in RCU list checking

* pm-acpi:
  ACPI: PM: s2idle: Refine active GPEs check
  ACPICA: Allow acpi_any_gpe_status_set() to skip one GPE
  ACPI: PM: s2idle: Fix comment in acpi_s2idle_prepare_late()

* pm-domains:
  cpuidle: psci: Split psci_dt_cpu_init_idle()
  PM / Domains: Allow no domain-idle-states DT property in genpd when parsing
2020-03-30 14:46:58 +02:00
Rafael J. Wysocki
8f1073ed8c Merge branch 'pm-qos'
* pm-qos: (30 commits)
  PM: QoS: annotate data races in pm_qos_*_value()
  Documentation: power: fix pm_qos_interface.rst format warning
  PM: QoS: Make CPU latency QoS depend on CONFIG_CPU_IDLE
  Documentation: PM: QoS: Update to reflect previous code changes
  PM: QoS: Update file information comments
  PM: QoS: Drop PM_QOS_CPU_DMA_LATENCY and rename related functions
  sound: Call cpu_latency_qos_*() instead of pm_qos_*()
  drivers: usb: Call cpu_latency_qos_*() instead of pm_qos_*()
  drivers: tty: Call cpu_latency_qos_*() instead of pm_qos_*()
  drivers: spi: Call cpu_latency_qos_*() instead of pm_qos_*()
  drivers: net: Call cpu_latency_qos_*() instead of pm_qos_*()
  drivers: mmc: Call cpu_latency_qos_*() instead of pm_qos_*()
  drivers: media: Call cpu_latency_qos_*() instead of pm_qos_*()
  drivers: hsi: Call cpu_latency_qos_*() instead of pm_qos_*()
  drm: i915: Call cpu_latency_qos_*() instead of pm_qos_*()
  x86: platform: iosf_mbi: Call cpu_latency_qos_*() instead of pm_qos_*()
  cpuidle: Call cpu_latency_qos_limit() instead of pm_qos_request()
  PM: QoS: Add CPU latency QoS API wrappers
  PM: QoS: Adjust pm_qos_request() signature and reorder pm_qos.h
  PM: QoS: Simplify definitions of CPU latency QoS trace events
  ...
2020-03-30 14:45:57 +02:00
David S. Miller
f0b5989745 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Minor comment conflict in mac80211.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29 21:25:29 -07:00
Sven Schnelle
3db81afd99 seccomp: Add missing compat_ioctl for notify
Executing the seccomp_bpf testsuite under a 64-bit kernel with 32-bit
userland (both s390 and x86) doesn't work because there's no compat_ioctl
handler defined. Add the handler.

Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Fixes: 6a21cc50f0 ("seccomp: add a return code to trap to userspace")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20200310123332.42255-1-svens@linux.ibm.com
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-03-29 21:10:51 -07:00
KP Singh
9e4e01dfd3 bpf: lsm: Implement attach, detach and execution
JITed BPF programs are dynamically attached to the LSM hooks
using BPF trampolines. The trampoline prologue generates code to handle
conversion of the signature of the hook to the appropriate BPF context.

The allocated trampoline programs are attached to the nop functions
initialized as LSM hooks.

BPF_PROG_TYPE_LSM programs must have a GPL compatible license and
and need CAP_SYS_ADMIN (required for loading eBPF programs).

Upon attachment:

* A BPF fexit trampoline is used for LSM hooks with a void return type.
* A BPF fmod_ret trampoline is used for LSM hooks which return an
  int. The attached programs can override the return value of the
  bpf LSM hook to indicate a MAC Policy decision.

Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Brendan Jackman <jackmanb@google.com>
Reviewed-by: Florent Revest <revest@google.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: James Morris <jamorris@linux.microsoft.com>
Link: https://lore.kernel.org/bpf/20200329004356.27286-5-kpsingh@chromium.org
2020-03-30 01:34:00 +02:00
KP Singh
9d3fdea789 bpf: lsm: Provide attachment points for BPF LSM programs
When CONFIG_BPF_LSM is enabled, nop functions, bpf_lsm_<hook_name>, are
generated for each LSM hook. These functions are initialized as LSM
hooks in a subsequent patch.

Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Brendan Jackman <jackmanb@google.com>
Reviewed-by: Florent Revest <revest@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: James Morris <jamorris@linux.microsoft.com>
Link: https://lore.kernel.org/bpf/20200329004356.27286-4-kpsingh@chromium.org
2020-03-30 01:34:00 +02:00
KP Singh
fc611f47f2 bpf: Introduce BPF_PROG_TYPE_LSM
Introduce types and configs for bpf programs that can be attached to
LSM hooks. The programs can be enabled by the config option
CONFIG_BPF_LSM.

Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Brendan Jackman <jackmanb@google.com>
Reviewed-by: Florent Revest <revest@google.com>
Reviewed-by: Thomas Garnier <thgarnie@google.com>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: James Morris <jamorris@linux.microsoft.com>
Link: https://lore.kernel.org/bpf/20200329004356.27286-2-kpsingh@chromium.org
2020-03-30 01:34:00 +02:00
Linus Torvalds
570203ec83 Merge branch 'akpm' (patches from Andrew)
Merge vm fixes from Andrew Morton:
 "5 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  mm/sparse: fix kernel crash with pfn_section_valid check
  mm: fork: fix kernel_stack memcg stats for various stack implementations
  hugetlb_cgroup: fix illegal access to memory
  drivers/base/memory.c: indicate all memory blocks as removable
  mm/swapfile.c: move inode_lock out of claim_swapfile
2020-03-29 10:40:31 -07:00
Linus Torvalds
01af08bd24 A single bugfix to prevent reference leaks in irq affinity notifiers.
-----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl6AxL0THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoeYsD/9HDNL8ckEGtPN8LRMSH7BLOAQcKaDk
 c1zoQoa5lW94xT/aqr++IAoaWEr+dJFGWwha7A2q37iiUgzyCAAP5JXseOi5n8Mw
 xzrj/jVwY7sNxNwNtdbJqjTdrijmfmxku9eddhll3EncNP96gPbeGhv+VLZuJsIB
 EkllnDTop6jnCJU6Qxgjp9P5YBQycx7O7r5pCKrsUB9iu68tfRtLZEStzYD3RRam
 mJffSZUrdGwge6GPqSRuS3sRx584O+q8II5pP2VIG6mCz4EO5NHslSph82xwGT00
 nP9aIK9urfwzXGEhwHME1VxnRJ3Ln3AkplbMnv6bs01j9ckupvW9Zut2hSe0AZ3d
 LstlERXdRhLUIlsia4vFvkoIYYo0SNlZ3ZMLZIJ8fcnzrKtHbEJGD5dXaTSLgIaB
 E2N1/FJq9C7Ur0KYas+jQsrhQpJqhJrLg72Kj06DeB1rD9xj6lSJqxWXSvS0J0ow
 sk2Xr7a096rOqjnOPstgSnqVNMR+133L9uIIdXIzyBRaExjcllTznFjLUX3ZTVJu
 9Fk+D2ZTh8aMqWNlCZmarz+6Qs2ok3KB/mMgae8qLULKQfEop2QPVi1trjrA8g/C
 Zn+txKlVGgKm0DcOfq8tN4jHr8G2PKyX+GxcF3ElThnFBgkusz+9MBeS8CqVFJvT
 Xnt483DaCYaEHg==
 =s1UQ
 -----END PGP SIGNATURE-----

Merge tag 'irq-urgent-2020-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq fix from Thomas Gleixner:
 "A single bugfix to prevent reference leaks in irq affinity notifiers"

* tag 'irq-urgent-2020-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq: Fix reference leaks on irq affinity notifiers
2020-03-29 10:07:00 -07:00
Roman Gushchin
8380ce4790 mm: fork: fix kernel_stack memcg stats for various stack implementations
Depending on CONFIG_VMAP_STACK and the THREAD_SIZE / PAGE_SIZE ratio the
space for task stacks can be allocated using __vmalloc_node_range(),
alloc_pages_node() and kmem_cache_alloc_node().

In the first and the second cases page->mem_cgroup pointer is set, but
in the third it's not: memcg membership of a slab page should be
determined using the memcg_from_slab_page() function, which looks at
page->slab_cache->memcg_params.memcg .  In this case, using
mod_memcg_page_state() (as in account_kernel_stack()) is incorrect:
page->mem_cgroup pointer is NULL even for pages charged to a non-root
memory cgroup.

It can lead to kernel_stack per-memcg counters permanently showing 0 on
some architectures (depending on the configuration).

In order to fix it, let's introduce a mod_memcg_obj_state() helper,
which takes a pointer to a kernel object as a first argument, uses
mem_cgroup_from_obj() to get a RCU-protected memcg pointer and calls
mod_memcg_state().  It allows to handle all possible configurations
(CONFIG_VMAP_STACK and various THREAD_SIZE/PAGE_SIZE values) without
spilling any memcg/kmem specifics into fork.c .

Note: This is a special version of the patch created for stable
backports.  It contains code from the following two patches:
  - mm: memcg/slab: introduce mem_cgroup_from_obj()
  - mm: fork: fix kernel_stack memcg stats for various stack implementations

[guro@fb.com: introduce mem_cgroup_from_obj()]
  Link: http://lkml.kernel.org/r/20200324004221.GA36662@carbon.dhcp.thefacebook.com
Fixes: 4d96ba3530 ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages")
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Bharata B Rao <bharata@linux.ibm.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200303233550.251375-1-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-29 09:47:05 -07:00
Thomas Gleixner
cf226c42b2 Merge branch 'uaccess.futex' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs into locking/core
Pull uaccess futex cleanups for Al Viro:

     Consolidate access_ok() usage and the futex uaccess function zoo.
2020-03-28 11:59:24 +01:00
Thomas Gleixner
e98eac6ff1 cpu/hotplug: Ignore pm_wakeup_pending() for disable_nonboot_cpus()
A recent change to freeze_secondary_cpus() which added an early abort if a
wakeup is pending missed the fact that the function is also invoked for
shutdown, reboot and kexec via disable_nonboot_cpus().

In case of disable_nonboot_cpus() the wakeup event needs to be ignored as
the purpose is to terminate the currently running kernel.

Add a 'suspend' argument which is only set when the freeze is in context of
a suspend operation. If not set then an eventually pending wakeup event is
ignored.

Fixes: a66d955e91 ("cpu/hotplug: Abort disabling secondary CPUs if wakeup is pending")
Reported-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Pavankumar Kondeti <pkondeti@codeaurora.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/874kuaxdiz.fsf@nanos.tec.linutronix.de
2020-03-28 11:42:55 +01:00
Al Viro
a08971e948 futex: arch_futex_atomic_op_inuser() calling conventions change
Move access_ok() in and pagefault_enable()/pagefault_disable() out.
Mechanical conversion only - some instances don't really need
a separate access_ok() at all (e.g. the ones only using
get_user()/put_user(), or architectures where access_ok()
is always true); we'll deal with that in followups.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-03-27 23:58:51 -04:00
Daniel Borkmann
0f09abd105 bpf: Enable bpf cgroup hooks to retrieve cgroup v2 and ancestor id
Enable the bpf_get_current_cgroup_id() helper for connect(), sendmsg(),
recvmsg() and bind-related hooks in order to retrieve the cgroup v2
context which can then be used as part of the key for BPF map lookups,
for example. Given these hooks operate in process context 'current' is
always valid and pointing to the app that is performing mentioned
syscalls if it's subject to a v2 cgroup. Also with same motivation of
commit 7723628101 ("bpf: Introduce bpf_skb_ancestor_cgroup_id helper")
enable retrieval of ancestor from current so the cgroup id can be used
for policy lookups which can then forbid connect() / bind(), for example.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/d2a7ef42530ad299e3cbb245e6c12374b72145ef.1585323121.git.daniel@iogearbox.net
2020-03-27 19:40:39 -07:00
Daniel Borkmann
f318903c0b bpf: Add netns cookie and enable it for bpf cgroup hooks
In Cilium we're mainly using BPF cgroup hooks today in order to implement
kube-proxy free Kubernetes service translation for ClusterIP, NodePort (*),
ExternalIP, and LoadBalancer as well as HostPort mapping [0] for all traffic
between Cilium managed nodes. While this works in its current shape and avoids
packet-level NAT for inter Cilium managed node traffic, there is one major
limitation we're facing today, that is, lack of netns awareness.

In Kubernetes, the concept of Pods (which hold one or multiple containers)
has been built around network namespaces, so while we can use the global scope
of attaching to root BPF cgroup hooks also to our advantage (e.g. for exposing
NodePort ports on loopback addresses), we also have the need to differentiate
between initial network namespaces and non-initial one. For example, ExternalIP
services mandate that non-local service IPs are not to be translated from the
host (initial) network namespace as one example. Right now, we have an ugly
work-around in place where non-local service IPs for ExternalIP services are
not xlated from connect() and friends BPF hooks but instead via less efficient
packet-level NAT on the veth tc ingress hook for Pod traffic.

On top of determining whether we're in initial or non-initial network namespace
we also have a need for a socket-cookie like mechanism for network namespaces
scope. Socket cookies have the nice property that they can be combined as part
of the key structure e.g. for BPF LRU maps without having to worry that the
cookie could be recycled. We are planning to use this for our sessionAffinity
implementation for services. Therefore, add a new bpf_get_netns_cookie() helper
which would resolve both use cases at once: bpf_get_netns_cookie(NULL) would
provide the cookie for the initial network namespace while passing the context
instead of NULL would provide the cookie from the application's network namespace.
We're using a hole, so no size increase; the assignment happens only once.
Therefore this allows for a comparison on initial namespace as well as regular
cookie usage as we have today with socket cookies. We could later on enable
this helper for other program types as well as we would see need.

  (*) Both externalTrafficPolicy={Local|Cluster} types
  [0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.c

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/c47d2346982693a9cf9da0e12690453aded4c788.1585323121.git.daniel@iogearbox.net
2020-03-27 19:40:38 -07:00
David S. Miller
a0ba26f37e Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2020-03-27

The following pull-request contains BPF updates for your *net* tree.

We've added 3 non-merge commits during the last 4 day(s) which contain
a total of 4 files changed, 25 insertions(+), 20 deletions(-).

The main changes are:

1) Explicitly memset the bpf_attr structure on bpf() syscall to avoid
   having to rely on compiler to do so. Issues have been noticed on
   some compilers with padding and other oddities where the request was
   then unexpectedly rejected, from Greg Kroah-Hartman.

2) Sanitize the bpf_struct_ops TCP congestion control name in order to
   avoid problematic characters such as whitespaces, from Martin KaFai Lau.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-27 16:18:51 -07:00
Steven Rostedt (VMware)
2768362603 tracing: Create set_event_notrace_pid to not trace tasks
There's currently a way to select a task that should only have its events
traced, but there's no way to select a task not to have itsevents traced.
Add a set_event_notrace_pid file that acts the same as set_event_pid (and is
also affected by event-fork), but the task pids in this file will not be
traced even if they are listed in the set_event_pid file. This makes it easy
for tools like trace-cmd to "hide" itself from beint traced by events when
it is recording other tasks.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-03-27 16:39:02 -04:00
Steven Rostedt (VMware)
b3b1e6eded ftrace: Create set_ftrace_notrace_pid to not trace tasks
There's currently a way to select a task that should only be traced by
functions, but there's no way to select a task not to be traced by the
function tracer. Add a set_ftrace_notrace_pid file that acts the same as
set_ftrace_pid (and is also affected by function-fork), but the task pids in
this file will not be traced even if they are listed in the set_ftrace_pid
file. This makes it easy for tools like trace-cmd to "hide" itself from the
function tracer when it is recording other tasks.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-03-27 16:39:02 -04:00
Steven Rostedt (VMware)
717e3f5ebc ftrace: Make function trace pid filtering a bit more exact
The set_ftrace_pid file is used to filter function tracing to only trace
tasks that are listed in that file. Instead of testing the pids listed in
that file (it's a bitmask) at each function trace event, the logic is done
via a sched_switch hook. A flag is set when the next task to run is in the
list of pids in the set_ftrace_pid file. But the sched_switch hook is not at
the exact location of when the task switches, and the flag gets set before
the task to be traced actually runs. This leaves a residue of traced
functions that do not belong to the pid that should be filtered on.

By changing the logic slightly, where instead of having  a boolean flag to
test, record the pid that should be traced, with special values for not to
trace and always trace. Then at each function call, a check will be made to
see if the function should be ignored, or if the current pid matches the
function that should be traced, and only trace if it matches (or if it has
the special value to always trace).

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-03-27 16:39:02 -04:00
Masami Hiramatsu
6a13a0d7b4 ftrace/kprobe: Show the maxactive number on kprobe_events
Show maxactive parameter on kprobe_events.
This allows user to save the current configuration and
restore it without losing maxactive parameter.

Link: http://lkml.kernel.org/r/4762764a-6df7-bc93-ed60-e336146dce1f@gmail.com
Link: http://lkml.kernel.org/r/158503528846.22706.5549974121212526020.stgit@devnote2

Cc: stable@vger.kernel.org
Fixes: 696ced4fb1 ("tracing/kprobes: expose maxactive for kretprobe in kprobe_events")
Reported-by: Taeung Song <treeze.taeung@gmail.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-03-27 16:39:02 -04:00
Steven Rostedt (VMware)
c9b7a4a72f ring-buffer/tracing: Have iterator acknowledge dropped events
Have the ring_buffer_iterator set a flag if events were dropped as it were
to go and peek at the next event. Have the trace file display this fact if
it happened with a "LOST EVENTS" message.

Link: http://lkml.kernel.org/r/20200317213417.045858900@goodmis.org

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-03-27 16:39:01 -04:00
Steven Rostedt (VMware)
06e0a548ba tracing: Do not disable tracing when reading the trace file
When opening the "trace" file, it is no longer necessary to disable tracing.

Note, a new option is created called "pause-on-trace", when set, will cause
the trace file to emulate its original behavior.

Link: http://lkml.kernel.org/r/20200317213416.903351225@goodmis.org

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-03-27 16:39:01 -04:00
Steven Rostedt (VMware)
1039221cc2 ring-buffer: Do not disable recording when there is an iterator
Now that the iterator can handle a concurrent writer, do not disable writing
to the ring buffer when there is an iterator present.

Link: http://lkml.kernel.org/r/20200317213416.759770696@goodmis.org

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-03-27 16:38:58 -04:00
Steven Rostedt (VMware)
07b8b10ec9 ring-buffer: Make resize disable per cpu buffer instead of total buffer
When the ring buffer becomes writable for even when the trace file is read,
it must still not be resized. But since tracers can be activated while the
trace file is being read, the irqsoff tracer can modify the per CPU buffers,
and this can cause the reader of the trace file to update the wrong buffer's
resize disable bit, as the irqsoff tracer swaps out cpu buffers.

By making the resize disable per cpu_buffer, it makes the update follow the
per cpu_buffer even if it's swapped out with the snapshot buffer and keeps
the release of the trace file modifying the same data as the open did.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-03-27 16:21:22 -04:00
Namhyung Kim
6546b19f95 perf/core: Add PERF_SAMPLE_CGROUP feature
The PERF_SAMPLE_CGROUP bit is to save (perf_event) cgroup information in
the sample.  It will add a 64-bit id to identify current cgroup and it's
the file handle in the cgroup file system.  Userspace should use this
information with PERF_RECORD_CGROUP event to match which cgroup it
belongs.

I put it before PERF_SAMPLE_AUX for simplicity since it just needs a
64-bit word.  But if we want bigger samples, I can work on that
direction too.

Committer testing:

  $ pahole perf_sample_data | grep -w cgroup -B5 -A5
  	/* --- cacheline 4 boundary (256 bytes) was 56 bytes ago --- */
  	struct perf_regs           regs_intr;            /*   312    16 */
  	/* --- cacheline 5 boundary (320 bytes) was 8 bytes ago --- */
  	u64                        stack_user_size;      /*   328     8 */
  	u64                        phys_addr;            /*   336     8 */
  	u64                        cgroup;               /*   344     8 */

  	/* size: 384, cachelines: 6, members: 22 */
  	/* padding: 32 */
  };
  $

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Zefan Li <lizefan@huawei.com>
Link: http://lore.kernel.org/lkml/20200325124536.2800725-3-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2020-03-27 10:41:44 -03:00
Namhyung Kim
96aaab6865 perf/core: Add PERF_RECORD_CGROUP event
To support cgroup tracking, add CGROUP event to save a link between
cgroup path and id number.  This is needed since cgroups can go away
when userspace tries to read the cgroup info (from the id) later.

The attr.cgroup bit was also added to enable cgroup tracking from
userspace.

This event will be generated when a new cgroup becomes active.
Userspace might need to synthesize those events for existing cgroups.

Committer testing:

From the resulting kernel, using /sys/kernel/btf/vmlinux:

  $ pahole perf_event_attr | grep -w cgroup -B5 -A1
  	__u64                      write_backward:1;     /*    40:27  8 */
  	__u64                      namespaces:1;         /*    40:28  8 */
  	__u64                      ksymbol:1;            /*    40:29  8 */
  	__u64                      bpf_event:1;          /*    40:30  8 */
  	__u64                      aux_output:1;         /*    40:31  8 */
  	__u64                      cgroup:1;             /*    40:32  8 */
  	__u64                      __reserved_1:31;      /*    40:33  8 */
  $

Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
[staticize perf_event_cgroup function]
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Zefan Li <lizefan@huawei.com>
Link: http://lore.kernel.org/lkml/20200325124536.2800725-2-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2020-03-27 10:39:11 -03:00
YueHaibing
f54a5bba12 bpf: Remove unused vairable 'bpf_xdp_link_lops'
kernel/bpf/syscall.c:2263:34: warning: 'bpf_xdp_link_lops' defined but not used [-Wunused-const-variable=]
 static const struct bpf_link_ops bpf_xdp_link_lops;
                                  ^~~~~~~~~~~~~~~~~

commit 70ed506c3b ("bpf: Introduce pinnable bpf_link abstraction")
involded this unused variable, remove it.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200326031613.19372-1-yuehaibing@huawei.com
2020-03-26 16:46:32 -07:00
Andrii Nakryiko
e28784e378 bpf: Factor out attach_type to prog_type mapping for attach/detach
Factor out logic mapping expected program attach type to program type and
subsequent handling of program attach/detach. Also list out all supported
cgroup BPF program types explicitly to prevent accidental bugs once more
program types are added to a mapping. Do the same for prog_query API.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200325065746.640559-3-andriin@fb.com
2020-03-26 16:38:13 -07:00
Andrii Nakryiko
00c4eddf7e bpf: Factor out cgroup storages operations
Refactor cgroup attach/detach code to abstract away common operations
performed on all types of cgroup storages. This makes high-level logic more
apparent, plus allows to reuse more code across multiple functions.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200325065746.640559-2-andriin@fb.com
2020-03-26 16:36:58 -07:00
John Fastabend
294f2fc6da bpf: Verifer, adjust_scalar_min_max_vals to always call update_reg_bounds()
Currently, for all op verification we call __red_deduce_bounds() and
__red_bound_offset() but we only call __update_reg_bounds() in bitwise
ops. However, we could benefit from calling __update_reg_bounds() in
BPF_ADD, BPF_SUB, and BPF_MUL cases as well.

For example, a register with state 'R1_w=invP0' when we subtract from
it,

 w1 -= 2

Before coerce we will now have an smin_value=S64_MIN, smax_value=U64_MAX
and unsigned bounds umin_value=0, umax_value=U64_MAX. These will then
be clamped to S32_MIN, U32_MAX values by coerce in the case of alu32 op
as done in above example. However tnum will be a constant because the
ALU op is done on a constant.

Without update_reg_bounds() we have a scenario where tnum is a const
but our unsigned bounds do not reflect this. By calling update_reg_bounds
after coerce to 32bit we further refine the umin_value to U64_MAX in the
alu64 case or U32_MAX in the alu32 case above.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/158507151689.15666.566796274289413203.stgit@john-Precision-5820-Tower
2020-03-25 22:51:40 -07:00
John Fastabend
07cd263148 bpf: Verifer, refactor adjust_scalar_min_max_vals
Pull per op ALU logic into individual functions. We are about to add
u32 versions of each of these by pull them out the code gets a bit
more readable here and nicer in the next patch.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/158507149518.15666.15672349629329072411.stgit@john-Precision-5820-Tower
2020-03-25 22:51:39 -07:00
David S. Miller
9fb16955fb Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Overlapping header include additions in macsec.c

A bug fix in 'net' overlapping with the removal of 'version'
string in ena_netdev.c

Overlapping test additions in selftests Makefile

Overlapping PCI ID table adjustments in iwlwifi driver.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-25 18:58:11 -07:00
Linus Torvalds
1b649e0bca Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Fix deadlock in bpf_send_signal() from Yonghong Song.

 2) Fix off by one in kTLS offload of mlx5, from Tariq Toukan.

 3) Add missing locking in iwlwifi mvm code, from Avraham Stern.

 4) Fix MSG_WAITALL handling in rxrpc, from David Howells.

 5) Need to hold RTNL mutex in tcindex_partial_destroy_work(), from Cong
    Wang.

 6) Fix producer race condition in AF_PACKET, from Willem de Bruijn.

 7) cls_route removes the wrong filter during change operations, from
    Cong Wang.

 8) Reject unrecognized request flags in ethtool netlink code, from
    Michal Kubecek.

 9) Need to keep MAC in reset until PHY is up in bcmgenet driver, from
    Doug Berger.

10) Don't leak ct zone template in act_ct during replace, from Paul
    Blakey.

11) Fix flushing of offloaded netfilter flowtable flows, also from Paul
    Blakey.

12) Fix throughput drop during tx backpressure in cxgb4, from Rahul
    Lakkireddy.

13) Don't let a non-NULL skb->dev leave the TCP stack, from Eric
    Dumazet.

14) TCP_QUEUE_SEQ socket option has to update tp->copied_seq as well,
    also from Eric Dumazet.

15) Restrict macsec to ethernet devices, from Willem de Bruijn.

16) Fix reference leak in some ethtool *_SET handlers, from Michal
    Kubecek.

17) Fix accidental disabling of MSI for some r8169 chips, from Heiner
    Kallweit.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (138 commits)
  net: Fix CONFIG_NET_CLS_ACT=n and CONFIG_NFT_FWD_NETDEV={y, m} build
  net: ena: Add PCI shutdown handler to allow safe kexec
  selftests/net/forwarding: define libs as TEST_PROGS_EXTENDED
  selftests/net: add missing tests to Makefile
  r8169: re-enable MSI on RTL8168c
  net: phy: mdio-bcm-unimac: Fix clock handling
  cxgb4/ptp: pass the sign of offset delta in FW CMD
  net: dsa: tag_8021q: replace dsa_8021q_remove_header with __skb_vlan_pop
  net: cbs: Fix software cbs to consider packet sending time
  net/mlx5e: Do not recover from a non-fatal syndrome
  net/mlx5e: Fix ICOSQ recovery flow with Striding RQ
  net/mlx5e: Fix missing reset of SW metadata in Striding RQ reset
  net/mlx5e: Enhance ICOSQ WQE info fields
  net/mlx5_core: Set IB capability mask1 to fix ib_srpt connection failure
  selftests: netfilter: add nfqueue test case
  netfilter: nft_fwd_netdev: allow to redirect to ifb via ingress
  netfilter: nft_fwd_netdev: validate family and chain type
  netfilter: nft_set_rbtree: Detect partial overlaps on insertion
  netfilter: nft_set_rbtree: Introduce and use nft_rbtree_interval_start()
  netfilter: nft_set_pipapo: Separate partial and complete overlap cases on insertion
  ...
2020-03-25 13:58:05 -07:00
Qiujun Huang
e7b3410050 kcsan: Fix a typo in a comment
s/slots slots/slots/

Signed-off-by: Qiujun Huang <hqjagain@gmail.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
[elver: commit message]
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-03-25 09:56:00 -07:00
Marco Elver
44656d3dc4 kcsan: Add current->state to implicitly atomic accesses
Add volatile current->state to list of implicitly atomic accesses. This
is in preparation to eventually enable KCSAN on kernel/sched (which
currently still has KCSAN_SANITIZE := n).

Since accesses that match the special check in atomic.h are rare, it
makes more sense to move this check to the slow-path, avoiding the
additional compare in the fast-path. With the microbenchmark, a speedup
of ~6% is measured.

Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-03-25 09:56:00 -07:00
Marco Elver
2402d0eae5 kcsan: Add option for verbose reporting
Adds CONFIG_KCSAN_VERBOSE to optionally enable more verbose reports.
Currently information about the reporting task's held locks and IRQ
trace events are shown, if they are enabled.

Signed-off-by: Marco Elver <elver@google.com>
Suggested-by: Qian Cai <cai@lca.pw>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-03-25 09:56:00 -07:00
Marco Elver
48b1fc190a kcsan: Add option to allow watcher interruptions
Add option to allow interrupts while a watchpoint is set up. This can be
enabled either via CONFIG_KCSAN_INTERRUPT_WATCHER or via the boot
parameter 'kcsan.interrupt_watcher=1'.

Note that, currently not all safe per-CPU access primitives and patterns
are accounted for, which could result in false positives. For example,
asm-generic/percpu.h uses plain operations, which by default are
instrumented. On interrupts and subsequent accesses to the same
variable, KCSAN would currently report a data race with this option.

Therefore, this option should currently remain disabled by default, but
may be enabled for specific test scenarios.

To avoid new warnings, changes all uses of smp_processor_id() to use the
raw version (as already done in kcsan_found_watchpoint()). The exact SMP
processor id is for informational purposes in the report, and
correctness is not affected.

Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-03-25 09:55:59 -07:00
Bernd Edlinger
501f9328bf pidfd: Use new infrastructure to fix deadlocks in execve
This changes __pidfd_fget to use the new exec_update_mutex
instead of cred_guard_mutex.

This should be safe, as the credentials do not change
before exec_update_mutex is locked.  Therefore whatever
file access is possible with holding the cred_guard_mutex
here is also possbile with the exec_update_mutex.

Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-03-25 10:04:01 -05:00
Bernd Edlinger
6914303824 perf: Use new infrastructure to fix deadlocks in execve
This changes perf_event_set_clock to use the new exec_update_mutex
instead of cred_guard_mutex.

This should be safe, as the credentials are only used for reading.

Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-03-25 10:04:01 -05:00
Bernd Edlinger
454e3126cb kernel/kcmp.c: Use new infrastructure to fix deadlocks in execve
This changes kcmp_epoll_target to use the new exec_update_mutex
instead of cred_guard_mutex.

This should be safe, as the credentials are only used for reading,
and furthermore ->mm and ->sighand are updated on execve,
but only under the new exec_update_mutex.

Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-03-25 10:04:01 -05:00
Bernd Edlinger
aa884c1131 kernel: doc: remove outdated comment cred.c
This removes an outdated comment in prepare_kernel_cred.

There is no "cred_replace_mutex" any more, so the comment must
go away.

Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-03-25 10:04:01 -05:00
Bernd Edlinger
3e74fabd39 exec: Fix a deadlock in strace
This fixes a deadlock in the tracer when tracing a multi-threaded
application that calls execve while more than one thread are running.

I observed that when running strace on the gcc test suite, it always
blocks after a while, when expect calls execve, because other threads
have to be terminated.  They send ptrace events, but the strace is no
longer able to respond, since it is blocked in vm_access.

The deadlock is always happening when strace needs to access the
tracees process mmap, while another thread in the tracee starts to
execve a child process, but that cannot continue until the
PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received:

strace          D    0 30614  30584 0x00000000
Call Trace:
__schedule+0x3ce/0x6e0
schedule+0x5c/0xd0
schedule_preempt_disabled+0x15/0x20
__mutex_lock.isra.13+0x1ec/0x520
__mutex_lock_killable_slowpath+0x13/0x20
mutex_lock_killable+0x28/0x30
mm_access+0x27/0xa0
process_vm_rw_core.isra.3+0xff/0x550
process_vm_rw+0xdd/0xf0
__x64_sys_process_vm_readv+0x31/0x40
do_syscall_64+0x64/0x220
entry_SYSCALL_64_after_hwframe+0x44/0xa9

expect          D    0 31933  30876 0x80004003
Call Trace:
__schedule+0x3ce/0x6e0
schedule+0x5c/0xd0
flush_old_exec+0xc4/0x770
load_elf_binary+0x35a/0x16c0
search_binary_handler+0x97/0x1d0
__do_execve_file.isra.40+0x5d4/0x8a0
__x64_sys_execve+0x49/0x60
do_syscall_64+0x64/0x220
entry_SYSCALL_64_after_hwframe+0x44/0xa9

This changes mm_access to use the new exec_update_mutex
instead of cred_guard_mutex.

This patch is based on the following patch by Eric W. Biederman:
"[PATCH 0/5] Infrastructure to allow fixing exec deadlocks"
Link: https://lore.kernel.org/lkml/87v9ne5y4y.fsf_-_@x220.int.ebiederm.org/

Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-03-25 10:04:00 -05:00
Eric W. Biederman
eea9673250 exec: Add exec_update_mutex to replace cred_guard_mutex
The cred_guard_mutex is problematic as it is held over possibly
indefinite waits for userspace.  The possible indefinite waits for
userspace that I have identified are: The cred_guard_mutex is held in
PTRACE_EVENT_EXIT waiting for the tracer.  The cred_guard_mutex is
held over "put_user(0, tsk->clear_child_tid)" in exit_mm().  The
cred_guard_mutex is held over "get_user(futex_offset, ...")  in
exit_robust_list.  The cred_guard_mutex held over copy_strings.

The functions get_user and put_user can trigger a page fault which can
potentially wait indefinitely in the case of userfaultfd or if
userspace implements part of the page fault path.

In any of those cases the userspace process that the kernel is waiting
for might make a different system call that winds up taking the
cred_guard_mutex and result in deadlock.

Holding a mutex over any of those possibly indefinite waits for
userspace does not appear necessary.  Add exec_update_mutex that will
just cover updating the process during exec where the permissions and
the objects pointed to by the task struct may be out of sync.

The plan is to switch the users of cred_guard_mutex to
exec_update_mutex one by one.  This lets us move forward while still
being careful and not introducing any regressions.

Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/
Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/
Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/
Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/
Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/
Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-03-25 10:03:36 -05:00
Qais Yousef
33c3736ec8 cpu/hotplug: Hide cpu_up/down()
Use separate functions for the device core to bring a CPU up and down.

Users outside the device core must use add/remove_cpu() which will take
care of extra housekeeping work like keeping sysfs in sync.

Make cpu_up/down() static and replace the extra layer of indirection.

[ tglx: Removed the extra wrapper functions and adjusted function names ]

Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200323135110.30522-18-qais.yousef@arm.com
2020-03-25 12:59:38 +01:00
Qais Yousef
b99a26593b cpu/hotplug: Move bringup of secondary CPUs out of smp_init()
This is the last direct user of cpu_up() before it can become an internal
implementation detail of the cpu subsystem.

Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200323135110.30522-17-qais.yousef@arm.com
2020-03-25 12:59:37 +01:00
Qais Yousef
457bc8ed3e torture: Replace cpu_up/down() with add/remove_cpu()
The core device API performs extra housekeeping bits that are missing
from directly calling cpu_up/down().

See commit a6717c01dd ("powerpc/rtas: use device model APIs and
serialization during LPM") for an example description of what might go
wrong.

This also prepares to make cpu_up/down() a private interface of the CPU
subsystem.

Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: "Paul E. McKenney" <paulmck@kernel.org>
Link: https://lkml.kernel.org/r/20200323135110.30522-16-qais.yousef@arm.com
2020-03-25 12:59:37 +01:00
Qais Yousef
d720f98604 cpu/hotplug: Provide bringup_hibernate_cpu()
arm64 uses cpu_up() in the resume from hibernation code to ensure that the
CPU on which the system hibernated is online. Provide a core function for
this.

[ tglx: Split out from the combo arm64 patch ]

Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Link: https://lkml.kernel.org/r/20200323135110.30522-9-qais.yousef@arm.com
2020-03-25 12:59:34 +01:00
Qais Yousef
0441a5597c cpu/hotplug: Create a new function to shutdown nonboot cpus
This function will be used later in machine_shutdown() for some
architectures.

disable_nonboot_cpus() is not safe to use when doing machine_down(),
because it relies on freeze_secondary_cpus() which in turn is a
suspend/resume related freeze and could abort if the logic detects any
pending activities that can prevent finishing the offlining process.

Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200323135110.30522-3-qais.yousef@arm.com
2020-03-25 12:59:31 +01:00
Qais Yousef
93ef1429e5 cpu/hotplug: Add new {add,remove}_cpu() functions
The new functions use device_{online,offline}() which are userspace safe.

This is in preparation to move cpu_{up, down} kernel users to use a safer
interface that is not racy with userspace.

Suggested-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lkml.kernel.org/r/20200323135110.30522-2-qais.yousef@arm.com
2020-03-25 12:59:31 +01:00
Masahiro Yamada
d198b34f38 .gitignore: add SPDX License Identifier
Add SPDX License Identifier to all .gitignore files.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-25 11:50:48 +01:00
Masahiro Yamada
2985bed680 .gitignore: remove too obvious comments
Some .gitignore files have comments like "Generated files",
"Ignore generated files" at the header part, but they are
too obvious.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-25 11:50:28 +01:00
Ingo Molnar
baf5fe7618 Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU changes from Paul E. McKenney:

 - Make kfree_rcu() use kfree_bulk() for added performance
 - RCU updates
 - Callback-overload handling updates
 - Tasks-RCU KCSAN and sparse updates
 - Locking torture test and RCU torture test updates
 - Documentation updates
 - Miscellaneous fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-03-24 10:10:09 +01:00
Sebastian Siewior
8bf6c677dd completion: Use lockdep_assert_RT_in_threaded_ctx() in complete_all()
The warning was intended to spot complete_all() users from hardirq
context on PREEMPT_RT. The warning as-is will also trigger in interrupt
handlers, which are threaded on PREEMPT_RT, which was not intended.

Use lockdep_assert_RT_in_threaded_ctx() which triggers in non-preemptive
context on PREEMPT_RT.

Fixes: a5c6234e10 ("completion: Use simple wait queues")
Reported-by: kernel test robot <rong.a.chen@intel.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200323152019.4qjwluldohuh3by5@linutronix.de
2020-03-23 18:40:25 +01:00
Amir Goldstein
aa93bdc550 fsnotify: use helpers to access data by data_type
Create helpers to access path and inode from different data types.

Link: https://lore.kernel.org/r/20200319151022.31456-5-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-03-23 18:19:06 +01:00
Heiko Carstens
086b2d7837 PM: remove s390 specific callbacks
ARCH_SAVE_PAGE_KEYS has been introduced in order to be able to save
and restore s390 specific storage keys into a hibernation image.
With hibernation support removed from s390 there is no point in
keeping the callbacks.

Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2020-03-23 13:41:55 +01:00
Greg Kroah-Hartman
cbf580ff09 Merge 5.6-rc7 into tty-next
We need the tty fixes in here as well.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-23 08:02:55 +01:00
Greg Kroah-Hartman
baca54d956 Merge 5.6-rc7 into char-misc-next
We need the char/misc driver fixes in here as well.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-23 07:59:38 +01:00
Joerg Roedel
763802b53a x86/mm: split vmalloc_sync_all()
Commit 3f8fd02b1b ("mm/vmalloc: Sync unmappings in
__purge_vmap_area_lazy()") introduced a call to vmalloc_sync_all() in
the vunmap() code-path.  While this change was necessary to maintain
correctness on x86-32-pae kernels, it also adds additional cycles for
architectures that don't need it.

Specifically on x86-64 with CONFIG_VMAP_STACK=y some people reported
severe performance regressions in micro-benchmarks because it now also
calls the x86-64 implementation of vmalloc_sync_all() on vunmap().  But
the vmalloc_sync_all() implementation on x86-64 is only needed for newly
created mappings.

To avoid the unnecessary work on x86-64 and to gain the performance
back, split up vmalloc_sync_all() into two functions:

	* vmalloc_sync_mappings(), and
	* vmalloc_sync_unmappings()

Most call-sites to vmalloc_sync_all() only care about new mappings being
synchronized.  The only exception is the new call-site added in the
above mentioned commit.

Shile Zhang directed us to a report of an 80% regression in reaim
throughput.

Fixes: 3f8fd02b1b ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()")
Reported-by: kernel test robot <oliver.sang@intel.com>
Reported-by: Shile Zhang <shile.zhang@linux.alibaba.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Borislav Petkov <bp@suse.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	[GHES]
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20191009124418.8286-1-joro@8bytes.org
Link: https://lists.01.org/hyperkitty/list/lkp@lists.01.org/thread/4D3JPPHBNOSPFK2KEPC6KGKS6J25AIDB/
Link: http://lkml.kernel.org/r/20191113095530.228959-1-shile.zhang@linux.alibaba.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-21 18:56:06 -07:00
Paul E. McKenney
aa93ec620b Merge branches 'doc.2020.02.27a', 'fixes.2020.03.21a', 'kfree_rcu.2020.02.20a', 'locktorture.2020.02.20a', 'ovld.2020.02.20a', 'rcu-tasks.2020.02.20a', 'srcu.2020.02.20a' and 'torture.2020.02.20a' into HEAD
doc.2020.02.27a: Documentation updates.
fixes.2020.03.21a: Miscellaneous fixes.
kfree_rcu.2020.02.20a: Updates to kfree_rcu().
locktorture.2020.02.20a: Lock torture-test updates.
ovld.2020.02.20a: Updates to callback-overload handling.
rcu-tasks.2020.02.20a: RCU-tasks updates.
srcu.2020.02.20a: SRCU updates.
torture.2020.02.20a: Torture-test updates.
2020-03-21 17:15:11 -07:00
Paul E. McKenney
127e29815b rcu: Make rcu_barrier() account for offline no-CBs CPUs
Currently, rcu_barrier() ignores offline CPUs,  However, it is possible
for an offline no-CBs CPU to have callbacks queued, and rcu_barrier()
must wait for those callbacks.  This commit therefore makes rcu_barrier()
directly invoke the rcu_barrier_func() with interrupts disabled for such
CPUs.  This requires passing the CPU number into this function so that
it can entrain the rcu_barrier() callback onto the correct CPU's callback
list, given that the code must instead execute on the current CPU.

While in the area, this commit fixes a bug where the first CPU's callback
might have been invoked before rcu_segcblist_entrain() returned, which
would also result in an early wakeup.

Fixes: 5d6742b377 ("rcu/nocb: Use rcu_segcblist for no-CBs CPUs")
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
[ paulmck: Apply optimization feedback from Boqun Feng. ]
Cc: <stable@vger.kernel.org> # 5.5.x
2020-03-21 16:14:25 -07:00
Paul E. McKenney
0f11ad323d rcu: Mark rcu_state.gp_seq to detect concurrent writes
The rcu_state structure's gp_seq field is only to be modified by the RCU
grace-period kthread, which is single-threaded.  This commit therefore
enlists KCSAN's help in enforcing this restriction.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-03-21 16:13:39 -07:00
Edward Cree
df81dfcfd6 genirq: Fix reference leaks on irq affinity notifiers
The handling of notify->work did not properly maintain notify->kref in two
 cases:
1) where the work was already scheduled, another irq_set_affinity_locked()
   would get the ref and (no-op-ly) schedule the work.  Thus when
   irq_affinity_notify() ran, it would drop the original ref but not the
   additional one.
2) when cancelling the (old) work in irq_set_affinity_notifier(), if there
   was outstanding work a ref had been got for it but was never put.
Fix both by checking the return values of the work handling functions
 (schedule_work() for (1) and cancel_work_sync() for (2)) and put the
 extra ref if the return value indicates preexisting work.

Fixes: cd7eab44e9 ("genirq: Add IRQ affinity notifiers")
Fixes: 59c39840f5 ("genirq: Prevent use-after-free and work list corruption")
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ben Hutchings <ben@decadent.org.uk>
Link: https://lkml.kernel.org/r/24f5983f-2ab5-e83a-44ee-a45b5f9300f5@solarflare.com
2020-03-21 17:32:46 +01:00
Peter Zijlstra
ef996916e7 lockdep: Rename trace_{hard,soft}{irq_context,irqs_enabled}()
Continue what commit:

  d820ac4c2f ("locking: rename trace_softirq_[enter|exit] => lockdep_softirq_[enter|exit]")

started, rename these to avoid confusing them with tracepoints.

git grep -l "trace_\(soft\|hard\)\(irq_context\|irqs_enabled\)" | while read file;
do
	sed -ie 's/trace_\(soft\|hard\)\(irq_context\|irqs_enabled\)/lockdep_\1\2/g' $file;
done

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lkml.kernel.org/r/20200320115859.178626842@infradead.org
2020-03-21 16:03:54 +01:00
Peter Zijlstra
0d38453c85 lockdep: Rename trace_softirqs_{on,off}()
Continue what commit:

  d820ac4c2f ("locking: rename trace_softirq_[enter|exit] => lockdep_softirq_[enter|exit]")

started, rename these to avoid confusing them with tracepoints.

git grep -l "trace_softirqs_\(on\|off\)" | while read file;
do
	sed -ie 's/trace_softirqs_\(on\|off\)/lockdep_softirqs_\1/g' $file;
done

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lkml.kernel.org/r/20200320115859.119434738@infradead.org
2020-03-21 16:03:54 +01:00
Thomas Gleixner
2502ec37a7 lockdep: Rename trace_hardirq_{enter,exit}()
Continue what commit:

  d820ac4c2f ("locking: rename trace_softirq_[enter|exit] => lockdep_softirq_[enter|exit]")

started, rename these to avoid confusing them with tracepoints.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lkml.kernel.org/r/20200320115859.060481361@infradead.org
2020-03-21 16:03:53 +01:00
Sebastian Andrzej Siewior
d53f2b62fc lockdep: Add posixtimer context tracing bits
Splitting run_posix_cpu_timers() into two parts is work in progress which
is stuck on other entry code related problems. The heavy lifting which
involves locking of sighand lock will be moved into task context so the
necessary execution time is burdened on the task and not on interrupt
context.

Until this work completes lockdep with the spinlock nesting rules enabled
would emit warnings for this known context.

Prevent it by setting "->irq_config = 1" for the invocation of
run_posix_cpu_timers() so lockdep does not complain when sighand lock is
acquried. This will be removed once the split is completed.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200321113242.751182723@linutronix.de
2020-03-21 16:00:25 +01:00
Sebastian Andrzej Siewior
49915ac35c lockdep: Annotate irq_work
Mark irq_work items with IRQ_WORK_HARD_IRQ which should be invoked in
hardirq context even on PREEMPT_RT. IRQ_WORK without this flag will be
invoked in softirq context on PREEMPT_RT.

Set ->irq_config to 1 for the IRQ_WORK items which are invoked in softirq
context so lockdep knows that these can safely acquire a spinlock_t.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200321113242.643576700@linutronix.de
2020-03-21 16:00:24 +01:00
Sebastian Andrzej Siewior
40db173965 lockdep: Add hrtimer context tracing bits
Set current->irq_config = 1 for hrtimers which are not marked to expire in
hard interrupt context during hrtimer_init(). These timers will expire in
softirq context on PREEMPT_RT.

Setting this allows lockdep to differentiate these timers. If a timer is
marked to expire in hard interrupt context then the timer callback is not
supposed to acquire a regular spinlock instead of a raw_spinlock in the
expiry callback.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200321113242.534508206@linutronix.de
2020-03-21 16:00:24 +01:00
Peter Zijlstra
de8f5e4f2d lockdep: Introduce wait-type checks
Extend lockdep to validate lock wait-type context.

The current wait-types are:

	LD_WAIT_FREE,		/* wait free, rcu etc.. */
	LD_WAIT_SPIN,		/* spin loops, raw_spinlock_t etc.. */
	LD_WAIT_CONFIG,		/* CONFIG_PREEMPT_LOCK, spinlock_t etc.. */
	LD_WAIT_SLEEP,		/* sleeping locks, mutex_t etc.. */

Where lockdep validates that the current lock (the one being acquired)
fits in the current wait-context (as generated by the held stack).

This ensures that there is no attempt to acquire mutexes while holding
spinlocks, to acquire spinlocks while holding raw_spinlocks and so on. In
other words, its a more fancy might_sleep().

Obviously RCU made the entire ordeal more complex than a simple single
value test because RCU can be acquired in (pretty much) any context and
while it presents a context to nested locks it is not the same as it
got acquired in.

Therefore its necessary to split the wait_type into two values, one
representing the acquire (outer) and one representing the nested context
(inner). For most 'normal' locks these two are the same.

[ To make static initialization easier we have the rule that:
  .outer == INV means .outer == .inner; because INV == 0. ]

It further means that its required to find the minimal .inner of the held
stack to compare against the outer of the new lock; because while 'normal'
RCU presents a CONFIG type to nested locks, if it is taken while already
holding a SPIN type it obviously doesn't relax the rules.

Below is an example output generated by the trivial test code:

  raw_spin_lock(&foo);
  spin_lock(&bar);
  spin_unlock(&bar);
  raw_spin_unlock(&foo);

 [ BUG: Invalid wait context ]
 -----------------------------
 swapper/0/1 is trying to lock:
 ffffc90000013f20 (&bar){....}-{3:3}, at: kernel_init+0xdb/0x187
 other info that might help us debug this:
 1 lock held by swapper/0/1:
  #0: ffffc90000013ee0 (&foo){+.+.}-{2:2}, at: kernel_init+0xd1/0x187

The way to read it is to look at the new -{n,m} part in the lock
description; -{3:3} for the attempted lock, and try and match that up to
the held locks, which in this case is the one: -{2,2}.

This tells that the acquiring lock requires a more relaxed environment than
presented by the lock stack.

Currently only the normal locks and RCU are converted, the rest of the
lockdep users defaults to .inner = INV which is ignored. More conversions
can be done when desired.

The check for spinlock_t nesting is not enabled by default. It's a separate
config option for now as there are known problems which are currently
addressed. The config option allows to identify these problems and to
verify that the solutions found are indeed solving them.

The config switch will be removed and the checks will permanently enabled
once the vast majority of issues has been addressed.

[ bigeasy: Move LD_WAIT_FREE,… out of CONFIG_LOCKDEP to avoid compile
	   failure with CONFIG_DEBUG_SPINLOCK + !CONFIG_LOCKDEP]
[ tglx: Add the config option ]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200321113242.427089655@linutronix.de
2020-03-21 16:00:24 +01:00
Thomas Gleixner
a5c6234e10 completion: Use simple wait queues
completion uses a wait_queue_head_t to enqueue waiters.

wait_queue_head_t contains a spinlock_t to protect the list of waiters
which excludes it from being used in truly atomic context on a PREEMPT_RT
enabled kernel.

The spinlock in the wait queue head cannot be replaced by a raw_spinlock
because:

  - wait queues can have custom wakeup callbacks, which acquire other
    spinlock_t locks and have potentially long execution times

  - wake_up() walks an unbounded number of list entries during the wake up
    and may wake an unbounded number of waiters.

For simplicity and performance reasons complete() should be usable on
PREEMPT_RT enabled kernels.

completions do not use custom wakeup callbacks and are usually single
waiter, except for a few corner cases.

Replace the wait queue in the completion with a simple wait queue (swait),
which uses a raw_spinlock_t for protecting the waiter list and therefore is
safe to use inside truly atomic regions on PREEMPT_RT.

There is no semantical or functional change:

  - completions use the exclusive wait mode which is what swait provides

  - complete() wakes one exclusive waiter

  - complete_all() wakes all waiters while holding the lock which protects
    the wait queue against newly incoming waiters. The conversion to swait
    preserves this behaviour.

complete_all() might cause unbound latencies with a large number of waiters
being woken at once, but most complete_all() usage sites are either in
testing or initialization code or have only a really small number of
concurrent waiters which for now does not cause a latency problem. Keep it
simple for now.

The fixup of the warning check in the USB gadget driver is just a straight
forward conversion of the lockless waiter check from one waitqueue type to
the other.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/20200321113242.317954042@linutronix.de
2020-03-21 16:00:24 +01:00
Thomas Gleixner
b3212fe2bc sched/swait: Prepare usage in completions
As a preparation to use simple wait queues for completions:

  - Provide swake_up_all_locked() to support complete_all()
  - Make __prepare_to_swait() public available

This is done to enable the usage of complete() within truly atomic contexts
on a PREEMPT_RT enabled kernel.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200321113242.228481202@linutronix.de
2020-03-21 16:00:23 +01:00
Thomas Gleixner
e5d4d1756b timekeeping: Split jiffies seqlock
seqlock consists of a sequence counter and a spinlock_t which is used to
serialize the writers. spinlock_t is substituted by a "sleeping" spinlock
on PREEMPT_RT enabled kernels which breaks the usage in the timekeeping
code as the writers are executed in hard interrupt and therefore
non-preemptible context even on PREEMPT_RT.

The spinlock in seqlock cannot be unconditionally replaced by a
raw_spinlock_t as many seqlock users have nesting spinlock sections or
other code which is not suitable to run in truly atomic context on RT.

Instead of providing a raw_seqlock API for a single use case, open code the
seqlock for the jiffies use case and implement it with a raw_spinlock_t and
a sequence counter.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200321113242.120587764@linutronix.de
2020-03-21 16:00:23 +01:00
Peter Zijlstra (Intel)
80fbaf1c3f rcuwait: Add @state argument to rcuwait_wait_event()
Extend rcuwait_wait_event() with a state variable so that it is not
restricted to UNINTERRUPTIBLE waits.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200321113241.824030968@linutronix.de
2020-03-21 16:00:22 +01:00
Marco Elver
f5d2313bd3 kcsan, trace: Make KCSAN compatible with tracing
Previously the system would lock up if ftrace was enabled together with
KCSAN. This is due to recursion on reporting if the tracer code is
instrumented with KCSAN.

To avoid this for all types of tracing, disable KCSAN instrumentation
for all of kernel/trace.

Furthermore, since KCSAN relies on udelay() to introduce delay, we have
to disable ftrace for udelay() (currently done for x86) in case KCSAN is
used together with lockdep and ftrace. The reason is that it may corrupt
lockdep IRQ flags tracing state due to a peculiar case of recursion
(details in Makefile comment).

Reported-by: Qian Cai <cai@lca.pw>
Tested-by: Qian Cai <cai@lca.pw>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-03-21 09:44:41 +01:00
Marco Elver
703b321501 kcsan: Introduce ASSERT_EXCLUSIVE_BITS(var, mask)
This introduces ASSERT_EXCLUSIVE_BITS(var, mask).
ASSERT_EXCLUSIVE_BITS(var, mask) will cause KCSAN to assume that the
following access is safe w.r.t. data races (however, please see the
docbook comment for disclaimer here).

For more context on why this was considered necessary, please see:

  http://lkml.kernel.org/r/1580995070-25139-1-git-send-email-cai@lca.pw

In particular, before this patch, data races between reads (that use
@mask bits of an access that should not be modified concurrently) and
writes (that change ~@mask bits not used by the readers) would have been
annotated with "data_race()" (or "READ_ONCE()"). However, doing so would
then hide real problems: we would no longer be able to detect harmful
races between reads to @mask bits and writes to @mask bits.

Therefore, by using ASSERT_EXCLUSIVE_BITS(var, mask), we accomplish:

  1. Avoid proliferation of specific macros at the call sites: by
     including a single mask in the argument list, we can use the same
     macro in a wide variety of call sites, regardless of how and which
     bits in a field each call site actually accesses.

  2. The existing code does not need to be modified (although READ_ONCE()
     may still be advisable if we cannot prove that the data race is
     always safe).

  3. We catch bugs where the exclusive bits are modified concurrently.

  4. We document properties of the current code.

Acked-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Qian Cai <cai@lca.pw>
2020-03-21 09:44:14 +01:00
Marco Elver
81af89e158 kcsan: Add kcsan_set_access_mask() support
When setting up an access mask with kcsan_set_access_mask(), KCSAN will
only report races if concurrent changes to bits set in access_mask are
observed. Conveying access_mask via a separate call avoids introducing
overhead in the common-case fast-path.

Acked-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-03-21 09:44:08 +01:00
Marco Elver
b738f6169f kcsan: Introduce kcsan_value_change type
Introduces kcsan_value_change type, which explicitly points out if we
either observed a value-change (TRUE), or we could not observe one but
cannot rule out a value-change happened (MAYBE). The MAYBE state can
either be reported or not, depending on configuration preferences.

A follow-up patch introduces the FALSE state, which should never be
reported.

No functional change intended.

Acked-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-03-21 09:44:03 +01:00
Marco Elver
3a5b45e503 kcsan: Fix misreporting if concurrent races on same address
If there are at least 4 threads racing on the same address, it can
happen that one of the readers may observe another matching reader in
other_info. To avoid locking up, we have to consume 'other_info'
regardless, but skip the report. See the added comment for more details.

Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-03-21 09:43:41 +01:00
Marco Elver
80d4c47752 kcsan: Expose core configuration parameters as module params
This adds early_boot, udelay_{task,interrupt}, and skip_watch as module
params. The latter parameters are useful to modify at runtime to tune
KCSAN's performance on new systems. This will also permit auto-tuning
these parameters to maximize overall system performance and KCSAN's race
detection ability.

None of the parameters are used in the fast-path and referring to them
via static variables instead of CONFIG constants will not affect
performance.

Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Qian Cai <cai@lca.pw>
2020-03-21 09:43:35 +01:00
Marco Elver
a312013578 kcsan: Add test to generate conflicts via debugfs
Add 'test=<iters>' option to KCSAN's debugfs interface to invoke KCSAN
checks on a dummy variable. By writing 'test=<iters>' to the debugfs
file from multiple tasks, we can generate real conflicts, and trigger
data race reports.

Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-03-21 09:43:30 +01:00
Marco Elver
d591ec3db7 kcsan: Introduce KCSAN_ACCESS_ASSERT access type
The KCSAN_ACCESS_ASSERT access type may be used to introduce dummy reads
and writes to assert certain properties of concurrent code, where bugs
could not be detected as normal data races.

For example, a variable that is only meant to be written by a single
CPU, but may be read (without locking) by other CPUs must still be
marked properly to avoid data races. However, concurrent writes,
regardless if WRITE_ONCE() or not, would be a bug. Using
kcsan_check_access(&x, sizeof(x), KCSAN_ACCESS_ASSERT) would allow
catching such bugs.

To support KCSAN_ACCESS_ASSERT the following notable changes were made:

  * If an access is of type KCSAN_ASSERT_ACCESS, disable various filters
    that only apply to data races, so that all races that KCSAN observes are
    reported.
  * Bug reports that involve an ASSERT access type will be reported as
    "KCSAN: assert: race in ..." instead of "data-race"; this will help
    more easily distinguish them.
  * Update a few comments to just mention 'races' where we do not always
    mean pure data races.

Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-03-21 09:42:50 +01:00
Marco Elver
ed95f95c86 kcsan: Fix 0-sized checks
Instrumentation of arbitrary memory-copy functions, such as user-copies,
may be called with size of 0, which could lead to false positives.

To avoid this, add a comparison in check_access() for size==0, which
will be optimized out for constant sized instrumentation
(__tsan_{read,write}N), and therefore not affect the common-case
fast-path.

Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-03-21 09:42:42 +01:00
Marco Elver
1e6ee2f0fe kcsan: Add option to assume plain aligned writes up to word size are atomic
This adds option KCSAN_ASSUME_PLAIN_WRITES_ATOMIC. If enabled, plain
aligned writes up to word size are assumed to be atomic, and also not
subject to other unsafe compiler optimizations resulting in data races.

This option has been enabled by default to reflect current kernel-wide
preferences.

Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-03-21 09:42:18 +01:00
Marco Elver
ad4f8eeca8 kcsan: Address missing case with KCSAN_REPORT_VALUE_CHANGE_ONLY
Even with KCSAN_REPORT_VALUE_CHANGE_ONLY, KCSAN still reports data
races between reads and watchpointed writes, even if the writes wrote
values already present.  This commit causes KCSAN to unconditionally
skip reporting in this case.

Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-03-21 09:41:29 +01:00
Marco Elver
f1bc96210c kcsan: Make KCSAN compatible with lockdep
We must avoid any recursion into lockdep if KCSAN is enabled on utilities
used by lockdep. One manifestation of this is corruption of lockdep's
IRQ trace state (if TRACE_IRQFLAGS), resulting in spurious warnings
(see below).  This commit fixes this by:

1. Using raw_local_irq{save,restore} in kcsan_setup_watchpoint().
2. Disabling lockdep in kcsan_report().

Tested with:

  CONFIG_LOCKDEP=y
  CONFIG_DEBUG_LOCKDEP=y
  CONFIG_TRACE_IRQFLAGS=y

This fix eliminates spurious warnings such as the following one:

    WARNING: CPU: 0 PID: 2 at kernel/locking/lockdep.c:4406 check_flags.part.0+0x101/0x220
    Modules linked in:
    CPU: 0 PID: 2 Comm: kthreadd Not tainted 5.5.0-rc1+ #11
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
    RIP: 0010:check_flags.part.0+0x101/0x220
    <snip>
    Call Trace:
     lock_is_held_type+0x69/0x150
     freezer_fork+0x20b/0x370
     cgroup_post_fork+0x2c9/0x5c0
     copy_process+0x2675/0x3b40
     _do_fork+0xbe/0xa30
     ? _raw_spin_unlock_irqrestore+0x40/0x50
     ? match_held_lock+0x56/0x250
     ? kthread_park+0xf0/0xf0
     kernel_thread+0xa6/0xd0
     ? kthread_park+0xf0/0xf0
     kthreadd+0x321/0x3d0
     ? kthread_create_on_cpu+0x130/0x130
     ret_from_fork+0x3a/0x50
    irq event stamp: 64
    hardirqs last  enabled at (63): [<ffffffff9a7995d0>] _raw_spin_unlock_irqrestore+0x40/0x50
    hardirqs last disabled at (64): [<ffffffff992a96d2>] kcsan_setup_watchpoint+0x92/0x460
    softirqs last  enabled at (32): [<ffffffff990489b8>] fpu__copy+0xe8/0x470
    softirqs last disabled at (30): [<ffffffff99048939>] fpu__copy+0x69/0x470

Reported-by: Qian Cai <cai@lca.pw>
Signed-off-by: Marco Elver <elver@google.com>
Acked-by: Alexander Potapenko <glider@google.com>
Tested-by: Qian Cai <cai@lca.pw>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-03-21 09:41:16 +01:00
Marco Elver
05f9a40679 kcsan: Rate-limit reporting per data races
KCSAN data-race reports can occur quite frequently, so much so as
to render the system useless.  This commit therefore adds support for
time-based rate-limiting KCSAN reports, with the time interval specified
by a new KCSAN_REPORT_ONCE_IN_MS Kconfig option.  The default is 3000
milliseconds, also known as three seconds.

Because KCSAN must detect data races in allocators and in other contexts
where use of allocation is ill-advised, a fixed-size array is used to
buffer reports during each reporting interval.  To reduce the number of
reports lost due to array overflow, this commit stores only one instance
of duplicate reports, which has the benefit of further reducing KCSAN's
console output rate.

Reported-by: Qian Cai <cai@lca.pw>
Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-03-21 09:40:52 +01:00
Marco Elver
47144eca28 kcsan: Show full access type in report
This commit adds access-type information to KCSAN's reports as follows:
"read", "read (marked)", "write", and "write (marked)".

Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-03-21 09:40:42 +01:00
Marco Elver
5c36142574 kcsan: Prefer __always_inline for fast-path
Prefer __always_inline for fast-path functions that are called outside
of user_access_save, to avoid generating UACCESS warnings when
optimizing for size (CC_OPTIMIZE_FOR_SIZE). It will also avoid future
surprises with compiler versions that change the inlining heuristic even
when optimizing for performance.

Reported-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: http://lkml.kernel.org/r/58708908-84a0-0a81-a836-ad97e33dbb62@infradead.org
2020-03-21 09:40:19 +01:00
Ingo Molnar
df10846ff2 Merge branch 'linus' into locking/kcsan, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-03-21 09:35:44 +01:00
Ingo Molnar
a4654e9bde Merge branch 'x86/kdump' into locking/kcsan, to resolve conflicts
Conflicts:
	arch/x86/purgatory/Makefile

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-03-21 09:24:41 +01:00
Greg Kroah-Hartman
5c6f258879 bpf: Explicitly memset some bpf info structures declared on the stack
Trying to initialize a structure with "= {};" will not always clean out
all padding locations in a structure. So be explicit and call memset to
initialize everything for a number of bpf information structures that
are then copied from userspace, sometimes from smaller memory locations
than the size of the structure.

Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200320162258.GA794295@kroah.com
2020-03-20 21:05:22 +01:00
Greg Kroah-Hartman
8096f22942 bpf: Explicitly memset the bpf_attr structure
For the bpf syscall, we are relying on the compiler to properly zero out
the bpf_attr union that we copy userspace data into. Unfortunately that
doesn't always work properly, padding and other oddities might not be
correctly zeroed, and in some tests odd things have been found when the
stack is pre-initialized to other values.

Fix this by explicitly memsetting the structure to 0 before using it.

Reported-by: Maciej Żenczykowski <maze@google.com>
Reported-by: John Stultz <john.stultz@linaro.org>
Reported-by: Alexander Potapenko <glider@google.com>
Reported-by: Alistair Delva <adelva@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://android-review.googlesource.com/c/kernel/common/+/1235490
Link: https://lore.kernel.org/bpf/20200320094813.GA421650@kroah.com
2020-03-20 21:04:30 +01:00
Peter Zijlstra
f6f48e1804 lockdep: Teach lockdep about "USED" <- "IN-NMI" inversions
nmi_enter() does lockdep_off() and hence lockdep ignores everything.

And NMI context makes it impossible to do full IN-NMI tracking like we
do IN-HARDIRQ, that could result in graph_lock recursion.

However, since look_up_lock_class() is lockless, we can find the class
of a lock that has prior use and detect IN-NMI after USED, just not
USED after IN-NMI.

NOTE: By shifting the lockdep_off() recursion count to bit-16, we can
easily differentiate between actual recursion and off.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Link: https://lkml.kernel.org/r/20200221134215.090538203@infradead.org
2020-03-20 13:06:25 +01:00
Peter Zijlstra
248efb2158 locking/lockdep: Rework lockdep_lock
A few sites want to assert we own the graph_lock/lockdep_lock, provide
a more conventional lock interface for it with a number of trivial
debug checks.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200313102107.GX12561@hirez.programming.kicks-ass.net
2020-03-20 13:06:25 +01:00
Peter Zijlstra
10476e6304 locking/lockdep: Fix bad recursion pattern
There were two patterns for lockdep_recursion:

Pattern-A:
	if (current->lockdep_recursion)
		return

	current->lockdep_recursion = 1;
	/* do stuff */
	current->lockdep_recursion = 0;

Pattern-B:
	current->lockdep_recursion++;
	/* do stuff */
	current->lockdep_recursion--;

But a third pattern has emerged:

Pattern-C:
	current->lockdep_recursion = 1;
	/* do stuff */
	current->lockdep_recursion = 0;

And while this isn't broken per-se, it is highly dangerous because it
doesn't nest properly.

Get rid of all Pattern-C instances and shore up Pattern-A with a
warning.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200313093325.GW12561@hirez.programming.kicks-ass.net
2020-03-20 13:06:25 +01:00