Commit Graph

66951 Commits

Author SHA1 Message Date
Linus Torvalds
4962a85696 io_uring-5.10-2020-10-20
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl+O9WEQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpqajEADSD5PiO94YWTtVNWFUjn5RW+GlCE70/0VV
 XImdasmvM8nb48E+z2EW0Ky4vKXeVy5r+WZAeIYqPUHy/ogQDVpEn00NL7tFQmOz
 8UYrlZ3LLE/bSeWM5iavgG7TldVs/ZfspJ0hj3/Ac7jJpzuRGEI5TClsxJ0mWV39
 b2qT4OYBDhwvwVPZ/qhWgEwJXJFzFywckouIbMw8gPkveebUYeyu/yuScNGwYuiQ
 46YPEk/XIuOy8iUvQjqhLY+NNlAKJwt3z9WZgt5F+TZIhkpp7z6h20+jezFQcuFP
 GXzIDN+EADpsbw7MWJYIZVffxDEMlDpkJlAVMT1hsYLDfTXoEzmFwRddoFh2Fjf6
 ghWqhOKffUuAOX2xs1MrS2xLaxd0ot7QqZJVTYk7zEljkaRANlstSZZ+PpI+Sad/
 rNieQvs6jnsmTODDEaV3qyFX5aBQ2NdvyndZNU9wz0GZAWAdz+wxE0A1FVD0A37i
 p6m8sIvhNg3/cW89G04IDYUkAygi8knVDnEDHRwaJtswZQ4pRSGMp+N4qZ0GpnK7
 BviaAhofGaYlqruavO6Ug2YyomYpWGlUxTaB9ZKh0HkEDlDM945+0sgQRdxfsE8d
 OboycqJn3puOl/wh5Fc4oGYrWLsDbaA/5kksC4lm85Z+HUf+UXMS4QFdoPJYjhuM
 H6oMz1w2bQ==
 =v56S
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.10-2020-10-20' of git://git.kernel.dk/linux-block

Pull io_uring updates from Jens Axboe:
 "A mix of fixes and a few stragglers. In detail:

   - Revert the bogus __read_mostly that we discussed for the initial
     pull request.

   - Fix a merge window regression with fixed file registration error
     path handling.

   - Fix io-wq numa node affinities.

   - Series abstracting out an io_identity struct, making it both easier
     to see what the personality items are, and also easier to to adopt
     more. Use this to cover audit logging.

   - Fix for read-ahead disabled block condition in async buffered
     reads, and using single page read-ahead to unify what
     generic_file_buffer_read() path is used.

   - Series for REQ_F_COMP_LOCKED fix and removal of it (Pavel)

   - Poll fix (Pavel)"

* tag 'io_uring-5.10-2020-10-20' of git://git.kernel.dk/linux-block: (21 commits)
  io_uring: use blk_queue_nowait() to check if NOWAIT supported
  mm: use limited read-ahead to satisfy read
  mm: mark async iocb read as NOWAIT once some data has been copied
  io_uring: fix double poll mask init
  io-wq: inherit audit loginuid and sessionid
  io_uring: use percpu counters to track inflight requests
  io_uring: assign new io_identity for task if members have changed
  io_uring: store io_identity in io_uring_task
  io_uring: COW io_identity on mismatch
  io_uring: move io identity items into separate struct
  io_uring: rely solely on work flags to determine personality.
  io_uring: pass required context in as flags
  io-wq: assign NUMA node locality if appropriate
  io_uring: fix error path cleanup in io_sqe_files_register()
  Revert "io_uring: mark io_uring_fops/io_op_defs as __read_mostly"
  io_uring: fix REQ_F_COMP_LOCKED by killing it
  io_uring: dig out COMP_LOCK from deep call chain
  io_uring: don't put a poll req under spinlock
  io_uring: don't unnecessarily clear F_LINK_TIMEOUT
  io_uring: don't set COMP_LOCKED if won't put
  ...
2020-10-20 13:19:30 -07:00
Linus Torvalds
bbe85027ce Recalling the first round of new code for 5.10, in which we added:
- New feature: Widen inode timestamps and quota grace expiration
   timestamps to support dates through the year 2486.
 - New feature: storing inode btree counts in the AGI to speed up certain
   mount time per-AG block reservation operatoins and add a little more
   metadata redundancy.
 
 For the second round of new code for 5.10:
 - Deprecate the V4 filesystem format, some disused mount options, and some
   legacy sysctl knobs now that we can support dates into the 25th century.
   Note that removal of V4 support will not happen until the early 2030s.
 - Fix some probles with inode realtime flag propagation.
 - Fix some buffer handling issues when growing a rt filesystem.
 - Fix a problem where a BMAP_REMAP unmap call would free rt extents even
   though the purpose of BMAP_REMAP is to avoid freeing the blocks.
 - Strengthen the dabtree online scrubber to check hash values on child
   dabtree blocks.
 - Actually log new intent items created as part of recovering log intent
   items.
 - Fix a bug where quotas weren't attached to an inode undergoing bmap
   intent item recovery.
 - Fix a buffer overrun problem with specially crafted log buffer
   headers.
 - Various cleanups to type usage and slightly inaccurate comments.
 - More cleanups to the xattr, log, and quota code.
 - Don't run the (slower) shared-rmap operations on attr fork mappings.
 - Fix a bug where we failed to check the LSN of finobt blocks during
   replay and could therefore overwrite newer data with older data.
 - Clean up the ugly nested transaction mess that log recovery uses to
   stage intent item recovery in the correct order by creating a proper
   data structure to capture recovered chains.
 - Use the capture structure to resume intent item chains with the
   same log space and block reservations as when they were captured.
 - Fix a UAF bug in bmap intent item recovery where we failed to maintain
   our reference to the incore inode if the bmap operation needed to
   relog itself to continue.
 - Rearrange the defer ops mechanism to finish newly created subtasks
   of a parent task before moving on to the next parent task.
 - Automatically relog intent items in deferred ops chains if doing so
   would help us avoid pinning the log tail.  This will help fix some
   log scaling problems now and will facilitate atomic file updates later.
 - Fix a deadlock in the GETFSMAP implementation by using an internal
   memory buffer to reduce indirect calls and copies to userspace,
   thereby improving its performance by ~20%.
 - Fix various problems when calling growfs on a realtime volume would
   not fully update the filesystem metadata.
 - Fix broken Kconfig asking about deprecated XFS when XFS is disabled.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl+KRuUACgkQ+H93GTRK
 tOtqsBAAm+AZ92DRjOD7/TbU3vJALRKBBUCc6weEYUJKaZUkdYpx7Fn8Si3K8Nu0
 Pxo8qLO8WtP3ECyd+CZgkQgZAhHrjRG+FnCOuNyj1yMguX9CDu4cK0dOh/M64+pM
 BvWPqLfd99mzr7HkQ0SuLIyDMeio3leU4lySAIVpADO3V7WF5ZgHCfEETpOh5Di1
 oIzhYlxHyfK+32u4sXSWsPnogQZwjyn4CyQ+6humK0d089pVB1wbjHaTym7exjSa
 cFhMqS1XDbpMuoF4BXMcx31UTOb+8/S6TKCVsRl61j3XKGzbYKSrLmrSb/r6gFWn
 wyXJGmLok0I2UDnX1ZArIWstJHcgPlTelWrssG8wAnopLSJoU10f8o88m43d0krF
 fCUCac1rKPcisg7CS5njgUkOBknSLeBCeztl59N/8acnkaETPQr0tReDpB4wGGaW
 aGEWBrCbz1QZyfDBttNPQLcreROGukZ8R8MMRl4GiAQwZz5UrTUFeoK6thplHVvp
 ANhpYGdJy4jJ79wt4MNVYUF8U8IRWdn0ddsRx08pLWchC1PH8HH944qrUXAVPYZ+
 MohSQqKtjvaKwZLP86SvCJFs20wEEUxzCSQbz4LTO5aBz3uDfo0LQSOYzPBU+OKp
 E33SNds13nsjeH8HBLtXH3lr3absywLcV2ZMaIGsQSLpE2p8AHA=
 =LuWg
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.10-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull more xfs updates from Darrick Wong:
 "The second large pile of new stuff for 5.10, with changes even more
  monumental than last week!

  We are formally announcing the deprecation of the V4 filesystem format
  in 2030. All users must upgrade to the V5 format, which contains
  design improvements that greatly strengthen metadata validation,
  supports reflink and online fsck, and is the intended vehicle for
  handling timestamps past 2038. We're also deprecating the old Irix
  behavioral tweaks in September 2025.

  Coming along for the ride are two design changes to the deferred
  metadata ops subsystem. One of the improvements is to retain correct
  logical ordering of tasks and subtasks, which is a more logical design
  for upper layers of XFS and will become necessary when we add atomic
  file range swaps and commits. The second improvement to deferred ops
  improves the scalability of the log by helping the log tail to move
  forward during long-running operations. This reduces log contention
  when there are a large number of threads trying to run transactions.

  In addition to that, this fixes numerous small bugs in log recovery;
  refactors logical intent log item recovery to remove the last
  remaining place in XFS where we could have nested transactions; fixes
  a couple of ways that intent log item recovery could fail in ways that
  wouldn't have happened in the regular commit paths; fixes a deadlock
  vector in the GETFSMAP implementation (which improves its performance
  by 20%); and fixes serious bugs in the realtime growfs, fallocate, and
  bitmap handling code.

  Summary:

   - Deprecate the V4 filesystem format, some disused mount options, and
     some legacy sysctl knobs now that we can support dates into the
     25th century. Note that removal of V4 support will not happen until
     the early 2030s.

   - Fix some probles with inode realtime flag propagation.

   - Fix some buffer handling issues when growing a rt filesystem.

   - Fix a problem where a BMAP_REMAP unmap call would free rt extents
     even though the purpose of BMAP_REMAP is to avoid freeing the
     blocks.

   - Strengthen the dabtree online scrubber to check hash values on
     child dabtree blocks.

   - Actually log new intent items created as part of recovering log
     intent items.

   - Fix a bug where quotas weren't attached to an inode undergoing bmap
     intent item recovery.

   - Fix a buffer overrun problem with specially crafted log buffer
     headers.

   - Various cleanups to type usage and slightly inaccurate comments.

   - More cleanups to the xattr, log, and quota code.

   - Don't run the (slower) shared-rmap operations on attr fork
     mappings.

   - Fix a bug where we failed to check the LSN of finobt blocks during
     replay and could therefore overwrite newer data with older data.

   - Clean up the ugly nested transaction mess that log recovery uses to
     stage intent item recovery in the correct order by creating a
     proper data structure to capture recovered chains.

   - Use the capture structure to resume intent item chains with the
     same log space and block reservations as when they were captured.

   - Fix a UAF bug in bmap intent item recovery where we failed to
     maintain our reference to the incore inode if the bmap operation
     needed to relog itself to continue.

   - Rearrange the defer ops mechanism to finish newly created subtasks
     of a parent task before moving on to the next parent task.

   - Automatically relog intent items in deferred ops chains if doing so
     would help us avoid pinning the log tail. This will help fix some
     log scaling problems now and will facilitate atomic file updates
     later.

   - Fix a deadlock in the GETFSMAP implementation by using an internal
     memory buffer to reduce indirect calls and copies to userspace,
     thereby improving its performance by ~20%.

   - Fix various problems when calling growfs on a realtime volume would
     not fully update the filesystem metadata.

   - Fix broken Kconfig asking about deprecated XFS when XFS is
     disabled"

* tag 'xfs-5.10-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (48 commits)
  xfs: fix Kconfig asking about XFS_SUPPORT_V4 when XFS_FS=n
  xfs: fix high key handling in the rt allocator's query_range function
  xfs: annotate grabbing the realtime bitmap/summary locks in growfs
  xfs: make xfs_growfs_rt update secondary superblocks
  xfs: fix realtime bitmap/summary file truncation when growing rt volume
  xfs: fix the indent in xfs_trans_mod_dquot
  xfs: do the ASSERT for the arguments O_{u,g,p}dqpp
  xfs: fix deadlock and streamline xfs_getfsmap performance
  xfs: limit entries returned when counting fsmap records
  xfs: only relog deferred intent items if free space in the log gets low
  xfs: expose the log push threshold
  xfs: periodically relog deferred intent items
  xfs: change the order in which child and parent defer ops are finished
  xfs: fix an incore inode UAF in xfs_bui_recover
  xfs: clean up xfs_bui_item_recover iget/trans_alloc/ilock ordering
  xfs: clean up bmap intent item recovery checking
  xfs: xfs_defer_capture should absorb remaining transaction reservation
  xfs: xfs_defer_capture should absorb remaining block reservations
  xfs: proper replay of deferred ops queued during log recovery
  xfs: remove XFS_LI_RECOVERED
  ...
2020-10-19 14:38:46 -07:00
Linus Torvalds
694565356c fuse update for 5.10
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCX4n0/gAKCRDh3BK/laaZ
 PM3jAP4xhaix0j/y3VyaxsUqWg6ZSrjq6X0o9clGMJv27IAtjgD/fJ7ZwzTldojD
 qb7N3utjLiPVRjwFmvsZ8JZ7O7PbwQ0=
 =oUbZ
 -----END PGP SIGNATURE-----

Merge tag 'fuse-update-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse

Pull fuse updates from Miklos Szeredi:

 - Support directly accessing host page cache from virtiofs. This can
   improve I/O performance for various workloads, as well as reducing
   the memory requirement by eliminating double caching. Thanks to Vivek
   Goyal for doing most of the work on this.

 - Allow automatic submounting inside virtiofs. This allows unique
   st_dev/ st_ino values to be assigned inside the guest to files
   residing on different filesystems on the host. Thanks to Max Reitz
   for the patches.

 - Fix an old use after free bug found by Pradeep P V K.

* tag 'fuse-update-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (25 commits)
  virtiofs: calculate number of scatter-gather elements accurately
  fuse: connection remove fix
  fuse: implement crossmounts
  fuse: Allow fuse_fill_super_common() for submounts
  fuse: split fuse_mount off of fuse_conn
  fuse: drop fuse_conn parameter where possible
  fuse: store fuse_conn in fuse_req
  fuse: add submount support to <uapi/linux/fuse.h>
  fuse: fix page dereference after free
  virtiofs: add logic to free up a memory range
  virtiofs: maintain a list of busy elements
  virtiofs: serialize truncate/punch_hole and dax fault path
  virtiofs: define dax address space operations
  virtiofs: add DAX mmap support
  virtiofs: implement dax read/write operations
  virtiofs: introduce setupmapping/removemapping commands
  virtiofs: implement FUSE_INIT map_alignment field
  virtiofs: keep a list of free dax memory ranges
  virtiofs: add a mount option to enable dax
  virtiofs: set up virtio_fs dax_device
  ...
2020-10-19 14:28:30 -07:00
Linus Torvalds
922a763ae1 zonefs changes for 5.10
This pull request introduces the following changes to zonefs:
 
 * Add the "explicit-open" mount option to automatically issue a
   REQ_OP_ZONE_OPEN command to the device whenever a sequential zone file
   is open for writing for the first time. This avoids "insufficient zone
   resources" errors for write operations on some drives with limited
   zone resources or on ZNS drives with a limited number of active zones.
   From Johannes.
 
 Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSRPv8tYSvhwAzJdzjdoc3SxdoYdgUCX4zOWAAKCRDdoc3SxdoY
 dh7PAP9IdcYnR9x6ttd2Aqsm17IBfY6b/TroE70Lm2YTlY0nTgD+IJTwYQG8KQAE
 QHAe6TD6VQfSftOeAOAnjEG64Iv2hQE=
 =vwu+
 -----END PGP SIGNATURE-----

Merge tag 'zonefs-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs

Pull zonefs updates from Damien Le Moal:
 "Add an 'explicit-open' mount option to automatically issue a
  REQ_OP_ZONE_OPEN command to the device whenever a sequential zone file
  is open for writing for the first time.

  This avoids 'insufficient zone resources' errors for write operations
  on some drives with limited zone resources or on ZNS drives with a
  limited number of active zones. From Johannes"

* tag 'zonefs-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
  zonefs: document the explicit-open mount option
  zonefs: open/close zone on file open/close
  zonefs: provide no-lock zonefs_io_error variant
  zonefs: introduce helper for zone management
2020-10-19 13:52:01 -07:00
Jeffle Xu
9ba0d0c812 io_uring: use blk_queue_nowait() to check if NOWAIT supported
commit 021a24460d ("block: add QUEUE_FLAG_NOWAIT") adds a new helper
function blk_queue_nowait() to check if the bdev supports handling of
REQ_NOWAIT or not. Since then bio-based dm device can also support
REQ_NOWAIT, and currently only dm-linear supports that since
commit 6abc49468e ("dm: add support for REQ_NOWAIT and enable it for
linear target").

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-19 07:32:36 -06:00
Linus Torvalds
1912b04e0f Merge branch 'akpm' (patches from Andrew)
Merge yet more updates from Andrew Morton:
 "Subsystems affected by this patch series: mm (memcg, migration,
  pagemap, gup, madvise, vmalloc), ia64, and misc"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (31 commits)
  mm: remove duplicate include statement in mmu.c
  mm: remove the filename in the top of file comment in vmalloc.c
  mm: cleanup the gfp_mask handling in __vmalloc_area_node
  mm: remove alloc_vm_area
  x86/xen: open code alloc_vm_area in arch_gnttab_valloc
  xen/xenbus: use apply_to_page_range directly in xenbus_map_ring_pv
  drm/i915: use vmap in i915_gem_object_map
  drm/i915: stop using kmap in i915_gem_object_map
  drm/i915: use vmap in shmem_pin_map
  zsmalloc: switch from alloc_vm_area to get_vm_area
  mm: allow a NULL fn callback in apply_to_page_range
  mm: add a vmap_pfn function
  mm: add a VM_MAP_PUT_PAGES flag for vmap
  mm: update the documentation for vfree
  mm/madvise: introduce process_madvise() syscall: an external memory hinting API
  pid: move pidfd_get_pid() to pid.c
  mm/madvise: pass mm to do_madvise
  selftests/vm: 10x speedup for hmm-tests
  binfmt_elf: take the mmap lock around find_extend_vma()
  mm/gup_benchmark: take the mmap lock around GUP
  ...
2020-10-18 12:25:25 -07:00
Linus Torvalds
429731277d This pull request contains fixes for UBI and UBIFS
UBI:
 - Correctly use kthread_should_stop in ubi worker
 
 UBIFS:
 
 - Fixes for memory leaks while iterating directory entries
 - Fix for a user triggerable error message
 - Fix for a space accounting bug in authenticated mode
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAl+LRl4WHHJpY2hhcmRA
 c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wVvTD/0ffSK9y4oECcW1/+84oHb5515g
 5/CxYf5zalruOXWzA55OxI55BL18e1tnS26jT1G79BpNoitPLbqh/OhvDVtHUqEI
 4Chd9PdeFb33GFubElcTviIBsGKMD2eYK5AVTX6fXxxRG8+UxT0u9T+1GXZGzlHA
 0N3qWBuDhAlrh65UtORulpBOAexLymbSeINS7ibTXKqo3+sc70xjFZYTyt9Tr9np
 q2VVI+SS019A30RrIzfeaSpyfDZ1tdh5vhfsm6eGbearHpUrX6OZgFUQglRCF7DS
 bMTTODH+feJAPyknd9T1EdXrVNjzX24i1V3/wM8hC7qUZfaf1ZHeCuu0t353iiGn
 dXg+qA/v+AKTYh71MRfHd8GJvmKHospdhze1K5IIvA+lL6+bRwur88KVF8PwyaB7
 KHRAghKLEdVBb68MzwF0ChbFSUDk6VTZdvj0FB1LO/h3YQ1I2Dp1Z946qjJm9bWF
 biATqHmR1hSPAyAP/VUGCdRqlwpuox5cUgoa7SwO3zP1aqBRG29Wsg1JJuqy74GN
 Dcov5vIhVly/zns6tKaHFleTeAMBO7e1fhEjg4TuMpzWt9v8eYnacWwR1Hfvm9VO
 +0rox1g5mDX/ZzPE/Koj3LFnr8JV8g0DPs0Uemz3D89xt3ftO0Du3NZSd8WgOcbS
 mZcpgTakNv2/DbTiGA==
 =RD+h
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-5.10-rc1-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs

Pull more ubi and ubifs updates from Richard Weinberger:
 "UBI:
   - Correctly use kthread_should_stop in ubi worker

  UBIFS:
   - Fixes for memory leaks while iterating directory entries
   - Fix for a user triggerable error message
   - Fix for a space accounting bug in authenticated mode"

* tag 'for-linus-5.10-rc1-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs:
  ubifs: journal: Make sure to not dirty twice for auth nodes
  ubifs: setflags: Don't show error message when vfs_ioc_setflags_prepare() fails
  ubifs: ubifs_jnl_change_xattr: Remove assertion 'nlink > 0' for host inode
  ubi: check kthread_should_stop() after the setting of task state
  ubifs: dent: Fix some potential memory leaks while iterating entries
  ubifs: xattr: Fix some potential memory leaks while iterating entries
2020-10-18 09:56:50 -07:00
Linus Torvalds
a96fd1cc3f This pull request contains changes for UBIFS
- Kernel-doc fixes
 - Fixes for memory leaks in authentication option parsing
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAl+LRSgWHHJpY2hhcmRA
 c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wZU7EACGIvNIUvPl35GhUxRP7EBGOVdz
 LPonjtBiipeuvIXnwviqzYcCwFJ2a5xFel4pEs3J58AYb9D6++qShgrSw7/g815Z
 aEcLDAuSTkYIj9HzmI03xObe4UHtGoIXpHTj2e8vzq/doJMEXYtZeWzKyMW9IvgJ
 dU4Lq1ZvNNaYpiaglE9ThJUG6oXT3rnHkWR/uFf0MOh2kZ6pH6U8QAGUTZp9ZdCF
 Tzy1ruDS7+ZEox7j7cIBhrmN6NyaSk/I4C9tBisRxDRNZ/ku9b6URd384DKxvli1
 8otLLYBxWypqx4FF0qF2Gyp7ZYGWDcnxLHgrmpafmsbwMqnOrOwjSD+whVKLdJqm
 B2nXzxRw6BNDqViAxQ+K/KISoZq4w3/dREPgvBzjypiDKkYtJ6MksTj4Mbw514b6
 iUjaKpGVMZF8PIjTxn/YWrbxR6UV/On+mbHs94FTyanldRw1GG9UcrcL5TyhaXj0
 WeVhBiU4Aui+vwtBtdPCY1sf6iBE3x40P7tTWe6/CPgc7pE1InVU3ex0517gyEls
 yY0JFVQPNeMIc9C/g12GbDiMA328FSxcCLKd+2+h3PAgcMd+RPvynHxIL5g9hxMC
 GRFUGERxCVHU2VkrOpXA8P6Z3u3Lu9kZkz4UP1dMU+5tiBcw0xN0Ks9TvZm8s/WV
 lGyhX9r+kpfNUiq4hQ==
 =/Spl
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs

Pull ubifs updates from Richard Weinberger:

 - Kernel-doc fixes

 - Fixes for memory leaks in authentication option parsing

* tag 'for-linus-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs:
  ubifs: mount_ubifs: Release authentication resource in error handling path
  ubifs: Don't parse authentication mount options in remount process
  ubifs: Fix a memleak after dumping authentication mount options
  ubifs: Fix some kernel-doc warnings in tnc.c
  ubifs: Fix some kernel-doc warnings in replay.c
  ubifs: Fix some kernel-doc warnings in gc.c
  ubifs: Fix 'hash' kernel-doc warning in auth.c
2020-10-18 09:51:10 -07:00
Minchan Kim
0726b01e70 mm/madvise: pass mm to do_madvise
Patch series "introduce memory hinting API for external process", v9.

Now, we have MADV_PAGEOUT and MADV_COLD as madvise hinting API.  With
that, application could give hints to kernel what memory range are
preferred to be reclaimed.  However, in some platform(e.g., Android), the
information required to make the hinting decision is not known to the app.
Instead, it is known to a centralized userspace daemon(e.g.,
ActivityManagerService), and that daemon must be able to initiate reclaim
on its own without any app involvement.

To solve the concern, this patch introduces new syscall -
process_madvise(2).  Bascially, it's same with madvise(2) syscall but it
has some differences.

1. It needs pidfd of target process to provide the hint

2. It supports only MADV_{COLD|PAGEOUT|MERGEABLE|UNMEREABLE} at this
   moment.  Other hints in madvise will be opened when there are explicit
   requests from community to prevent unexpected bugs we couldn't support.

3. Only privileged processes can do something for other process's
   address space.

For more detail of the new API, please see "mm: introduce external memory
hinting API" description in this patchset.

This patch (of 3):

In upcoming patches, do_madvise will be called from external process
context so we shouldn't asssume "current" is always hinted process's
task_struct.

Furthermore, we must not access mm_struct via task->mm, but obtain it via
access_mm() once (in the following patch) and only use that pointer [1],
so pass it to do_madvise() as well.  Note the vma->vm_mm pointers are
safe, so we can use them further down the call stack.

And let's pass current->mm as arguments of do_madvise so it shouldn't
change existing behavior but prepare next patch to make review easy.

[vbabka@suse.cz: changelog tweak]
[minchan@kernel.org: use current->mm for io_uring]
  Link: http://lkml.kernel.org/r/20200423145215.72666-1-minchan@kernel.org
[akpm@linux-foundation.org: fix it for upstream changes]
[akpm@linux-foundation.org: whoops]
[rdunlap@infradead.org: add missing includes]

Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jann Horn <jannh@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Daniel Colascione <dancol@google.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: John Dias <joaodias@google.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: SeongJae Park <sj38.park@gmail.com>
Cc: Christian Brauner <christian@brauner.io>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: SeongJae Park <sjpark@amazon.de>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Florian Weimer <fw@deneb.enyo.de>
Cc: <linux-man@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200901000633.1920247-1-minchan@kernel.org
Link: http://lkml.kernel.org/r/20200622192900.22757-1-minchan@kernel.org
Link: http://lkml.kernel.org/r/20200302193630.68771-2-minchan@kernel.org
Link: http://lkml.kernel.org/r/20200622192900.22757-2-minchan@kernel.org
Link: https://lkml.kernel.org/r/20200901000633.1920247-2-minchan@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-18 09:27:09 -07:00
Jann Horn
b2767d97f5 binfmt_elf: take the mmap lock around find_extend_vma()
create_elf_tables() runs after setup_new_exec(), so other tasks can
already access our new mm and do things like process_madvise() on it.  (At
the time I'm writing this commit, process_madvise() is not in mainline
yet, but has been in akpm's tree for some time.)

While I believe that there are currently no APIs that would actually allow
another process to mess up our VMA tree (process_madvise() is limited to
MADV_COLD and MADV_PAGEOUT, and uring and userfaultfd cannot reach an mm
under which no syscalls have been executed yet), this seems like an
accident waiting to happen.

Let's make sure that we always take the mmap lock around GUP paths as long
as another process might be able to see the mm.

(Yes, this diff looks suspicious because we drop the lock before doing
anything with `vma`, but that's because we actually don't do anything with
it apart from the NULL check.)

Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michel Lespinasse <walken@google.com>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
Link: https://lkml.kernel.org/r/CAG48ez1-PBCdv3y8pn-Ty-b+FmBSLwDuVKFSt8h7wARLy0dF-Q@mail.gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-18 09:27:09 -07:00
Roman Gushchin
b87d8cefe4 mm, memcg: rework remote charging API to support nesting
Currently the remote memcg charging API consists of two functions:
memalloc_use_memcg() and memalloc_unuse_memcg(), which set and clear the
memcg value, which overwrites the memcg of the current task.

  memalloc_use_memcg(target_memcg);
  <...>
  memalloc_unuse_memcg();

It works perfectly for allocations performed from a normal context,
however an attempt to call it from an interrupt context or just nest two
remote charging blocks will lead to an incorrect accounting.  On exit from
the inner block the active memcg will be cleared instead of being
restored.

  memalloc_use_memcg(target_memcg);

  memalloc_use_memcg(target_memcg_2);
    <...>
    memalloc_unuse_memcg();

    Error: allocation here are charged to the memcg of the current
    process instead of target_memcg.

  memalloc_unuse_memcg();

This patch extends the remote charging API by switching to a single
function: struct mem_cgroup *set_active_memcg(struct mem_cgroup *memcg),
which sets the new value and returns the old one.  So a remote charging
block will look like:

  old_memcg = set_active_memcg(target_memcg);
  <...>
  set_active_memcg(old_memcg);

This patch is heavily based on the patch by Johannes Weiner, which can be
found here: https://lkml.org/lkml/2020/5/28/806 .

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dan Schatzberg <dschatzberg@fb.com>
Link: https://lkml.kernel.org/r/20200821212056.3769116-1-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-18 09:27:09 -07:00
Pavel Begunkov
58852d4d67 io_uring: fix double poll mask init
__io_queue_proc() is used by both, poll reqs and apoll. Don't use
req->poll.events to copy poll mask because for apoll it aliases with
private data of the request.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-17 09:25:47 -06:00
Jens Axboe
4ea33a976b io-wq: inherit audit loginuid and sessionid
Make sure the async io-wq workers inherit the loginuid and sessionid from
the original task, and restore them to unset once we're done with the
async work item.

While at it, disable the ability for kernel threads to write to their own
loginuid.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-17 09:25:47 -06:00
Jens Axboe
d8a6df10aa io_uring: use percpu counters to track inflight requests
Even though we place the req_issued and req_complete in separate
cachelines, there's considerable overhead in doing the atomics
particularly on the completion side.

Get rid of having the two counters, and just use a percpu_counter for
this. That's what it was made for, after all. This considerably
reduces the overhead in __io_free_req().

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-17 09:25:47 -06:00
Jens Axboe
500a373d73 io_uring: assign new io_identity for task if members have changed
This avoids doing a copy for each new async IO, if some parts of the
io_identity has changed. We avoid reference counting for the normal
fast path of nothing ever changing.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-17 09:25:46 -06:00
Jens Axboe
5c3462cfd1 io_uring: store io_identity in io_uring_task
This is, by definition, a per-task structure. So store it in the
task context, instead of doing carrying it in each io_kiocb. We're being
a bit inefficient if members have changed, as that requires an alloc and
copy of a new io_identity struct. The next patch will fix that up.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-17 09:25:46 -06:00
Jens Axboe
1e6fa5216a io_uring: COW io_identity on mismatch
If the io_identity doesn't completely match the task, then create a
copy of it and use that. The existing copy remains valid until the last
user of it has gone away.

This also changes the personality lookup to be indexed by io_identity,
instead of creds directly.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-17 09:25:46 -06:00
Jens Axboe
98447d65b4 io_uring: move io identity items into separate struct
io-wq contains a pointer to the identity, which we just hold in io_kiocb
for now. This is in preparation for putting this outside io_kiocb. The
only exception is struct files_struct, which we'll need different rules
for to avoid a circular dependency.

No functional changes in this patch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-17 09:25:45 -06:00
Jens Axboe
dfead8a8e2 io_uring: rely solely on work flags to determine personality.
We solely rely on work->work_flags now, so use that for proper checking
and clearing/dropping of various identity items.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-17 09:25:45 -06:00
Jens Axboe
0f20376588 io_uring: pass required context in as flags
We have a number of bits that decide what context to inherit. Set up
io-wq flags for these instead. This is in preparation for always having
the various members set, but not always needing them for all requests.

No intended functional changes in this patch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-17 09:25:45 -06:00
Jens Axboe
a8b595b22d io-wq: assign NUMA node locality if appropriate
There was an assumption that kthread_create_on_node() would properly set
NUMA affinities in terms of CPUs allowed, but it doesn't. Make sure we
do this when creating an io-wq context on NUMA.

Cc: stable@vger.kernel.org
Stefan Metzmacher <metze@samba.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-17 09:25:44 -06:00
Jens Axboe
55cbc2564a io_uring: fix error path cleanup in io_sqe_files_register()
syzbot reports the following crash:

general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
CPU: 1 PID: 8927 Comm: syz-executor.3 Not tainted 5.9.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:io_file_from_index fs/io_uring.c:5963 [inline]
RIP: 0010:io_sqe_files_register fs/io_uring.c:7369 [inline]
RIP: 0010:__io_uring_register fs/io_uring.c:9463 [inline]
RIP: 0010:__do_sys_io_uring_register+0x2fd2/0x3ee0 fs/io_uring.c:9553
Code: ec 03 49 c1 ee 03 49 01 ec 49 01 ee e8 57 61 9c ff 41 80 3c 24 00 0f 85 9b 09 00 00 4d 8b af b8 01 00 00 4c 89 e8 48 c1 e8 03 <80> 3c 28 00 0f 85 76 09 00 00 49 8b 55 00 89 d8 c1 f8 09 48 98 4c
RSP: 0018:ffffc90009137d68 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffc9000ef2a000
RDX: 0000000000040000 RSI: ffffffff81d81dd9 RDI: 0000000000000005
RBP: dffffc0000000000 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffffed1012882a37
R13: 0000000000000000 R14: ffffed1012882a38 R15: ffff888094415000
FS:  00007f4266f3c700(0000) GS:ffff8880ae500000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000118c000 CR3: 000000008e57d000 CR4: 00000000001506e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x45de59
Code: 0d b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 db b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f4266f3bc78 EFLAGS: 00000246 ORIG_RAX: 00000000000001ab
RAX: ffffffffffffffda RBX: 00000000000083c0 RCX: 000000000045de59
RDX: 0000000020000280 RSI: 0000000000000002 RDI: 0000000000000005
RBP: 000000000118bf68 R08: 0000000000000000 R09: 0000000000000000
R10: 40000000000000a1 R11: 0000000000000246 R12: 000000000118bf2c
R13: 00007fff2fa4f12f R14: 00007f4266f3c9c0 R15: 000000000118bf2c
Modules linked in:
---[ end trace 2a40a195e2d5e6e6 ]---
RIP: 0010:io_file_from_index fs/io_uring.c:5963 [inline]
RIP: 0010:io_sqe_files_register fs/io_uring.c:7369 [inline]
RIP: 0010:__io_uring_register fs/io_uring.c:9463 [inline]
RIP: 0010:__do_sys_io_uring_register+0x2fd2/0x3ee0 fs/io_uring.c:9553
Code: ec 03 49 c1 ee 03 49 01 ec 49 01 ee e8 57 61 9c ff 41 80 3c 24 00 0f 85 9b 09 00 00 4d 8b af b8 01 00 00 4c 89 e8 48 c1 e8 03 <80> 3c 28 00 0f 85 76 09 00 00 49 8b 55 00 89 d8 c1 f8 09 48 98 4c
RSP: 0018:ffffc90009137d68 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffc9000ef2a000
RDX: 0000000000040000 RSI: ffffffff81d81dd9 RDI: 0000000000000005
RBP: dffffc0000000000 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffffed1012882a37
R13: 0000000000000000 R14: ffffed1012882a38 R15: ffff888094415000
FS:  00007f4266f3c700(0000) GS:ffff8880ae400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000074a918 CR3: 000000008e57d000 CR4: 00000000001506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

which is a copy of fget failure condition jumping to cleanup, but the
cleanup requires ctx->file_data to be assigned. Assign it when setup,
and ensure that we clear it again for the error path exit.

Fixes: 5398ae6985 ("io_uring: clean file_data access in files_register")
Reported-by: syzbot+f4ebcc98223dafd8991e@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-17 09:25:44 -06:00
Jens Axboe
0918682be4 Revert "io_uring: mark io_uring_fops/io_op_defs as __read_mostly"
This reverts commit 738277adc8.

This change didn't make a lot of sense, and as Linus reports, it actually
fails on clang:

   /tmp/io_uring-dd40c4.s:26476: Warning: ignoring changed section
   attributes for .data..read_mostly

The arrays are already marked const so, by definition, they are not
just read-mostly, they are read-only.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-17 09:25:43 -06:00
Pavel Begunkov
216578e55a io_uring: fix REQ_F_COMP_LOCKED by killing it
REQ_F_COMP_LOCKED is used and implemented in a buggy way. The problem is
that the flag is set before io_put_req() but not cleared after, and if
that wasn't the final reference, the request will be freed with the flag
set from some other context, which may not hold a spinlock. That means
possible races with removing linked timeouts and unsynchronised
completion (e.g. access to CQ).

Instead of fixing REQ_F_COMP_LOCKED, kill the flag and use
task_work_add() to move such requests to a fresh context to free from
it, as was done with __io_free_req_finish().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-17 09:25:43 -06:00
Pavel Begunkov
4edf20f999 io_uring: dig out COMP_LOCK from deep call chain
io_req_clean_work() checks REQ_F_COMP_LOCK to pass this two layers up.
Move the check up into __io_free_req(), so at least it doesn't looks so
ugly and would facilitate further changes.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-17 09:25:43 -06:00
Pavel Begunkov
6a0af224c2 io_uring: don't put a poll req under spinlock
Move io_put_req() in io_poll_task_handler() from under spinlock. This
eliminates the need to use REQ_F_COMP_LOCKED, at the expense of
potentially having to grab the lock again. That's still a better trade
off than relying on the locked flag.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-17 09:25:42 -06:00
Pavel Begunkov
b1b74cfc19 io_uring: don't unnecessarily clear F_LINK_TIMEOUT
If a request had REQ_F_LINK_TIMEOUT it would've been cleared in
__io_kill_linked_timeout() by the time of __io_fail_links(), so no need
to care about it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-17 09:25:42 -06:00
Pavel Begunkov
368c5481ae io_uring: don't set COMP_LOCKED if won't put
__io_kill_linked_timeout() sets REQ_F_COMP_LOCKED for a linked timeout
even if it can't cancel it, e.g. it's already running. It not only races
with io_link_timeout_fn() for ->flags field, but also leaves the flag
set and so io_link_timeout_fn() may find it and decide that it holds the
lock. Hopefully, the second problem is potential.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-17 09:25:42 -06:00
Colin Ian King
035fbafc7a io_uring: Fix sizeof() mismatch
An incorrect sizeof() is being used, sizeof(file_data->table) is not
correct, it should be sizeof(*file_data->table).

Fixes: 5398ae6985 ("io_uring: clean file_data access in files_register")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Addresses-Coverity: ("Sizeof not portable (SIZEOF_MISMATCH)")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-17 09:25:41 -06:00
Darrick J. Wong
894645546b xfs: fix Kconfig asking about XFS_SUPPORT_V4 when XFS_FS=n
Pavel Machek complained that the question about supporting deprecated
XFS v4 comes up even when XFS is disabled.  This clearly makes no sense,
so fix Kconfig.

Reported-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2020-10-16 15:34:28 -07:00
Darrick J. Wong
d88850bd55 xfs: fix high key handling in the rt allocator's query_range function
Fix some off-by-one errors in xfs_rtalloc_query_range.  The highest key
in the realtime bitmap is always one less than the number of rt extents,
which means that the key clamp at the start of the function is wrong.
The 4th argument to xfs_rtfind_forw is the highest rt extent that we
want to probe, which means that passing 1 less than the high key is
wrong.  Finally, drop the rem variable that controls the loop because we
can compare the iteration point (rtstart) against the high key directly.

The sordid history of this function is that the original commit (fb3c3)
incorrectly passed (high_rec->ar_startblock - 1) as the 'limit' parameter
to xfs_rtfind_forw.  This was wrong because the "high key" is supposed
to be the largest key for which the caller wants result rows, not the
key for the first row that could possibly be outside the range that the
caller wants to see.

A subsequent attempt (8ad56) to strengthen the parameter checking added
incorrect clamping of the parameters to the number of rt blocks in the
system (despite the bitmap functions all taking units of rt extents) to
avoid querying ranges past the end of rt bitmap file but failed to fix
the incorrect _rtfind_forw parameter.  The original _rtfind_forw
parameter error then survived the conversion of the startblock and
blockcount fields to rt extents (a0e5c), and the most recent off-by-one
fix (a3a37) thought it was patching a problem when the end of the rt
volume is not in use, but none of these fixes actually solved the
original problem that the author was confused about the "limit" argument
to xfs_rtfind_forw.

Sadly, all four of these patches were written by this author and even
his own usage of this function and rt testing were inadequate to get
this fixed quickly.

Original-problem: fb3c3de2f6 ("xfs: add a couple of queries to iterate free extents in the rtbitmap")
Not-fixed-by: 8ad560d256 ("xfs: strengthen rtalloc query range checks")
Not-fixed-by: a0e5c435ba ("xfs: fix xfs_rtalloc_rec units")
Fixes: a3a374bf18 ("xfs: fix off-by-one error in xfs_rtalloc_query_range")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
2020-10-16 15:34:28 -07:00
Linus Torvalds
071a0578b0 overlayfs update for 5.10
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCX4n9mwAKCRDh3BK/laaZ
 PB7EAP0cxydeomN0m29SpugawMFxzGpB/GnBEr0Qdonz5BJG7wD9EaF9dsLmGbXY
 Q2P/nbTYmNFC3Kz7xJAZNqmg86AgmQU=
 =o+01
 -----END PGP SIGNATURE-----

Merge tag 'ovl-update-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs

Pull overlayfs updates from Miklos Szeredi:

 - Improve performance for certain container setups by introducing a
   "volatile" mode

 - ioctl improvements

 - continue preparation for unprivileged overlay mounts

* tag 'ovl-update-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: use generic vfs_ioc_setflags_prepare() helper
  ovl: support [S|G]ETFLAGS and FS[S|G]ETXATTR ioctls for directories
  ovl: rearrange ovl_can_list()
  ovl: enumerate private xattrs
  ovl: pass ovl_fs down to functions accessing private xattrs
  ovl: drop flags argument from ovl_do_setxattr()
  ovl: adhere to the vfs_ vs. ovl_do_ conventions for xattrs
  ovl: use ovl_do_getxattr() for private xattr
  ovl: fold ovl_getxattr() into ovl_get_redirect_xattr()
  ovl: clean up ovl_getxattr() in copy_up.c
  duplicate ovl_getxattr()
  ovl: provide a mount option "volatile"
  ovl: check for incompatible features in work dir
2020-10-16 15:29:46 -07:00
Linus Torvalds
fad70111d5 afs fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl+JqvsACgkQ+7dXa6fL
 C2sQeRAAiW+rGQa46ybOv8qBLgVEiH6OzNUk62fesbhl5rKRgapS2r2SJbau29gv
 KqBld29lp4o84/THJ6hZ6Al1iwZnO2FW+h53Q47M5TH9AbwZhf/zooRSoGb5AmKJ
 /FR4+K/zTLk7VCSptTtXlKG81bOKQF+tmi6sLdxx70h2T+Ythm3zdVq8PCZZhhGG
 hw1IfuNDNbUDGus5WgZfIdVkdFCs5WW5cEhrPgqKR0mXYQklnkwgtov/+RAh/Ewf
 bZ1JFqap15KJ2AcKgw79NZ01MBJ4KyZckzKwgaTVlIEtCMQMBDhJpbKzdtP767rh
 xkxdzmDmXOop9RNQ8WrIIt6EjpB0We2qxQVGLsPHdEixmlv5n2BjL/wdWa3u/4QZ
 ymjv7B8jcFSpXQzVnrmeNYsgHo1oJ7G8q8Xh+CJm9BNBtNUiz1raQfxWpS5TeWT4
 dQsr/Z/PMCBCBOb6F08pjF8od67DUCx+x1nkCG4qFCeXiC4ouTNkfMZnxWH9mVKP
 v5Hca2E0ilR3PI5+tn8o/ZPul5NDTesBnAvAAEBQDT79Ff8kAgEJ3M3ipy38VoKV
 xFiNpNHwDPp15vzneJ7f7Ir27VauRlhCLEuSUtyYH4pD458K3UQAAQRFZB7bkzMX
 TQTbmEPUbtrs/mOEMKQN6ew8dwawRpMy01j0KnDl0KBFDtfTLU8=
 =0anM
 -----END PGP SIGNATURE-----

Merge tag 'afs-fixes-20201016' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull afs updates from David Howells:
 "A collection of fixes to fix afs_cell struct refcounting, thereby
  fixing a slew of related syzbot bugs:

   - Fix the cell tree in the netns to use an rwsem rather than RCU.

     There seem to be some problems deriving from the use of RCU and a
     seqlock to walk the rbtree, but it's not entirely clear what since
     there are several different failures being seen.

     Changing things to use an rwsem instead makes it more robust. The
     extra performance derived from using RCU isn't necessary in this
     case since the only time we're looking up a cell is during mount or
     when cells are being manually added.

   - Fix the refcounting by splitting the usage counter into a memory
     refcount and an active users counter. The usage counter was doing
     double duty, keeping track of whether a cell is still in use and
     keeping track of when it needs to be destroyed - but this makes the
     clean up tricky. Separating these out simplifies the logic.

   - Fix purging a cell that has an alias. A cell alias pins the cell
     it's an alias of, but the alias is always later in the list. Trying
     to purge in a single pass causes rmmod to hang in such a case.

   - Fix cell removal. If a cell's manager is requeued whilst it's
     removing itself, the manager will run again and re-remove itself,
     causing problems in various places. Follow Hillf Danton's
     suggestion to insert a more terminal state that causes the manager
     to do nothing post-removal.

  In additional to the above, two other changes:

   - Add a tracepoint for the cell refcount and active users count. This
     helped with debugging the above and may be useful again in future.

   - Downgrade an assertion to a print when a still-active server is
     seen during purging. This was happening as a consequence of
     incomplete cell removal before the servers were cleaned up"

* tag 'afs-fixes-20201016' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  afs: Don't assert on unpurgeable server records
  afs: Add tracing for cell refcount and active user count
  afs: Fix cell removal
  afs: Fix cell purging with aliases
  afs: Fix cell refcounting by splitting the usage counter
  afs: Fix rapid cell addition/removal by not using RCU on cells tree
2020-10-16 15:22:41 -07:00
Linus Torvalds
7a3dadedc8 f2fs-for-5.10-rc1
In this round, we've added new features such as zone capacity for ZNS and
 a new GC policy, ATGC, along with in-memory segment management. In addition,
 we could improve the decompression speed significantly by changing virtual
 mapping method. Even though we've fixed lots of small bugs in compression
 support, I feel that it becomes more stable so that I could give it a try in
 production.
 
 Enhancement:
  - suport zone capacity in NVMe Zoned Namespace devices
  - introduce in-memory current segment management
  - add standart casefolding support
  - support age threshold based garbage collection
  - improve decompression speed by changing virtual mapping method
 
 Bug fix:
  - fix condition checks in some ioctl() such as compression, move_range, etc
  - fix 32/64bits support in data structures
  - fix memory allocation in zstd decompress
  - add some boundary checks to avoid kernel panic on corrupted image
  - fix disallowing compression for non-empty file
  - fix slab leakage of compressed block writes
 
 In addition, it includes code refactoring for better readability and minor
 bug fixes for compression and zoned device support.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAl+I0nQACgkQQBSofoJI
 UNIJdw//Rj0YapYXSu1nlOQppzSAkCiL6wxrrG2qkisRE7uXSYGZfWBqrMT6Asnf
 i0f25i6ywZOZ00dgN/klZRBh4YYSgJqYx9BPkTxZsQZ/S/EmZJPpr8m4VUB69LKL
 VijwUdgcW9vNDJ2/DkDQDVBd/ZqRxXnltffWtP4pS96Gj089/dE2q8KXqQrt3LM1
 lLQjDfHj+0AyWRzKpErTO0W9DOgO7wmmelS0h6m2RYttkbb328JEZezg5bjWNNlk
 eTXxuAFFy8Ap9DngkC/sqvY2NRTv1YgOPfrT8XWwdDIiFTZ+LoYdFI5Ap/UW7QwG
 iz7B/0wPj3+9ncl536LRbFPiLisbYrArYGmZKF6t8w1cP6mTVlileMT4Q6s9+qhn
 GJS88tTBVlR9vbzDu2brjI6qRQVTBdsohIGoA1g6lz0ogbphhmTzujPbFQ6GTSBi
 3sKKp59urkBpVH3TVJU1oshLjIEG2yToMgYwZH9DU7zlzJS6XpetJrzReqItEThc
 VNixg2DxdIFQ+nrMt+LtWaOHs5qzxIIPksguGEhqSkLL5lI75n2MZxrhKXmUsaZa
 qItJE0ndJFfi6vggIkJID+a0bpTss7+AxF1AmSZDafMLkZy8j14DNQAnmBUYRX0J
 5QExZ+LyyaKjWQ/k9SsSzV1Y3dbguDyB+gkeMhr/6XEr9DSwZc8=
 =exgv
 -----END PGP SIGNATURE-----

Merge tag 'f2fs-for-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "In this round, we've added new features such as zone capacity for ZNS
  and a new GC policy, ATGC, along with in-memory segment management. In
  addition, we could improve the decompression speed significantly by
  changing virtual mapping method. Even though we've fixed lots of small
  bugs in compression support, I feel that it becomes more stable so
  that I could give it a try in production.

  Enhancements:
   - suport zone capacity in NVMe Zoned Namespace devices
   - introduce in-memory current segment management
   - add standart casefolding support
   - support age threshold based garbage collection
   - improve decompression speed by changing virtual mapping method

  Bug fixes:
   - fix condition checks in some ioctl() such as compression, move_range, etc
   - fix 32/64bits support in data structures
   - fix memory allocation in zstd decompress
   - add some boundary checks to avoid kernel panic on corrupted image
   - fix disallowing compression for non-empty file
   - fix slab leakage of compressed block writes

  In addition, it includes code refactoring for better readability and
  minor bug fixes for compression and zoned device support"

* tag 'f2fs-for-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (51 commits)
  f2fs: code cleanup by removing unnecessary check
  f2fs: wait for sysfs kobject removal before freeing f2fs_sb_info
  f2fs: fix writecount false positive in releasing compress blocks
  f2fs: introduce check_swap_activate_fast()
  f2fs: don't issue flush in f2fs_flush_device_cache() for nobarrier case
  f2fs: handle errors of f2fs_get_meta_page_nofail
  f2fs: fix to set SBI_NEED_FSCK flag for inconsistent inode
  f2fs: reject CASEFOLD inode flag without casefold feature
  f2fs: fix memory alignment to support 32bit
  f2fs: fix slab leak of rpages pointer
  f2fs: compress: fix to disallow enabling compress on non-empty file
  f2fs: compress: introduce cic/dic slab cache
  f2fs: compress: introduce page array slab cache
  f2fs: fix to do sanity check on segment/section count
  f2fs: fix to check segment boundary during SIT page readahead
  f2fs: fix uninit-value in f2fs_lookup
  f2fs: remove unneeded parameter in find_in_block()
  f2fs: fix wrong total_sections check and fsmeta check
  f2fs: remove duplicated code in sanity_check_area_boundary
  f2fs: remove unused check on version_bitmap
  ...
2020-10-16 15:14:43 -07:00
Linus Torvalds
96685f8666 powerpc updates for 5.10
- A series from Nick adding ARCH_WANT_IRQS_OFF_ACTIVATE_MM & selecting it for
    powerpc, as well as a related fix for sparc.
 
  - Remove support for PowerPC 601.
 
  - Some fixes for watchpoints & addition of a new ptrace flag for detecting ISA
    v3.1 (Power10) watchpoint features.
 
  - A fix for kernels using 4K pages and the hash MMU on bare metal Power9
    systems with > 16TB of RAM, or RAM on the 2nd node.
 
  - A basic idle driver for shallow stop states on Power10.
 
  - Tweaks to our sched domains code to better inform the scheduler about the
    hardware topology on Power9/10, where two SMT4 cores can be presented by
    firmware as an SMT8 core.
 
  - A series doing further reworks & cleanups of our EEH code.
 
  - Addition of a filter for RTAS (firmware) calls done via sys_rtas(), to
    prevent root from overwriting kernel memory.
 
  - Other smaller features, fixes & cleanups.
 
 Thanks to:
   Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V, Athira Rajeev, Biwen
   Li, Cameron Berkenpas, Cédric Le Goater, Christophe Leroy, Christoph Hellwig,
   Colin Ian King, Daniel Axtens, David Dai, Finn Thain, Frederic Barrat, Gautham
   R. Shenoy, Greg Kurz, Gustavo Romero, Ira Weiny, Jason Yan, Joel Stanley,
   Jordan Niethe, Kajol Jain, Konrad Rzeszutek Wilk, Laurent Dufour, Leonardo
   Bras, Liu Shixin, Luca Ceresoli, Madhavan Srinivasan, Mahesh Salgaonkar,
   Nathan Lynch, Nicholas Mc Guire, Nicholas Piggin, Nick Desaulniers, Oliver
   O'Halloran, Pedro Miraglia Franco de Carvalho, Pratik Rajesh Sampat, Qian Cai,
   Qinglang Miao, Ravi Bangoria, Russell Currey, Satheesh Rajendran, Scott
   Cheloha, Segher Boessenkool, Srikar Dronamraju, Stan Johnson, Stephen Kitt,
   Stephen Rothwell, Thiago Jung Bauermann, Tyrel Datwyler, Vaibhav Jain,
   Vaidyanathan Srinivasan, Vasant Hegde, Wang Wensheng, Wolfram Sang, Yang
   Yingliang, zhengbin.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAl+JBQoTHG1wZUBlbGxl
 cm1hbi5pZC5hdQAKCRBR6+o8yOGlgJJAD/0e3tsFP+9rFlxKSJlDcMW3w7kXDRXE
 tG40F1ubYFLU8wtFVR0De3njTRsz5HyaNU6SI8CwPq48mCa7OFn1D1OeHonHXDX9
 w6v3GE2S1uXXQnjm+czcfdjWQut0IwWBLx007/S23WcPff3Abc2irupKLNu+Gx29
 b/yxJHZSRJVX59jSV94HkdJS75mDHQ3oUOlFGXtuGcUZDufpD1ynRcQOjr0V/8JU
 F4WAblFSe7hiczHGqIvfhFVJ+OikEhnj2aEMAL8U7vxzrAZ7RErKCN9s/0Tf0Ktx
 FzNEFNLHZGqh+qNDpKKmM+RnaeO2Lcoc9qVn7vMHOsXPzx9F5LJwkI/DgPjtgAq/
 mFvGnQB/FapATnQeMluViC/qhEe5bQXLUfPP5i2+QOjK0QqwyFlUMgaVNfsY8jRW
 0Q/sNA72Opzst4WUTveCd4SOInlUuat09e5nLooCRLW7u7/jIiXNRSFNvpOiwkfF
 EcIPJsi6FUQ4SNbqpRSNEO9fK5JZrrUtmr0pg8I7fZhHYGcxEjqPR6IWCs3DTsak
 4/KhjhhTnP/IWJRw6qKAyNhEyEwpWqYZ97SIQbvSb1g/bS47AIdQdJRb0eEoRjhx
 sbbnnYFwPFkG4c1yQSIFanT9wNDQ2hFx/c/mRfbd7J+ordx9JsoqXjqrGuhsU/pH
 GttJLmkJ5FH+pQ==
 =akeX
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-5.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:

 - A series from Nick adding ARCH_WANT_IRQS_OFF_ACTIVATE_MM & selecting
   it for powerpc, as well as a related fix for sparc.

 - Remove support for PowerPC 601.

 - Some fixes for watchpoints & addition of a new ptrace flag for
   detecting ISA v3.1 (Power10) watchpoint features.

 - A fix for kernels using 4K pages and the hash MMU on bare metal
   Power9 systems with > 16TB of RAM, or RAM on the 2nd node.

 - A basic idle driver for shallow stop states on Power10.

 - Tweaks to our sched domains code to better inform the scheduler about
   the hardware topology on Power9/10, where two SMT4 cores can be
   presented by firmware as an SMT8 core.

 - A series doing further reworks & cleanups of our EEH code.

 - Addition of a filter for RTAS (firmware) calls done via sys_rtas(),
   to prevent root from overwriting kernel memory.

 - Other smaller features, fixes & cleanups.

Thanks to: Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V,
Athira Rajeev, Biwen Li, Cameron Berkenpas, Cédric Le Goater, Christophe
Leroy, Christoph Hellwig, Colin Ian King, Daniel Axtens, David Dai, Finn
Thain, Frederic Barrat, Gautham R. Shenoy, Greg Kurz, Gustavo Romero,
Ira Weiny, Jason Yan, Joel Stanley, Jordan Niethe, Kajol Jain, Konrad
Rzeszutek Wilk, Laurent Dufour, Leonardo Bras, Liu Shixin, Luca
Ceresoli, Madhavan Srinivasan, Mahesh Salgaonkar, Nathan Lynch, Nicholas
Mc Guire, Nicholas Piggin, Nick Desaulniers, Oliver O'Halloran, Pedro
Miraglia Franco de Carvalho, Pratik Rajesh Sampat, Qian Cai, Qinglang
Miao, Ravi Bangoria, Russell Currey, Satheesh Rajendran, Scott Cheloha,
Segher Boessenkool, Srikar Dronamraju, Stan Johnson, Stephen Kitt,
Stephen Rothwell, Thiago Jung Bauermann, Tyrel Datwyler, Vaibhav Jain,
Vaidyanathan Srinivasan, Vasant Hegde, Wang Wensheng, Wolfram Sang, Yang
Yingliang, zhengbin.

* tag 'powerpc-5.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (228 commits)
  Revert "powerpc/pci: unmap legacy INTx interrupts when a PHB is removed"
  selftests/powerpc: Fix eeh-basic.sh exit codes
  cpufreq: powernv: Fix frame-size-overflow in powernv_cpufreq_reboot_notifier
  powerpc/time: Make get_tb() common to PPC32 and PPC64
  powerpc/time: Make get_tbl() common to PPC32 and PPC64
  powerpc/time: Remove get_tbu()
  powerpc/time: Avoid using get_tbl() and get_tbu() internally
  powerpc/time: Make mftb() common to PPC32 and PPC64
  powerpc/time: Rename mftbl() to mftb()
  powerpc/32s: Remove #ifdef CONFIG_PPC_BOOK3S_32 in head_book3s_32.S
  powerpc/32s: Rename head_32.S to head_book3s_32.S
  powerpc/32s: Setup the early hash table at all time.
  powerpc/time: Remove ifdef in get_dec() and set_dec()
  powerpc: Remove get_tb_or_rtc()
  powerpc: Remove __USE_RTC()
  powerpc: Tidy up a bit after removal of PowerPC 601.
  powerpc: Remove support for PowerPC 601
  powerpc: Remove PowerPC 601
  powerpc: Drop SYNC_601() ISYNC_601() and SYNC()
  powerpc: Remove CONFIG_PPC601_SYNC_FIX
  ...
2020-10-16 12:21:15 -07:00
Linus Torvalds
c4cf498dc0 Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:
 "155 patches.

  Subsystems affected by this patch series: mm (dax, debug, thp,
  readahead, page-poison, util, memory-hotplug, zram, cleanups), misc,
  core-kernel, get_maintainer, MAINTAINERS, lib, bitops, checkpatch,
  binfmt, ramfs, autofs, nilfs, rapidio, panic, relay, kgdb, ubsan,
  romfs, and fault-injection"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (155 commits)
  lib, uaccess: add failure injection to usercopy functions
  lib, include/linux: add usercopy failure capability
  ROMFS: support inode blocks calculation
  ubsan: introduce CONFIG_UBSAN_LOCAL_BOUNDS for Clang
  sched.h: drop in_ubsan field when UBSAN is in trap mode
  scripts/gdb/tasks: add headers and improve spacing format
  scripts/gdb/proc: add struct mount & struct super_block addr in lx-mounts command
  kernel/relay.c: drop unneeded initialization
  panic: dump registers on panic_on_warn
  rapidio: fix the missed put_device() for rio_mport_add_riodev
  rapidio: fix error handling path
  nilfs2: fix some kernel-doc warnings for nilfs2
  autofs: harden ioctl table
  ramfs: fix nommu mmap with gaps in the page cache
  mm: remove the now-unnecessary mmget_still_valid() hack
  mm/gup: take mmap_lock in get_dump_page()
  binfmt_elf, binfmt_elf_fdpic: use a VMA list snapshot
  coredump: rework elf/elf_fdpic vma_dump_size() into common helper
  coredump: refactor page range dumping into common helper
  coredump: let dump_emit() bail out on short writes
  ...
2020-10-16 11:31:55 -07:00
Libing Zhou
d9bc85de46 ROMFS: support inode blocks calculation
When use 'stat' tool to display file status, the 'Blocks' field always in
'0', this is not good for tool 'du'(e.g.: busybox 'du'), it always output
'0' size for the files under ROMFS since such tool calculates number of
512B Blocks.

This patch calculates approx.  number of 512B blocks based on inode size.

Signed-off-by: Libing Zhou <libing.zhou@nokia-sbell.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Link: http://lkml.kernel.org/r/20200811052606.4243-1-libing.zhou@nokia-sbell.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16 11:11:22 -07:00
Wang Hai
64ead5201e nilfs2: fix some kernel-doc warnings for nilfs2
Fixes the following W=1 kernel build warning(s):

fs/nilfs2/bmap.c:378: warning: Excess function parameter 'bhp' description in 'nilfs_bmap_assign'
fs/nilfs2/cpfile.c:907: warning: Excess function parameter 'status' description in 'nilfs_cpfile_change_cpmode'
fs/nilfs2/cpfile.c:946: warning: Excess function parameter 'stat' description in 'nilfs_cpfile_get_stat'
fs/nilfs2/page.c:76: warning: Excess function parameter 'inode' description in 'nilfs_forget_buffer'
fs/nilfs2/sufile.c:563: warning: Excess function parameter 'stat' description in 'nilfs_sufile_get_stat'

Signed-off-by: Wang Hai <wanghai38@huawei.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: https://lkml.kernel.org/r/1601386269-2423-1-git-send-email-konishi.ryusuke@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16 11:11:22 -07:00
Matthew Wilcox
589f6b5268 autofs: harden ioctl table
The table of ioctl functions should be marked const in order to put them
in read-only memory, and we should use array_index_nospec() to avoid
speculation disclosing the contents of kernel memory to userspace.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Ian Kent <raven@themaw.net>
Link: https://lkml.kernel.org/r/20200818122203.GO17456@casper.infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16 11:11:22 -07:00
Matthew Wilcox (Oracle)
50b7d85680 ramfs: fix nommu mmap with gaps in the page cache
ramfs needs to check that pages are both physically contiguous and
contiguous in the file.  If the page cache happens to have, eg, page A for
index 0 of the file, no page for index 1, and page A+1 for index 2, then
an mmap of the first two pages of the file will succeed when it should
fail.

Fixes: 642fb4d1f1 ("[PATCH] NOMMU: Provide shared-writable mmap support on ramfs")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: David Howells <dhowells@redhat.com>
Link: https://lkml.kernel.org/r/20200914122239.GO6583@casper.infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16 11:11:22 -07:00
Jann Horn
4d45e75a99 mm: remove the now-unnecessary mmget_still_valid() hack
The preceding patches have ensured that core dumping properly takes the
mmap_lock.  Thanks to that, we can now remove mmget_still_valid() and all
its users.

Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Link: http://lkml.kernel.org/r/20200827114932.3572699-8-jannh@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16 11:11:22 -07:00
Jann Horn
a07279c9a8 binfmt_elf, binfmt_elf_fdpic: use a VMA list snapshot
In both binfmt_elf and binfmt_elf_fdpic, use a new helper
dump_vma_snapshot() to take a snapshot of the VMA list (including the gate
VMA, if we have one) while protected by the mmap_lock, and then use that
snapshot instead of walking the VMA list without locking.

An alternative approach would be to keep the mmap_lock held across the
entire core dumping operation; however, keeping the mmap_lock locked while
we may be blocked for an unbounded amount of time (e.g.  because we're
dumping to a FUSE filesystem or so) isn't really optimal; the mmap_lock
blocks things like the ->release handler of userfaultfd, and we don't
really want critical system daemons to grind to a halt just because
someone "gifted" them SCM_RIGHTS to an eternally-locked userfaultfd, or
something like that.

Since both the normal ELF code and the FDPIC ELF code need this
functionality (and if any other binfmt wants to add coredump support in
the future, they'd probably need it, too), implement this with a common
helper in fs/coredump.c.

A downside of this approach is that we now need a bigger amount of kernel
memory per userspace VMA in the normal ELF case, and that we need O(n)
kernel memory in the FDPIC ELF case at all; but 40 bytes per VMA shouldn't
be terribly bad.

There currently is a data race between stack expansion and anything that
reads ->vm_start or ->vm_end under the mmap_lock held in read mode; to
mitigate that for core dumping, take the mmap_lock in write mode when
taking a snapshot of the VMA hierarchy.  (If we only took the mmap_lock in
read mode, we could end up with a corrupted core dump if someone does
get_user_pages_remote() concurrently.  Not really a major problem, but
taking the mmap_lock either way works here, so we might as well avoid the
issue.) (This doesn't do anything about the existing data races with stack
expansion in other mm code.)

Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Link: http://lkml.kernel.org/r/20200827114932.3572699-6-jannh@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16 11:11:21 -07:00
Jann Horn
429a22e776 coredump: rework elf/elf_fdpic vma_dump_size() into common helper
At the moment, the binfmt_elf and binfmt_elf_fdpic code have slightly
different code to figure out which VMAs should be dumped, and if so,
whether the dump should contain the entire VMA or just its first page.

Eliminate duplicate code by reworking the binfmt_elf version into a
generic core dumping helper in coredump.c.

As part of that, change the heuristic for detecting executable/library
header pages to check whether the inode is executable instead of looking
at the file mode.

This is less problematic in terms of locking because it lets us avoid
get_user() under the mmap_sem.  (And arguably it looks nicer and makes
more sense in generic code.)

Adjust a little bit based on the binfmt_elf_fdpic version: ->anon_vma is
only meaningful under CONFIG_MMU, otherwise we have to assume that the VMA
has been written to.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Link: http://lkml.kernel.org/r/20200827114932.3572699-5-jannh@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16 11:11:21 -07:00
Jann Horn
afc63a97b7 coredump: refactor page range dumping into common helper
Both fs/binfmt_elf.c and fs/binfmt_elf_fdpic.c need to dump ranges of
pages into the coredump file.  Extract that logic into a common helper.

Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Link: http://lkml.kernel.org/r/20200827114932.3572699-4-jannh@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16 11:11:21 -07:00
Jann Horn
df0c09c011 coredump: let dump_emit() bail out on short writes
dump_emit() has a retry loop, but there seems to be no way for that retry
logic to actually be used; and it was also buggy, writing the same data
repeatedly after a short write.

Let's just bail out on a short write.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Link: http://lkml.kernel.org/r/20200827114932.3572699-3-jannh@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16 11:11:21 -07:00
Jann Horn
8f942eea12 binfmt_elf_fdpic: stop using dump_emit() on user pointers on !MMU
Patch series "Fix ELF / FDPIC ELF core dumping, and use mmap_lock properly in there", v5.

At the moment, we have that rather ugly mmget_still_valid() helper to work
around <https://crbug.com/project-zero/1790>: ELF core dumping doesn't
take the mmap_sem while traversing the task's VMAs, and if anything (like
userfaultfd) then remotely messes with the VMA tree, fireworks ensue.  So
at the moment we use mmget_still_valid() to bail out in any writers that
might be operating on a remote mm's VMAs.

With this series, I'm trying to get rid of the need for that as cleanly as
possible.  ("cleanly" meaning "avoid holding the mmap_lock across
unbounded sleeps".)

Patches 1, 2, 3 and 4 are relatively unrelated cleanups in the core
dumping code.

Patches 5 and 6 implement the main change: Instead of repeatedly accessing
the VMA list with sleeps in between, we snapshot it at the start with
proper locking, and then later we just use our copy of the VMA list.  This
ensures that the kernel won't crash, that VMA metadata in the coredump is
consistent even in the presence of concurrent modifications, and that any
virtual addresses that aren't being concurrently modified have their
contents show up in the core dump properly.

The disadvantage of this approach is that we need a bit more memory during
core dumping for storing metadata about all VMAs.

At the end of the series, patch 7 removes the old workaround for this
issue (mmget_still_valid()).

I have tested:

 - Creating a simple core dump on X86-64 still works.
 - The created coredump on X86-64 opens in GDB and looks plausible.
 - X86-64 core dumps contain the first page for executable mappings at
   offset 0, and don't contain the first page for non-executable file
   mappings or executable mappings at offset !=0.
 - NOMMU 32-bit ARM can still generate plausible-looking core dumps
   through the FDPIC implementation. (I can't test this with GDB because
   GDB is missing some structure definition for nommu ARM, but I've
   poked around in the hexdump and it looked decent.)

This patch (of 7):

dump_emit() is for kernel pointers, and VMAs describe userspace memory.
Let's be tidy here and avoid accessing userspace pointers under KERNEL_DS,
even if it probably doesn't matter much on !MMU systems - especially given
that it looks like we can just use the same get_dump_page() as on MMU if
we move it out of the CONFIG_MMU block.

One small change we have to make in get_dump_page() is to use
__get_user_pages_locked() instead of __get_user_pages(), since the latter
doesn't exist on nommu.  On mmu builds, __get_user_pages_locked() will
just call __get_user_pages() for us.

Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Link: http://lkml.kernel.org/r/20200827114932.3572699-1-jannh@google.com
Link: http://lkml.kernel.org/r/20200827114932.3572699-2-jannh@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16 11:11:21 -07:00
Chris Kennelly
ce81bb256a fs/binfmt_elf: use PT_LOAD p_align values for suitable start address
Patch series "Selecting Load Addresses According to p_align", v3.

The current ELF loading mechancism provides page-aligned mappings.  This
can lead to the program being loaded in a way unsuitable for file-backed,
transparent huge pages when handling PIE executables.

While specifying -z,max-page-size=0x200000 to the linker will generate
suitably aligned segments for huge pages on x86_64, the executable needs
to be loaded at a suitably aligned address as well.  This alignment
requires the binary's cooperation, as distinct segments need to be
appropriately paddded to be eligible for THP.

For binaries built with increased alignment, this limits the number of
bits usable for ASLR, but provides some randomization over using fixed
load addresses/non-PIE binaries.

This patch (of 2):

The current ELF loading mechancism provides page-aligned mappings.  This
can lead to the program being loaded in a way unsuitable for file-backed,
transparent huge pages when handling PIE executables.

For binaries built with increased alignment, this limits the number of
bits usable for ASLR, but provides some randomization over using fixed
load addresses/non-PIE binaries.

Tested by verifying program with -Wl,-z,max-page-size=0x200000 loading.

[akpm@linux-foundation.org: fix max() warning]
[ckennelly@google.com: augment comment]
  Link: https://lkml.kernel.org/r/20200821233848.3904680-2-ckennelly@google.com

Signed-off-by: Chris Kennelly <ckennelly@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Hugh Dickens <hughd@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: Fangrui Song <maskray@google.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Link: https://lkml.kernel.org/r/20200820170541.1132271-1-ckennelly@google.com
Link: https://lkml.kernel.org/r/20200820170541.1132271-2-ckennelly@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16 11:11:21 -07:00
Randy Dunlap
ce9bebe683 fs: configfs: delete repeated words in comments
Drop duplicated words {the, that} in comments.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Christoph Hellwig <hch@lst.de>
Link: https://lkml.kernel.org/r/20200811021826.25032-1-rdunlap@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16 11:11:19 -07:00
Matthew Wilcox (Oracle)
73bb49da50 mm/readahead: make page_cache_ra_unbounded take a readahead_control
Define it in the callers instead of in page_cache_ra_unbounded().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Eric Biggers <ebiggers@google.com>
Link: https://lkml.kernel.org/r/20200903140844.14194-4-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16 11:11:16 -07:00
Matthew Wilcox (Oracle)
01c7026705 fs: add a filesystem flag for THPs
The page cache needs to know whether the filesystem supports THPs so that
it doesn't send THPs to filesystems which can't handle them.  Dave Chinner
points out that getting from the page mapping to the filesystem type is
too many steps (mapping->host->i_sb->s_type->fs_flags) so cache that
information in the address space flags.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Link: https://lkml.kernel.org/r/20200916032717.22917-1-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16 11:11:15 -07:00