should've been done as soon as overlayfs stopped messing with fake
paths...
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
always equal to ->dentry->d_inode of the path argument these
days.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmXwt3cUHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQ6iDy2pc3iXObOhAAqldn1nbYS/t1D/k/9ZN/PtSQetK4
S58D8+gB59Sg0daWFaRhCwwShIbXS/6XzhqaVb3iAPptJs0YDFMbWLAW2d+dd69K
/7C8diguHbuJdEnCJtFYQIVinavaYVRlyoQcO8uwTz8uvTgXPOhr2P9NcOApJXcR
xqttuADVo/9Zn0O9/+GUPCH0ROL0SMnuUjwdVP3bpPHj9zEk8F1/A6chzTeSLJru
Y4+cRrN/r0JTkvRqPdnF9LSvxK7mtAEaHkKGeLQbw0O5pv3r3w0EWMJvq+uonGU2
WX0eR5VMfevkFMUdw8FKOTa+OZ0HJ2KKIb4sB4wDMgeGyov7Z6SxgvFeQiSyD3aB
QnyfLDzeEuPfousxUd45dUDnsWNnSgFF+JAdi0LSzm5hMuLeQDozTsFmh0orQcX1
L5A6VtAbSPP0ffl+tuPi48q3P3LlSjMP0B8W20NXFYhXukKXCgXVMr/dEvpwpu1m
o1glviGIXeLQQSnX3lMWb7Ds2igmCtXPrqkdu2vpRhMp0od6n4R4jH73Aj5MeSQn
n3sP73dg5sAaMjtI2NOisMeFUp09MMlOumCCM+AIplPXremm1kwgKRTIp0rKsLW9
VoQPXa43LQc3hAgPrpGuE+4yBfaBUq7Z8I37IFER/2y4K8b9YkduW4kDh7OdRz+d
iQ4Nnu2lR/+CCH0=
=0mTM
-----END PGP SIGNATURE-----
Merge tag 'lsm-pr-20240312' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm
Pull lsm updates from Paul Moore:
- Promote IMA/EVM to a proper LSM
This is the bulk of the diffstat, and the source of all the changes
in the VFS code. Prior to the start of the LSM stacking work it was
important that IMA/EVM were separate from the rest of the LSMs,
complete with their own hooks, infrastructure, etc. as it was the
only way to enable IMA/EVM at the same time as a LSM.
However, now that the bulk of the LSM infrastructure supports
multiple simultaneous LSMs, we can simplify things greatly by
bringing IMA/EVM into the LSM infrastructure as proper LSMs. This is
something I've wanted to see happen for quite some time and Roberto
was kind enough to put in the work to make it happen.
- Use the LSM hook default values to simplify the call_int_hook() macro
Previously the call_int_hook() macro required callers to supply a
default return value, despite a default value being specified when
the LSM hook was defined.
This simplifies the macro by using the defined default return value
which makes life easier for callers and should also reduce the number
of return value bugs in the future (we've had a few pop up recently,
hence this work).
- Use the KMEM_CACHE() macro instead of kmem_cache_create()
The guidance appears to be to use the KMEM_CACHE() macro when
possible and there is no reason why we can't use the macro, so let's
use it.
- Fix a number of comment typos in the LSM hook comment blocks
Not much to say here, we fixed some questionable grammar decisions in
the LSM hook comment blocks.
* tag 'lsm-pr-20240312' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm: (28 commits)
cred: Use KMEM_CACHE() instead of kmem_cache_create()
lsm: use default hook return value in call_int_hook()
lsm: fix typos in security/security.c comment headers
integrity: Remove LSM
ima: Make it independent from 'integrity' LSM
evm: Make it independent from 'integrity' LSM
evm: Move to LSM infrastructure
ima: Move IMA-Appraisal to LSM infrastructure
ima: Move to LSM infrastructure
integrity: Move integrity_kernel_module_request() to IMA
security: Introduce key_post_create_or_update hook
security: Introduce inode_post_remove_acl hook
security: Introduce inode_post_set_acl hook
security: Introduce inode_post_create_tmpfile hook
security: Introduce path_post_mknod hook
security: Introduce file_release hook
security: Introduce file_post_open hook
security: Introduce inode_post_removexattr hook
security: Introduce inode_post_setattr hook
security: Align inode_setattr hook definition with EVM
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmXuD/AQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpsojEACNlJKqsebZv24szCR5ViBGqoDi/A5v5vZv
1p7f0sVgpwFLuDu3CCb9IG1tuAiuhBa5yvBKKpyGuGglQd+7Sxqsgdc2Bv/76D7S
Ej/fc1x5dxuvAvAetYk4yH2idPhYIBVIx3g2oz44bO4Ur3jFZ/yXzp+JtuKEuTba
7kQmAXfN7c497XDsmSv1eJM/+D/LKjmvjqMX2gnXprw2qPgdAklXcUSnBYaS2JEt
o4HGWAImJOV416d7QkOWgKfk6ksJbO3lFzQ6R+JdQCl6KVqc0+5u0oT06ZGVpSUf
fQqfcV+cJw41dQB47Qr017ku0EdDI19L3YpL9/WOnNMBM421j1QER1cKiKfiHD2B
LCOn+tvunxcGMzYonAFfgSF4XXFJWSK33TpvmmVsU3w0+YSC9oIqFfCxOdHuAJqB
tHSuGHgzkufgqhNIQWHiWZEJJUW+MO4Dv2rUV6n+dfCz6JQG48Gs9clDv/tAEY4U
4NzErfYLCsWlNaMPQK1f/b9dWjBXAnpJA4yq8jPyYB3GqjnVuX3Ze14UfwOWgv0B
E++qgPsh30ShbP/NRHqS9tNQC2hIy27x/jzpTyKwxuoSs/nyeZg7lFXIPaQQo7wt
GZhGzsMasbhoylqblB171NFlxpRetY9aYvHZ3OfUP4xAt1THVOzR6hZrBurOKMv/
e8FBGBh/cg==
=Hy//
-----END PGP SIGNATURE-----
Merge tag 'for-6.9/io_uring-20240310' of git://git.kernel.dk/linux
Pull io_uring updates from Jens Axboe:
- Make running of task_work internal loops more fair, and unify how the
different methods deal with them (me)
- Support for per-ring NAPI. The two minor networking patches are in a
shared branch with netdev (Stefan)
- Add support for truncate (Tony)
- Export SQPOLL utilization stats (Xiaobing)
- Multishot fixes (Pavel)
- Fix for a race in manipulating the request flags via poll (Pavel)
- Cleanup the multishot checking by making it generic, moving it out of
opcode handlers (Pavel)
- Various tweaks and cleanups (me, Kunwu, Alexander)
* tag 'for-6.9/io_uring-20240310' of git://git.kernel.dk/linux: (53 commits)
io_uring: Fix sqpoll utilization check racing with dying sqpoll
io_uring/net: dedup io_recv_finish req completion
io_uring: refactor DEFER_TASKRUN multishot checks
io_uring: fix mshot io-wq checks
io_uring/net: add io_req_msg_cleanup() helper
io_uring/net: simplify msghd->msg_inq checking
io_uring/kbuf: rename REQ_F_PARTIAL_IO to REQ_F_BL_NO_RECYCLE
io_uring/net: remove dependency on REQ_F_PARTIAL_IO for sr->done_io
io_uring/net: correctly handle multishot recvmsg retry setup
io_uring/net: clear REQ_F_BL_EMPTY in the multishot retry handler
io_uring: fix io_queue_proc modifying req->flags
io_uring: fix mshot read defer taskrun cqe posting
io_uring/net: fix overflow check in io_recvmsg_mshot_prep()
io_uring/net: correct the type of variable
io_uring/sqpoll: statistics of the true utilization of sq threads
io_uring/net: move recv/recvmsg flags out of retry loop
io_uring/kbuf: flag request if buffer pool is empty after buffer pick
io_uring/net: improve the usercopy for sendmsg/recvmsg
io_uring/net: move receive multishot out of the generic msghdr path
io_uring/net: unify how recvmsg and sendmsg copy in the msghdr
...
Move hardcoded IMA function calls (not appraisal-specific functions) from
various places in the kernel to the LSM infrastructure, by introducing a
new LSM named 'ima' (at the end of the LSM list and always enabled like
'integrity').
Having IMA before EVM in the Makefile is sufficient to preserve the
relative order of the new 'ima' LSM in respect to the upcoming 'evm' LSM,
and thus the order of IMA and EVM function calls as when they were
hardcoded.
Make moved functions as static (except ima_post_key_create_or_update(),
which is not in ima_main.c), and register them as implementation of the
respective hooks in the new function init_ima_lsm().
Select CONFIG_SECURITY_PATH, to ensure that the path-based LSM hook
path_post_mknod is always available and ima_post_path_mknod() is always
executed to mark files as new, as before the move.
A slight difference is that IMA and EVM functions registered for the
inode_post_setattr, inode_post_removexattr, path_post_mknod,
inode_post_create_tmpfile, inode_post_set_acl and inode_post_remove_acl
won't be executed for private inodes. Since those inodes are supposed to be
fs-internal, they should not be of interest to IMA or EVM. The S_PRIVATE
flag is used for anonymous inodes, hugetlbfs, reiserfs xattrs, XFS scrub
and kernel-internal tmpfs files.
Conditionally register ima_post_key_create_or_update() if
CONFIG_IMA_MEASURE_ASYMMETRIC_KEYS is enabled. Also, conditionally register
ima_kernel_module_request() if CONFIG_INTEGRITY_ASYMMETRIC_KEYS is enabled.
Finally, add the LSM_ID_IMA case in lsm_list_modules_test.c.
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Acked-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Stefan Berger <stefanb@linux.ibm.com>
Reviewed-by: Mimi Zohar <zohar@linux.ibm.com>
Acked-by: Mimi Zohar <zohar@linux.ibm.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
do_sys_ftruncate receives a file descriptor, fgets the struct file, and
finally actually truncates the file.
do_ftruncate allows for passing in a file directly, with the caller
already holding a reference to it.
Signed-off-by: Tony Solomonik <tony.solomonik@gmail.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20240202121724.17461-2-tony.solomonik@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In a future patch, we're going to split file leases into their own
structure. Since a lot of the underlying machinery uses the same fields
move those into a new file_lock_core, and embed that inside struct
file_lock.
For now, add some macros to ensure that we can continue to build while
the conversion is in progress.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20240131-flsplit-v3-17-c6129007ee8d@kernel.org
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZZUzXQAKCRCRxhvAZXjc
ogOtAQDpqUp1zY4dV/dZisCJ5xarZTsSZ1AvgmcxZBtS0NhbdgEAshWvYGA9ryS/
ChL5jjtjjZDLhRA//reoFHTQIrdp2w8=
=bF+R
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.8.rw' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs rw updates from Christian Brauner:
"This contains updates from Amir for read-write backing file helpers
for stacking filesystems such as overlayfs:
- Fanotify is currently in the process of introducing pre content
events. Roughly, a new permission event will be added indicating
that it is safe to write to the file being accessed. These events
are used by hierarchical storage managers to e.g., fill the content
of files on first access.
During that work we noticed that our current permission checking is
inconsistent in rw_verify_area() and remap_verify_area().
Especially in the splice code permission checking is done multiple
times. For example, one time for the whole range and then again for
partial ranges inside the iterator.
In addition, we mostly do permission checking before we call
file_start_write() except for a few places where we call it after.
For pre-content events we need such permission checking to be done
before file_start_write(). So this is a nice reason to clean this
all up.
After this series, all permission checking is done before
file_start_write().
As part of this cleanup we also massaged the splice code a bit. We
got rid of a few helpers because we are alredy drowning in special
read-write helpers. We also cleaned up the return types for splice
helpers.
- Introduce generic read-write helpers for backing files. This lifts
some overlayfs code to common code so it can be used by the FUSE
passthrough work coming in over the next cycles. Make Amir and
Miklos the maintainers for this new subsystem of the vfs"
* tag 'vfs-6.8.rw' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (30 commits)
fs: fix __sb_write_started() kerneldoc formatting
fs: factor out backing_file_mmap() helper
fs: factor out backing_file_splice_{read,write}() helpers
fs: factor out backing_file_{read,write}_iter() helpers
fs: prepare for stackable filesystems backing file helpers
fsnotify: optionally pass access range in file permission hooks
fsnotify: assert that file_start_write() is not held in permission hooks
fsnotify: split fsnotify_perm() into two hooks
fs: use splice_copy_file_range() inline helper
splice: return type ssize_t from all helpers
fs: use do_splice_direct() for nfsd/ksmbd server-side-copy
fs: move file_start_write() into direct_splice_actor()
fs: fork splice_file_range() from do_splice_direct()
fs: create {sb,file}_write_not_started() helpers
fs: create file_write_started() helper
fs: create __sb_write_started() helper
fs: move kiocb_start_write() into vfs_iocb_iter_write()
fs: move permission hook out of do_iter_read()
fs: move permission hook out of do_iter_write()
fs: move file_start_write() into vfs_iter_write()
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZZUxRQAKCRCRxhvAZXjc
ov/QAQDzvge3oQ9MEymmOiyzzcF+HhAXBr+9oEsYJjFc1p0TsgEA61gXjZo7F1jY
KBqd6znOZCR+Waj0kIVJRAo/ISRBqQc=
=0bRl
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.8.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"This contains the usual miscellaneous features, cleanups, and fixes
for vfs and individual fses.
Features:
- Add Jan Kara as VFS reviewer
- Show correct device and inode numbers in proc/<pid>/maps for vma
files on stacked filesystems. This is now easily doable thanks to
the backing file work from the last cycles. This comes with
selftests
Cleanups:
- Remove a redundant might_sleep() from wait_on_inode()
- Initialize pointer with NULL, not 0
- Clarify comment on access_override_creds()
- Rework and simplify eventfd_signal() and eventfd_signal_mask()
helpers
- Process aio completions in batches to avoid needless wakeups
- Completely decouple struct mnt_idmap from namespaces. We now only
keep the actual idmapping around and don't stash references to
namespaces
- Reformat maintainer entries to indicate that a given subsystem
belongs to fs/
- Simplify fput() for files that were never opened
- Get rid of various pointless file helpers
- Rename various file helpers
- Rename struct file members after SLAB_TYPESAFE_BY_RCU switch from
last cycle
- Make relatime_need_update() return bool
- Use GFP_KERNEL instead of GFP_USER when allocating superblocks
- Replace deprecated ida_simple_*() calls with their current ida_*()
counterparts
Fixes:
- Fix comments on user namespace id mapping helpers. They aren't
kernel doc comments so they shouldn't be using /**
- s/Retuns/Returns/g in various places
- Add missing parameter documentation on can_move_mount_beneath()
- Rename i_mapping->private_data to i_mapping->i_private_data
- Fix a false-positive lockdep warning in pipe_write() for watch
queues
- Improve __fget_files_rcu() code generation to improve performance
- Only notify writer that pipe resizing has finished after setting
pipe->max_usage otherwise writers are never notified that the pipe
has been resized and hang
- Fix some kernel docs in hfsplus
- s/passs/pass/g in various places
- Fix kernel docs in ntfs
- Fix kcalloc() arguments order reported by gcc 14
- Fix uninitialized value in reiserfs"
* tag 'vfs-6.8.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (36 commits)
reiserfs: fix uninit-value in comp_keys
watch_queue: fix kcalloc() arguments order
ntfs: dir.c: fix kernel-doc function parameter warnings
fs: fix doc comment typo fs tree wide
selftests/overlayfs: verify device and inode numbers in /proc/pid/maps
fs/proc: show correct device and inode numbers in /proc/pid/maps
eventfd: Remove usage of the deprecated ida_simple_xx() API
fs: super: use GFP_KERNEL instead of GFP_USER for super block allocation
fs/hfsplus: wrapper.c: fix kernel-doc warnings
fs: add Jan Kara as reviewer
fs/inode: Make relatime_need_update return bool
pipe: wakeup wr_wait after setting max_usage
file: remove __receive_fd()
file: stop exposing receive_fd_user()
fs: replace f_rcuhead with f_task_work
file: remove pointless wrapper
file: s/close_fd_get_file()/file_close_fd()/g
Improve __fget_files_rcu() code generation (and thus __fget_light())
file: massage cleanup of files that failed to open
fs/pipe: Fix lockdep false-positive in watchqueue pipe_write()
...
In preparation for factoring out some backing file io helpers from
overlayfs, move backing_file_open() into a new file fs/backing-file.c
and header.
Add a MAINTAINERS entry for stackable filesystems and add a Kconfig
FS_STACK which stackable filesystems need to select.
For now, the backing_file struct, the backing_file alloc/free functions
and the backing_file_real_path() accessor remain internal to file_table.c.
We may change that in the future.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Bring in the changes to the file infrastructure for this cycle. Mostly
cleanups and some performance tweaks.
* file: remove __receive_fd()
* file: stop exposing receive_fd_user()
* fs: replace f_rcuhead with f_task_work
* file: remove pointless wrapper
* file: s/close_fd_get_file()/file_close_fd()/g
* Improve __fget_files_rcu() code generation (and thus __fget_light())
* file: massage cleanup of files that failed to open
Signed-off-by: Christian Brauner <brauner@kernel.org>
This code is rarely (never?) enabled by distros, and it hasn't caught
anything in decades. Let's kill off this legacy debug code.
Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In preparation for pre-content permission events with file access range,
move fsnotify_file_perm() hook out of security_file_permission() and into
the callers.
Callers that have the access range information call the new hook
fsnotify_file_area_perm() with the access range.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231212094440.250945-6-amir73il@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
That really shouldn't have "get" in there as that implies we're bumping
the reference count which we don't do at all. We used to but not anmore.
Now we're just closing the fd and pick that file from the fdtable
without bumping the reference count. Update the wrong documentation
while at it.
Link: https://lore.kernel.org/r/20231130-vfs-files-fixes-v1-1-e73ca6f4ea83@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
A backing file struct stores two path's, one "real" path that is referring
to f_inode and one "fake" path, which should be displayed to users in
/proc/<pid>/maps.
There is a lot more potential code that needs to know the "real" path, then
code that needs to know the "fake" path.
Instead of code having to request the "real" path with file_real_path(),
store the "real" path in f_path and require code that needs to know the
"fake" path request it with file_user_path().
Replace the file_real_path() helper with a simple const accessor f_path().
After this change, file_dentry() is not expected to observe any files
with overlayfs f_path and real f_inode, so the call to ->d_real() should
not be needed. Leave the ->d_real() call for now and add an assertion
in ovl_d_real() to catch if we made wrong assumptions.
Suggested-by: Miklos Szeredi <miklos@szeredi.hu>
Link: https://lore.kernel.org/r/CAJfpegtt48eXhhjDFA1ojcHPNKj3Go6joryCPtEFAKpocyBsnw@mail.gmail.com/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231009153712.1566422-4-amir73il@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
A writeable mapped backing file can perform writes to the real inode.
Therefore, the real path mount must be kept writable so long as the
writable map exists.
This may not be strictly needed for ovelrayfs private upper mount,
but it is correct to take the mnt_writers count in the vfs helper.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20231009153712.1566422-2-amir73il@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Before exporting these helpers to modules, make their names more
meaningful.
The names mnt_{get,put)_write_access*() were chosen, because they rhyme
with the inode {get,put)_write_access() helpers, which have a very close
meaning for the inode object.
Suggested-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20230817-anfechtbar-ruhelosigkeit-8c6cca8443fc@brauner/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Message-Id: <20230908132900.2983519-2-amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZOXT7QAKCRCRxhvAZXjc
ort3AP0VIK/oJk5skgjpinQrCfvtVz0XOtawuBtn0f1weIfb6AD9Hg1rqOKnQD5z
dkvn3xaEr3gPOVzqU5SvFwVoCM0cMwA=
=24Ha
-----END PGP SIGNATURE-----
Merge tag 'v6.6-vfs.fchmodat2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull fchmodat2 system call from Christian Brauner:
"This adds the fchmodat2() system call. It is a revised version of the
fchmodat() system call, adding a missing flag argument. Support for
both AT_SYMLINK_NOFOLLOW and AT_EMPTY_PATH are included.
Adding this system call revision has been a longstanding request but
so far has always fallen through the cracks. While the kernel
implementation of fchmodat() does not have a flag argument the libc
provided POSIX-compliant fchmodat(3) version does. Both glibc and musl
have to implement a workaround in order to support AT_SYMLINK_NOFOLLOW
(see [1] and [2]).
The workaround is brittle because it relies not just on O_PATH and
O_NOFOLLOW semantics and procfs magic links but also on our rather
inconsistent symlink semantics.
This gives userspace a proper fchmodat2() system call that libcs can
use to properly implement fchmodat(3) and allows them to get rid of
their hacks. In this case it will immediately benefit them as the
current workaround is already defunct because of aformentioned
inconsistencies.
In addition to AT_SYMLINK_NOFOLLOW, give userspace the ability to use
AT_EMPTY_PATH with fchmodat2(). This is already possible with
fchownat() so there's no reason to not also support it for
fchmodat2().
The implementation is simple and comes with selftests. Implementation
of the system call and wiring up the system call are done as separate
patches even though they could arguably be one patch. But in case
there are merge conflicts from other system call additions it can be
beneficial to have separate patches"
Link: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/fchmodat.c;h=17eca54051ee28ba1ec3f9aed170a62630959143;hb=a492b1e5ef7ab50c6fdd4e4e9879ea5569ab0a6c#l35 [1]
Link: https://git.musl-libc.org/cgit/musl/tree/src/stat/fchmodat.c?id=718f363bc2067b6487900eddc9180c84e7739f80#n28 [2]
* tag 'v6.6-vfs.fchmodat2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
selftests: fchmodat2: remove duplicate unneeded defines
fchmodat2: add support for AT_EMPTY_PATH
selftests: Add fchmodat2 selftest
arch: Register fchmodat2, usually as syscall 452
fs: Add fchmodat2()
Non-functional cleanup of a "__user * filename"
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZOXTxQAKCRCRxhvAZXjc
okaVAP94WAlItvDRt/z2Wtzf0+RqPZeTXEdGTxua8+RxqCyYIQD+OO5nRfKQPHlV
AqqGJMKItQMSMIYgB5ftqVhNWZfnHgM=
=pSEW
-----END PGP SIGNATURE-----
Merge tag 'v6.6-vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"This contains the usual miscellaneous features, cleanups, and fixes
for vfs and individual filesystems.
Features:
- Block mode changes on symlinks and rectify our broken semantics
- Report file modifications via fsnotify() for splice
- Allow specifying an explicit timeout for the "rootwait" kernel
command line option. This allows to timeout and reboot instead of
always waiting indefinitely for the root device to show up
- Use synchronous fput for the close system call
Cleanups:
- Get rid of open-coded lockdep workarounds for async io submitters
and replace it all with a single consolidated helper
- Simplify epoll allocation helper
- Convert simple_write_begin and simple_write_end to use a folio
- Convert page_cache_pipe_buf_confirm() to use a folio
- Simplify __range_close to avoid pointless locking
- Disable per-cpu buffer head cache for isolated cpus
- Port ecryptfs to kmap_local_page() api
- Remove redundant initialization of pointer buf in pipe code
- Unexport the d_genocide() function which is only used within core
vfs
- Replace printk(KERN_ERR) and WARN_ON() with WARN()
Fixes:
- Fix various kernel-doc issues
- Fix refcount underflow for eventfds when used as EFD_SEMAPHORE
- Fix a mainly theoretical issue in devpts
- Check the return value of __getblk() in reiserfs
- Fix a racy assert in i_readcount_dec
- Fix integer conversion issues in various functions
- Fix LSM security context handling during automounts that prevented
NFS superblock sharing"
* tag 'v6.6-vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (39 commits)
cachefiles: use kiocb_{start,end}_write() helpers
ovl: use kiocb_{start,end}_write() helpers
aio: use kiocb_{start,end}_write() helpers
io_uring: use kiocb_{start,end}_write() helpers
fs: create kiocb_{start,end}_write() helpers
fs: add kerneldoc to file_{start,end}_write() helpers
io_uring: rename kiocb_end_write() local helper
splice: Convert page_cache_pipe_buf_confirm() to use a folio
libfs: Convert simple_write_begin and simple_write_end to use a folio
fs/dcache: Replace printk and WARN_ON by WARN
fs/pipe: remove redundant initialization of pointer buf
fs: Fix kernel-doc warnings
devpts: Fix kernel-doc warnings
doc: idmappings: fix an error and rephrase a paragraph
init: Add support for rootwait timeout parameter
vfs: fix up the assert in i_readcount_dec
fs: Fix one kernel-doc comment
docs: filesystems: idmappings: clarify from where idmappings are taken
fs/buffer.c: disable per-CPU buffer_head cache for isolated CPUs
vfs, security: Fix automount superblock LSM init problem, preventing NFS sb sharing
...
These have a variety of causes and a corresponding variety of solutions.
Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Message-Id: <20230818200824.2720007-1-willy@infradead.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
close(2) is a special case which guarantees a shallow kernel stack,
making delegation to task_work machinery unnecessary. Said delegation is
problematic as it involves atomic ops and interrupt masking trips, none
of which are cheap on x86-64. Forcing close(2) to do it looks like an
oversight in the original work.
Moreover presence of CONFIG_RSEQ adds an additional overhead as fput()
-> task_work_add(..., TWA_RESUME) -> set_notify_resume() makes the
thread returning to userspace land in resume_user_mode_work(), where
rseq_handle_notify_resume takes a SMAP round-trip if rseq is enabled for
the thread (and it is by default with contemporary glibc).
Sample result when benchmarking open1_processes -t 1 from will-it-scale
(that's an open + close loop) + tmpfs on /tmp, running on the Sapphire
Rapid CPU (ops/s):
stock+RSEQ: 1329857
stock-RSEQ: 1421667 (+7%)
patched: 1523521 (+14.5% / +7%) (with / without rseq)
Patched result is the same regardless of rseq as the codepath is avoided.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
O_TMPFILE is actually __O_TMPFILE|O_DIRECTORY. This means that the old
fast-path check for RESOLVE_CACHED would reject all users passing
O_DIRECTORY with -EAGAIN, when in fact the intended test was to check
for __O_TMPFILE.
Cc: stable@vger.kernel.org # v5.12+
Fixes: 99668f6180 ("fs: expose LOOKUP_CACHED through openat2() RESOLVE_CACHED")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Message-Id: <20230806-resolve_cached-o_tmpfile-v1-1-7ba16308465e@cyphar.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
This allows userspace to avoid going through /proc/self/fd when dealing
with all types of file descriptors for chmod(), and makes fchmodat2() a
proper superset of all other chmod syscalls.
The primary difference between fchmodat2(AT_EMPTY_PATH) and fchmod() is
that fchmod() doesn't operate on O_PATH file descriptors by design. To
quote open(2):
> O_PATH (since Linux 2.6.39)
> [...]
> The file itself is not opened, and other file operations (e.g.,
> read(2), write(2), fchmod(2), fchown(2), fgetxattr(2), ioctl(2),
> mmap(2)) fail with the error EBADF.
However, procfs has allowed userspace to do this operation ever since
the introduction of O_PATH through magic-links, so adding this feature
is only an improvement for programs that have to mess around with
/proc/self/fd/$n today to get this behaviour. In addition,
fchownat(AT_EMPTY_PATH) has existed since the introduction of O_PATH and
allows chown() operations directly on O_PATH descriptors.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Acked-by: Alexey Gladkov <legion@kernel.org>
Message-Id: <20230728-fchmodat2-at_empty_path-v1-1-f3add31d3516@cyphar.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
On the userspace side fchmodat(3) is implemented as a wrapper
function which implements the POSIX-specified interface. This
interface differs from the underlying kernel system call, which does not
have a flags argument. Most implementations require procfs [1][2].
There doesn't appear to be a good userspace workaround for this issue
but the implementation in the kernel is pretty straight-forward.
The new fchmodat2() syscall allows to pass the AT_SYMLINK_NOFOLLOW flag,
unlike existing fchmodat.
[1] https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/fchmodat.c;h=17eca54051ee28ba1ec3f9aed170a62630959143;hb=a492b1e5ef7ab50c6fdd4e4e9879ea5569ab0a6c#l35
[2] https://git.musl-libc.org/cgit/musl/tree/src/stat/fchmodat.c?id=718f363bc2067b6487900eddc9180c84e7739f80#n28
Co-developed-by: Palmer Dabbelt <palmer@sifive.com>
Signed-off-by: Palmer Dabbelt <palmer@sifive.com>
Signed-off-by: Alexey Gladkov <legion@kernel.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Message-Id: <f2a846ef495943c5d101011eebcf01179d0c7b61.1689092120.git.legion@kernel.org>
[brauner: pre reviews, do flag conversion in do_fchmodat() directly]
Signed-off-by: Christian Brauner <brauner@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmScTY8ACgkQnJ2qBz9k
QNnfwwgAhZtow1klDLH6qCnWMufB6AwT7VfAHyA3fvyTYMjUKb0sGHkpuh8hqVOb
Lzb4YB+jSWV8XnMFn/4gFJQU/nAv8bMPavghMGpr5VNjQi7WkxYF/GB6O1I5NOHK
EnJjDExgdxXDJZORaaXLVJWrtzJuDFgdiSeIwJECFa0MdTHNgPy3XOl+PPxnYQ/V
xyHyP5ImGgd5O4iy3PFDQBGgOXIMrBX8IMce+qLQNYIvjSIUgmdnIkoUCvsQiisp
LyKI2LxqAqnpA4h4Ow6hOZDw2VlPT0vDwFVUfFIZMIqs5YgaSbWa1Z6cs37MigAn
fgUyRVx2y8A2Lwla7rwLaUEToRVADw==
=ZdcG
-----END PGP SIGNATURE-----
Merge tag 'fsnotify_for_v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify updates from Jan Kara:
- Support for fanotify events returning file handles for filesystems
not exportable via NFS
- Improved error handling exportfs functions
- Add missing FS_OPEN events when unusual open helpers are used
* tag 'fsnotify_for_v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fsnotify: move fsnotify_open() hook into do_dentry_open()
exportfs: check for error return value from exportfs_encode_*()
fanotify: support reporting non-decodeable file handles
exportfs: allow exporting non-decodeable file handles to userspace
exportfs: add explicit flag to request non-decodeable file handles
exportfs: change connectable argument to bit flags
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZJU4WgAKCRCRxhvAZXjc
oofvAQDs9RJwQUyWHJmQA+tWz5cUE5DviVWCwwul5dQRRCqgaQEA2OIO0gPFaVoq
1OYOeLyUjl/cpS8e3u4uJtw34jttdQA=
=AwcR
-----END PGP SIGNATURE-----
Merge tag 'v6.5/vfs.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs file handling updates from Christian Brauner:
"This contains Amir's work to fix a long-standing problem where an
unprivileged overlayfs mount can be used to avoid fanotify permission
events that were requested for an inode or superblock on the
underlying filesystem.
Some background about files opened in overlayfs. If a file is opened
in overlayfs @file->f_path will refer to a "fake" path. What this
means is that while @file->f_inode will refer to inode of the
underlying layer, @file->f_path refers to an overlayfs
{dentry,vfsmount} pair. The reasons for doing this are out of scope
here but it is the reason why the vfs has been providing the
open_with_fake_path() helper for overlayfs for very long time now. So
nothing new here.
This is for sure not very elegant and everyone including the overlayfs
maintainers agree. Improving this significantly would involve more
fragile and potentially rather invasive changes.
In various codepaths access to the path of the underlying filesystem
is needed for such hybrid file. The best example is fsnotify where
this becomes security relevant. Passing the overlayfs
@file->f_path->dentry will cause fsnotify to skip generating fsnotify
events registered on the underlying inode or superblock.
To fix this we extend the vfs provided open_with_fake_path() concept
for overlayfs to create a backing file container that holds the real
path and to expose a helper that can be used by relevant callers to
get access to the path of the underlying filesystem through the new
file_real_path() helper. This pattern is similar to what we do in
d_real() and d_real_inode().
The first beneficiary is fsnotify and fixes the security sensitive
problem mentioned above.
There's a couple of nice cleanups included as well.
Over time, the old open_with_fake_path() helper added specifically for
overlayfs a long time ago started to get used in other places such as
cachefiles. Even though cachefiles have nothing to do with hybrid
files.
The only reason cachefiles used that concept was that files opened
with open_with_fake_path() aren't charged against the caller's open
file limit by raising FMODE_NOACCOUNT. It's just mere coincidence that
both overlayfs and cachefiles need to ensure to not overcharge the
caller for their internal open calls.
So this work disentangles FMODE_NOACCOUNT use cases and backing file
use-cases by adding the FMODE_BACKING flag which indicates that the
file can be used to retrieve the backing file of another filesystem.
(Fyi, Jens will be sending you a really nice cleanup from Christoph
that gets rid of 3 FMODE_* flags otherwise this would be the last
fmode_t bit we'd be using.)
So now overlayfs becomes the sole user of the renamed
open_with_fake_path() helper which is now named backing_file_open().
For internal kernel users such as cachefiles that are only interested
in FMODE_NOACCOUNT but not in FMODE_BACKING we add a new
kernel_file_open() helper which opens a file without being charged
against the caller's open file limit. All new helpers are properly
documented and clearly annotated to mention their special uses.
We also rename vfs_tmpfile_open() to kernel_tmpfile_open() to clearly
distinguish it from vfs_tmpfile() and align it the other kernel_*()
internal helpers"
* tag 'v6.5/vfs.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
ovl: enable fsnotify events on underlying real files
fs: use backing_file container for internal files with "fake" f_path
fs: move kmem_cache_zalloc() into alloc_empty_file*() helpers
fs: use a helper for opening kernel internal files
fs: rename {vfs,kernel}_tmpfile_open()
Overlayfs uses open_with_fake_path() to allocate internal kernel files,
with a "fake" path - whose f_path is not on the same fs as f_inode.
Allocate a container struct backing_file for those internal files, that
is used to hold the "fake" ovl path along with the real path.
backing_file_real_path() can be used to access the stored real path.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Message-Id: <20230615112229.2143178-5-amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
cachefiles uses kernel_open_tmpfile() to open kernel internal tmpfile
without accounting for nr_files.
cachefiles uses open_with_fake_path() for the same reason without the
need for a fake path.
Fork open_with_fake_path() to kernel_file_open() which only does the
noaccount part and use it in cachefiles.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Message-Id: <20230615112229.2143178-3-amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
fsnotify_open() hook is called only from high level system calls
context and not called for the very many helpers to open files.
This may makes sense for many of the special file open cases, but it is
inconsistent with fsnotify_close() hook that is called for every last
fput() of on a file object with FMODE_OPENED.
As a result, it is possible to observe ACCESS, MODIFY and CLOSE events
without ever observing an OPEN event.
Fix this inconsistency by replacing all the fsnotify_open() hooks with
a single hook inside do_dentry_open().
If there are special cases that would like to opt-out of the possible
overhead of fsnotify() call in fsnotify_open(), they would probably also
want to avoid the overhead of fsnotify() call in the rest of the fsnotify
hooks, so they should be opening that file with the __FMODE_NONOTIFY flag.
However, in the majority of those cases, the s_fsnotify_connectors
optimization in fsnotify_parent() would be sufficient to avoid the
overhead of fsnotify() call anyway.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <20230611122429.1499617-1-amir73il@gmail.com>
Use __FMODE_NONOTIFY instead of FMODE_NONOTIFY to fixes
the following sparce warnings:
fs/overlayfs/file.c:48:37: sparse: warning: restricted fmode_t degrades to integer
fs/overlayfs/file.c:128:13: sparse: warning: restricted fmode_t degrades to integer
fs/open.c:1159:21: sparse: warning: restricted fmode_t degrades to integer
Signed-off-by: Min-Hua Chen <minhuadotchen@gmail.com>
Message-Id: <20230502232210.119063-1-minhuadotchen@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
fs/open.c: In functions 'setattr_vfsuid' and 'setattr_vfsgid':
warning: Function parameter or member 'attr' not described
- Fix warning by removing kernel-doc for these as they are static
inline functions and not required to be exposed via kernel-doc.
fs/open.c:
warning: Excess function parameter 'opened' description in 'finish_open'
warning: Excess function parameter 'cred' description in 'vfs_open'
- Fix by removing the parameters from the kernel-doc as they are no
longer required by the function.
Signed-off-by: Anuradha Weeraman <anuradha@debian.org>
Message-Id: <20230506182928.384105-1-anuradha@debian.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
After a couple of years and multiple LTS releases we received a report
that the behavior of O_DIRECTORY | O_CREAT changed starting with v5.7.
On kernels prior to v5.7 combinations of O_DIRECTORY, O_CREAT, O_EXCL
had the following semantics:
(1) open("/tmp/d", O_DIRECTORY | O_CREAT)
* d doesn't exist: create regular file
* d exists and is a regular file: ENOTDIR
* d exists and is a directory: EISDIR
(2) open("/tmp/d", O_DIRECTORY | O_CREAT | O_EXCL)
* d doesn't exist: create regular file
* d exists and is a regular file: EEXIST
* d exists and is a directory: EEXIST
(3) open("/tmp/d", O_DIRECTORY | O_EXCL)
* d doesn't exist: ENOENT
* d exists and is a regular file: ENOTDIR
* d exists and is a directory: open directory
On kernels since to v5.7 combinations of O_DIRECTORY, O_CREAT, O_EXCL
have the following semantics:
(1) open("/tmp/d", O_DIRECTORY | O_CREAT)
* d doesn't exist: ENOTDIR (create regular file)
* d exists and is a regular file: ENOTDIR
* d exists and is a directory: EISDIR
(2) open("/tmp/d", O_DIRECTORY | O_CREAT | O_EXCL)
* d doesn't exist: ENOTDIR (create regular file)
* d exists and is a regular file: EEXIST
* d exists and is a directory: EEXIST
(3) open("/tmp/d", O_DIRECTORY | O_EXCL)
* d doesn't exist: ENOENT
* d exists and is a regular file: ENOTDIR
* d exists and is a directory: open directory
This is a fairly substantial semantic change that userspace didn't
notice until Pedro took the time to deliberately figure out corner
cases. Since no one noticed this breakage we can somewhat safely assume
that O_DIRECTORY | O_CREAT combinations are likely unused.
The v5.7 breakage is especially weird because while ENOTDIR is returned
indicating failure a regular file is actually created. This doesn't make
a lot of sense.
Time was spent finding potential users of this combination. Searching on
codesearch.debian.net showed that codebases often express semantical
expectations about O_DIRECTORY | O_CREAT which are completely contrary
to what our code has done and currently does.
The expectation often is that this particular combination would create
and open a directory. This suggests users who tried to use that
combination would stumble upon the counterintuitive behavior no matter
if pre-v5.7 or post v5.7 and quickly realize neither semantics give them
what they want. For some examples see the code examples in [1] to [3]
and the discussion in [4].
There are various ways to address this issue. The lazy/simple option
would be to restore the pre-v5.7 behavior and to just live with that bug
forever. But since there's a real chance that the O_DIRECTORY | O_CREAT
quirk isn't relied upon we should try to get away with murder(ing bad
semantics) first. If we need to Frankenstein pre-v5.7 behavior later so
be it.
So let's simply return EINVAL categorically for O_DIRECTORY | O_CREAT
combinations. In addition to cleaning up the old bug this also opens up
the possiblity to make that flag combination do something more intuitive
in the future.
Starting with this commit the following semantics apply:
(1) open("/tmp/d", O_DIRECTORY | O_CREAT)
* d doesn't exist: EINVAL
* d exists and is a regular file: EINVAL
* d exists and is a directory: EINVAL
(2) open("/tmp/d", O_DIRECTORY | O_CREAT | O_EXCL)
* d doesn't exist: EINVAL
* d exists and is a regular file: EINVAL
* d exists and is a directory: EINVAL
(3) open("/tmp/d", O_DIRECTORY | O_EXCL)
* d doesn't exist: ENOENT
* d exists and is a regular file: ENOTDIR
* d exists and is a directory: open directory
One additional note, O_TMPFILE is implemented as:
#define __O_TMPFILE 020000000
#define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
#define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)
For older kernels it was important to return an explicit error when
O_TMPFILE wasn't supported. So O_TMPFILE requires that O_DIRECTORY is
raised alongside __O_TMPFILE. It also enforced that O_CREAT wasn't
specified. Since O_DIRECTORY | O_CREAT could be used to create a regular
allowing that combination together with __O_TMPFILE would've meant that
false positives were possible, i.e., that a regular file was created
instead of a O_TMPFILE. This could've been used to trick userspace into
thinking it operated on a O_TMPFILE when it wasn't.
Now that we block O_DIRECTORY | O_CREAT completely the check for O_CREAT
in the __O_TMPFILE branch via if ((flags & O_TMPFILE_MASK) != O_TMPFILE)
can be dropped. Instead we can simply check verify that O_DIRECTORY is
raised via if (!(flags & O_DIRECTORY)) and explain this in two comments.
As Aleksa pointed out O_PATH is unaffected by this change since it
always returned EINVAL if O_CREAT was specified - with or without
O_DIRECTORY.
Link: https://lore.kernel.org/lkml/20230320071442.172228-1-pedro.falcato@gmail.com
Link: https://sources.debian.org/src/flatpak/1.14.4-1/subprojects/libglnx/glnx-dirfd.c/?hl=324#L324 [1]
Link: https://sources.debian.org/src/flatpak-builder/1.2.3-1/subprojects/libglnx/glnx-shutil.c/?hl=251#L251 [2]
Link: https://sources.debian.org/src/ostree/2022.7-2/libglnx/glnx-dirfd.c/?hl=324#L324 [3]
Link: https://www.openwall.com/lists/oss-security/2014/11/26/14 [4]
Reported-by: Pedro Falcato <pedro.falcato@gmail.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
access(2) remains commonly used, for example on exec:
access("/etc/ld.so.preload", R_OK)
or when running gcc: strace -c gcc empty.c
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
0.00 0.000000 0 42 26 access
It falls down to do_faccessat without the AT_EACCESS flag, which in turn
results in allocation of new creds in order to modify fsuid/fsgid and
caps. This is a very expensive process single-threaded and most notably
multi-threaded, with numerous structures getting refed and unrefed on
imminent new cred destruction.
Turns out for typical consumers the resulting creds would be identical
and this can be checked upfront, avoiding the hard work.
An access benchmark plugged into will-it-scale running on Cascade Lake
shows:
test proc before after
access1 1 1310582 2908735 (+121%) # distinct files
access1 24 4716491 63822173 (+1353%) # distinct files
access2 24 2378041 5370335 (+125%) # same file
The above benchmarks are not integrated into will-it-scale, but can be
found in a pull request:
https://github.com/antonblanchard/will-it-scale/pull/36/files
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCY+5SogAKCRCRxhvAZXjc
orVwAP4jJ1dPZYx1xHip9TfB5fv5xHz3euhvWns6qGJdVYoHzwEAhVxgYUpqWdXX
L/+VKRFFujYxsSXP4BbS3xDPUJeQFAI=
=ccK2
-----END PGP SIGNATURE-----
Merge tag 'fs.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping
Pull vfs hardening update from Christian Brauner:
"Jan pointed out that during shutdown both filp_close() and super block
destruction will use basic printk logging when bugs are detected. This
causes issues in a few scenarios:
- Tools like syzkaller cannot figure out that the logged message
indicates a bug.
- Users that explicitly opt in to have the kernel bug on data
corruption by selecting CONFIG_BUG_ON_DATA_CORRUPTION should see
the kernel crash when they did actually select that option.
- When there are busy inodes after the superblock is shut down later
access to such a busy inodes walks through freed memory. It would
be better to cleanly crash instead.
All of this can be addressed by using the already existing
CHECK_DATA_CORRUPTION() macro in these places when kernel bugs are
detected. Its logging improvement is useful for all users.
Otherwise this only has a meaningful behavioral effect when users do
select CONFIG_BUG_ON_DATA_CORRUPTION which means this is backward
compatible for regular users"
* tag 'fs.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping:
fs: Use CHECK_DATA_CORRUPTION() when kernel bugs are detected
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCY+5NlQAKCRCRxhvAZXjc
orOaAP9i2h3OJy95nO2Fpde0Bt2UT+oulKCCcGlvXJ8/+TQpyQD/ZQq47gFQ0EAz
Br5NxeyGeecAb0lHpFz+CpLGsxMrMwQ=
=+BG5
-----END PGP SIGNATURE-----
Merge tag 'fs.idmapped.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping
Pull vfs idmapping updates from Christian Brauner:
- Last cycle we introduced the dedicated struct mnt_idmap type for
mount idmapping and the required infrastucture in 256c8aed2b ("fs:
introduce dedicated idmap type for mounts"). As promised in last
cycle's pull request message this converts everything to rely on
struct mnt_idmap.
Currently we still pass around the plain namespace that was attached
to a mount. This is in general pretty convenient but it makes it easy
to conflate namespaces that are relevant on the filesystem with
namespaces that are relevant on the mount level. Especially for
non-vfs developers without detailed knowledge in this area this was a
potential source for bugs.
This finishes the conversion. Instead of passing the plain namespace
around this updates all places that currently take a pointer to a
mnt_userns with a pointer to struct mnt_idmap.
Now that the conversion is done all helpers down to the really
low-level helpers only accept a struct mnt_idmap argument instead of
two namespace arguments.
Conflating mount and other idmappings will now cause the compiler to
complain loudly thus eliminating the possibility of any bugs. This
makes it impossible for filesystem developers to mix up mount and
filesystem idmappings as they are two distinct types and require
distinct helpers that cannot be used interchangeably.
Everything associated with struct mnt_idmap is moved into a single
separate file. With that change no code can poke around in struct
mnt_idmap. It can only be interacted with through dedicated helpers.
That means all filesystems are and all of the vfs is completely
oblivious to the actual implementation of idmappings.
We are now also able to extend struct mnt_idmap as we see fit. For
example, we can decouple it completely from namespaces for users that
don't require or don't want to use them at all. We can also extend
the concept of idmappings so we can cover filesystem specific
requirements.
In combination with the vfs{g,u}id_t work we finished in v6.2 this
makes this feature substantially more robust and thus difficult to
implement wrong by a given filesystem and also protects the vfs.
- Enable idmapped mounts for tmpfs and fulfill a longstanding request.
A long-standing request from users had been to make it possible to
create idmapped mounts for tmpfs. For example, to share the host's
tmpfs mount between multiple sandboxes. This is a prerequisite for
some advanced Kubernetes cases. Systemd also has a range of use-cases
to increase service isolation. And there are more users of this.
However, with all of the other work going on this was way down on the
priority list but luckily someone other than ourselves picked this
up.
As usual the patch is tiny as all the infrastructure work had been
done multiple kernel releases ago. In addition to all the tests that
we already have I requested that Rodrigo add a dedicated tmpfs
testsuite for idmapped mounts to xfstests. It is to be included into
xfstests during the v6.3 development cycle. This should add a slew of
additional tests.
* tag 'fs.idmapped.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: (26 commits)
shmem: support idmapped mounts for tmpfs
fs: move mnt_idmap
fs: port vfs{g,u}id helpers to mnt_idmap
fs: port fs{g,u}id helpers to mnt_idmap
fs: port i_{g,u}id_into_vfs{g,u}id() to mnt_idmap
fs: port i_{g,u}id_{needs_}update() to mnt_idmap
quota: port to mnt_idmap
fs: port privilege checking helpers to mnt_idmap
fs: port inode_owner_or_capable() to mnt_idmap
fs: port inode_init_owner() to mnt_idmap
fs: port acl to mnt_idmap
fs: port xattr to mnt_idmap
fs: port ->permission() to pass mnt_idmap
fs: port ->fileattr_set() to pass mnt_idmap
fs: port ->set_acl() to pass mnt_idmap
fs: port ->get_acl() to pass mnt_idmap
fs: port ->tmpfile() to pass mnt_idmap
fs: port ->rename() to pass mnt_idmap
fs: port ->mknod() to pass mnt_idmap
fs: port ->mkdir() to pass mnt_idmap
...
Currently, filp_close() and generic_shutdown_super() use printk() to log
messages when bugs are detected. This is problematic because infrastructure
like syzkaller has no idea that this message indicates a bug.
In addition, some people explicitly want their kernels to BUG() when kernel
data corruption has been detected (CONFIG_BUG_ON_DATA_CORRUPTION).
And finally, when generic_shutdown_super() detects remaining inodes on a
system without CONFIG_BUG_ON_DATA_CORRUPTION, it would be nice if later
accesses to a busy inode would at least crash somewhat cleanly rather than
walking through freed memory.
To address all three, use CHECK_DATA_CORRUPTION() when kernel bugs are
detected.
Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.
Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.
Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.
Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.
Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.
Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.
Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
256c8aed2b ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.
Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.
Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
locks_inode was turned into a wrapper around file_inode in de2a4a501e
(Partially revert "locks: fix file locking on overlayfs"). Finish
replacing locks_inode invocations everywhere with file_inode.
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
The file locking definitions have lived in fs.h since the dawn of time,
but they are only used by a small subset of the source files that
include it.
Move the file locking definitions to a new header file, and add the
appropriate #include directives to the source files that need them. By
doing this we trim down fs.h a bit and limit the amount of rebuilding
that has to be done when we make changes to the file locking APIs.
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Howells <dhowells@redhat.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: Steve French <stfrench@microsoft.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
-----BEGIN PGP SIGNATURE-----
iIYEABYIAC4WIQSVyBthFV4iTW/VU1/l49DojIL20gUCY5b27RAcbWljQGRpZ2lr
b2QubmV0AAoJEOXj0OiMgvbSg9YA/0K10H+VsGt1+qqR4+w9SM7SFzbgszrV3Yw9
rwiPgaPVAP9rxXPr2bD2hAk7/Lv9LeJ2kfM9RzMErP1A6UsC5YVbDA==
=mAG7
-----END PGP SIGNATURE-----
Merge tag 'landlock-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mic/linux
Pull landlock updates from Mickaël Salaün:
"This adds file truncation support to Landlock, contributed by Günther
Noack. As described by Günther [1], the goal of these patches is to
work towards a more complete coverage of file system operations that
are restrictable with Landlock.
The known set of currently unsupported file system operations in
Landlock is described at [2]. Out of the operations listed there,
truncate is the only one that modifies file contents, so these patches
should make it possible to prevent the direct modification of file
contents with Landlock.
The new LANDLOCK_ACCESS_FS_TRUNCATE access right covers both the
truncate(2) and ftruncate(2) families of syscalls, as well as open(2)
with the O_TRUNC flag. This includes usages of creat() in the case
where existing regular files are overwritten.
Additionally, this introduces a new Landlock security blob associated
with opened files, to track the available Landlock access rights at
the time of opening the file. This is in line with Unix's general
approach of checking the read and write permissions during open(), and
associating this previously checked authorization with the opened
file. An ongoing patch documents this use case [3].
In order to treat truncate(2) and ftruncate(2) calls differently in an
LSM hook, we split apart the existing security_path_truncate hook into
security_path_truncate (for truncation by path) and
security_file_truncate (for truncation of previously opened files)"
Link: https://lore.kernel.org/r/20221018182216.301684-1-gnoack3000@gmail.com [1]
Link: https://www.kernel.org/doc/html/v6.1/userspace-api/landlock.html#filesystem-flags [2]
Link: https://lore.kernel.org/r/20221209193813.972012-1-mic@digikod.net [3]
* tag 'landlock-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mic/linux:
samples/landlock: Document best-effort approach for LANDLOCK_ACCESS_FS_REFER
landlock: Document Landlock's file truncation support
samples/landlock: Extend sample tool to support LANDLOCK_ACCESS_FS_TRUNCATE
selftests/landlock: Test ftruncate on FDs created by memfd_create(2)
selftests/landlock: Test FD passing from restricted to unrestricted processes
selftests/landlock: Locally define __maybe_unused
selftests/landlock: Test open() and ftruncate() in multiple scenarios
selftests/landlock: Test file truncation support
landlock: Support file truncation
landlock: Document init_layer_masks() helper
landlock: Refactor check_access_path_dual() into is_access_to_paths_allowed()
security: Create file_truncate hook from path_truncate hook
Like path_truncate, the file_truncate hook also restricts file
truncation, but is called in the cases where truncation is attempted
on an already-opened file.
This is required in a subsequent commit to handle ftruncate()
operations differently to truncate() operations.
Acked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: John Johansen <john.johansen@canonical.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Günther Noack <gnoack3000@gmail.com>
Link: https://lore.kernel.org/r/20221018182216.301684-2-gnoack3000@gmail.com
Signed-off-by: Mickaël Salaün <mic@digikod.net>
Currently setgid stripping in file_remove_privs()'s should_remove_suid()
helper is inconsistent with other parts of the vfs. Specifically, it only
raises ATTR_KILL_SGID if the inode is S_ISGID and S_IXGRP but not if the
inode isn't in the caller's groups and the caller isn't privileged over the
inode although we require this already in setattr_prepare() and
setattr_copy() and so all filesystem implement this requirement implicitly
because they have to use setattr_{prepare,copy}() anyway.
But the inconsistency shows up in setgid stripping bugs for overlayfs in
xfstests (e.g., generic/673, generic/683, generic/685, generic/686,
generic/687). For example, we test whether suid and setgid stripping works
correctly when performing various write-like operations as an unprivileged
user (fallocate, reflink, write, etc.):
echo "Test 1 - qa_user, non-exec file $verb"
setup_testfile
chmod a+rws $junk_file
commit_and_check "$qa_user" "$verb" 64k 64k
The test basically creates a file with 6666 permissions. While the file has
the S_ISUID and S_ISGID bits set it does not have the S_IXGRP set. On a
regular filesystem like xfs what will happen is:
sys_fallocate()
-> vfs_fallocate()
-> xfs_file_fallocate()
-> file_modified()
-> __file_remove_privs()
-> dentry_needs_remove_privs()
-> should_remove_suid()
-> __remove_privs()
newattrs.ia_valid = ATTR_FORCE | kill;
-> notify_change()
-> setattr_copy()
In should_remove_suid() we can see that ATTR_KILL_SUID is raised
unconditionally because the file in the test has S_ISUID set.
But we also see that ATTR_KILL_SGID won't be set because while the file
is S_ISGID it is not S_IXGRP (see above) which is a condition for
ATTR_KILL_SGID being raised.
So by the time we call notify_change() we have attr->ia_valid set to
ATTR_KILL_SUID | ATTR_FORCE. Now notify_change() sees that
ATTR_KILL_SUID is set and does:
ia_valid = attr->ia_valid |= ATTR_MODE
attr->ia_mode = (inode->i_mode & ~S_ISUID);
which means that when we call setattr_copy() later we will definitely
update inode->i_mode. Note that attr->ia_mode still contains S_ISGID.
Now we call into the filesystem's ->setattr() inode operation which will
end up calling setattr_copy(). Since ATTR_MODE is set we will hit:
if (ia_valid & ATTR_MODE) {
umode_t mode = attr->ia_mode;
vfsgid_t vfsgid = i_gid_into_vfsgid(mnt_userns, inode);
if (!vfsgid_in_group_p(vfsgid) &&
!capable_wrt_inode_uidgid(mnt_userns, inode, CAP_FSETID))
mode &= ~S_ISGID;
inode->i_mode = mode;
}
and since the caller in the test is neither capable nor in the group of the
inode the S_ISGID bit is stripped.
But assume the file isn't suid then ATTR_KILL_SUID won't be raised which
has the consequence that neither the setgid nor the suid bits are stripped
even though it should be stripped because the inode isn't in the caller's
groups and the caller isn't privileged over the inode.
If overlayfs is in the mix things become a bit more complicated and the bug
shows up more clearly. When e.g., ovl_setattr() is hit from
ovl_fallocate()'s call to file_remove_privs() then ATTR_KILL_SUID and
ATTR_KILL_SGID might be raised but because the check in notify_change() is
questioning the ATTR_KILL_SGID flag again by requiring S_IXGRP for it to be
stripped the S_ISGID bit isn't removed even though it should be stripped:
sys_fallocate()
-> vfs_fallocate()
-> ovl_fallocate()
-> file_remove_privs()
-> dentry_needs_remove_privs()
-> should_remove_suid()
-> __remove_privs()
newattrs.ia_valid = ATTR_FORCE | kill;
-> notify_change()
-> ovl_setattr()
// TAKE ON MOUNTER'S CREDS
-> ovl_do_notify_change()
-> notify_change()
// GIVE UP MOUNTER'S CREDS
// TAKE ON MOUNTER'S CREDS
-> vfs_fallocate()
-> xfs_file_fallocate()
-> file_modified()
-> __file_remove_privs()
-> dentry_needs_remove_privs()
-> should_remove_suid()
-> __remove_privs()
newattrs.ia_valid = attr_force | kill;
-> notify_change()
The fix for all of this is to make file_remove_privs()'s
should_remove_suid() helper to perform the same checks as we already
require in setattr_prepare() and setattr_copy() and have notify_change()
not pointlessly requiring S_IXGRP again. It doesn't make any sense in the
first place because the caller must calculate the flags via
should_remove_suid() anyway which would raise ATTR_KILL_SGID.
While we're at it we move should_remove_suid() from inode.c to attr.c
where it belongs with the rest of the iattr helpers. Especially since it
returns ATTR_KILL_S{G,U}ID flags. We also rename it to
setattr_should_drop_suidgid() to better reflect that it indicates both
setuid and setgid bit removal and also that it returns attr flags.
Running xfstests with this doesn't report any regressions. We should really
try and use consistent checks.
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Beginning of the merge window we introduced the vfs{g,u}id_t types in
b27c82e129 ("attr: port attribute changes to new types") and changed
various codepaths over including chown_common().
During that change we forgot to account for the case were the passed
ownership value is -1. In this case the ownership fields in struct iattr
aren't initialized but we rely on them being initialized by the time we
generate the ownership to pass down to the LSMs. All the major LSMs
don't care about the ownership values at all. Only Tomoyo uses them and
so it took a while for syzbot to unearth this issue.
Fix this by initializing the ownership fields and do it within the
retry_deleg block. While notify_change() doesn't alter the ownership
fields currently we shouldn't rely on it.
Since no kernel has been released with these changes this does not
needed to be backported to any stable kernels.
[Christian Brauner (Microsoft) <brauner@kernel.org>]
* rewrote commit message
* use INVALID_VFS{G,U}ID macros
Fixes: b27c82e129 ("attr: port attribute changes to new types") # mainline only
Reported-and-tested-by: syzbot+541e21dcc32c4046cba9@syzkaller.appspotmail.com
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Thread A trying to acquire a write lease checks the value of i_readcount
and i_writecount in check_conflicting_open() to verify that its own fd
is the only fd referencing the file.
Thread B trying to open the file for read will call break_lease() in
do_dentry_open() before incrementing i_readcount, which leaves a small
window where thread A can acquire the write lease and then thread B
completes the open of the file for read without breaking the write lease
that was acquired by thread A.
Fix this race by incrementing i_readcount before checking for existing
leases, same as the case with i_writecount.
Use a helper put_file_access() to decrement i_readcount or i_writecount
in do_dentry_open() and __fput().
Fixes: 387e3746d0 ("locks: eliminate false positive conflicts for write lease")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
One of the goals is to reduce the overhead of using ->read_iter()
and ->write_iter() instead of ->read()/->write(); new_sync_{read,write}()
has a surprising amount of overhead, in particular inside iocb_flags().
That's why the beginning of the series is in this pile; it's not directly
iov_iter-related, but it's a part of the same work...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCYurGOQAKCRBZ7Krx/gZQ
6ysyAP91lvBfMRepcxpd9kvtuzWkU8A3rfSziZZteEHANB9Q7QEAiPn2a2OjWkcZ
uAyUWfCkHCNx+dSMkEvUgR5okQ0exAM=
=9UCV
-----END PGP SIGNATURE-----
Merge tag 'pull-work.iov_iter-base' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs iov_iter updates from Al Viro:
"Part 1 - isolated cleanups and optimizations.
One of the goals is to reduce the overhead of using ->read_iter() and
->write_iter() instead of ->read()/->write().
new_sync_{read,write}() has a surprising amount of overhead, in
particular inside iocb_flags(). That's the explanation for the
beginning of the series is in this pile; it's not directly
iov_iter-related, but it's a part of the same work..."
* tag 'pull-work.iov_iter-base' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
first_iovec_segment(): just return address
iov_iter: massage calling conventions for first_{iovec,bvec}_segment()
iov_iter: first_{iovec,bvec}_segment() - simplify a bit
iov_iter: lift dealing with maxpages out of first_{iovec,bvec}_segment()
iov_iter_get_pages{,_alloc}(): cap the maxsize with MAX_RW_COUNT
iov_iter_bvec_advance(): don't bother with bvec_iter
copy_page_{to,from}_iter(): switch iovec variants to generic
keep iocb_flags() result cached in struct file
iocb: delay evaluation of IS_SYNC(...) until we want to check IOCB_DSYNC
struct file: use anonymous union member for rcuhead and llist
btrfs: use IOMAP_DIO_NOSYNC
teach iomap_dio_rw() to suppress dsync
No need of likely/unlikely on calls of check_copy_size()