- New Features:
- Add NFSv4.2 xattr tracepoints
- Replace xprtiod WQ in rpcrdma
- Flexfiles cancels I/O on layout recall or revoke
- Bugfixes and Cleanups:
- Directly use ida_alloc() / ida_free()
- Don't open-code max_t()
- Prefer using strscpy over strlcpy
- Remove unused forward declarations
- Always return layout states on flexfiles layout return
- Have LISTXATTR treat NFS4ERR_NOXATTR as an empty reply instead of error
- Allow more xprtrdma memory allocations to fail without triggering a reclaim
- Various other xprtrdma clean ups
- Fix rpc_killall_tasks() races
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmNHJToACgkQ18tUv7Cl
QOtTbA//QiresBzf7cnZOAwiZbe9LXiWfR2p5IkBLJPYJ8xtTliRLwnwYgQib9OI
+4DzBiEqujah9BDac5OeatYW1UDLQ9lMIoCyvPjSw8Yxa8JEHDb/1ODDUOMS+ZIo
dk1AKV2Wi2stxn85Sy+VGriE3JKiaeJxAlsWgiT/BLP0hAyZw1L3Tg017EgxVIVz
8cfPBciu/Bc2/pZp9f5+GBjAlcUX0u/JFKiLPDHDZkvFTr4RgREZOyStDWncgsxK
iHAIfSr6TxlynHabNAnFNVuYq7gkBe3jg1TkABdQ+SilAgdLpugAW8MFdig0AZQO
UIsVJHjRHLpz6cJurnDcu9tGB6jLVTZfyz8PZQl5H9CqnbSHUxdOCTuve7fGhVas
+wSXq1U98gStzoqtw5pMwsB2YSSOsUR8QEZpLEkvQgzHwoszNa7FrELqaZUJyJHR
qmRH2nKCzsSBbQn5AhnzHBxzeOv6r0r3YjvKd5utwsRtq3g9GX14KAOmqvDTKk2q
9KmrGlDVtVmOww2QnPTXH6mSthHLuqcKg1H2H7Xymmskq9n8PC6M+EiQd8XsKNJa
MfBkOVFdxrJq6Htpx4IMLJP6jvYVKEbef2eRFt8hNnla8pMPlsDqoIysJulaWpiB
HqdoPHR9Y26Qxuw7G91ba5Q5qqu9+ZOLB9jeSRjcXtsUDxq9f/A=
=p47k
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-6.1-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client updates from Anna Schumaker:
"New Features:
- Add NFSv4.2 xattr tracepoints
- Replace xprtiod WQ in rpcrdma
- Flexfiles cancels I/O on layout recall or revoke
Bugfixes and Cleanups:
- Directly use ida_alloc() / ida_free()
- Don't open-code max_t()
- Prefer using strscpy over strlcpy
- Remove unused forward declarations
- Always return layout states on flexfiles layout return
- Have LISTXATTR treat NFS4ERR_NOXATTR as an empty reply instead of
error
- Allow more xprtrdma memory allocations to fail without triggering a
reclaim
- Various other xprtrdma clean ups
- Fix rpc_killall_tasks() races"
* tag 'nfs-for-6.1-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (27 commits)
NFSv4/flexfiles: Cancel I/O if the layout is recalled or revoked
SUNRPC: Add API to force the client to disconnect
SUNRPC: Add a helper to allow pNFS drivers to selectively cancel RPC calls
SUNRPC: Fix races with rpc_killall_tasks()
xprtrdma: Fix uninitialized variable
xprtrdma: Prevent memory allocations from driving a reclaim
xprtrdma: Memory allocation should be allowed to fail during connect
xprtrdma: MR-related memory allocation should be allowed to fail
xprtrdma: Clean up synopsis of rpcrdma_regbuf_alloc()
xprtrdma: Clean up synopsis of rpcrdma_req_create()
svcrdma: Clean up RPCRDMA_DEF_GFP
SUNRPC: Replace the use of the xprtiod WQ in rpcrdma
NFSv4.2: Add a tracepoint for listxattr
NFSv4.2: Add tracepoints for getxattr, setxattr, and removexattr
NFSv4.2: Move TRACE_DEFINE_ENUM(NFS4_CONTENT_*) under CONFIG_NFS_V4_2
NFSv4.2: Add special handling for LISTXATTR receiving NFS4ERR_NOXATTR
nfs: remove nfs_wait_atomic_killable() and nfs_write_prepare() declaration
NFSv4: remove nfs4_renewd_prepare_shutdown() declaration
fs/nfs/pnfs_nfs.c: fix spelling typo and syntax error in comment
NFSv4/pNFS: Always return layout stats on layout return for flexfiles
...
- Debuggability:
- Change most occurances of BUG_ON() to WARN_ON_ONCE()
- Reorganize & fix TASK_ state comparisons, turn it into a bitmap
- Update/fix misc scheduler debugging facilities
- Load-balancing & regular scheduling:
- Improve the behavior of the scheduler in presence of lot of
SCHED_IDLE tasks - in particular they should not impact other
scheduling classes.
- Optimize task load tracking, cleanups & fixes
- Clean up & simplify misc load-balancing code
- Freezer:
- Rewrite the core freezer to behave better wrt thawing and be simpler
in general, by replacing PF_FROZEN with TASK_FROZEN & fixing/adjusting
all the fallout.
- Deadline scheduler:
- Fix the DL capacity-aware code
- Factor out dl_task_is_earliest_deadline() & replenish_dl_new_period()
- Relax/optimize locking in task_non_contending()
- Cleanups:
- Factor out the update_current_exec_runtime() helper
- Various cleanups, simplifications
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmM/01cRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1geZA/+PB4KC1T9aVxzaTHI36R03YgJYZmIdtxw
wTf02MixePmz+gQCbepJbempGOh5ST28aOcI0xhdYOql5B63MaUBBMlB0HvGUyDG
IU3zETqLMRtAbnSTdQFv8m++ECUtZYp8/x1FCel4WO7ya4ETkRu1NRfCoUepEhpZ
aVAlae9LH3NBaF9t7s0PT2lTjf3pIzMFRkddJ0ywJhbFR3VnWat05fAK+J6fGY8+
LS54coefNlJD4oDh5TY8uniL1j5SmWmmwbk9Cdj7bLU5P3dFSS0/+5FJNHJPVGDE
srGT7wstRUcDrN0CnZo48VIUBiApJCCDqTfJYi9wNYd0NAHvwY6MIJJgEIY8mKsI
L/qH26H81Wt+ezSZ/5JIlGlZ/LIeNaa6OO/fbWEYABBQogvvx3nxsRNUYKSQzumH
CnSBasBjLnjWyLlK4qARM9cI7NFSEK6NUigrEx/7h8JFu/8T4DlSy6LsF1HUyKgq
4+FJLAqG6cL0tcwB/fHYd0oRESN8dStnQhGxSojgufwLc7dlFULvCYF5JM/dX+/V
IKwbOfIOeOn6ViMtSOXAEGdII+IQ2/ZFPwr+8Z5JC7NzvTVL6xlu/3JXkLZR3L7o
yaXTSaz06h1vil7Z+GRf7RHc+wUeGkEpXh5vnarGZKXivhFdWsBdROIJANK+xR0i
TeSLCxQxXlU=
=KjMD
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2022-10-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"Debuggability:
- Change most occurances of BUG_ON() to WARN_ON_ONCE()
- Reorganize & fix TASK_ state comparisons, turn it into a bitmap
- Update/fix misc scheduler debugging facilities
Load-balancing & regular scheduling:
- Improve the behavior of the scheduler in presence of lot of
SCHED_IDLE tasks - in particular they should not impact other
scheduling classes.
- Optimize task load tracking, cleanups & fixes
- Clean up & simplify misc load-balancing code
Freezer:
- Rewrite the core freezer to behave better wrt thawing and be
simpler in general, by replacing PF_FROZEN with TASK_FROZEN &
fixing/adjusting all the fallout.
Deadline scheduler:
- Fix the DL capacity-aware code
- Factor out dl_task_is_earliest_deadline() &
replenish_dl_new_period()
- Relax/optimize locking in task_non_contending()
Cleanups:
- Factor out the update_current_exec_runtime() helper
- Various cleanups, simplifications"
* tag 'sched-core-2022-10-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (41 commits)
sched: Fix more TASK_state comparisons
sched: Fix TASK_state comparisons
sched/fair: Move call to list_last_entry() in detach_tasks
sched/fair: Cleanup loop_max and loop_break
sched/fair: Make sure to try to detach at least one movable task
sched: Show PF_flag holes
freezer,sched: Rewrite core freezer logic
sched: Widen TAKS_state literals
sched/wait: Add wait_event_state()
sched/completion: Add wait_for_completion_state()
sched: Add TASK_ANY for wait_task_inactive()
sched: Change wait_task_inactive()s match_state
freezer,umh: Clean up freezer/initrd interaction
freezer: Have {,un}lock_system_sleep() save/restore flags
sched: Rename task_running() to task_on_cpu()
sched/fair: Cleanup for SIS_PROP
sched/fair: Default to false in test_idle_cores()
sched/fair: Remove useless check in select_idle_core()
sched/fair: Avoid double search on same cpu
sched/fair: Remove redundant check in select_idle_smt()
...
Rewrite the core freezer to behave better wrt thawing and be simpler
in general.
By replacing PF_FROZEN with TASK_FROZEN, a special block state, it is
ensured frozen tasks stay frozen until thawed and don't randomly wake
up early, as is currently possible.
As such, it does away with PF_FROZEN and PF_FREEZER_SKIP, freeing up
two PF_flags (yay!).
Specifically; the current scheme works a little like:
freezer_do_not_count();
schedule();
freezer_count();
And either the task is blocked, or it lands in try_to_freezer()
through freezer_count(). Now, when it is blocked, the freezer
considers it frozen and continues.
However, on thawing, once pm_freezing is cleared, freezer_count()
stops working, and any random/spurious wakeup will let a task run
before its time.
That is, thawing tries to thaw things in explicit order; kernel
threads and workqueues before doing bringing SMP back before userspace
etc.. However due to the above mentioned races it is entirely possible
for userspace tasks to thaw (by accident) before SMP is back.
This can be a fatal problem in asymmetric ISA architectures (eg ARMv9)
where the userspace task requires a special CPU to run.
As said; replace this with a special task state TASK_FROZEN and add
the following state transitions:
TASK_FREEZABLE -> TASK_FROZEN
__TASK_STOPPED -> TASK_FROZEN
__TASK_TRACED -> TASK_FROZEN
The new TASK_FREEZABLE can be set on any state part of TASK_NORMAL
(IOW. TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE) -- any such state
is already required to deal with spurious wakeups and the freezer
causes one such when thawing the task (since the original state is
lost).
The special __TASK_{STOPPED,TRACED} states *can* be restored since
their canonical state is in ->jobctl.
With this, frozen tasks need an explicit TASK_FROZEN wakeup and are
free of undue (early / spurious) wakeups.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lore.kernel.org/r/20220822114649.055452969@infradead.org
Currently, when the writeback code detects a server reboot, it redirties
any pages that were not committed to disk, and it sets the flag
NFS_CONTEXT_RESEND_WRITES in the nfs_open_context of the file descriptor
that dirtied the file. While this allows the file descriptor in question
to redrive its own writes, it violates the fsync() requirement that we
should be synchronising all writes to disk.
While the problem is infrequent, we do see corner cases where an
untimely server reboot causes the fsync() call to abandon its attempt to
sync data to disk and causing data corruption issues due to missed error
conditions or similar.
In order to tighted up the client's ability to deal with this situation
without introducing livelocks, add a counter that records the number of
times pages are redirtied due to a server reboot-like condition, and use
that in fsync() to redrive the sync to disk.
Fixes: 2197e9b06c ("NFS: Fix up fsync() when the server rebooted")
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Highlights include:
Stable fixes:
- SUNRPC: Ensure we flush any closed sockets before xs_xprt_free()
Bugfixes:
- Fix an Oopsable condition due to SLAB_ACCOUNT setting in the NFSv4.2
xattr code.
- Fix for open() using an file open mode of '3' in NFSv4
- Replace readdir's use of xxhash() with hash_64()
- Several patches to handle malloc() failure in SUNRPC
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmJQb5IACgkQZwvnipYK
APLVeRAAheQX422gIrjrWsQrZxJWmMDBBj9qwiV14XMs6g5KF/DtNbIJm1BGjA9Y
aLgNVknSDcGHMmvdh5MCVF7R61sqkCNbQWP3Bx6OBP6ei3FEgQBPyujcFXAaZ7UK
T4fBzfPoEYaxVkEG/L5BsAUnh6TvyCfNm1oqZZv2bv9M0U6UlXnRLWZN9I6cAtcw
GI9HgnufWOxJWIqPd9dY45aF44Ru+aJ953eecPh0G83Reoc99EAU+PvJJD18Wl0H
BZqM6ar5pNush50yVIjnPFXg+sM97dKGlLvYL11Uy7f8valSaBXPZLZQ/jG4Vx/a
m/l8FVwBEV89BG7z6jKNju7ERbO+xbPgXP8lSwkj69fXOpuvzo/G6VAxS6ZJww12
p6TJnhCMSEF7qrQc5mejA803dT4MiJWo4i2th482Ws/tRN5H+y9pDYLsvBPM8iHo
sVkJBco04tBHEH3qKgDTu8EY7/shH+GVO3Wmtcjz47T2rbmQB7gJtfipHcLmV2qd
Jy1tiz1T7rhGs7KNtWpu8210MY9iFhIt3rAvdGdeVmgnNjrmRex0Sxqx8vgKUAFE
aQjXkkpwHA3rHWnuGPMKu5i48BPFQ8m4QgdVC9F6ylNT6RIbImqplQiwIYNIxYC+
cBohUG1yYZRjFEMkXBdfK5cImkTLW7RkLCatRh5v8OU8xqpRVL8=
=edfs
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.18-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client fixes from Trond Myklebust:
"Stable fixes:
- SUNRPC: Ensure we flush any closed sockets before xs_xprt_free()
Bugfixes:
- Fix an Oopsable condition due to SLAB_ACCOUNT setting in the
NFSv4.2 xattr code.
- Fix for open() using an file open mode of '3' in NFSv4
- Replace readdir's use of xxhash() with hash_64()
- Several patches to handle malloc() failure in SUNRPC"
* tag 'nfs-for-5.18-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
SUNRPC: Move the call to xprt_send_pagedata() out of xprt_sock_sendmsg()
SUNRPC: svc_tcp_sendmsg() should handle errors from xdr_alloc_bvec()
SUNRPC: Handle allocation failure in rpc_new_task()
NFS: Ensure rpc_run_task() cannot fail in nfs_async_rename()
NFSv4/pnfs: Handle RPC allocation errors in nfs4_proc_layoutget
SUNRPC: Handle low memory situations in call_status()
SUNRPC: Handle ENOMEM in call_transmit_status()
NFSv4.2: Fix missing removal of SLAB_ACCOUNT on kmem_cache allocation
SUNRPC: Ensure we flush any closed sockets before xs_xprt_free()
NFS: Replace readdir's use of xxhash() with hash_64()
SUNRPC: handle malloc failure in ->request_prepare
NFSv4: fix open failure with O_ACCMODE flag
Revert "NFSv4: Handle the special Linux file open access mode"
Highlights include:
Features:
- Switch NFS to use readahead instead of the obsolete readpages.
- Readdir fixes to improve cacheability of large directories when there
are multiple readers and writers.
- Readdir performance improvements when doing a seekdir() immediately
after opening the directory (common when re-exporting NFS).
- NFS swap improvements from Neil Brown.
- Loosen up memory allocation to permit direct reclaim and write back
in cases where there is no danger of deadlocking the writeback code or
NFS swap.
- Avoid sillyrename when the NFSv4 server claims to support the
necessary features to recover the unlinked but open file after reboot.
Bugfixes:
- Patch from Olga to add a mount option to control NFSv4.1 session
trunking discovery, and default it to being off.
- Fix a lockup in nfs_do_recoalesce().
- Two fixes for list iterator variables being used when pointing to the
list head.
- Fix a kernel memory scribble when reading from a non-socket transport
in /sys/kernel/sunrpc.
- Fix a race where reconnecting to a server could leave the TCP socket
stuck forever in the connecting state.
- Patch from Neil to fix a shutdown race which can leave the SUNRPC
transport timer primed after we free the struct xprt itself.
- Patch from Xin Xiong to fix reference count leaks in the NFSv4.2 copy
offload.
- Sunrpc patch from Olga to avoid resending a task on an offlined
transport.
Cleanups:
- Patches from Dave Wysochanski to clean up the fscache code
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmJDLLUACgkQZwvnipYK
APJZ5Q/9FH5S38RcUOxpFS+XaVdmxltJTMQv9Bz5mcyljq6PwRjTn73Qs7MfBQfD
k90O9Pn/EXgQpFWNW4o/saQGLAO1MbVaesdNNF92Le9NziMz7lr95fsFXOqpg9EQ
rEVOImWinv/N0+P0yv5tgjx6Fheu70L4dnwsxepnmtv2R1VjXOVchK7WjsV+9qey
Xvn15OzhtVCGhRvvKfmGuGW+2I0D29tRNn2X0Pi+LjxAEQ0We/rcsJeGDMGYtbeC
2U4ADk7KB0kJq04WYCH8k8tTrljL3PUONzO6QP1lltmAZ0FOwXf/AjyaD3SML2Xu
A1yc2bEn3/S6wCQ3BiMxoE7UCSewg/ZyQD7B9GjZ/nWYNXREhLKvtVZt+ULTCA6o
UKAI2vlEHljzQI2YfbVVZFyL8KCZqi+BFcJ/H/dtFKNazIkiefkaF+703Pflazis
HjhA5G6IT0Pk6qZ6OBp+Z/4kCDYZYXXi9ysiNuMPWCiYEfZzt7ocUjm44uRAi+vR
Qpf0V01Rg4iplTj3YFXqb9TwxrudzC7AcQ7hbLJia0kD/8djG+aGZs3SJWeO+VuE
Xa41oxAx9jqOSRvSKPvj7nbexrwn8PtQI2NtDRqjPXlg6+Jbk23u8lIEVhsCH877
xoMUHtTKLyqECskzRN4nh7qoBD3QfiLuI74aDcPtGEuXzU1+Ox8=
=jeGt
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.18-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Highlights include:
Features:
- Switch NFS to use readahead instead of the obsolete readpages.
- Readdir fixes to improve cacheability of large directories when
there are multiple readers and writers.
- Readdir performance improvements when doing a seekdir() immediately
after opening the directory (common when re-exporting NFS).
- NFS swap improvements from Neil Brown.
- Loosen up memory allocation to permit direct reclaim and write back
in cases where there is no danger of deadlocking the writeback code
or NFS swap.
- Avoid sillyrename when the NFSv4 server claims to support the
necessary features to recover the unlinked but open file after
reboot.
Bugfixes:
- Patch from Olga to add a mount option to control NFSv4.1 session
trunking discovery, and default it to being off.
- Fix a lockup in nfs_do_recoalesce().
- Two fixes for list iterator variables being used when pointing to
the list head.
- Fix a kernel memory scribble when reading from a non-socket
transport in /sys/kernel/sunrpc.
- Fix a race where reconnecting to a server could leave the TCP
socket stuck forever in the connecting state.
- Patch from Neil to fix a shutdown race which can leave the SUNRPC
transport timer primed after we free the struct xprt itself.
- Patch from Xin Xiong to fix reference count leaks in the NFSv4.2
copy offload.
- Sunrpc patch from Olga to avoid resending a task on an offlined
transport.
Cleanups:
- Patches from Dave Wysochanski to clean up the fscache code"
* tag 'nfs-for-5.18-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (91 commits)
NFSv4/pNFS: Fix another issue with a list iterator pointing to the head
NFS: Don't loop forever in nfs_do_recoalesce()
SUNRPC: Don't return error values in sysfs read of closed files
SUNRPC: Do not dereference non-socket transports in sysfs
NFSv4.1: don't retry BIND_CONN_TO_SESSION on session error
SUNRPC don't resend a task on an offlined transport
NFS: replace usage of found with dedicated list iterator variable
SUNRPC: avoid race between mod_timer() and del_timer_sync()
pNFS/files: Ensure pNFS allocation modes are consistent with nfsiod
pNFS/flexfiles: Ensure pNFS allocation modes are consistent with nfsiod
NFSv4/pnfs: Ensure pNFS allocation modes are consistent with nfsiod
NFS: Avoid writeback threads getting stuck in mempool_alloc()
NFS: nfsiod should not block forever in mempool_alloc()
SUNRPC: Make the rpciod and xprtiod slab allocation modes consistent
SUNRPC: Fix unx_lookup_cred() allocation
NFS: Fix memory allocation in rpc_alloc_task()
NFS: Fix memory allocation in rpc_malloc()
SUNRPC: Improve accuracy of socket ENOBUFS determination
SUNRPC: Replace internal use of SOCKWQ_ASYNC_NOSPACE
SUNRPC: Fix socket waits for write buffer space
...
The inode allocation is supposed to use alloc_inode_sb(), so convert
kmem_cache_alloc() of all filesystems to alloc_inode_sb().
Link: https://lkml.kernel.org/r/20220228122126.37293-5-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Theodore Ts'o <tytso@mit.edu> [ext4]
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Alex Shi <alexs@kernel.org>
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Fam Zheng <fam.zheng@bytedance.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kari Argillander <kari.argillander@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If attribute caching is turned off, then use of readdirplus is not going
to help stat() performance.
Readdirplus also doesn't help if a file is being written to, since we
will have to flush those writes in order to sync the mtime/ctime.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The heuristic for readdirplus is designed to try to detect 'ls -l' and
similar patterns. It does so by looking for cache hit/miss patterns in
both the attribute cache and in the dcache of the files in a given
directory, and then sets a flag for the readdirplus code to interpret.
The problem with this approach is that a single attribute or dcache miss
can cause the NFS code to force a refresh of the attributes for the
entire set of files contained in the directory.
To be able to make a more nuanced decision, let's sample the number of
hits and misses in the set of open directory descriptors. That allows us
to set thresholds at which we start preferring READDIRPLUS over regular
READDIR, or at which we start to force a re-read of the remaining
readdir cache using READDIRPLUS.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
We should never expect the 'xattr_cache' to be non-null in that case,
hence nfs_set_cache_invalid() is just going to optimise it away.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Now that we have more fine grained attribute revalidation, let's just
get rid of NFS_INO_REVAL_PAGECACHE.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Assume that sections that should not re-enter the filesystem are already
protected with memalloc_nofs_save/restore call, so relax those GFP_NOFS
instances which might be used by other contexts.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Allow kmemcg to limit the number of open/lock file contexts, in the same
way that it limits the parent file descriptors.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The result of the writeback, whether it is an ENOSPC or an EIO, or
anything else, does not inhibit the NFS client from reporting the
correct file timestamps.
Fixes: 79566ef018 ("NFS: Getattr doesn't require data sync semantics")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Change the nfs filesystem to support fscache's indexing rewrite and
reenable caching in nfs.
The following changes have been made:
(1) The fscache_netfs struct is no more, and there's no need to register
the filesystem as a whole.
(2) The session cookie is now an fscache_volume cookie, allocated with
fscache_acquire_volume(). That takes three parameters: a string
representing the "volume" in the index, a string naming the cache to
use (or NULL) and a u64 that conveys coherency metadata for the
volume.
For nfs, I've made it render the volume name string as:
"nfs,<ver>,<family>,<address>,<port>,<fsidH>,<fsidL>*<,param>[,<uniq>]"
(3) The fscache_cookie_def is no more and needed information is passed
directly to fscache_acquire_cookie(). The cache no longer calls back
into the filesystem, but rather metadata changes are indicated at
other times.
fscache_acquire_cookie() is passed the same keying and coherency
information as before.
(4) fscache_enable/disable_cookie() have been removed.
Call fscache_use_cookie() and fscache_unuse_cookie() when a file is
opened or closed to prevent a cache file from being culled and to keep
resources to hand that are needed to do I/O.
If a file is opened for writing, we invalidate it with
FSCACHE_INVAL_DIO_WRITE in lieu of doing writeback to the cache,
thereby making it cease caching until all currently open files are
closed. This should give the same behaviour as the uptream code.
Making the cache store local modifications isn't straightforward for
NFS, so that's left for future patches.
(5) fscache_invalidate() now needs to be given uptodate auxiliary data and
a file size. It also takes a flag to indicate if this was due to a
DIO write.
(6) Call nfs_fscache_invalidate() with FSCACHE_INVAL_DIO_WRITE on a file
to which a DIO write is made.
(7) Call fscache_note_page_release() from nfs_release_page().
(8) Use a killable wait in nfs_vm_page_mkwrite() when waiting for
PG_fscache to be cleared.
(9) The functions to read and write data to/from the cache are stubbed out
pending a conversion to use netfslib.
Changes
=======
ver #3:
- Added missing =n fallback for nfs_fscache_release_file()[1][2].
ver #2:
- Use gfpflags_allow_blocking() rather than using flag directly.
- fscache_acquire_volume() now returns errors.
- Remove NFS_INO_FSCACHE as it's no longer used.
- Need to unuse a cookie on file-release, not inode-clear.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Co-developed-by: David Howells <dhowells@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
cc: Trond Myklebust <trond.myklebust@hammerspace.com>
cc: Anna Schumaker <anna.schumaker@netapp.com>
cc: linux-nfs@vger.kernel.org
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/202112100804.nksO8K4u-lkp@intel.com/ [1]
Link: https://lore.kernel.org/r/202112100957.2oEDT20W-lkp@intel.com/ [2]
Link: https://lore.kernel.org/r/163819668938.215744.14448852181937731615.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906979003.143852.2601189243864854724.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967182112.1823006.7791504655391213379.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021575950.640689.12069642327533368467.stgit@warthog.procyon.org.uk/ # v4
This provides some insight into the client's invalidation behavior to show
both when the client uses the helper, and the results of calling the
helper which can vary depending on how the helper is called.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
For creating fattrs with the label field already allocated for us. I
also update nfs_free_fattr() to free the label in the end.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
We're about to add a check in nfs_free_fattr() for whether or not the
label is non-zero.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Save some space in the nfs_inode by setting up an anonymous union with
the fields that are peculiar to a specific type of filesystem object.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Generate a trace event whenever the NFS client modifies the size of
a file. These new events aid troubleshooting workloads that trigger
races around size updates.
There are four new trace points, all named nfs_size_something so
they are easy to grep for or enable as a group with a single glob.
Size updated on the server:
kworker/u24:10-194 [010] 369.939174: nfs_size_update: fileid=00:28:2 fhandle=0x36fbbe51 version=1752899344277980615 cursize=250471 newsize=172083
Server-side size update reported via NFSv3 WCC attributes:
fsx-1387 [006] 380.760686: nfs_size_wcc: fileid=00:28:2 fhandle=0x36fbbe51 version=1752899355909932456 cursize=146792 newsize=171216
File has been truncated locally:
fsx-1387 [007] 369.437421: nfs_size_truncate: fileid=00:28:2 fhandle=0x36fbbe51 version=1752899231200117272 cursize=215244 newsize=0
File has been extended locally:
fsx-1387 [007] 369.439213: nfs_size_grow: fileid=00:28:2 fhandle=0x36fbbe51 version=1752899343704248410 cursize=258048 newsize=262144
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
NFS_INO_DATA_INVAL_DEFER and NFS_INO_INVALID_DATA should be considered
mutually exclusive.
Fixes: 1c341b7775 ("NFS: Add deferred cache invalidation for close-to-open consistency violations")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Both NFSv3 and NFSv2 generate their change attribute from the ctime
value that was supplied by the server. However the problem is that there
are plenty of servers out there with ctime resolutions of 1ms or worse.
In a modern performance system, this is insufficient when trying to
decide which is the most recent set of attributes when, for instance, a
READ or GETATTR call races with a WRITE or SETATTR.
For this reason, let's revert to labelling the NFSv2/v3 change
attributes as NFS4_CHANGE_TYPE_IS_UNDEFINED. This will ensure we protect
against such races.
Fixes: 7b24dacf08 ("NFS: Another inode revalidation improvement")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
If the cached credential exists but doesn't have any expiration callback
then exit early.
Fix up atomicity issues when replacing the credential with a new one
since the existing code could lead to refcount leaks.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If a file has already been closed, then it should not be selected to
support further I/O.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
[Trond: Fix an invalid pointer deref reported by Colin Ian King]
We know that the attributes changed on the server if and only if the
change attribute is different. Otherwise, we're just refreshing our
cache with values that were already known to be stale.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The inode is considered revalidated when we've checked the value of the
change attribute against our cached value since that suffices to
establish whether or not the other cached values are valid.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Highlights include:
Stable fixes:
- Add validation of the UDP retrans parameter to prevent shift out-of-bounds
- Don't discard pNFS layout segments that are marked for return
Bugfixes:
- Fix a NULL dereference crash in xprt_complete_bc_request() when the
NFSv4.1 server misbehaves.
- Fix the handling of NFS READDIR cookie verifiers
- Sundry fixes to ensure attribute revalidation works correctly when the
server does not return post-op attributes.
- nfs4_bitmask_adjust() must not change the server global bitmasks
- Fix major timeout handling in the RPC code.
- NFSv4.2 fallocate() fixes.
- Fix the NFSv4.2 SEEK_HOLE/SEEK_DATA end-of-file handling
- Copy offload attribute revalidation fixes
- Fix an incorrect filehandle size check in the pNFS flexfiles driver
- Fix several RDMA transport setup/teardown races
- Fix several RDMA queue wrapping issues
- Fix a misplaced memory read barrier in sunrpc's call_decode()
Features:
- Micro optimisation of the TCP transmission queue using TCP_CORK
- statx() performance improvements by further splitting up the tracking
of invalid cached file metadata.
- Support the NFSv4.2 "change_attr_type" attribute and use it to
optimise handling of change attribute updates.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmCVLooACgkQZwvnipYK
APJB5BAAtIJyhx40ooMBzcucDmXd1qovlKsb8ZlvnSI6c7wvHhFPNk9z4zwThnjL
FpVYzJzK6XzAQY/PtgbrPwnSUmW925ngPWYR/hiYe+OGPBnYV+tXP8izCyEkNgMg
45goDOxojGWl7AGTuAJiKcDSdH9PyIrbvt28iwcNSGjslasGSbAoL/836l4OIGr1
Ymxs/NDML11dPco8GIKLGtHd8leFGleDx089VeNsgud8MdaFErp16O5Iz8DdzRKd
W1l2zDMb05j8eDZIfy3w3FyrLkDXA+KgLSADiC8TcpxoadPaQJMeCvoIq8oqVndn
bZBoxduXdLgf54Aec0WnNKFAOyc7pGvZoSNmFouT7EGV73g+g1LQ+ZbEE1bb8fCQ
XHqCVaBt2+47NiTUgdxjXlZRfcn8fYKx0tVxfG3mQVMXUAWfsjmMyQMNgijDRJI2
8Wz3lZMRGMILbR9j4QpP1biVy/2zGNWG/TB5ZZyZMSY4uT+aOpzlqdknb4UsRaSp
f7MfmB7xEWpS4DJr9RIBrJ/hIdnMu1mNInxDPFo5Kl5HNp4TaPm2dPir2ZD2wMZI
daURTX7giUhpE15ZebQDBqWD+mTR0bVDqLLeo131JRmMfMEHugNrr49xe+NkBu/R
QWnFzgkGdQsOeiKRRwEUuhsi74JspqfwzdZzHqcRM5WuXVvBLcA=
=h01b
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.13-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Highlights include:
Stable fixes:
- Add validation of the UDP retrans parameter to prevent shift
out-of-bounds
- Don't discard pNFS layout segments that are marked for return
Bugfixes:
- Fix a NULL dereference crash in xprt_complete_bc_request() when the
NFSv4.1 server misbehaves.
- Fix the handling of NFS READDIR cookie verifiers
- Sundry fixes to ensure attribute revalidation works correctly when
the server does not return post-op attributes.
- nfs4_bitmask_adjust() must not change the server global bitmasks
- Fix major timeout handling in the RPC code.
- NFSv4.2 fallocate() fixes.
- Fix the NFSv4.2 SEEK_HOLE/SEEK_DATA end-of-file handling
- Copy offload attribute revalidation fixes
- Fix an incorrect filehandle size check in the pNFS flexfiles driver
- Fix several RDMA transport setup/teardown races
- Fix several RDMA queue wrapping issues
- Fix a misplaced memory read barrier in sunrpc's call_decode()
Features:
- Micro optimisation of the TCP transmission queue using TCP_CORK
- statx() performance improvements by further splitting up the
tracking of invalid cached file metadata.
- Support the NFSv4.2 'change_attr_type' attribute and use it to
optimise handling of change attribute updates"
* tag 'nfs-for-5.13-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (85 commits)
xprtrdma: Fix a NULL dereference in frwr_unmap_sync()
sunrpc: Fix misplaced barrier in call_decode
NFSv4.2: Remove ifdef CONFIG_NFSD from NFSv4.2 client SSC code.
xprtrdma: Move fr_mr field to struct rpcrdma_mr
xprtrdma: Move the Work Request union to struct rpcrdma_mr
xprtrdma: Move fr_linv_done field to struct rpcrdma_mr
xprtrdma: Move cqe to struct rpcrdma_mr
xprtrdma: Move fr_cid to struct rpcrdma_mr
xprtrdma: Remove the RPC/RDMA QP event handler
xprtrdma: Don't display r_xprt memory addresses in tracepoints
xprtrdma: Add an rpcrdma_mr_completion_class
xprtrdma: Add tracepoints showing FastReg WRs and remote invalidation
xprtrdma: Avoid Send Queue wrapping
xprtrdma: Do not wake RPC consumer on a failed LocalInv
xprtrdma: Do not recycle MR after FastReg/LocalInv flushes
xprtrdma: Clarify use of barrier in frwr_wc_localinv_done()
xprtrdma: Rename frwr_release_mr()
xprtrdma: rpcrdma_mr_pop() already does list_del_init()
xprtrdma: Delete rpcrdma_recv_buffer_put()
xprtrdma: Fix cwnd update ordering
...
Pull vfs inode type handling updates from Al Viro:
"We should never change the type bits of ->i_mode or the method tables
(->i_op and ->i_fop) of a live inode.
Unfortunately, not all filesystems took care to prevent that"
* 'work.inode-type-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
spufs: fix bogosity in S_ISGID handling
9p: missing chunk of "fs/9p: Don't update file type when updating file attributes"
openpromfs: don't do unlock_new_inode() until the new inode is set up
hostfs_mknod(): don't bother with init_special_inode()
cifs: have cifs_fattr_to_inode() refuse to change type on live inode
cifs: have ->mkdir() handle race with another client sanely
do_cifs_create(): don't set ->i_mode of something we had not created
gfs2: be careful with inode refresh
ocfs2_inode_lock_update(): make sure we don't change the type bits of i_mode
orangefs_inode_is_stale(): i_mode type bits do *not* form a bitmap...
vboxsf: don't allow to change the inode type
afs: Fix updating of i_mode due to 3rd party change
ceph: don't allow type or device number to change on non-I_NEW inodes
ceph: fix up error handling with snapdirs
new helper: inode_wrong_type()
NFS_INO_REVAL_FORCED is intended to tell us that the cache needs
revalidation despite the fact that we hold a delegation. We shouldn't
need to store it anymore, though.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If we're trying to update the inode because a previous update left the
cache in a partially unrevalidated state, then allow the update if the
change attrs match.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If the NFSv4.2 server supports the 'change_attr_type' attribute, then
allow the client to optimise its attribute cache update strategy.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
We should not be invalidating the access or acl caches in
nfs_check_inode_attributes(), since the point is we're unsure about
whether the contents of the struct nfs_fattr are fully up to date.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Commit 0b467264d0 ("NFS: Fix attribute revalidation") changed the way
we populate the 'invalid' attribute, and made the line that strips away
the NFS_INO_INVALID_ATTR bits redundant.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If there is an outstanding layoutcommit, then the list of attributes
whose values are expected to change is not the full set. So let's
be explicit about the full list.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
chown()/chgrp() and chmod() are separate operations, and in addition,
there are mode operations that are performed automatically by the
server. So let's track mode validity separately from the file ownership
validity.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Rename can cause us to revalidate the access cache, so lets track the
nlinks separately from the mode/uid/gid.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
When checking cache validity, be more specific than just 'we want to
check the page cache validity'. In almost all cases, we want to check
that change attribute, and possibly also the size.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Add an argument to nfs_revalidate_inode() to allow callers to specify
which attributes they need to check for validity.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>