Add a number of helper functions to manage access to a cookie, pinning the
cache object in place for the duration to prevent cache withdrawal from
removing it:
(1) void fscache_init_access_gate(struct fscache_cookie *cookie);
This function initialises the access count when a cache binds to a
cookie. An extra ref is taken on the access count to prevent wakeups
while the cache is active. We're only interested in the wakeup when a
cookie is being withdrawn and we're waiting for it to quiesce - at
which point the counter will be decremented before the wait.
The FSCACHE_COOKIE_NACC_ELEVATED flag is set on the cookie to keep
track of the extra ref in order to handle a race between
relinquishment and withdrawal both trying to drop the extra ref.
(2) bool fscache_begin_cookie_access(struct fscache_cookie *cookie,
enum fscache_access_trace why);
This function attempts to begin access upon a cookie, pinning it in
place if it's cached. If successful, it returns true and leaves a the
access count incremented.
(3) void fscache_end_cookie_access(struct fscache_cookie *cookie,
enum fscache_access_trace why);
This function drops the access count obtained by (2), permitting
object withdrawal to take place when it reaches zero.
A tracepoint is provided to track changes to the access counter on a
cookie.
Changes
=======
ver #2:
- Don't hold n_accesses elevated whilst cache is bound to a cookie, but
rather add a flag that prevents the state machine from being queued when
n_accesses reaches 0.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819595085.215744.1706073049250505427.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906895313.143852.10141619544149102193.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967095980.1823006.1133648159424418877.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021503063.640689.8870918985269528670.stgit@warthog.procyon.org.uk/ # v4
Add a pair of helper functions to manage access to a volume, pinning the
volume in place for the duration to prevent cache withdrawal from removing
it:
bool fscache_begin_volume_access(struct fscache_volume *volume,
enum fscache_access_trace why);
void fscache_end_volume_access(struct fscache_volume *volume,
enum fscache_access_trace why);
The way the access gate on the volume works/will work is:
(1) If the cache tests as not live (state is not FSCACHE_CACHE_IS_ACTIVE),
then we return false to indicate access was not permitted.
(2) If the cache tests as live, then we increment the volume's n_accesses
count and then recheck the cache liveness, ending the access if it
ceased to be live.
(3) When we end the access, we decrement the volume's n_accesses and wake
up the any waiters if it reaches 0.
(4) Whilst the cache is caching, the volume's n_accesses is kept
artificially incremented to prevent wakeups from happening.
(5) When the cache is taken offline, the state is changed to prevent new
accesses, the volume's n_accesses is decremented and we wait for it to
become 0.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819594158.215744.8285859817391683254.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906894315.143852.5454793807544710479.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967095028.1823006.9173132503876627466.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021501546.640689.9631510472149608443.stgit@warthog.procyon.org.uk/ # v4
Add functions to the fscache API to allow data file cookies to be acquired
and relinquished by the network filesystem. It is intended that the
filesystem will create such cookies per-inode under a volume.
To request a cookie, the filesystem should call:
struct fscache_cookie *
fscache_acquire_cookie(struct fscache_volume *volume,
u8 advice,
const void *index_key,
size_t index_key_len,
const void *aux_data,
size_t aux_data_len,
loff_t object_size)
The filesystem must first have created a volume cookie, which is passed in
here. If it passes in NULL then the function will just return a NULL
cookie.
A binary key should be passed in index_key and is of size index_key_len.
This is saved in the cookie and is used to locate the associated data in
the cache.
A coherency data buffer of size aux_data_len will be allocated and
initialised from the buffer pointed to by aux_data. This is used to
validate cache objects when they're opened and is stored on disk with them
when they're committed. The data is stored in the cookie and will be
updateable by various functions in later patches.
The object_size must also be given. This is also used to perform a
coherency check and to size the backing storage appropriately.
This function disallows a cookie from being acquired twice in parallel,
though it will cause the second user to wait if the first is busy
relinquishing its cookie.
When a network filesystem has finished with a cookie, it should call:
void
fscache_relinquish_cookie(struct fscache_volume *volume,
bool retire)
If retire is true, any backing data will be discarded immediately.
Changes
=======
ver #3:
- fscache_hash()'s size parameter is now in bytes. Use __le32 as the unit
to round up to.
- When comparing cookies, simply see if the attributes are the same rather
than subtracting them to produce a strcmp-style return[1].
- Add a check to see if the cookie is still hashed at the point of
freeing.
ver #2:
- Don't hold n_accesses elevated whilst cache is bound to a cookie, but
rather add a flag that prevents the state machine from being queued when
n_accesses reaches 0.
- Remove the unused cookie pointer field from the fscache_acquire
tracepoint.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/CAHk-=whtkzB446+hX0zdLsdcUJsJ=8_-0S1mE_R+YurThfUbLA@mail.gmail.com/ [1]
Link: https://lore.kernel.org/r/163819590658.215744.14934902514281054323.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906891983.143852.6219772337558577395.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967088507.1823006.12659006350221417165.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021498432.640689.12743483856927722772.stgit@warthog.procyon.org.uk/ # v4
Add functions to the fscache API to allow volumes to be acquired and
relinquished by the network filesystem. A volume is an index of data
storage cache objects. A volume is represented by a volume cookie in the
API. A filesystem would typically create a volume for a superblock and
then create per-inode cookies within it.
To request a volume, the filesystem calls:
struct fscache_volume *
fscache_acquire_volume(const char *volume_key,
const char *cache_name,
const void *coherency_data,
size_t coherency_len)
The volume_key is a printable string used to match the volume in the cache.
It should not contain any '/' characters. For AFS, for example, this would
be "afs,<cellname>,<volume_id>", e.g. "afs,example.com,523001".
The cache_name can be NULL, but if not it should be a string indicating the
name of the cache to use if there's more than one available.
The coherency data, if given, is an arbitrarily-sized blob that's attached
to the volume and is compared when the volume is looked up. If it doesn't
match, the old volume is judged to be out of date and it and everything
within it is discarded.
Acquiring a volume twice concurrently is disallowed, though the function
will wait if an old volume cookie is being relinquishing.
When a network filesystem has finished with a volume, it should return the
volume cookie by calling:
void
fscache_relinquish_volume(struct fscache_volume *volume,
const void *coherency_data,
bool invalidate)
If invalidate is true, the entire volume will be discarded; if false, the
volume will be synced and the coherency data will be updated.
Changes
=======
ver #4:
- Removed an extraneous param from kdoc on fscache_relinquish_volume()[3].
ver #3:
- fscache_hash()'s size parameter is now in bytes. Use __le32 as the unit
to round up to.
- When comparing cookies, simply see if the attributes are the same rather
than subtracting them to produce a strcmp-style return[2].
- Make the coherency data an arbitrary blob rather than a u64, but don't
store it for the moment.
ver #2:
- Fix error check[1].
- Make a fscache_acquire_volume() return errors, including EBUSY if a
conflicting volume cookie already exists. No error is printed now -
that's left to the netfs.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/20211203095608.GC2480@kili/ [1]
Link: https://lore.kernel.org/r/CAHk-=whtkzB446+hX0zdLsdcUJsJ=8_-0S1mE_R+YurThfUbLA@mail.gmail.com/ [2]
Link: https://lore.kernel.org/r/20211220224646.30e8205c@canb.auug.org.au/ [3]
Link: https://lore.kernel.org/r/163819588944.215744.1629085755564865996.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906890630.143852.13972180614535611154.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967086836.1823006.8191672796841981763.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021495816.640689.4403156093668590217.stgit@warthog.procyon.org.uk/ # v4
Both fallocate and clone can end up updating the blocks used attribute.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
When doing a non-pNFS write, allow the writeback code to specify that it
also needs to update 'blocks used'.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
There are currently 2 ways to create a set of sysfs files for a
kobj_type, through the default_attrs field, and the default_groups
field. Move the NFS code to use default_groups field which has been the
preferred way since aa30f47cf6 ("kobject: Add support for default
attribute groups to kobj_type") so that we can soon get rid of the
obsolete default_attrs field.
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Anna Schumaker <anna.schumaker@netapp.com>
Cc: linux-nfs@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
When dealing with case insensitive names, the client has no idea how the
server performs the mapping, so cannot collapse the dentries into a
single representative. So both rename and unlink need to deal with the
fact that there could be several dentries representing the file, and
have to somehow force them to be revalidated. Use d_prune_aliases() as a
big hammer approach.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
If we create a file, rename it, or hardlink it, then we need to assume
that cached negative dentries need to be revalidated.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
If the directory contents change, we cannot rely on the negative dentry
being cacheable.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Add capabilities to allow the NFS client to recognise when it is dealing
with case insensitive and case preserving filesystems.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
When decode_devicenotify_args() exits with no entries, we need to
ensure that the struct cb_devicenotifyargs is initialised to
{ 0, NULL } in order to avoid problems in
nfs4_callback_devicenotify().
Reported-by: <rtm@csail.mit.edu>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
kstrdup() returns NULL when some internal memory errors happen, it is
better to check the return value of it so to catch the memory error in
time.
Signed-off-by: Xiaoke Wang <xkernel.wang@foxmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
When the bitmask of the attributes doesn't include the security label,
don't bother printing it. Since the label might not be null terminated,
adjust the printing format accordingly.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
There is a regular need in the kernel to provide a way to declare having
a dynamically sized set of trailing elements in a structure. Kernel code
should always use “flexible array members”[1] for these cases. The older
style of one-element or zero-length arrays should no longer be used[2].
Refactor the code a bit according to the use of a flexible-array member
in struct nfs4_file_layout_dsaddr instead of a one-element array, and
use the struct_size() helper.
This helps with the ongoing efforts to globally enable -Warray-bounds
and get us closer to being able to tighten the FORTIFY_SOURCE routines
on memcpy().
This issue was found with the help of Coccinelle and audited and fixed,
manually.
[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://www.kernel.org/doc/html/v5.10/process/deprecated.html#zero-length-and-one-element-arrays
Link: https://github.com/KSPP/linux/issues/79
Link: https://github.com/KSPP/linux/issues/109
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Renaming a file is required by POSIX to update the file ctime, so
ensure that the file data is synced to disk so that we don't clobber the
updated ctime by writing back after creating the hard link.
Fixes: f2c2c552f1 ("NFS: Move delegation recall into the NFSv4 callback for rename_setup()")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Creating a hard link is required by POSIX to update the file ctime, so
ensure that the file data is synced to disk so that we don't clobber the
updated ctime by writing back after creating the hard link.
Fixes: 9f76827287 ("NFS: Move the delegation return down into nfs4_proc_link()")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Storing the 'struct cred *' in nfs_access_entry is problematic.
An active 'cred' can keep a 'struct key *' active, and a quota is
imposed on the number of such keys that a user can maintain.
Cached 'nfs_access_entry' structs have indefinite lifetime, and having
these keep 'struct key's alive imposes on that quota.
So remove the 'struct cred *' and replace it with the fields we need:
kuid_t, kgid_t, and struct group_info *
This makes the 'struct nfs_access_entry' 64 bits larger.
New function "access_cmp" is introduced which is identical to
cred_fscmp() except that the second arg is an 'nfs_access_entry', rather
than a 'cred'
Fixes: b68572e07c ("NFS: change access cache to use 'struct cred'.")
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Storing the 'struct cred *' in nfs_access_entry is problematic.
An active 'cred' can keep a 'struct key *' active, and a quota is
imposed on the number of such keys that a user can maintain.
Cached 'nfs_access_entry' structs have indefinite lifetime, and having
these keep 'struct key's alive imposes on that quota.
So a future patch will remove the ->cred ref from nfs_access_entry.
To prepare, change various functions to not assume there is a 'cred' in
the nfs_access_entry, but to pass the cred around explicitly.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Currently the nfs_access_get_cached family of functions report a
'struct nfs_access_entry' as the result, with both .mask and .cred set.
However the .cred is never used. This is probably good and there is no
guarantee that it won't be freed before use.
Change to only report the 'mask' - as this is all that is used or needed.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Inodes aren't supposed to have a project id of -1U (aka 4294967295) but
the kernel hasn't always validated FSSETXATTR correctly. Flag this as
something for the sysadmin to check out.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Online fsck depends on callers holding ILOCK_EXCL from the time they
decide to update a block mapping until after they've updated the reverse
mapping records to guarantee the stability of both mapping records.
Unfortunately, the quota code drops ILOCK_EXCL at the first transaction
roll in the dquot allocation process, which breaks that assertion. This
leads to sporadic failures in the online rmap repair code if the repair
code grabs the AGF after bmapi_write maps a new block into the quota
file's data fork but before it can finish the deferred rmap update.
Fix this by rewriting the function to hold the ILOCK until after the
transaction commit like all other bmap updates do, and get rid of the
dqread wrapper that does nothing but complicate the codebase.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
mp is being initialized to log->l_mp but this is never read
as record is overwritten later on. Remove the redundant
assignment.
Cleans up the following clang-analyzer warning:
fs/xfs/xfs_log_recover.c:3543:20: warning: Value stored to 'mp' during
its initialization is never read [clang-analyzer-deadcode.DeadStores].
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Oh, let me count the ways that the kvmalloc API sucks dog eggs.
The problem is when we are logging lots of large objects, we hit
kvmalloc really damn hard with costly order allocations, and
behaviour utterly sucks:
- 49.73% xlog_cil_commit
- 31.62% kvmalloc_node
- 29.96% __kmalloc_node
- 29.38% kmalloc_large_node
- 29.33% __alloc_pages
- 24.33% __alloc_pages_slowpath.constprop.0
- 18.35% __alloc_pages_direct_compact
- 17.39% try_to_compact_pages
- compact_zone_order
- 15.26% compact_zone
5.29% __pageblock_pfn_to_page
3.71% PageHuge
- 1.44% isolate_migratepages_block
0.71% set_pfnblock_flags_mask
1.11% get_pfnblock_flags_mask
- 0.81% get_page_from_freelist
- 0.59% _raw_spin_lock_irqsave
- do_raw_spin_lock
__pv_queued_spin_lock_slowpath
- 3.24% try_to_free_pages
- 3.14% shrink_node
- 2.94% shrink_slab.constprop.0
- 0.89% super_cache_count
- 0.66% xfs_fs_nr_cached_objects
- 0.65% xfs_reclaim_inodes_count
0.55% xfs_perag_get_tag
0.58% kfree_rcu_shrink_count
- 2.09% get_page_from_freelist
- 1.03% _raw_spin_lock_irqsave
- do_raw_spin_lock
__pv_queued_spin_lock_slowpath
- 4.88% get_page_from_freelist
- 3.66% _raw_spin_lock_irqsave
- do_raw_spin_lock
__pv_queued_spin_lock_slowpath
- 1.63% __vmalloc_node
- __vmalloc_node_range
- 1.10% __alloc_pages_bulk
- 0.93% __alloc_pages
- 0.92% get_page_from_freelist
- 0.89% rmqueue_bulk
- 0.69% _raw_spin_lock
- do_raw_spin_lock
__pv_queued_spin_lock_slowpath
13.73% memcpy_erms
- 2.22% kvfree
On this workload, that's almost a dozen CPUs all trying to compact
and reclaim memory inside kvmalloc_node at the same time. Yet it is
regularly falling back to vmalloc despite all that compaction, page
and shrinker reclaim that direct reclaim is doing. Copying all the
metadata is taking far less CPU time than allocating the storage!
Direct reclaim should be considered extremely harmful.
This is a high frequency, high throughput, CPU usage and latency
sensitive allocation. We've got memory there, and we're using
kvmalloc to allow memory allocation to avoid doing lots of work to
try to do contiguous allocations.
Except it still does *lots of costly work* that is unnecessary.
Worse: the only way to avoid the slowpath page allocation trying to
do compaction on costly allocations is to turn off direct reclaim
(i.e. remove __GFP_RECLAIM_DIRECT from the gfp flags).
Unfortunately, the stupid kvmalloc API then says "oh, this isn't a
GFP_KERNEL allocation context, so you only get kmalloc!". This
cuts off the vmalloc fallback, and this leads to almost instant OOM
problems which ends up in filesystems deadlocks, shutdowns and/or
kernel crashes.
I want some basic kvmalloc behaviour:
- kmalloc for a contiguous range with fail fast semantics - no
compaction direct reclaim if the allocation enters the slow path.
- run normal vmalloc (i.e. GFP_KERNEL) if kmalloc fails
The really, really stupid part about this is these kvmalloc() calls
are run under memalloc_nofs task context, so all the allocations are
always reduced to GFP_NOFS regardless of the fact that kvmalloc
requires GFP_KERNEL to be passed in. IOWs, we're already telling
kvmalloc to behave differently to the gfp flags we pass in, but it
still won't allow vmalloc to be run with anything other than
GFP_KERNEL.
So, this patch open codes the kvmalloc() in the commit path to have
the above described behaviour. The result is we more than halve the
CPU time spend doing kvmalloc() in this path and transaction commits
with 64kB objects in them more than doubles. i.e. we get ~5x
reduction in CPU usage per costly-sized kvmalloc() invocation and
the profile looks like this:
- 37.60% xlog_cil_commit
16.01% memcpy_erms
- 8.45% __kmalloc
- 8.04% kmalloc_order_trace
- 8.03% kmalloc_order
- 7.93% alloc_pages
- 7.90% __alloc_pages
- 4.05% __alloc_pages_slowpath.constprop.0
- 2.18% get_page_from_freelist
- 1.77% wake_all_kswapds
....
- __wake_up_common_lock
- 0.94% _raw_spin_lock_irqsave
- 3.72% get_page_from_freelist
- 2.43% _raw_spin_lock_irqsave
- 5.72% vmalloc
- 5.72% __vmalloc_node_range
- 4.81% __get_vm_area_node.constprop.0
- 3.26% alloc_vmap_area
- 2.52% _raw_spin_lock
- 1.46% _raw_spin_lock
0.56% __alloc_pages_bulk
- 4.66% kvfree
- 3.25% vfree
- __vfree
- 3.23% __vunmap
- 1.95% remove_vm_area
- 1.06% free_vmap_area_noflush
- 0.82% _raw_spin_lock
- 0.68% _raw_spin_lock
- 0.92% _raw_spin_lock
- 1.40% kfree
- 1.36% __free_pages
- 1.35% __free_pages_ok
- 1.02% _raw_spin_lock_irqsave
It's worth noting that over 50% of the CPU time spent allocating
these shadow buffers is now spent on spinlocks. So the shadow buffer
allocation overhead is greatly reduced by getting rid of direct
reclaim from kmalloc, and could probably be made even less costly if
vmalloc() didn't use global spinlocks to protect it's structures.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
There are currently 2 ways to create a set of sysfs files for a
kobj_type, through the default_attrs field, and the default_groups
field. Move the xfs sysfs code to use default_groups field which has
been the preferred way since aa30f47cf6 ("kobject: Add support for
default attribute groups to kobj_type") so that we can soon get rid of
the obsolete default_attrs field.
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: linux-xfs@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
When the kernel is locked down the kernel allows reading only debugfs
files with mode 444. Mode 400 is also valid but is not allowed.
Make the 444 into a mask.
Fixes: 5496197f9b ("debugfs: Restrict debugfs when the kernel is locked down")
Signed-off-by: Michal Suchanek <msuchanek@suse.de>
Link: https://lore.kernel.org/r/20220104170505.10248-1-msuchanek@suse.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Move the request list macros to the header file that defines that struct
they operate on.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220105170518.3181469-2-kbusch@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Compress page will invalidate in truncate block process too, so remove
redunant invalidate compress pages in f2fs_evict_inode.
Signed-off-by: Fengnan Chang <changfengnan@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Fix the following coccicheck warning:
./fs/f2fs/sysfs.c:491:41-46: WARNING: conversion to bool not needed here
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
For compressed inode, in .{invalidate,release}page, we will call
f2fs_invalidate_compress_pages() to drop all compressed page cache of
current inode.
But we don't need to drop compressed page cache synchronously in
.invalidatepage, because, all trancation paths of compressed physical
block has been covered with f2fs_invalidate_compress_page().
And also we don't need to drop compressed page cache synchronously
in .releasepage, because, if there is out-of-memory, we can count
on page cache reclaim on sbi->compress_inode.
BTW, this patch may fix the issue reported below:
https://lore.kernel.org/linux-f2fs-devel/20211202092812.197647-1-changfengnan@vivo.com/T/#u
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
https://bugzilla.kernel.org/show_bug.cgi?id=204137
With below script, we will hit panic during new segment allocation:
DISK=bingo.img
MOUNT_DIR=/mnt/f2fs
dd if=/dev/zero of=$DISK bs=1M count=105
mkfs.f2fe -a 1 -o 19 -t 1 -z 1 -f -q $DISK
mount -t f2fs $DISK $MOUNT_DIR -o "noinline_dentry,flush_merge,noextent_cache,mode=lfs,io_bits=7,fsync_mode=strict"
for (( i = 0; i < 4096; i++ )); do
name=`head /dev/urandom | tr -dc A-Za-z0-9 | head -c 10`
mkdir $MOUNT_DIR/$name
done
umount $MOUNT_DIR
rm $DISK
--- Core dump ---
Call Trace:
allocate_segment_by_default+0x9d/0x100 [f2fs]
f2fs_allocate_data_block+0x3c0/0x5c0 [f2fs]
do_write_page+0x62/0x110 [f2fs]
f2fs_outplace_write_data+0x43/0xc0 [f2fs]
f2fs_do_write_data_page+0x386/0x560 [f2fs]
__write_data_page+0x706/0x850 [f2fs]
f2fs_write_cache_pages+0x267/0x6a0 [f2fs]
f2fs_write_data_pages+0x19c/0x2e0 [f2fs]
do_writepages+0x1c/0x70
__filemap_fdatawrite_range+0xaa/0xe0
filemap_fdatawrite+0x1f/0x30
f2fs_sync_dirty_inodes+0x74/0x1f0 [f2fs]
block_operations+0xdc/0x350 [f2fs]
f2fs_write_checkpoint+0x104/0x1150 [f2fs]
f2fs_sync_fs+0xa2/0x120 [f2fs]
f2fs_balance_fs_bg+0x33c/0x390 [f2fs]
f2fs_write_node_pages+0x4c/0x1f0 [f2fs]
do_writepages+0x1c/0x70
__writeback_single_inode+0x45/0x320
writeback_sb_inodes+0x273/0x5c0
wb_writeback+0xff/0x2e0
wb_workfn+0xa1/0x370
process_one_work+0x138/0x350
worker_thread+0x4d/0x3d0
kthread+0x109/0x140
ret_from_fork+0x25/0x30
The root cause here is, with IO alignment feature enables, in worst
case, we need F2FS_IO_SIZE() free blocks space for single one 4k write
due to IO alignment feature will fill dummy pages to make IO being
aligned.
So we will easily run out of free segments during non-inline directory's
data writeback, even in process of foreground GC.
In order to fix this issue, I just propose to reserve additional free
space for IO alignment feature to handle worst case of free space usage
ratio during FGGC.
Fixes: 0a595ebaaa ("f2fs: support IO alignment for DATA and NODE writes")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Otherwise, nat_bit area may be persisted across boundary of CP area during
nat_bit rebuilding.
Fixes: 94c821fb28 ("f2fs: rebuild nat_bits during umount")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
f2fs: support fault injection for f2fs_trylock_op()
This patch supports to inject fault into f2fs_trylock_op().
Usage:
a) echo 65536 > /sys/fs/f2fs/<dev>/inject_type or
b) mount -o fault_type=65536 <dev> <mountpoint>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
As Wenqing Liu reported in bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=215235
- Overview
page fault in f2fs_setxattr() when mount and operate on corrupted image
- Reproduce
tested on kernel 5.16-rc3, 5.15.X under root
1. unzip tmp7.zip
2. ./single.sh f2fs 7
Sometimes need to run the script several times
- Kernel dump
loop0: detected capacity change from 0 to 131072
F2FS-fs (loop0): Found nat_bits in checkpoint
F2FS-fs (loop0): Mounted with checkpoint version = 7548c2ee
BUG: unable to handle page fault for address: ffffe47bc7123f48
RIP: 0010:kfree+0x66/0x320
Call Trace:
__f2fs_setxattr+0x2aa/0xc00 [f2fs]
f2fs_setxattr+0xfa/0x480 [f2fs]
__f2fs_set_acl+0x19b/0x330 [f2fs]
__vfs_removexattr+0x52/0x70
__vfs_removexattr_locked+0xb1/0x140
vfs_removexattr+0x56/0x100
removexattr+0x57/0x80
path_removexattr+0xa3/0xc0
__x64_sys_removexattr+0x17/0x20
do_syscall_64+0x37/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
The root cause is in __f2fs_setxattr(), we missed to do sanity check on
last xattr entry, result in out-of-bound memory access during updating
inconsistent xattr data of target inode.
After the fix, it can detect such xattr inconsistency as below:
F2FS-fs (loop11): inode (7) has invalid last xattr entry, entry_size: 60676
F2FS-fs (loop11): inode (8) has corrupted xattr
F2FS-fs (loop11): inode (8) has corrupted xattr
F2FS-fs (loop11): inode (8) has invalid last xattr entry, entry_size: 47736
Cc: stable@vger.kernel.org
Reported-by: Wenqing Liu <wenqingliu0120@gmail.com>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch tries to mitigate lock contention between f2fs_write_checkpoint and
f2fs_get_node_info along with nat_tree_lock.
The idea is, if checkpoint is currently running, other threads that try to grab
nat_tree_lock would be better to wait for checkpoint.
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Get rid of old erofs_get_meta_page() within zmap operations by
using on-stack meta buffers in order to prepare subpage and folio
features.
Finally, erofs_get_meta_page() is useless. Get rid of it!
Link: https://lore.kernel.org/r/20220102040017.51352-6-hsiangkao@linux.alibaba.com
Reviewed-by: Yue Hu <huyue2@yulong.com>
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Get rid of old erofs_get_meta_page() within xattr operations by
using on-stack meta buffers in order to prepare subpage and folio
features.
Link: https://lore.kernel.org/r/20220102040017.51352-5-hsiangkao@linux.alibaba.com
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Get rid of old erofs_get_meta_page() within super operations by
using on-stack meta buffers in order to prepare subpage and folio
features.
Link: https://lore.kernel.org/r/20220102081317.109797-1-hsiangkao@linux.alibaba.com
Reviewed-by: Yue Hu <huyue2@yulong.com>
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Get rid of old erofs_get_meta_page() within inode operations by
using on-stack meta buffers in order to prepare subpage and folio
features.
Link: https://lore.kernel.org/r/20220102040017.51352-3-hsiangkao@linux.alibaba.com
Reviewed-by: Yue Hu <huyue2@yulong.com>
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
In order to support subpage and folio for all uncompressed files,
introduce meta buffer descriptors, which can be effectively stored
on stack, in place of meta page operations.
This converts the uncompressed data path to meta buffers.
Link: https://lore.kernel.org/r/20220102040017.51352-2-hsiangkao@linux.alibaba.com
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
This patch prints the cluster node address if a non-cluster node
(according to the dlm config setting) tries to connect. The current
hexdump call will print in a different loglevel and only available if
dynamic debug is enabled. Additional we using the ip address format
strings to print an IETF ip4/6 string represenation.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
btrfs_free_space_ctl::private is either unset or it always points to
struct btrfs_block_group when it is set. So there's no point in keeping
the unhelpful 'private' name and keeping it an untyped pointer. Change
both the type and name to be self-describing. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no point in the function taking an fs_info and a
btrfs_free_space because the ctl passed always belongs to the block
group. Furthermore fs_info can be referenced from the block group. No
functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The only difference between the two is whether btrfs_free_space::bytes
is adjusted. Instead of having 2 separate functions control this
behavior via an additional parameter and make them one function instead.
No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The only difference is the former adjusts btrfs_free_space::bytes
member. Consolidate the two function into 1 and add a bool parameter
which controls whether the adjustment is made or not. No functional
changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the future we are going to have multiple copies of these trees. To
facilitate this we need a way to lookup the different roots we are
looking for. Handle this by adding a global root rb tree that is
indexed on the root->root_key. Then instead of loading the roots at
mount time with individually targeted keys, simply search the tree_root
for anything with the specific objectid we want. This will make it
straightforward to support both old style and new style file systems.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We don't set SHAREABLE on the extent root, we don't need to have this
safety check here.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're going to have multiple free space roots in the future, so adjust
all the users of the free space root to use a helper to access the root.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are going to have multiple csum roots in the future, so convert all
users of ->csum_root to btrfs_csum_root() and rename ->csum_root to
->_csum_root so we can easily find remaining users in the future.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a few places where we skip doing csums if we mounted with one of
the rescue options that ignores bad csum roots. In the future when
there are multiple csum roots it'll be costly to check and see if there
are any missing csum roots, so simply add a flag to indicate the fs
should skip loading csums in case of errors.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the future we may have multiple csum roots, so simply check the
objectid is for a csum root instead of checking against ->csum_root.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we start having multiple extent roots we'll need to use a helper to
get to the correct extent_root. Rename fs_info->extent_root to
_extent_root and convert all of the users of the extent root to using
the btrfs_extent_root() helper. This will allow us to easily clean up
the remaining direct accesses in the future.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the future we're going to have multiple csum and extent root trees,
so init the roots block_rsv at setup_root time based on their root key
objectid.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We only need the root to start a transaction, and since it's a global
root we can pick anything, change to the tree_root as we'll have a lot
of extent roots in the future.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are going to have many extent_roots soon, and we don't need a root
here necessarily as we're not modifying anything, we're just getting the
trans handle so we can have an accurate view of references, so use the
tree_root here.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're just using the extent_root to set the chunk owner to
root_key->objectid, which is BTRFS_EXTENT_TREE_OBJECTID, so use that
directly instead of using the root.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We only defrag leaves on roots that have SHAREABLE set, so we don't need
to check if we're the extent root as it doesn't have SHAREABLE set.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is a leftover from when we used to independently swap the extent
root's commit root and the fs tree commit roots. At the time I simply
changed the helper to a list_add. There's actually no reason to not add
the extent root to the switch commit root at this point, we don't care
about the order we do the switching since it's all done under the
commit_root_sem.
If we re-mark the extent root dirty after adding it to the
switch_commits list we'll see that BTRFS_ROOT_DIRTY isn't set and then
list_move it back onto the dirty list, and then we'll redo the tree
update and everything will be ok.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're only using this to start the transaction with to possibly allocate
a chunk. It doesn't really matter which root to use, but with extent
tree v2 we'll need a bytenr to look up a extent root which makes the
usage of the extent_root awkward here. Simply change it to the
chunk_root.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With extent tree v2 we'll have a different extent root based on where
the bytenr is located, so adjust the remove_extent_backref() helper and
it's helpers to pass the extent_root around.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With extent tree v2 we will have a separate root to hold the block group
items. Add a btrfs_block_group_root() that will return the appropriate
root given the flags of the fs, and convert all functions that need to
modify block group items to use the helper.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we're looking for leafs that point to a data extent we want to record
the extent items that point at our bytenr. At this point we have the
reference and we know for a fact that this leaf should have a reference
to our bytenr. However if there's some sort of corruption we may not
find any references to our leaf, and thus could end up with eie == NULL.
Replace this BUG_ON() with an ASSERT() and then return -EUCLEAN for the
mortals.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We search for an extent entry with .offset = -1, which shouldn't be a
thing, but corruption happens. Add an ASSERT() for the developers,
return -EUCLEAN for mortals.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We define __TRANS_DUMMY always, so this extra ifdef stuff is not needed.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This comment was much closer to the related code when it was originally
added, but has slowly migrated north far from its ancestral lands. Move
it back down with its people.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We pass in the path, but use btrfs_next_item() using the root we
searched with. Pass the root down to add_keyed_refs() instead of the
fs_info so we can continue to use the same root we searched with.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nobody is using this anymore, remove it.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The root on the trans->root can be anything, and generally we're
committing from the transaction kthread so it's usually the tree_root.
Change this to just take an fs_info, and to maintain compatibility
simply put the ROOT_TREE_OBJECTID as the root objectid for the
tracepoint. This will allow use to remove trans->root.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we do this awful thing where we get another ref on a trans
handle, async off that handle and commit the transaction from that work.
Because we do this we have to mess with current->journal_info and the
freeze counting stuff.
We already have an async thing to kick for the transaction commit, the
transaction kthread. Replace this work struct with a flag on the
fs_info to tell the kthread to go ahead and commit even if it's before
our timeout. Then we can drastically simplify the async transaction
commit path.
Note: this can be simplified and functionality based on the pending
operation COMMIT.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ add note ]
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is no longer used, the -o nobarrier is handled by
BTRFS_MOUNT_NOBARRIER. Remove the flag.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reshuffle the code inside the first loop of tree_search_offset so that
one if() is eliminated and the becomes more linear.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When debugging calc_bio_boundaries(), I found that even for RAID1
metadata, we're following stripe length to calculate stripe boundary.
# mkfs.btrfs -m raid1 -d raid1 /dev/test/scratch[12]
# mount /dev/test/scratch /mnt/btrfs
# xfs_io -f -c "pwrite 0 64K" /mnt/btrfs/file
# umount
Above very basic operations will make calc_bio_boundaries() to report
the following result:
submit_extent_page: r/i=1/1 file_offset=22036480 len_to_stripe_boundary=49152
submit_extent_page: r/i=1/1 file_offset=30474240 len_to_stripe_boundary=65536
...
submit_extent_page: r/i=1/1 file_offset=30523392 len_to_stripe_boundary=16384
submit_extent_page: r/i=1/1 file_offset=30457856 len_to_stripe_boundary=16384
submit_extent_page: r/i=5/257 file_offset=0 len_to_stripe_boundary=65536
submit_extent_page: r/i=5/257 file_offset=65536 len_to_stripe_boundary=65536
submit_extent_page: r/i=1/1 file_offset=30490624 len_to_stripe_boundary=49152
submit_extent_page: r/i=1/1 file_offset=30507008 len_to_stripe_boundary=32768
Where "r/i" is the rootid and inode, 1/1 means they metadata.
The remaining names match the member used in kernel.
Even all data/metadata are using RAID1, we're still following stripe
length.
[CAUSE]
This behavior is caused by a wrong condition in btrfs_get_io_geometry():
if (map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
/* Fill using stripe_len */
len = min_t(u64, em->len - offset, max_len);
} else {
len = em->len - offset;
}
This means, only for SINGLE we will not follow stripe_len.
However for profiles like RAID1*, DUP, they don't need to bother
stripe_len.
This can lead to unnecessary bio split for RAID1*/DUP profiles, and can
even be a blockage for future zoned RAID support.
[FIX]
Introduce one single-use macro, BTRFS_BLOCK_GROUP_STRIPE_MASK, and
change the condition to only calculate the length using stripe length
for stripe based profiles.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is a small optimisation since the currently 'entry' is already
checked in the if () {} else if {} construct above the loop. In essence
the first iteration of the final while loop is redundant. To eliminate
this extra check simply get the next entry at the beginning of the loop.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
I noticed a few corner cases when looking at my bytes_index patch for
obvious bugs, so add a bunch of tests to validate proper behavior of the
bytes_index tree. A couple of basic tests to make sure it puts things
in the correct order, and then more complicated tests to make sure it
re-arranges bitmap entries properly and does the right thing when we try
to make allocations.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we index free space on offset only, because usually we have a
hint from the allocator that we want to honor for locality reasons.
However if we fail to use this hint we have to go back to a brute force
search through the free space entries to find a large enough extent.
With sufficiently fragmented free space this becomes quite expensive, as
we have to linearly search all of the free space entries to find if we
have a part that's long enough.
To fix this add a cached rb tree to index based on free space entry
bytes. This will allow us to quickly look up the largest chunk in the
free space tree for this block group, and stop searching once we've
found an entry that is too small to satisfy our allocation. We simply
choose to use this tree if we're searching from the beginning of the
block group, as we know we do not care about locality at that point.
I wrote an allocator test that creates a 10TiB ram backed null block
device and then fallocates random files until the file system is full.
I think go through and delete all of the odd files. Then I spawn 8
threads that fallocate 64MiB files (1/2 our extent size cap) until the
file system is full again. I use bcc's funclatency to measure the
latency of find_free_extent. The baseline results are
nsecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 10356 |**** |
512 -> 1023 : 58242 |************************* |
1024 -> 2047 : 74418 |******************************** |
2048 -> 4095 : 90393 |****************************************|
4096 -> 8191 : 79119 |*********************************** |
8192 -> 16383 : 35614 |*************** |
16384 -> 32767 : 13418 |***** |
32768 -> 65535 : 12811 |***** |
65536 -> 131071 : 17090 |******* |
131072 -> 262143 : 26465 |*********** |
262144 -> 524287 : 40179 |***************** |
524288 -> 1048575 : 55469 |************************ |
1048576 -> 2097151 : 48807 |********************* |
2097152 -> 4194303 : 26744 |*********** |
4194304 -> 8388607 : 35351 |*************** |
8388608 -> 16777215 : 13918 |****** |
16777216 -> 33554431 : 21 | |
avg = 908079 nsecs, total: 580889071441 nsecs, count: 639690
And the patch results are
nsecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 6883 |** |
512 -> 1023 : 54346 |********************* |
1024 -> 2047 : 79170 |******************************** |
2048 -> 4095 : 98890 |****************************************|
4096 -> 8191 : 81911 |********************************* |
8192 -> 16383 : 27075 |********** |
16384 -> 32767 : 14668 |***** |
32768 -> 65535 : 13251 |***** |
65536 -> 131071 : 15340 |****** |
131072 -> 262143 : 26715 |********** |
262144 -> 524287 : 43274 |***************** |
524288 -> 1048575 : 53870 |********************* |
1048576 -> 2097151 : 55368 |********************** |
2097152 -> 4194303 : 41036 |**************** |
4194304 -> 8388607 : 24927 |********** |
8388608 -> 16777215 : 33 | |
16777216 -> 33554431 : 9 | |
avg = 623599 nsecs, total: 397259314759 nsecs, count: 637042
There's a little variation in the amount of calls done because of timing
of the threads with metadata requirements, but the avg, total, and
count's are relatively consistent between runs (usually within 2-5% of
each other). As you can see here we have around a 30% decrease in
average latency with a 30% decrease in overall time spent in
find_free_extent.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While adding self tests for my space index change I was hitting a
problem where the space indexed tree wasn't returning the expected
->max_extent_size. This is because we will skip searching any entry
that doesn't have ->bytes >= the amount of bytes we want. However we'll
still set the max_extent_size based on that entry. The problem is if we
don't search the bitmap we won't have ->max_extent_size set properly, so
we can't really trust it.
This doesn't really result in a problem per-se, it can just result in us
not finding contiguous area that may exist. Fix the max_extent_size
helper to return ->bytes if ->max_extent_size isn't set, and add a big
comment explaining why we're doing this.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We use @nr_written to record how many pages have been started by
btrfs_run_delalloc_range().
Currently there are only two cases that would populate @nr_written:
- Inline extent creation
- Compressed write
But both cases will also set @page_started to one.
In fact, in writepage_delalloc() we have the following code, showing
that @nr_written is really only utilized for above two cases:
/* did the fill delalloc function already unlock and start
* the IO?
*/
if (page_started) {
/*
* we've unlocked the page, so we can't update
* the mapping's writeback index, just update
* nr_to_write.
*/
wbc->nr_to_write -= nr_written;
return 1;
}
But for such cases, writepage_delalloc() will return 1, and exit
__extent_writepage() without going through __extent_writepage_io().
Thus this means, inside __extent_writepage_io(), we always get
@nr_written as 0.
So this patch is going to remove the unnecessary parameter from the
following functions:
- writepage_delalloc()
As @nr_written passed in is always the initial value 0.
Although inside that function, we still need a local @nr_written
to update wbc->nr_to_write.
- __extent_writepage_io()
As explained above, @nr_written passed in can only be 0.
This also means we can remove one update_nr_written() call.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We used to need the root for btrfs_reserve_metadata_bytes to check the
orphan cleanup state, but we no longer need that, we simply need the
fs_info. Change btrfs_reserve_metadata_bytes() to use the fs_info, and
change both btrfs_block_rsv_refill() and btrfs_block_rsv_add() to do the
same as they simply call btrfs_reserve_metadata_bytes() and then
manipulate the block_rsv that is being used.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we don't care about the stage of the orphan_cleanup_state,
simply replace it with a bit on ->state to make sure we don't call the
orphan cleanup every time we wander into this root.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is very old code before we were stealing from the global reserve
during evict. We have proper ways to steal from the global reserve
while we're evicting, so rip out this code as it's no longer necessary.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
I forgot to convert this over when I introduced the global reserve
stealing code to the space flushing code. Evict was simply trying to
make its reservation and then if it failed it would steal from the
global rsv, which is racey because it's outside of the normal ticketing
code.
Fix this by setting ticket->steal if we are BTRFS_RESERVE_FLUSH_EVICT,
and then make the priority flushing path do the steal for us.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're going to use this helper in the priority flushing loop, move this
check into the helper to simplify the logic.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since we're dropping locks before we enter the priority flushing loops
we could have had our ticket granted before we got the space_info->lock.
So add this check to avoid doing some extra flushing in the priority
flushing cases.
The case in priority_reclaim_metadata_space is an optimization. Think
we came in to reserve, we didn't have the space, we added our ticket to
the list. But at the same time somebody was waiting on the space_info
lock to add space and do btrfs_try_granting_ticket(), so we drop the
lock, get satisfied, come in to do our loop, and we have been
satisfied.
This is the priority reclaim path, so to_reclaim could be !0 still
because we may have only satisfied the priority tickets and still left
non priority tickets on the list. We would then have to_reclaim but
->bytes == 0.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ add note about the optimization ]
Signed-off-by: David Sterba <dsterba@suse.com>
Currently the error case for the priority tickets is handled where we
deal with all of the tickets, priority and non-priority. This is OK in
general, but it makes for some awkward locking. We take and drop the
space_info->lock back to back because of these different types of
tickets.
Rework the code to handle priority ticket failures in their respective
helpers. This allows us to be less wonky with our space_info->lock
usage, and means that the main handler simply has to check
ticket->error, as the ticket is guaranteed to be off any list and
completely handled by the time it exits one of the handlers.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When mounting a device, we are reporting the zones twice: once for
checking the zone attributes in btrfs_get_dev_zone_info and once for
loading block groups' zone info in
btrfs_load_block_group_zone_info(). With a lot of block groups, that
leads to a lot of REPORT ZONE commands and slows down the mount
process.
This patch introduces a zone info cache in struct
btrfs_zoned_device_info. The cache is populated while in
btrfs_get_dev_zone_info() and used for
btrfs_load_block_group_zone_info() to reduce the number of REPORT ZONE
commands. The zone cache is then released after loading the block
groups, as it will not be much effective during the run time.
Benchmark: Mount an HDD with 57,007 block groups
Before patch: 171.368 seconds
After patch: 64.064 seconds
While it still takes a minute due to the slowness of loading all the
block groups, the patch reduces the mount time by 1/3.
Link: https://lore.kernel.org/linux-btrfs/CAHQ7scUiLtcTqZOMMY5kbWUBOhGRwKo6J6wYPT5WY+C=cD49nQ@mail.gmail.com/
Fixes: 5b31646898 ("btrfs: get zone information of zoned block devices")
CC: stable@vger.kernel.org
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since commit ba8a9d0795 ("Btrfs: delete the entire async bio submission
framework") removed submit workqueues, the parameter fs_devices is not used
anymore.
Remove it, no functional changes.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Su Yue <l@damenly.su>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the transaction commit path we are acquiring the tree log mutex too
early and we have a stale comment because:
1) It mentions a function named btrfs_commit_tree_roots(), which does not
exists anymore, it was the old name of commit_cowonly_roots(), renamed
a very long time ago by commit 5d4f98a28c ("Btrfs: Mixed back
reference (FORWARD ROLLING FORMAT CHANGE)"));
2) It mentions that we need to acquire the tree log mutex at that point
to ensure we have no running log writers. That is not correct anymore,
for many years at least, since we are guaranteed that we do not have
any log writers at that point simply because we have set the state of
the transaction to TRANS_STATE_COMMIT_DOING and have waited for all
writers to complete - meaning no one can log until we change the state
of the transaction to TRANS_STATE_UNBLOCKED. Any attempts to join the
transaction or start a new one will block until we do that state
transition;
3) The comment mentions a "trans mutex" which doesn't exists since 2011,
commit a4abeea41a ("Btrfs: kill trans_mutex") removed it;
4) The current use of the tree log mutex is to ensure proper serialization
of super block writes - if someone started a new transaction and uses it
for logging, it will wait for the previous transaction to write its
super block before writing the super block when attempting to sync the
log.
So acquire the tree log mutex only when it's absolutely needed, before
setting the transaction state to TRANS_STATE_UNBLOCKED, fix and move the
stale comment, add some assertions and new comments where appropriate.
Also, this has no effect on concurrency or performance, since the new
start of the critical section is still when the transaction is in the
state TRANS_STATE_COMMIT_DOING.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_prepare_sprout() splices seed devices into its own struct fs_devices,
so that its parent function btrfs_init_new_device() can add the new sprout
device to fs_info->fs_devices.
Both btrfs_prepare_sprout() and btrfs_init_new_device() need
device_list_mutex. But they are holding it separately, thus create a
small race window. Close it and hold device_list_mutex across both
functions btrfs_init_new_device() and btrfs_prepare_sprout().
Split btrfs_prepare_sprout() into btrfs_init_sprout() and
btrfs_setup_sprout(). This split is essential because device_list_mutex
must not be held for allocations in btrfs_init_sprout() but must be held
for btrfs_setup_sprout(). So now a common device_list_mutex can be used
between btrfs_init_new_device() and btrfs_setup_sprout().
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Declare int seeding_dev as a bool. Also, move its declaration a line
below to adjust packing.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Again, I don't think this was ever used since iterate_dir_item() is only
used for xattrs. No functional change.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As far as I can tell, this was never used. No functional change.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The name btrfs_item_end_nr() is a bit of a misnomer, as it's actually
the offset of the end of the data the item points to. In fact all of
the helpers that we use btrfs_item_end_nr() use data in their name, like
BTRFS_LEAF_DATA_SIZE() and leaf_data(). Rename to btrfs_item_data_end()
to make it clear what this helper is giving us.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're only using btrfs_item_end() from btrfs_item_end_nr(), so this can
be collapsed.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that all call sites are using the slot number to modify item values,
rename the SETGET helpers to raw_item_*(), and then rework the _nr()
helpers to be the btrfs_item_*() btrfs_set_item_*() helpers, and then
rename all of the callers to the new helpers.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The last remaining place where we have the pattern of
item = btrfs_item_nr(slot)
<do something with the item>
are the token helpers. Handle this by introducing token helpers that
will do the btrfs_item_nr() work inside of the helper itself, and then
convert all users of the btrfs_item token helpers to the new _nr()
variants.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of getting the btrfs_item for this, simply pass in the slot of
the item and then use the btrfs_item_size_nr() helper inside of
btrfs_file_extent_inline_item_len().
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have the pattern of
item = btrfs_item_nr(slot);
btrfs_set_item_*(leaf, item);
in a bunch of places in our code. Fix this by adding
btrfs_set_item_*_nr() helpers which will do the appropriate work, and
replace those calls with
btrfs_set_item_*_nr(leaf, slot);
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have this pattern in a lot of places
item = btrfs_item_nr(slot);
btrfs_item_size(leaf, item);
when we could simply use
btrfs_item_size(leaf, slot);
Fix all callers of btrfs_item_size() and btrfs_item_offset() to use the
_nr variation of the helpers.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we log only dir index keys when logging a directory, we no longer
need to deal with dir item keys in the log replay code for replaying
directory deletes. This is also true for the case when we replay a log
tree created by a kernel that still logs dir items.
So remove the remaining code of the replay of directory deletes algorithm
that deals with dir item keys.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, when logging a directory, we copy both dir items and dir index
items from the fs/subvolume tree to the log tree. Both items have exactly
the same data (same struct btrfs_dir_item), the difference lies in the key
values, where a dir index key contains the index number of a directory
entry while the dir item key does not, as it's used for doing fast lookups
of an entry by name, while the former is used for sorting entries when
listing a directory.
We can exploit that and log only the dir index items, since they contain
all the information needed to correctly add, replace and delete directory
entries when replaying a log tree. Logging only the dir index items is
also backward and forward compatible: an unpatched kernel (without this
change) can correctly replay a log tree generated by a patched kernel
(with this patch), and a patched kernel can correctly replay a log tree
generated by an unpatched kernel.
The backward compatibility is ensured because:
1) For inserting a new dentry: a dentry is only inserted when we find a
new dir index key - we can only insert if we know the dir index offset,
which is encoded in the dir index key's offset;
2) For deleting dentries: during log replay, before adding or replacing
dentries, we first replay dentry deletions. Whenever we find a dir item
key or a dir index key in the subvolume/fs tree that is not logged in
a range for which the log tree is authoritative, we do the unlink of
the dentry, which removes both the existing dir item key and the dir
index key. Therefore logging just dir index keys is enough to ensure
dentry deletions are correctly replayed;
3) For dentry replacements: they work when we log only dir index keys
and this is mostly due to a combination of 1) and 2). If we replace a
dentry with name "foobar" to point from inode A to inode B, then we
know the dir index key for the new dentry is different from the old
one, as it has an index number (key offset) larger than the old one.
This results in replaying a deletion, through replay_dir_deletes(),
that causes the old dentry to be removed, both the dir item key and
the dir index key, as mentioned at 2). Then when processing the new
dir index key, we add the new dentry, adding both a new dir item key
and a new index key pointing to inode B, as stated in 1).
The forward compatibility, the ability for a patched kernel to replay a
log created by an older, unpatched kernel, comes from the changes required
for making sure we are able to replay a log that only contains dir index
keys - we simply ignore every dir item key we find.
So modify directory logging to log only dir index items, and modify the
log replay process to ignore dir item keys, from log trees created by an
unpatched kernel, and process only with dir index keys. This reduces the
amount of logged metadata by about half, and therefore the time spent
logging or fsyncing large directories (less CPU time and less IO).
The following test script was used to measure this change:
#!/bin/bash
DEV=/dev/nvme0n1
MNT=/mnt/nvme0n1
NUM_NEW_FILES=1000000
NUM_FILE_DELETES=10000
mkfs.btrfs -f $DEV
mount -o ssd $DEV $MNT
mkdir $MNT/testdir
for ((i = 1; i <= $NUM_NEW_FILES; i++)); do
echo -n > $MNT/testdir/file_$i
done
start=$(date +%s%N)
xfs_io -c "fsync" $MNT/testdir
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "dir fsync took $dur ms after adding $NUM_NEW_FILES files"
# sync to force transaction commit and wipeout the log.
sync
del_inc=$(( $NUM_NEW_FILES / $NUM_FILE_DELETES ))
for ((i = 1; i <= $NUM_NEW_FILES; i += $del_inc)); do
rm -f $MNT/testdir/file_$i
done
start=$(date +%s%N)
xfs_io -c "fsync" $MNT/testdir
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "dir fsync took $dur ms after deleting $NUM_FILE_DELETES files"
echo
umount $MNT
The tests were run on a physical machine, with a non-debug kernel (Debian's
default kernel config), for different values of $NUM_NEW_FILES and
$NUM_FILE_DELETES, and the results were the following:
** Before patch, NUM_NEW_FILES = 1 000 000, NUM_DELETE_FILES = 10 000 **
dir fsync took 8412 ms after adding 1000000 files
dir fsync took 500 ms after deleting 10000 files
** After patch, NUM_NEW_FILES = 1 000 000, NUM_DELETE_FILES = 10 000 **
dir fsync took 4252 ms after adding 1000000 files (-49.5%)
dir fsync took 269 ms after deleting 10000 files (-46.2%)
** Before patch, NUM_NEW_FILES = 100 000, NUM_DELETE_FILES = 1 000 **
dir fsync took 745 ms after adding 100000 files
dir fsync took 59 ms after deleting 1000 files
** After patch, NUM_NEW_FILES = 100 000, NUM_DELETE_FILES = 1 000 **
dir fsync took 404 ms after adding 100000 files (-45.8%)
dir fsync took 31 ms after deleting 1000 files (-47.5%)
** Before patch, NUM_NEW_FILES = 10 000, NUM_DELETE_FILES = 1 000 **
dir fsync took 67 ms after adding 10000 files
dir fsync took 9 ms after deleting 1000 files
** After patch, NUM_NEW_FILES = 10 000, NUM_DELETE_FILES = 1 000 **
dir fsync took 36 ms after adding 10000 files (-46.3%)
dir fsync took 5 ms after deleting 1000 files (-44.4%)
** Before patch, NUM_NEW_FILES = 1 000, NUM_DELETE_FILES = 100 **
dir fsync took 9 ms after adding 1000 files
dir fsync took 4 ms after deleting 100 files
** After patch, NUM_NEW_FILES = 1 000, NUM_DELETE_FILES = 100 **
dir fsync took 7 ms after adding 1000 files (-22.2%)
dir fsync took 3 ms after deleting 100 files (-25.0%)
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since both unused block groups and reclaim bgs lists are protected by
unused_bgs_lock then free them in the same critical section without
doing an extra unlock/lock pair.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When enabling quotas, we attempt to commit a transaction while holding the
mutex fs_info->qgroup_ioctl_lock. This can result on a deadlock with other
quota operations such as:
- qgroup creation and deletion, ioctl BTRFS_IOC_QGROUP_CREATE;
- adding and removing qgroup relations, ioctl BTRFS_IOC_QGROUP_ASSIGN.
This is because these operations join a transaction and after that they
attempt to lock the mutex fs_info->qgroup_ioctl_lock. Acquiring that mutex
after joining or starting a transaction is a pattern followed everywhere
in qgroups, so the quota enablement operation is the one at fault here,
and should not commit a transaction while holding that mutex.
Fix this by making the transaction commit while not holding the mutex.
We are safe from two concurrent tasks trying to enable quotas because
we are serialized by the rw semaphore fs_info->subvol_sem at
btrfs_ioctl_quota_ctl(), which is the only call site for enabling
quotas.
When this deadlock happens, it produces a trace like the following:
INFO: task syz-executor:25604 blocked for more than 143 seconds.
Not tainted 5.15.0-rc6 #4
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz-executor state:D stack:24800 pid:25604 ppid: 24873 flags:0x00004004
Call Trace:
context_switch kernel/sched/core.c:4940 [inline]
__schedule+0xcd9/0x2530 kernel/sched/core.c:6287
schedule+0xd3/0x270 kernel/sched/core.c:6366
btrfs_commit_transaction+0x994/0x2e90 fs/btrfs/transaction.c:2201
btrfs_quota_enable+0x95c/0x1790 fs/btrfs/qgroup.c:1120
btrfs_ioctl_quota_ctl fs/btrfs/ioctl.c:4229 [inline]
btrfs_ioctl+0x637e/0x7b70 fs/btrfs/ioctl.c:5010
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:874 [inline]
__se_sys_ioctl fs/ioctl.c:860 [inline]
__x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f86920b2c4d
RSP: 002b:00007f868f61ac58 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007f86921d90a0 RCX: 00007f86920b2c4d
RDX: 0000000020005e40 RSI: 00000000c0109428 RDI: 0000000000000008
RBP: 00007f869212bd80 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f86921d90a0
R13: 00007fff6d233e4f R14: 00007fff6d233ff0 R15: 00007f868f61adc0
INFO: task syz-executor:25628 blocked for more than 143 seconds.
Not tainted 5.15.0-rc6 #4
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz-executor state:D stack:29080 pid:25628 ppid: 24873 flags:0x00004004
Call Trace:
context_switch kernel/sched/core.c:4940 [inline]
__schedule+0xcd9/0x2530 kernel/sched/core.c:6287
schedule+0xd3/0x270 kernel/sched/core.c:6366
schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425
__mutex_lock_common kernel/locking/mutex.c:669 [inline]
__mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729
btrfs_remove_qgroup+0xb7/0x7d0 fs/btrfs/qgroup.c:1548
btrfs_ioctl_qgroup_create fs/btrfs/ioctl.c:4333 [inline]
btrfs_ioctl+0x683c/0x7b70 fs/btrfs/ioctl.c:5014
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:874 [inline]
__se_sys_ioctl fs/ioctl.c:860 [inline]
__x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
Reported-by: Hao Sun <sunhao.th@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CACkBjsZQF19bQ1C6=yetF3BvL10OSORpFUcWXTP6HErshDB4dQ@mail.gmail.com/
Fixes: 340f1aa27f ("btrfs: qgroups: Move transaction management inside btrfs_quota_enable/disable")
CC: stable@vger.kernel.org # 4.19
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing a direct IO write against a file range that either has
preallocated extents in that range or has regular extents and the file
has the NOCOW attribute set, the write fails with -ENOSPC when all of
the following conditions are met:
1) There are no data blocks groups with enough free space matching
the size of the write;
2) There's not enough unallocated space for allocating a new data block
group;
3) The extents in the target file range are not shared, neither through
snapshots nor through reflinks.
This is wrong because a NOCOW write can be done in such case, and in fact
it's possible to do it using a buffered IO write, since when failing to
allocate data space, the buffered IO path checks if a NOCOW write is
possible.
The failure in direct IO write path comes from the fact that early on,
at btrfs_dio_iomap_begin(), we try to allocate data space for the write
and if it that fails we return the error and stop - we never check if we
can do NOCOW. But later, at btrfs_get_blocks_direct_write(), we check
if we can do a NOCOW write into the range, or a subset of the range, and
then release the previously reserved data space.
Fix this by doing the data reservation only if needed, when we must COW,
at btrfs_get_blocks_direct_write() instead of doing it at
btrfs_dio_iomap_begin(). This also simplifies a bit the logic and removes
the inneficiency of doing unnecessary data reservations.
The following example test script reproduces the problem:
$ cat dio-nocow-enospc.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
# Use a small fixed size (1G) filesystem so that it's quick to fill
# it up.
# Make sure the mixed block groups feature is not enabled because we
# later want to not have more space available for allocating data
# extents but still have enough metadata space free for the file writes.
mkfs.btrfs -f -b $((1024 * 1024 * 1024)) -O ^mixed-bg $DEV
mount $DEV $MNT
# Create our test file with the NOCOW attribute set.
touch $MNT/foobar
chattr +C $MNT/foobar
# Now fill in all unallocated space with data for our test file.
# This will allocate a data block group that will be full and leave
# no (or a very small amount of) unallocated space in the device, so
# that it will not be possible to allocate a new block group later.
echo
echo "Creating test file with initial data..."
xfs_io -c "pwrite -S 0xab -b 1M 0 900M" $MNT/foobar
# Now try a direct IO write against file range [0, 10M[.
# This should succeed since this is a NOCOW file and an extent for the
# range was previously allocated.
echo
echo "Trying direct IO write over allocated space..."
xfs_io -d -c "pwrite -S 0xcd -b 10M 0 10M" $MNT/foobar
umount $MNT
When running the test:
$ ./dio-nocow-enospc.sh
(...)
Creating test file with initial data...
wrote 943718400/943718400 bytes at offset 0
900 MiB, 900 ops; 0:00:01.43 (625.526 MiB/sec and 625.5265 ops/sec)
Trying direct IO write over allocated space...
pwrite: No space left on device
A test case for fstests will follow, testing both this direct IO write
scenario as well as the buffered IO write scenario to make it less likely
to get future regressions on the buffered IO case.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With the new per-channel bitmask for reconnect, we have an option to
reconnect the tcp session associated with the channel without reconnecting
the smb session. i.e. if there are still channels to operate on, we can
continue to use the smb session and tcon.
However, there are cases where it makes sense to reconnect the smb session
even when there are active channels underneath. For example for
SMB session expiry.
With this patch, we'll have an option to do either, and use the correct
option for specific cases.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
chan_count keeps track of the total number of channels.
Since at least the primary channel will always be connected,
this value can never go below 1. Warn if that happens.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Use ses->chans_need_reconnect bitmask to print the connection
status of each channel under an SMB session.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
We use the concept of "binding" when one of the secondary channel
is in the process of connecting/reconnecting to the server. Till this
binding process completes, and the channel is bound to an existing session,
we redirect traffic from other established channels on the binding channel,
effectively blocking all traffic till individual channels get reconnected.
With my last set of commits, we can get rid of this binding serialization.
We now have a bitmap of connection states for each channel. We will use
this bitmap instead for tracking channel status.
Having a bitmap also now enables us to keep the session alive, as long
as even a single channel underneath is alive.
Unfortunately, this also meant that we need to supply the tcp connection
info for the channel during all negotiate and session setup functions.
These changes have resulted in a slightly bigger code churn.
However, I expect perf and robustness improvements in the mchan scenario
after this change.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
We needed a way to identify the channels under the smb session
which are in reconnect, so that the traffic to other channels
can continue. So I replaced the bool need_reconnect with
a bitmask identifying all the channels that need reconnection
(named chans_need_reconnect). When a channel needs reconnection,
the bit corresponding to the index of the server in ses->chans
is used to set this bitmask. Checking if no channels or all
the channels need reconnect then becomes very easy.
Also wrote some helper macros for checking and setting the bits.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
The pointer p is being assigned with a value that is never read. The
pointer is being re-assigned a different value inside the do-while
loop. The assignment is redundant and can be removed.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
This gets the statistics correct for large folios by modifying the
counters by the number of pages in the folio instead of by 1.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
'buffer_index_array' really looks like a bitmap. So it should be allocated
as such.
When kzalloc is called, a number of bytes is expected, but a number of
longs is passed instead.
In get(), if not enough memory is allocated, un-allocated memory may be
read or written.
So use bitmap_zalloc() to safely allocate the correct memory size and
avoid un-expected behavior.
While at it, change the corresponding kfree() into bitmap_free() to keep
the semantic.
Fixes: ea2c9c9f65 ("orangefs: bufmap rewrite")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
There are currently 2 ways to create a set of sysfs files for a
kobj_type, through the default_attrs field, and the default_groups
field. Move the orangfs code to use default_groups field which has been
the preferred way since aa30f47cf6 ("kobject: Add support for default
attribute groups to kobj_type") so that we can soon get rid of the
obsolete default_attrs field.
Cc: Mike Marshall <hubcap@omnibond.com>
Cc: Martin Brandenburg <martin@omnibond.com>
Cc: devel@lists.orangefs.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Alexei Starovoitov says:
====================
pull-request: bpf-next 2021-12-30
The following pull-request contains BPF updates for your *net-next* tree.
We've added 72 non-merge commits during the last 20 day(s) which contain
a total of 223 files changed, 3510 insertions(+), 1591 deletions(-).
The main changes are:
1) Automatic setrlimit in libbpf when bpf is memcg's in the kernel, from Andrii.
2) Beautify and de-verbose verifier logs, from Christy.
3) Composable verifier types, from Hao.
4) bpf_strncmp helper, from Hou.
5) bpf.h header dependency cleanup, from Jakub.
6) get_func_[arg|ret|arg_cnt] helpers, from Jiri.
7) Sleepable local storage, from KP.
8) Extend kfunc with PTR_TO_CTX, PTR_TO_MEM argument support, from Kumar.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Make sure that finish_mount_kattr() is called after mount_kattr was
succesfully built in both the success and failure case to prevent
leaking any references we took when we built it. We returned early if
path lookup failed thereby risking to leak an additional reference we
took when building mount_kattr when an idmapped mount was requested.
Cc: linux-fsdevel@vger.kernel.org
Cc: stable@vger.kernel.org
Fixes: 9caccd4154 ("fs: introduce MOUNT_ATTR_IDMAP")
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
commit 077cdda764 ("net/mlx5e: TC, Fix memory leak with rules with internal port")
commit 31108d142f ("net/mlx5: Fix some error handling paths in 'mlx5e_tc_add_fdb_flow()'")
commit 4390c6edc0 ("net/mlx5: Fix some error handling paths in 'mlx5e_tc_add_fdb_flow()'")
https://lore.kernel.org/all/20211229065352.30178-1-saeed@kernel.org/
net/smc/smc_wr.c
commit 49dc9013e3 ("net/smc: Use the bitmap API when applicable")
commit 349d43127d ("net/smc: fix kernel panic caused by race of smc_sock")
bitmap_zero()/memset() is removed by the fix
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Introduces erofs compressed tail-packing inline support.
This approach adds a new field called `h_idata_size' in the
per-file compression header to indicate the encoded size of
each tail-packing pcluster.
At runtime, it will find the start logical offset of the tail
pcluster when initializing per-inode zmap and record such
extent (headlcn, idataoff) information to the in-memory inode.
Therefore, follow-on requests can directly recognize if one
pcluster is a tail-packing inline pcluster or not.
Link: https://lore.kernel.org/r/20211228054604.114518-6-hsiangkao@linux.alibaba.com
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Yue Hu <huyue2@yulong.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Currently, we have already support tail-packing inline for
uncompressed file, let's also implement this for compressed
files to save I/Os and storage space.
Different from normal pclusters, compressed data is available
in advance because of other metadata I/Os. Therefore, they
directly move into the bypass queue without extra I/O submission.
It's the last compression feature before folio/subpage support.
Link: https://lore.kernel.org/r/20211228232919.21413-1-xiang@kernel.org
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Yue Hu <huyue2@yulong.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Previously, compressed data was assumed as block-aligned. This
should be changed due to in-block tail-packing inline data.
Link: https://lore.kernel.org/r/20211228054604.114518-4-hsiangkao@linux.alibaba.com
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Yue Hu <huyue2@yulong.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
sock.h is pretty heavily used (5k objects rebuilt on x86 after
it's touched). We can drop the include of filter.h from it and
add a forward declaration of struct sk_filter instead.
This decreases the number of rebuilt objects when bpf.h
is touched from ~5k to ~1k.
There's a lot of missing includes this was masking. Primarily
in networking tho, this time.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Marc Kleine-Budde <mkl@pengutronix.de>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Acked-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://lore.kernel.org/bpf/20211229004913.513372-1-kuba@kernel.org
There are currently 2 ways to create a set of sysfs files for a
kobj_type, through the default_attrs field, and the default_groups
field. Move the nilfs2 code to use default_groups field which has been
the preferred way since aa30f47cf6 ("kobject: Add support for default
attribute groups to kobj_type") so that we can soon get rid of the
obsolete default_attrs field.
Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Link: https://lore.kernel.org/r/20211228144252.390554-1-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Remove some warnings found by running scripts/kernel-doc,
which is caused by using 'make W=1'.
fs/ksmbd/smb2pdu.c:623: warning: Function parameter or member
'local_nls' not described in 'smb2_get_name'
fs/ksmbd/smb2pdu.c:623: warning: Excess function parameter 'nls_table'
description in 'smb2_get_name'
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
A warning is reported because an invalid argument description, it is found
by running scripts/kernel-doc, which is caused by using 'make W=1'.
fs/ksmbd/smb2pdu.c:3406: warning: Excess function parameter 'user_ns'
description in 'smb2_populate_readdir_entry'
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Fixes: 475d6f9880 ("ksmbd: fix translation in smb2_populate_readdir_entry()")
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Fix argument list that the kdoc format and script verified in
smb2_set_info_file().
The warnings were found by running scripts/kernel-doc, which is
caused by using 'make W=1'.
fs/ksmbd/smb2pdu.c:5862: warning: Function parameter or member 'req' not
described in 'smb2_set_info_file'
fs/ksmbd/smb2pdu.c:5862: warning: Excess function parameter 'info_class'
description in 'smb2_set_info_file'
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Fixes: 9496e268e3 ("ksmbd: add request buffer validation in smb2_set_info")
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Add the description of @rsp_org in buffer_check_err() kernel-doc comment
to remove a warning found by running scripts/kernel-doc, which is caused
by using 'make W=1'.
fs/ksmbd/smb2pdu.c:4028: warning: Function parameter or member 'rsp_org'
not described in 'buffer_check_err'
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Fixes: cb4517201b ("ksmbd: remove smb2_buf_length in smb2_hdr")
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
When RSS mode is enable, windows client do simultaneously send several
session requests to server. There is racy issue using
sess->ntlmssp.cryptkey on N connection : 1 session. So authetication
failed using wrong cryptkey on some session. This patch move cryptkey
to ksmbd_conn structure to use each cryptkey on connection.
Tested-by: Ziwei Xie <zw.xie@high-flyer.cn>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Set ipv4 and ipv6 address in FSCTL_QUERY_NETWORK_INTERFACE_INFO.
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Set RSS capable in FSCTL_QUERY_NETWORK_INTERFACE_INFO if netdev has
multi tx queues. And add ksmbd_compare_user() to avoid racy condition
issue in ksmbd_free_user(). because windows client is simultaneously used
to send session setup requests for multichannel connection.
Tested-by: Ziwei Xie <zw.xie@high-flyer.cn>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
These fields are remnants of the not upstreamed SMB1 code.
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Marios Makassikis <mmakassikis@freebox.fr>
Signed-off-by: Steve French <stfrench@microsoft.com>
The 'share' parameter is no longer used by smb2_get_name() since
commit 265fd1991c ("ksmbd: use LOOKUP_BENEATH to prevent the out of
share access").
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Marios Makassikis <mmakassikis@freebox.fr>
Signed-off-by: Steve French <stfrench@microsoft.com>
Use look_up_OID to decode OIDs rather than
implementing functions.
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
To prepare for the upcoming ztailpacking feature, introduce
z_erofs_fixup_insize() and pageofs_in to wrap up the process
to get the exact compressed size via zero padding.
Link: https://lore.kernel.org/r/20211228054604.114518-3-hsiangkao@linux.alibaba.com
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
To prepare for the upcoming ztailpacking feature and further
cleanups, introduce a unique z_erofs_lz4_decompress_ctx to keep
the context, including inpages, outpages and oend, which are
frequently used by the lz4 decompressor.
No logic changes.
Link: https://lore.kernel.org/r/20211228054604.114518-2-hsiangkao@linux.alibaba.com
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
We don't need to poll oneshot request if we've got a desired mask in
io_poll_wake(), task_work will clean it up correctly, but as we already
hold a wq spinlock, we can remove ourselves and save on additional
spinlocking in io_poll_remove_entries().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ee170a344a18c9ef36b554d806c64caadfd61c31.1639605189.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It's not possible to go forward with the current state of io_uring
polling, we need a more straightforward and easier synchronisation.
There are a lot of problems with how it is at the moment, including
missing events on rewait.
The main idea here is to introduce a notion of request ownership while
polling, no one but the owner can modify any part but ->poll_refs of
struct io_kiocb, that grants us protection against all sorts of races.
Main users of such exclusivity are poll task_work handler, so before
queueing a tw one should have/acquire ownership, which will be handed
off to the tw handler.
The other user is __io_arm_poll_handler() do initial poll arming. It
starts taking the ownership, so tw handlers won't be run until it's
released later in the function after vfs_poll. note: also prevents
races in __io_queue_proc().
Poll wake/etc. may not be able to get ownership, then they need to
increase the poll refcount and the task_work should notice it and retry
if necessary, see io_poll_check_events().
There is also IO_POLL_CANCEL_FLAG flag to notify that we want to kill
request.
It makes cancellations more reliable, enables double multishot polling,
fixes double poll rewait, fixes missing poll events and fixes another
bunch of races.
Even though it adds some overhead for new refcounting, and there are a
couple of nice performance wins:
- no req->refs refcounting for poll requests anymore
- if the data is already there (once measured for some test to be 1-2%
of all apoll requests), it removes it doesn't add atomics and removes
spin_lock/unlock pair.
- works well with multishots, we don't do remove from queue / add to
queue for each new poll event.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6b652927c77ed9580ea4330ac5612f0e0848c946.1639605189.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With IORING_FEAT_FAST_POLL in place, io_put_req_find_next() for poll
requests doesn't make much sense, and in any case re-adding it
shouldn't be a problem considering batching in tctx_task_work(). We can
remove it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/15699682bf81610ec901d4e79d6da64baa9f70be.1639605189.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is no need to pass the pointer to the kset in the struct
kset_uevent_ops callbacks as no one uses it, so just remove that pointer
entirely.
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Wedson Almeida Filho <wedsonaf@google.com>
Link: https://lore.kernel.org/r/20211227163924.3970661-1-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmG+uVcACgkQiiy9cAdy
T1EbsQv+NRGXy7WoI4V0QQSY6Q3uWVZwKXwe8MO4+wHex7t5ZDaGDTUboExZ9gnK
W63t/pq+hsb4Z8JI//vWdcVdVgnUexrcbHt3ANKGPPwZz1/uttIismyqLGkw+Ntc
c99NKve+Q6LNuU1TEF+RRDd3wgrT3ZD3odlRW+xN0w6ZQ7BPMz38OlemSmUaYqUc
fh2S3W0Vp3/gxLNfjp1GMrOqjQ8qtTZIzHtLLXHzJKGUYxtq3d77+jP36BAA2Wvn
ufBNFO/Y94gRekKyw+KiW8+E5ceDHNits4SCI/C97txyBnBCbo7gu4TKi+ChDcek
auDqiLrFp/GuuU2D5TuOw8GeFPNQo08O6IAkcR3ftnxcTM8rvqYPdvRNnEkgkHKH
Fu/FWk8wXJLGlymW7bMjeA4bDs0e8z6HjqFWd8d0xeYOismcdO7OZuY4iaC73lCy
Nv/2shlzocO68nNd0kxU7gkSPkv1LynVE5IV55DQaJpryUWe71gClldiNS/HTm4c
OTnxbAfN
=tOhf
-----END PGP SIGNATURE-----
Merge tag '5.16-rc5-ksmbd-fixes' of git://git.samba.org/ksmbd
Pull ksmbd fixes from Steve French:
"Three ksmbd fixes, all for stable as well.
Two fix potential unitialized memory and one fixes a security problem
where encryption is unitentionally disabled from some clients"
* tag '5.16-rc5-ksmbd-fixes' of git://git.samba.org/ksmbd:
ksmbd: disable SMB2_GLOBAL_CAP_ENCRYPTION for SMB 3.1.1
ksmbd: fix uninitialized symbol 'pntsd_size'
ksmbd: fix error code in ndr_read_int32()
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmHEmDcQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpu1EEACnW4afkADzsu5L0DOYV0k2171cBQFaIeNw
FJb3ebp6iHbWS0BBnseKKqdpFOQoGF6q8lcDdyx/cmUH7217cZTRDt1jxxXPwmGP
i3Zy2HMh6Dq/RX2t0Be7oMEFZeRyB+FIQySvsR4JSfhOeCF+x0kg1Rvwljd4u/t/
sEM4BEPl1ssm1YO5ZXvhnxCoX7eEhS3pbnJj5ZV/y46YTLr9crJQBYfeOHOflEvp
iyTfK6NGy/CDR8a9DTXrYaGRozd7J8ArO+5K4qdLtNP7Tc6+bMrV/CE5l7gwWdlU
06eeRxGt/4NQwfWD2aphRaRM5Qa/VcssMy5ikKf98UKvalzwbaAl7Hk5ywbxdQPV
cgiHTQa1gwpY7FAw+KQwE2m0iAqaLs4eXfFNFt68yJ2NKc3tVWsMXtFskuBdz/E7
TeoXQTEb5PwDsjHmmFbFlW2OuJajEeIJfRG0pXellOgqm8VDruvjemNDFlg80QmC
TpF5y/dbhIht9BHTt+K5mniRZXlVfjePok04XWQgYaIvkYATbGzC4vXMAQdwfW6y
kO6Ez7zaDjQJShDKZvni5VlXhWwWQEjm01f1tLmljPiLmZH+bwGQVin1jHgYpskW
4mUZD2UpwNY5N5QbEfKeQ/xCxl7EQAN8WPY4HoCssK139WFHczoGTECl6/DFC4sB
TrTMi9rRaA==
=CQfA
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.16-2021-12-23' of git://git.kernel.dk/linux-block
Pull io_uring fix from Jens Axboe:
"Single fix for not clearing kiocb->ki_pos back to 0 for a stream,
destined for stable as well"
* tag 'io_uring-5.16-2021-12-23' of git://git.kernel.dk/linux-block:
io_uring: zero iocb->ki_pos for stream file types
This series takes care of a couple of TODOs and adds new ones. Update
the TODOs section to reflect current state and future work that needs
to happen.
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20211223202140.2061101-5-harshads@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Move fast commit stats updating logic to a separate function from
ext4_fc_commit(). This significantly improves readability of
ext4_fc_commit().
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20211223202140.2061101-4-harshads@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This patch drops ext4_fc_start_ineligible() and
ext4_fc_stop_ineligible() APIs. Fast commit ineligible transactions
should simply call ext4_fc_mark_ineligible() after starting the
trasaction.
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20211223202140.2061101-3-harshads@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This patch drops all calls to ext4_fc_start_update() and
ext4_fc_stop_update(). To ensure that there are no ongoing journal
updates during fast commit, we also make jbd2_fc_begin_commit() lock
journal for updates. This way we don't have to maintain two different
transaction start stop APIs for fast commit and full commit. This
patch doesn't remove the functions altogether since in future we want
to have inode level locking for fast commits.
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20211223202140.2061101-2-harshads@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
iomap_write_end() does not return a negative errno to indicate an
error, but the number of bytes successfully copied. It cannot return
an error today, so include a debugging assertion like the one in
iomap_unshare_iter().
Fixes: c6f4046865 ("fsdax: decouple zeroing from the iomap buffered I/O code")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211221044450.517558-1-willy@infradead.org
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
GC task can deadlock in read_cache_page() because it may attempt
to release a page that is actually allocated by another task in
jffs2_write_begin().
The reason is that in jffs2_write_begin() there is a small window
a cache page is allocated for use but not set Uptodate yet.
This ends up with a deadlock between two tasks:
1) A task (e.g. file copy)
- jffs2_write_begin() locks a cache page
- jffs2_write_end() tries to lock "alloc_sem" from
jffs2_reserve_space() <-- STUCK
2) GC task (jffs2_gcd_mtd3)
- jffs2_garbage_collect_pass() locks "alloc_sem"
- try to lock the same cache page in read_cache_page() <-- STUCK
So to avoid this deadlock, hold "alloc_sem" in jffs2_write_begin()
while reading data in a cache page.
Signed-off-by: Kyeong Yoo <kyeong.yoo@alliedtelesis.co.nz>
Signed-off-by: Richard Weinberger <richard@nod.at>
If ubifs_garbage_collect_leb() returns -EAGAIN and ubifs_return_leb
returns error, a LEB will always has a "taken" flag. In this case,
set the ubifs to read-only to prevent a worse situation.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
If ubifs_garbage_collect_leb() returns -EAGAIN and enters the "out"
branch, ubifs_return_leb will execute twice on the same lnum. This
can cause data loss in concurrency situations.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
Hulk Robot reported a KASAN report about slab-out-of-bounds:
==================================================================
BUG: KASAN: slab-out-of-bounds in ubifs_change_lp+0x3a9/0x1390 [ubifs]
Read of size 8 at addr ffff888101c961f8 by task fsstress/1068
[...]
Call Trace:
check_memory_region+0x1c1/0x1e0
ubifs_change_lp+0x3a9/0x1390 [ubifs]
ubifs_change_one_lp+0x170/0x220 [ubifs]
ubifs_garbage_collect+0x7f9/0xda0 [ubifs]
ubifs_budget_space+0xfe4/0x1bd0 [ubifs]
ubifs_write_begin+0x528/0x10c0 [ubifs]
Allocated by task 1068:
kmemdup+0x25/0x50
ubifs_lpt_lookup_dirty+0x372/0xb00 [ubifs]
ubifs_update_one_lp+0x46/0x260 [ubifs]
ubifs_tnc_end_commit+0x98b/0x1720 [ubifs]
do_commit+0x6cb/0x1950 [ubifs]
ubifs_run_commit+0x15a/0x2b0 [ubifs]
ubifs_budget_space+0x1061/0x1bd0 [ubifs]
ubifs_write_begin+0x528/0x10c0 [ubifs]
[...]
==================================================================
In ubifs_garbage_collect(), if ubifs_find_dirty_leb returns an error,
lp is an uninitialized variable. But lp.num might be used in the out
branch, which is a random value. If the value is -1 or another value
that can pass the check, soob may occur in the ubifs_change_lp() in
the following procedure.
To solve this problem, we initialize lp.lnum to -1, and then initialize
it correctly in ubifs_find_dirty_leb, which is not equal to -1, and
ubifs_return_leb is executed only when lp.lnum != -1.
if find a retained or indexing LEB and continue to next loop, but break
before find another LEB, the "taken" flag of this LEB will be cleaned
in ubi_return_lebi(). This bug has also been fixed in this patch.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
The snprintf() function returns the number of bytes (not including the
NUL terminator) which would have been printed if there were enough
space. So it can be greater than UBIFS_DFS_DIR_LEN. And actually if
it equals UBIFS_DFS_DIR_LEN then that's okay so this check is too
strict.
Fixes: 9a620291fc01 ("ubifs: Export filesystem error counters")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
Not all ubifs filesystem errors are propagated to userspace.
Export bad magic, bad node and crc errors via sysfs. This allows userspace
to notice filesystem errors:
/sys/fs/ubifs/ubiX_Y/errors_magic
/sys/fs/ubifs/ubiX_Y/errors_node
/sys/fs/ubifs/ubiX_Y/errors_crc
The counters are reset to 0 with a remount.
Signed-off-by: Stefan Schaeckeler <sschaeck@cisco.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
Repalce kthread_create/wake_up_process() with kthread_run()
to simplify the code.
Signed-off-by: Cai Huoqing <caihuoqing@baidu.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
Found with `codespell -i 3 -w fs/ubifs/**` and proof reading that parts.
Signed-off-by: Alexander Dahl <ada@thorsis.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
i_version mount option is getting lost on remount. This is because the
'i_version' mount option differs from the util-linux mount option
'iversion', but it has exactly the same functionality. We have to
specifically notify the vfs that this is what we want by setting
appropriate flag in fc->sb_flags. Fix it and as a result we can remove
*flags argument from __ext4_remount(); do the same for
__ext4_fill_super().
In addition set out to deprecate ext4 specific 'i_version' mount option
in favor or 'iversion' by kernel version 5.20.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Fixes: cebe85d570 ("ext4: switch to the new mount api")
Link: https://lore.kernel.org/r/20211222104517.11187-2-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The lazytime and nolazytime mount options were added temporarily back in
2015 with commit a26f49926d ("ext4: add optimization for the lazytime
mount option"). It think it has been enough time for the util-linux with
lazytime support to get widely used. Remove the mount options.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Link: https://lore.kernel.org/r/20211222104517.11187-1-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Switching to the new mount api introduced inconsistency in how the
journalling mode mount option (data=) is handled during a remount.
Ext4 always prevented changing the journalling mode during the remount,
however the new code always fails the remount when the journalling mode
is specified, even if it remains unchanged. Fix it.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reported-by: Heiner Kallweit <hkallweit1@gmail.com>
Fixes: cebe85d570 ("ext4: switch to the new mount api")
Link: https://lore.kernel.org/r/20211220152657.101599-1-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
io_uring supports using offset == -1 for using the current file position,
and we read that in as part of read/write command setup. For the non-iter
read/write types we pass in NULL for the position pointer, but for the
iter types we should not be passing any anything but 0 for the position
for a stream.
Clear kiocb->ki_pos if the file is a stream, don't leave it as -1. If we
do, then the request will error with -ESPIPE.
Fixes: ba04291eb6 ("io_uring: allow use of offset == -1 to mean file position")
Link: https://github.com/axboe/liburing/discussions/501
Reported-by: Samuel Williams <samuel.williams@oriontransfer.co.nz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The old ALLOCSP/FREESP ioctls in XFS can be used to preallocate space at
the end of files, just like fallocate and RESVSP. Make the behavior
consistent with the other ioctls.
Reported-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
While I was running with KASAN and lockdep enabled, I stumbled upon an
KASAN report about a UAF to a freed CIL checkpoint. Looking at the
comment for xfs_log_item_in_current_chkpt, it seems pretty obvious to me
that the original patch to xfs_defer_finish_noroll should have done
something to lock the CIL to prevent it from switching the CIL contexts
while the predicate runs.
For upper level code that needs to know if a given log item is new
enough not to need relogging, add a new wrapper that takes the CIL
context lock long enough to sample the current CIL context. This is
kind of racy in that the CIL can switch the contexts immediately after
sampling, but that's ok because the consequence is that the defer ops
code is a little slow to relog items.
==================================================================
BUG: KASAN: use-after-free in xfs_log_item_in_current_chkpt+0x139/0x160 [xfs]
Read of size 8 at addr ffff88804ea5f608 by task fsstress/527999
CPU: 1 PID: 527999 Comm: fsstress Tainted: G D 5.16.0-rc4-xfsx #rc4
Call Trace:
<TASK>
dump_stack_lvl+0x45/0x59
print_address_description.constprop.0+0x1f/0x140
kasan_report.cold+0x83/0xdf
xfs_log_item_in_current_chkpt+0x139/0x160
xfs_defer_finish_noroll+0x3bb/0x1e30
__xfs_trans_commit+0x6c8/0xcf0
xfs_reflink_remap_extent+0x66f/0x10e0
xfs_reflink_remap_blocks+0x2dd/0xa90
xfs_file_remap_range+0x27b/0xc30
vfs_dedupe_file_range_one+0x368/0x420
vfs_dedupe_file_range+0x37c/0x5d0
do_vfs_ioctl+0x308/0x1260
__x64_sys_ioctl+0xa1/0x170
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f2c71a2950b
Code: 0f 1e fa 48 8b 05 85 39 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff
ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01
f0 ff ff 73 01 c3 48 8b 0d 55 39 0d 00 f7 d8 64 89 01 48
RSP: 002b:00007ffe8c0e03c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00005600862a8740 RCX: 00007f2c71a2950b
RDX: 00005600862a7be0 RSI: 00000000c0189436 RDI: 0000000000000004
RBP: 000000000000000b R08: 0000000000000027 R09: 0000000000000003
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000005a
R13: 00005600862804a8 R14: 0000000000016000 R15: 00005600862a8a20
</TASK>
Allocated by task 464064:
kasan_save_stack+0x1e/0x50
__kasan_kmalloc+0x81/0xa0
kmem_alloc+0xcd/0x2c0 [xfs]
xlog_cil_ctx_alloc+0x17/0x1e0 [xfs]
xlog_cil_push_work+0x141/0x13d0 [xfs]
process_one_work+0x7f6/0x1380
worker_thread+0x59d/0x1040
kthread+0x3b0/0x490
ret_from_fork+0x1f/0x30
Freed by task 51:
kasan_save_stack+0x1e/0x50
kasan_set_track+0x21/0x30
kasan_set_free_info+0x20/0x30
__kasan_slab_free+0xed/0x130
slab_free_freelist_hook+0x7f/0x160
kfree+0xde/0x340
xlog_cil_committed+0xbfd/0xfe0 [xfs]
xlog_cil_process_committed+0x103/0x1c0 [xfs]
xlog_state_do_callback+0x45d/0xbd0 [xfs]
xlog_ioend_work+0x116/0x1c0 [xfs]
process_one_work+0x7f6/0x1380
worker_thread+0x59d/0x1040
kthread+0x3b0/0x490
ret_from_fork+0x1f/0x30
Last potentially related work creation:
kasan_save_stack+0x1e/0x50
__kasan_record_aux_stack+0xb7/0xc0
insert_work+0x48/0x2e0
__queue_work+0x4e7/0xda0
queue_work_on+0x69/0x80
xlog_cil_push_now.isra.0+0x16b/0x210 [xfs]
xlog_cil_force_seq+0x1b7/0x850 [xfs]
xfs_log_force_seq+0x1c7/0x670 [xfs]
xfs_file_fsync+0x7c1/0xa60 [xfs]
__x64_sys_fsync+0x52/0x80
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
The buggy address belongs to the object at ffff88804ea5f600
which belongs to the cache kmalloc-256 of size 256
The buggy address is located 8 bytes inside of
256-byte region [ffff88804ea5f600, ffff88804ea5f700)
The buggy address belongs to the page:
page:ffffea00013a9780 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88804ea5ea00 pfn:0x4ea5e
head:ffffea00013a9780 order:1 compound_mapcount:0
flags: 0x4fff80000010200(slab|head|node=1|zone=1|lastcpupid=0xfff)
raw: 04fff80000010200 ffffea0001245908 ffffea00011bd388 ffff888004c42b40
raw: ffff88804ea5ea00 0000000000100009 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff88804ea5f500: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff88804ea5f580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff88804ea5f600: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff88804ea5f680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88804ea5f700: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
==================================================================
Fixes: 4e919af782 ("xfs: periodically relog deferred intent items")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Hostfs was not setting up the backing device information, which means it
uses the noop bdi. The noop bdi does not have the writeback capability
enabled, which in turns means dirty pages never got written back to
storage.
In other words programs using mmap to write to files on hostfs never
actually got their data written out...
Fix this by simply setting up the bdi with default settings as all the
required code for writeback is already in place.
Signed-off-by: Sjoerd Simons <sjoerd@collabora.com>
Reviewed-by: Christopher Obbard <chris.obbard@collabora.com>
Tested-by: Ritesh Raj Sarraf <ritesh@collabora.com>
Acked-By: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
To make the merge easier, replicate the inlining of __iomap_zero_iter()
into iomap_zero_iter() that is currently in the nvdimm tree.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
The "bufsize" comes from the root user. If "bufsize" is negative then,
because of type promotion, neither of the validation checks at the start
of the function are able to catch it:
if (bufsize < sizeof(struct xfs_attrlist) ||
bufsize > XFS_XATTR_LIST_MAX)
return -EINVAL;
This means "bufsize" will trigger (WARN_ON_ONCE(size > INT_MAX)) in
kvmalloc_node(). Fix this by changing the type from int to size_t.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Since kernel commit 1abcf26101 ("xfs: move on-disk inode allocation out of xfs_ialloc()"),
xfs_ialloc has been renamed to xfs_init_new_inode. So update this in comments.
Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Got a report that a repeated crash test of a container host would
eventually fail with a log recovery error preventing the system from
mounting the root filesystem. It manifested as a directory leaf node
corruption on writeback like so:
XFS (loop0): Mounting V5 Filesystem
XFS (loop0): Starting recovery (logdev: internal)
XFS (loop0): Metadata corruption detected at xfs_dir3_leaf_check_int+0x99/0xf0, xfs_dir3_leaf1 block 0x12faa158
XFS (loop0): Unmount and run xfs_repair
XFS (loop0): First 128 bytes of corrupted metadata buffer:
00000000: 00 00 00 00 00 00 00 00 3d f1 00 00 e1 9e d5 8b ........=.......
00000010: 00 00 00 00 12 fa a1 58 00 00 00 29 00 00 1b cc .......X...)....
00000020: 91 06 78 ff f7 7e 4a 7d 8d 53 86 f2 ac 47 a8 23 ..x..~J}.S...G.#
00000030: 00 00 00 00 17 e0 00 80 00 43 00 00 00 00 00 00 .........C......
00000040: 00 00 00 2e 00 00 00 08 00 00 17 2e 00 00 00 0a ................
00000050: 02 35 79 83 00 00 00 30 04 d3 b4 80 00 00 01 50 .5y....0.......P
00000060: 08 40 95 7f 00 00 02 98 08 41 fe b7 00 00 02 d4 .@.......A......
00000070: 0d 62 ef a7 00 00 01 f2 14 50 21 41 00 00 00 0c .b.......P!A....
XFS (loop0): Corruption of in-memory data (0x8) detected at xfs_do_force_shutdown+0x1a/0x20 (fs/xfs/xfs_buf.c:1514). Shutting down.
XFS (loop0): Please unmount the filesystem and rectify the problem(s)
XFS (loop0): log mount/recovery failed: error -117
XFS (loop0): log mount failed
Tracing indicated that we were recovering changes from a transaction
at LSN 0x29/0x1c16 into a buffer that had an LSN of 0x29/0x1d57.
That is, log recovery was overwriting a buffer with newer changes on
disk than was in the transaction. Tracing indicated that we were
hitting the "recovery immediately" case in
xfs_buf_log_recovery_lsn(), and hence it was ignoring the LSN in the
buffer.
The code was extracting the LSN correctly, then ignoring it because
the UUID in the buffer did not match the superblock UUID. The
problem arises because the UUID check uses the wrong UUID - it
should be checking the sb_meta_uuid, not sb_uuid. This filesystem
has sb_uuid != sb_meta_uuid (which is fine), and the buffer has the
correct matching sb_meta_uuid in it, it's just the code checked it
against the wrong superblock uuid.
The is no corruption in the filesystem, and failing to recover the
buffer due to a write verifier failure means the recovery bug did
not propagate the corruption to disk. Hence there is no corruption
before or after this bug has manifested, the impact is limited
simply to an unmountable filesystem....
This was missed back in 2015 during an audit of incorrect sb_uuid
usage that resulted in commit fcfbe2c4ef ("xfs: log recovery needs
to validate against sb_meta_uuid") that fixed the magic32 buffers to
validate against sb_meta_uuid instead of sb_uuid. It missed the
magicda buffers....
Fixes: ce748eaa65 ("xfs: create new metadata UUID field and incompat flag")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
When xfs_scrub encounters a directory with a leaf1 block, it tries to
validate that the leaf1 block's bestcount (aka the best free count of
each directory data block) is the correct size. Previously, this author
believed that comparing bestcount to the directory isize (since
directory data blocks are under isize, and leaf/bestfree blocks are
above it) was sufficient.
Unfortunately during testing of online repair, it was discovered that it
is possible to create a directory with a hole between the last directory
block and isize. The directory code seems to handle this situation just
fine and xfs_repair doesn't complain, which effectively makes this quirk
part of the disk format.
Fix the check to work properly.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
As part of multiple customer escalations due to file data corruption
after copy on write operations, I wrote some fstests that use fsstress
to hammer on COW to shake things loose. Regrettably, I caught some
filesystem shutdowns due to incorrect rmap operations with the following
loop:
mount <filesystem> # (0)
fsstress <run only readonly ops> & # (1)
while true; do
fsstress <run all ops>
mount -o remount,ro # (2)
fsstress <run only readonly ops>
mount -o remount,rw # (3)
done
When (2) happens, notice that (1) is still running. xfs_remount_ro will
call xfs_blockgc_stop to walk the inode cache to free all the COW
extents, but the blockgc mechanism races with (1)'s reader threads to
take IOLOCKs and loses, which means that it doesn't clean them all out.
Call such a file (A).
When (3) happens, xfs_remount_rw calls xfs_reflink_recover_cow, which
walks the ondisk refcount btree and frees any COW extent that it finds.
This function does not check the inode cache, which means that incore
COW forks of inode (A) is now inconsistent with the ondisk metadata. If
one of those former COW extents are allocated and mapped into another
file (B) and someone triggers a COW to the stale reservation in (A), A's
dirty data will be written into (B) and once that's done, those blocks
will be transferred to (A)'s data fork without bumping the refcount.
The results are catastrophic -- file (B) and the refcount btree are now
corrupt. In the first patch, we fixed the race condition in (2) so that
(A) will always flush the COW fork. In this second patch, we move the
_recover_cow call to the initial mount call in (0) for safety.
As mentioned previously, xfs_reflink_recover_cow walks the refcount
btree looking for COW staging extents, and frees them. This was
intended to be run at mount time (when we know there are no live inodes)
to clean up any leftover staging events that may have been left behind
during an unclean shutdown. As a time "optimization" for readonly
mounts, we deferred this to the ro->rw transition, not realizing that
any failure to clean all COW forks during a rw->ro transition would
result in catastrophic corruption.
Therefore, remove this optimization and only run the recovery routine
when we're guaranteed not to have any COW staging extents anywhere,
which means we always run this at mount time. While we're at it, move
the callsite to xfs_log_mount_finish because any refcount btree
expansion (however unlikely given that we're removing records from the
right side of the index) must be fed by a per-AG reservation, which
doesn't exist in its current location.
Fixes: 174edb0e46 ("xfs: store in-progress CoW allocations in the refcount btree")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Ian Kent reported that for inline symlinks, it's possible for
vfs_readlink to hang on to the target buffer returned by
_vn_get_link_inline long after it's been freed by xfs inode reclaim.
This is a layering violation -- we should never expose XFS internals to
the VFS.
When the symlink has a remote target, we allocate a separate buffer,
copy the internal information, and let the VFS manage the new buffer's
lifetime. Let's adapt the inline code paths to do this too. It's
less efficient, but fixes the layering violation and avoids the need to
adapt the if_data lifetime to rcu rules. Clearly I don't care about
readlink benchmarks.
As a side note, this fixes the minor locking violation where we can
access the inode data fork without taking any locks; proper locking (and
eliminating the possibility of having to switch inode_operations on a
live inode) is essential to online repair coordinating repairs
correctly.
Reported-by: Ian Kent <raven@themaw.net>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Prior to commit 40b52225e5 ("xfs: remove support for disabling quota
accounting on a mounted file system"), we used the quotaoff mutex to
protect dquot operations against quotaoff trying to pull down dquots as
part of disabling quota.
Now that we only support turning off quota enforcement, the quotaoff
mutex only protects changes in m_qflags/sb_qflags. We don't need it to
protect dquots, which means we can remove it from setqlimits and the
dquot scrub code. While we're at it, fix the function that forces
quotacheck, since it should have been taking the quotaoff mutex.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
While debugging some very strange rmap corruption reports in connection
with the online directory repair code. I root-caused the error to the
following incorrect sequence:
<start repair transaction>
<expand directory, causing a deferred rmap to be queued>
<roll transaction>
<cancel transaction>
Obviously, we should have committed the transaction instead of
cancelling it. Thinking more broadly, however, xfs_trans_cancel should
have warned us that we were throwing away work item that we already
committed to performing. This is not correct, and we need to shut down
the filesystem.
Change xfs_trans_cancel to complain in the loudest manner if we're
cancelling any transaction with deferred work items attached.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmG9V/0ACgkQiiy9cAdy
T1GbQAv+N/2p7mMzhxcB91xlU6DSo2kZhC1FcmiXld3BEiMzkef6Rtb/lzW1AtBP
kNgqjVU7MMmdFQNKLhrHUV2BNCtLWKOC2I0xTZqr+nBlGMJkiNUBzUWW8DcnLd1e
zGky7Wck8HHu7LdnZA3TVlPhNV9KjdlHMEYBBJFlOc779a/Q0KyHLdwhP9mpSa2d
QrVRHkBMylNUqF4WiLKQ0I8eJFhKtd7pJNAW1qGeKe2YDV6kZQiXwHOLIZb7SHET
9Qtj83IwsQ13cLTmb7tGibj0Fn+GEEqgBv40PjSMuTCoiCpfhcLrx0lWT5hxlxao
9w3Yw1pPPaJXbxCmL9b3Bqm3vnNqMaN7HqxsuGIFwnulGFB4GJY2l+ZQPcTHzWG0
oCHlimlixH7RKVEfjIDhidC1C0ReZI8/1XuHPGE17Ysxt6L7dH/a7J2y7SlIdYTZ
ZVvVT0Kib74I7ZUJH1ymc4MdAVnBIkiTGOrQk/FU+cHOyzMUpNcwXiCqkvF77QUa
hDW4hkzE
=MW5I
-----END PGP SIGNATURE-----
Merge tag '5.16-rc5-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Two cifs/smb3 fixes, one fscache related, and one mount parsing
related for stable"
* tag '5.16-rc5-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: sanitize multiple delimiters in prepath
cifs: ignore resource_id while getting fscache super cookie
If a client sends a READDIR count argument that is too small (say,
zero), then the buffer size calculation in the new init_dirlist
helper functions results in an underflow, allowing the XDR stream
functions to write beyond the actual buffer.
This calculation has always been suspect. NFSD has never sanity-
checked the READDIR count argument, but the old entry encoders
managed the problem correctly.
With the commits below, entry encoding changed, exposing the
underflow to the pointer arithmetic in xdr_reserve_space().
Modern NFS clients attempt to retrieve as much data as possible
for each READDIR request. Also, we have no unit tests that
exercise the behavior of READDIR at the lower bound of @count
values. Thus this case was missed during testing.
Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Fixes: f5dcccd647 ("NFSD: Update the NFSv2 READDIR entry encoder to use struct xdr_stream")
Fixes: 7f87fc2d34 ("NFSD: Update NFSv3 READDIR entry encoders to use struct xdr_stream")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
These methods indirect the actual DAX read/write path. In the end pmem
uses magic flush and mc safe variants and fuse and dcssblk use plain ones
while device mapper picks redirects to the underlying device.
Add set_dax_nocache() and set_dax_nomc() APIs to control which copy
routines are used to remove indirect call from the read/write fast path
as well as a lot of boilerplate code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Vivek Goyal <vgoyal@redhat.com> [virtiofs]
Link: https://lore.kernel.org/r/20211215084508.435401-5-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Remove the DAXDEV_F_SYNC flag and thus the flags argument to alloc_dax and
just let the drivers call set_dax_synchronous directly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Pankaj Gupta <pankaj.gupta@ionos.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Link: https://lore.kernel.org/r/20211215084508.435401-4-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Now that iomap has been converted, XFS is large folio safe.
Indicate to the VFS that it can now create large folios for XFS.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
If we're punching a hole in a large folio, we need to remove the
per-folio iomap data as the folio is about to be split and each page will
need its own. If a dirty folio is only partially-uptodate, the iomap
data contains the information about which blocks cannot be written back,
so assert that a dirty folio is fully uptodate.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
The arguments are still pages for now, but we can use folios internally
and cut out a lot of calls to compound_head().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
We still iterate one block at a time, but now we call compound_head()
less often.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Rename end_offset to end_pos and offset_into_page to poff to match the
rest of the file. Simplify the handling of the last page straddling
i_size by doing the EOF check based on the byte granularity i_size
instead of converting to a pgoff prematurely.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Rename end_offset to end_pos and file_offset to pos to match the rest
of the file. Simplify the loop by calculating nblocks up front instead
of each time around the loop.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
XFS has the only implementation of ->discard_page today, so convert it
to use folios in the same patch as converting the API.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
This conversion is only safe because iomap only supports writes to inline
data which starts at the beginning of the file.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
These functions still only work in PAGE_SIZE chunks, but there are
fewer conversions from tail to head pages as a result of this patch.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
The zero iterator can work in folio-sized chunks instead of page-sized
chunks. This will save a lot of page cache lookups if the file is cached
in large folios.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
In the future, we want write_begin to know the entire length of the
write so that it can choose to allocate large folios. Pass the full
length in from __iomap_zero_iter() and limit it where necessary.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
One fix and one trivial update for rc6:
* Add MODULE_ALIAS_FS to get automatic module loading on mount (from
Naohiro)
* Update Damien's email address in the MAINTAINERS file (from me).
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSRPv8tYSvhwAzJdzjdoc3SxdoYdgUCYb0hkgAKCRDdoc3SxdoY
dlciAP4lGpsiFcO7TdLY2W64EHgIkcstGx1UitsqBTR5iZnp6QD/bAfaHOaTNQDG
Nr8GznsB3di4WRbFoV7BhBsKFcC4cgA=
=GWY0
-----END PGP SIGNATURE-----
Merge tag 'zonefs-5.16-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs
Pull zonefs fixes from Damien Le Moal:
"One fix and one trivial update for rc6:
- Add MODULE_ALIAS_FS to get automatic module loading on mount
(Naohiro)
- Update Damien's email address in the MAINTAINERS file (me)"
* tag 'zonefs-5.16-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
MAITAINERS: Change zonefs maintainer email address
zonefs: add MODULE_ALIAS_FS
According to the official Microsoft MS-SMB2 document section 3.3.5.4, this
flag should be used only for 3.0 and 3.0.2 dialects. Setting it for 3.1.1
is a violation of the specification.
This causes my Windows 10 client to detect an anomaly in the negotiation,
and disable encryption entirely despite being explicitly enabled in ksmbd,
causing all data transfers to go in plain text.
Fixes: e2f34481b2 ("cifsd: add server-side procedures for SMB3")
Cc: stable@vger.kernel.org # v5.15
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Marcos Del Sol Vives <marcos@orca.pet>
Signed-off-by: Steve French <stfrench@microsoft.com>
mount.cifs can pass a device with multiple delimiters in it. This will
cause rename(2) to fail with ENOENT.
V2:
- Make sanitize_path more readable.
- Fix multiple delimiters between UNC and prepath.
- Avoid a memory leak if a bad user starts putting a lot of delimiters
in the path on purpose.
BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=2031200
Fixes: 24e0a1eff9 ("cifs: switch to new mount api")
Cc: stable@vger.kernel.org # 5.11+
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Thiago Rafael Becker <trbecker@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
We have a cyclic dependency between fscache super cookie
and root inode cookie. The super cookie relies on
tcon->resource_id, which gets populated from the root inode
number. However, fetching the root inode initializes inode
cookie as a child of super cookie, which is yet to be populated.
resource_id is only used as auxdata to check the validity of
super cookie. We can completely avoid setting resource_id to
remove the circular dependency. Since vol creation time and
vol serial numbers are used for auxdata, we should be fine.
Additionally, there will be auxiliary data check for each
inode cookie as well.
Fixes: 5bf91ef03d ("cifs: wait for tcon resource_id before getting fscache super")
CC: David Howells <dhowells@redhat.com>
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmG8+tEACgkQxWXV+ddt
WDuuGA/9E75ZMqsMLW5az7z8Rt5voBjPeweyRHmGCLZKpgfaj0QjrJRvu0CTKU/W
zCSQf+ShTTY2D3cmh1eEwKyX/waKQ71qBrMX/SgIeA0OjmlhK1UGB18MF5sAVGCB
mymVYJh7IntYJE7S7OiMUL/yILmIWZYrYT+iaPZlIc9M6h0b1gjMIsE0VEmxJMCN
X8RAQ4CfL9bpTTKItNehSyXx+J7TB5yamh5AspaiB/ivyN1DcUcsFf3AoaWXeh2D
YIBzq4WbGnDMfUdWXKE2rqDfQgaTXtN9ffGUvphJnegg8Tqfp29LyLZ1GU0qGSXc
/K8g5QNmM3nhubXq2MG5zfbHPJ1H2CgnvkDqiCcyeop+09yj/ugxTt+ULaIbJL76
pKSpcuIFXTmoW2Z7ZwijIEX4H5Dgk2l2DbE8SkJT4LjJybgpHfBT1KDQrj5iQdx+
XgmG/CbRELuGGltJNuldp0SqIyMNRgDuv6Rheg9N9H73m9epwfH5oiM0Fj/FYyQ6
lfgle6DQCP4xaDmk1zA9zrJHTUqi8Caeyg+tQYT6AbkoeCoXnvEAPgv9OOGe1M+C
Ks7zeAseWs3A/j/+wCdiCKombOfR+AY3RGkPzlodUJj4YYOTyXrigtb5yhTz6Zdv
ozVBZ71LUMMOf0NV45mGtqsiLqyfe3cnlqj1XtHQKaajyjgHvW8=
=G7CE
-----END PGP SIGNATURE-----
Merge tag 'for-5.16-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more fixes, almost all error handling one-liners and for stable.
- regression fix in directory logging items
- regression fix of extent buffer status bits handling after an error
- fix memory leak in error handling path in tree-log
- fix freeing invalid anon device number when handling errors during
subvolume creation
- fix warning when freeing leaf after subvolume creation failure
- fix missing blkdev put in device scan error handling
- fix invalid delayed ref after subvolume creation failure"
* tag 'for-5.16-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix missing blkdev_put() call in btrfs_scan_one_device()
btrfs: fix warning when freeing leaf after subvolume creation failure
btrfs: fix invalid delayed ref after subvolume creation failure
btrfs: check WRITE_ERR when trying to read an extent buffer
btrfs: fix missing last dir item offset update when logging directory
btrfs: fix double free of anon_dev after failure to create subvolume
btrfs: fix memory leak in __add_inode_ref()
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmG8wLIQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpncuD/wI9UnDapmyWodpXCw+f/GN0ZfuGzKAblkR
yYJcA3spTIYHHZiR4KCKke1vM+j8PoPBTGwCuDX9EhHFq9EIEAAAFqQwbta85Rcu
c3iKhb7qWGN2VQwqdmQDZtegdblM5usNsY3fNUA3iaGeQy+7qvYEyk0u5CDSIKCQ
oR0NQpE/cHYlyoqUbITlCvOwD32QvcuHK2oCBnGW02FYiKq3gddJ3EPxnBkLcTUI
zeHRCtslZTFrZexapQ/w4rAJ0O6bMR/08TYjtRDxLB4bTpYR34Hx23tvEsZk4jqE
olpUlJDwV2B+vZ1aAKqCrabbkNQpiws7Y1EMSwM1pqqZYaPhpVeEmcqkpDNsSgY2
s764iXPKjJnKB6XWUuuShobOjiwwk0rAyNdWKVdvzNVkEnhpNnFVmfllChOiphqM
jHua88LHpRJpmHSDdfJMO9qT1uffixv1AIfhEoc0hBCddwp9RwqhijelbJDYIjou
KKE7aip0rzwK90tOTFGP9lD7cUl3aFig3yQNLaJpXybm2Qa+xE//hdUk4lm1i57c
ciEYtz8S827lCx/CzhpWoTszG11cpT249IMXyxtspoZZh44VTXYC83EeYI2cyG24
q8zfmgMLNMYn6dc4dP5oAA/d0r7s9AY9GvyblhwmLGpjO5mANqJPZAtl2PpSYWg6
jVh1NDEyKw==
=IY1e
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.16-2021-12-17' of git://git.kernel.dk/linux-block
Pull io_uring fix from Jens Axboe:
"Just a single fix, fixing an issue with the worker creation change
that was merged last week"
* tag 'io_uring-5.16-2021-12-17' of git://git.kernel.dk/linux-block:
io-wq: drop wqe lock before creating new worker
Add MODULE_ALIAS_FS() to load the module automatically when you do "mount
-t zonefs".
Fixes: 8dcc1a9d90 ("fs: New zonefs file system")
Cc: stable <stable@vger.kernel.org> # 5.6+
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Johannes Thumshirn <jth@kernel.org>
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
If we write to any page in a folio, we have to mark the entire
folio as dirty, and potentially COW the entire folio, because it'll
all get written back as one unit.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Handle folios of arbitrary size instead of working in PAGE_SIZE units.
readahead_folio() decreases the page refcount for you, so this is not
quite a mechanical change.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
We still only support up to a single page of inline data (at least,
per call to iomap_read_inline_data()), but it can now be written into
the middle of a folio in case we decide to allocate a 16KiB page for
a file that's 8.1KiB in size.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Pass a folio around instead of the page, and make sure the offset
is relative to the start of the folio instead of the start of a page.
Also use size_t for offset & length to make it clear that these are byte
counts, and to support >2GB folios in the future.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Use bio_for_each_folio() to iterate over each folio in the bio
instead of iterating over each page.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
All but one caller already has the iomap_page, so we can avoid getting
it again.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Keep iomap_invalidatepage around as a wrapper for use in address_space
operations.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
This is an address_space operation, so its argument must remain as a
struct page, but we can use a folio internally.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
iomap_page_release() was also assuming that it was being passed a
head page.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
This function already assumed it was being passed a head page, so
just formalise that.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The big comment about only using a head page can go away now that
it takes a folio argument.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
There are no plans to convert buffer_head infrastructure to use large
folios, but __block_write_begin_int() is called from iomap, and it's
more convenient and less error-prone if we pass in a folio from iomap.
It also has a nice saving of almost 200 bytes of code from removing
repeated calls to compound_head().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
No check for if "rc" is an error code for build_sec_desc().
This can cause problems with using uninitialized pntsd_size.
Fixes: e2f34481b2 ("cifsd: add server-side procedures for SMB3")
Cc: stable@vger.kernel.org # v5.15
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
This is a failure path and it should return -EINVAL instead of success.
Otherwise it could result in the caller using uninitialized memory.
Fixes: 303fff2b8c ("ksmbd: add validation for ndr read/write functions")
Cc: stable@vger.kernel.org # v5.15
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
accounting fix and two fixups to appease static checkers.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmG54aMTHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzixPBB/4rf0d+oJbBNQQIjTVSO09G9A5pxnU1
CY++W5xppv2hQQX8Czh8iLLYSP3tRh+1xpAlKT+DFsRI3GTh5tXiPf7Mpu76noo3
UAKxiz5NmtqiwrHKZtpsQEJlmwtAFa+qFVJIOiqq1l+gFeiYMueoFEGDgdSCLJrE
P6TL9CZx06Yf+fef7hJnXKR3YM0qras1jV6Adpvu+2g/ciaThzJF3YdyAFYFp7nl
sGnPTXlJ1DDIsiOAgj0oGeQICpjQJOSGNZtxVjhJx5dJjqgKjim0V4MIShqsW/tg
n8LzjrdGh/lR2QfCgOfYmvFrVTeb9QQyF57sT034rGrP6VXlzCkALPi3
=FQag
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-5.16-rc6' of git://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
"An SGID directory handling fix (marked for stable), a metrics
accounting fix and two fixups to appease static checkers"
* tag 'ceph-for-5.16-rc6' of git://github.com/ceph/ceph-client:
ceph: fix up non-directory creation in SGID directories
ceph: initialize pathlen variable in reconnect_caps_cb
ceph: initialize i_size variable in ceph_sync_read
ceph: fix duplicate increment of opened_inodes metric
The function btrfs_scan_one_device() calls blkdev_get_by_path() and
blkdev_put() to get and release its target block device. However, when
btrfs_sb_log_location_bdev() fails, blkdev_put() is not called and the
block device is left without clean up. This triggered failure of fstests
generic/085. Fix the failure path of btrfs_sb_log_location_bdev() to
call blkdev_put().
Fixes: 12659251ca ("btrfs: implement log-structured superblock for ZONED mode")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When creating a subvolume, at ioctl.c:create_subvol(), if we fail to
insert the new root's root item into the root tree, we are freeing the
metadata extent we reserved for the new root to prevent a metadata
extent leak, as we don't abort the transaction at that point (since
there is nothing at that point that is irreversible).
However we allocated the metadata extent for the new root which we are
creating for the new subvolume, so its delayed reference refers to the
ID of this new root. But when we free the metadata extent we pass the
root of the subvolume where the new subvolume is located to
btrfs_free_tree_block() - this is incorrect because this will generate
a delayed reference that refers to the ID of the parent subvolume's root,
and not to ID of the new root.
This results in a failure when running delayed references that leads to
a transaction abort and a trace like the following:
[3868.738042] RIP: 0010:__btrfs_free_extent+0x709/0x950 [btrfs]
[3868.739857] Code: 68 0f 85 e6 fb ff (...)
[3868.742963] RSP: 0018:ffffb0e9045cf910 EFLAGS: 00010246
[3868.743908] RAX: 00000000fffffffe RBX: 00000000fffffffe RCX: 0000000000000002
[3868.745312] RDX: 00000000fffffffe RSI: 0000000000000002 RDI: ffff90b0cd793b88
[3868.746643] RBP: 000000000e5d8000 R08: 0000000000000000 R09: ffff90b0cd793b88
[3868.747979] R10: 0000000000000002 R11: 00014ded97944d68 R12: 0000000000000000
[3868.749373] R13: ffff90b09afe4a28 R14: 0000000000000000 R15: ffff90b0cd793b88
[3868.750725] FS: 00007f281c4a8b80(0000) GS:ffff90b3ada00000(0000) knlGS:0000000000000000
[3868.752275] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[3868.753515] CR2: 00007f281c6a5000 CR3: 0000000108a42006 CR4: 0000000000370ee0
[3868.754869] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[3868.756228] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[3868.757803] Call Trace:
[3868.758281] <TASK>
[3868.758655] ? btrfs_merge_delayed_refs+0x178/0x1c0 [btrfs]
[3868.759827] __btrfs_run_delayed_refs+0x2b1/0x1250 [btrfs]
[3868.761047] btrfs_run_delayed_refs+0x86/0x210 [btrfs]
[3868.762069] ? lock_acquired+0x19f/0x420
[3868.762829] btrfs_commit_transaction+0x69/0xb20 [btrfs]
[3868.763860] ? _raw_spin_unlock+0x29/0x40
[3868.764614] ? btrfs_block_rsv_release+0x1c2/0x1e0 [btrfs]
[3868.765870] create_subvol+0x1d8/0x9a0 [btrfs]
[3868.766766] btrfs_mksubvol+0x447/0x4c0 [btrfs]
[3868.767669] ? preempt_count_add+0x49/0xa0
[3868.768444] __btrfs_ioctl_snap_create+0x123/0x190 [btrfs]
[3868.769639] ? _copy_from_user+0x66/0xa0
[3868.770391] btrfs_ioctl_snap_create_v2+0xbb/0x140 [btrfs]
[3868.771495] btrfs_ioctl+0xd1e/0x35c0 [btrfs]
[3868.772364] ? __slab_free+0x10a/0x360
[3868.773198] ? rcu_read_lock_sched_held+0x12/0x60
[3868.774121] ? lock_release+0x223/0x4a0
[3868.774863] ? lock_acquired+0x19f/0x420
[3868.775634] ? rcu_read_lock_sched_held+0x12/0x60
[3868.776530] ? trace_hardirqs_on+0x1b/0xe0
[3868.777373] ? _raw_spin_unlock_irqrestore+0x3e/0x60
[3868.778280] ? kmem_cache_free+0x321/0x3c0
[3868.779011] ? __x64_sys_ioctl+0x83/0xb0
[3868.779718] __x64_sys_ioctl+0x83/0xb0
[3868.780387] do_syscall_64+0x3b/0xc0
[3868.781059] entry_SYSCALL_64_after_hwframe+0x44/0xae
[3868.781953] RIP: 0033:0x7f281c59e957
[3868.782585] Code: 3c 1c 48 f7 d8 4c (...)
[3868.785867] RSP: 002b:00007ffe1f83e2b8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[3868.787198] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f281c59e957
[3868.788450] RDX: 00007ffe1f83e2c0 RSI: 0000000050009418 RDI: 0000000000000003
[3868.789748] RBP: 00007ffe1f83f300 R08: 0000000000000000 R09: 00007ffe1f83fe36
[3868.791214] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000003
[3868.792468] R13: 0000000000000003 R14: 00007ffe1f83e2c0 R15: 00000000000003cc
[3868.793765] </TASK>
[3868.794037] irq event stamp: 0
[3868.794548] hardirqs last enabled at (0): [<0000000000000000>] 0x0
[3868.795670] hardirqs last disabled at (0): [<ffffffff98294214>] copy_process+0x934/0x2040
[3868.797086] softirqs last enabled at (0): [<ffffffff98294214>] copy_process+0x934/0x2040
[3868.798309] softirqs last disabled at (0): [<0000000000000000>] 0x0
[3868.799284] ---[ end trace be24c7002fe27747 ]---
[3868.799928] BTRFS info (device dm-0): leaf 241188864 gen 1268 total ptrs 214 free space 469 owner 2
[3868.801133] BTRFS info (device dm-0): refs 2 lock_owner 225627 current 225627
[3868.802056] item 0 key (237436928 169 0) itemoff 16250 itemsize 33
[3868.802863] extent refs 1 gen 1265 flags 2
[3868.803447] ref#0: tree block backref root 1610
(...)
[3869.064354] item 114 key (241008640 169 0) itemoff 12488 itemsize 33
[3869.065421] extent refs 1 gen 1268 flags 2
[3869.066115] ref#0: tree block backref root 1689
(...)
[3869.403834] BTRFS error (device dm-0): unable to find ref byte nr 241008640 parent 0 root 1622 owner 0 offset 0
[3869.405641] BTRFS: error (device dm-0) in __btrfs_free_extent:3076: errno=-2 No such entry
[3869.407138] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2159: errno=-2 No such entry
Fix this by passing the new subvolume's root ID to btrfs_free_tree_block().
This requires changing the root argument of btrfs_free_tree_block() from
struct btrfs_root * to a u64, since at this point during the subvolume
creation we have not yet created the struct btrfs_root for the new
subvolume, and btrfs_free_tree_block() only needs a root ID and nothing
else from a struct btrfs_root.
This was triggered by test case generic/475 from fstests.
Fixes: 67addf2900 ("btrfs: fix metadata extent leak after failure to create subvolume")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Filipe reported a hang when we have errors on btrfs. This turned out to
be a side-effect of my fix c2e3930529 ("btrfs: clear extent buffer
uptodate when we fail to write it") which made it so we clear
EXTENT_BUFFER_UPTODATE on an eb when we fail to write it out.
Below is a paste of Filipe's analysis he got from using drgn to debug
the hang
"""
btree readahead code calls read_extent_buffer_pages(), sets ->io_pages to
a value while writeback of all pages has not yet completed:
--> writeback for the first 3 pages finishes, we clear
EXTENT_BUFFER_UPTODATE from eb on the first page when we get an
error.
--> at this point eb->io_pages is 1 and we cleared Uptodate bit from the
first 3 pages
--> read_extent_buffer_pages() does not see EXTENT_BUFFER_UPTODATE() so
it continues, it's able to lock the pages since we obviously don't
hold the pages locked during writeback
--> read_extent_buffer_pages() then computes 'num_reads' as 3, and sets
eb->io_pages to 3, since only the first page does not have Uptodate
bit set at this point
--> writeback for the remaining page completes, we ended decrementing
eb->io_pages by 1, resulting in eb->io_pages == 2, and therefore
never calling end_extent_buffer_writeback(), so
EXTENT_BUFFER_WRITEBACK remains in the eb's flags
--> of course, when the read bio completes, it doesn't and shouldn't
call end_extent_buffer_writeback()
--> we should clear EXTENT_BUFFER_UPTODATE only after all pages of
the eb finished writeback? or maybe make the read pages code
wait for writeback of all pages of the eb to complete before
checking which pages need to be read, touch ->io_pages, submit
read bio, etc
writeback bit never cleared means we can hang when aborting a
transaction, at:
btrfs_cleanup_one_transaction()
btrfs_destroy_marked_extents()
wait_on_extent_buffer_writeback()
"""
This is a problem because our writes are not synchronized with reads in
any way. We clear the UPTODATE flag and then we can easily come in and
try to read the EB while we're still waiting on other bio's to
complete.
We have two options here, we could lock all the pages, and then check to
see if eb->io_pages != 0 to know if we've already got an outstanding
write on the eb.
Or we can simply check to see if we have WRITE_ERR set on this extent
buffer. We set this bit _before_ we clear UPTODATE, so if the read gets
triggered because we aren't UPTODATE because of a write error we're
guaranteed to have WRITE_ERR set, and in this case we can simply return
-EIO. This will fix the reported hang.
Reported-by: Filipe Manana <fdmanana@suse.com>
Fixes: c2e3930529 ("btrfs: clear extent buffer uptodate when we fail to write it")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
FAN_RENAME is the successor of FAN_MOVED_FROM and FAN_MOVED_TO
and can be used to get the old and new parent+name information in
a single event.
FAN_MOVED_FROM and FAN_MOVED_TO are still supported for backward
compatibility, but it makes little sense to use them together with
FAN_RENAME in the same group.
FAN_RENAME uses special info type records to report the old and
new parent+name, so reporting only old and new parent id is less
useful and was not implemented.
Therefore, FAN_REANAME requires a group with flag FAN_REPORT_NAME.
Link: https://lore.kernel.org/r/20211129201537.1932819-12-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
In the special case of FAN_RENAME event, we report old or new or both
old and new parent+name.
A single info record will be reported if either the old or new dir
is watched and two records will be reported if both old and new dir
(or their filesystem) are watched.
The old and new parent+name are reported using new info record types
FAN_EVENT_INFO_TYPE_{OLD,NEW}_DFID_NAME, so if a single info record
is reported, it is clear to the application, to which dir entry the
fid+name info is referring to.
Link: https://lore.kernel.org/r/20211129201537.1932819-11-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
We do not want to report the dirfid+name of a directory whose
inode/sb are not watched, because watcher may not have permissions
to see the directory content.
Use an internal iter_info to indicate to fanotify_alloc_event()
which marks of this group are watching FAN_RENAME, so it can decide
if we need to record only the old parent+name, new parent+name or both.
Link: https://lore.kernel.org/r/20211129201537.1932819-10-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
[JK: Modified code to pass around only mask of mark types matching
generated event]
Signed-off-by: Jan Kara <jack@suse.cz>
Allow storing a secondary dir fh and name tupple in fanotify_info.
This will be used to store the new parent and name information in
FAN_RENAME event.
Link: https://lore.kernel.org/r/20211129201537.1932819-8-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
fanotify_info buffer is parceled into variable sized records, so the
records must be written in order: dir_fh, file_fh, name.
Use helpers to assert that order and make fanotify_alloc_name_event()
a bit more generic to allow empty dir_fh record and to allow expanding
to more records (i.e. name2) soon.
Link: https://lore.kernel.org/r/20211129201537.1932819-7-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
The fanotify_info buffer contains up to two file handles and a name.
Use macros to simplify the code that access the different items within
the buffer.
Add assertions to verify that stored fh len and name len do not overflow
the u8 stored value in fanotify_info header.
Remove the unused fanotify_info_len() helper.
Link: https://lore.kernel.org/r/20211129201537.1932819-6-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
The dnotify FS_DN_RENAME event is used to request notification about
a move within the same parent directory and was always coupled with
the FS_MOVED_FROM event.
Rename the FS_DN_RENAME event flag to FS_RENAME, decouple it from
FS_MOVED_FROM and report it with the moved dentry instead of the moved
inode, so it has the information about both old and new parent and name.
Generate the FS_RENAME event regardless of same parent dir and apply
the "same parent" rule in the generic fsnotify_handle_event() helper
that is used to call backends with ->handle_inode_event() method
(i.e. dnotify). The ->handle_inode_event() method is not rich enough to
report both old and new parent and name anyway.
The enriched event is reported to fanotify over the ->handle_event()
method with the old and new dir inode marks in marks array slots for
ITER_TYPE_INODE and a new iter type slot ITER_TYPE_INODE2.
The enriched event will be used for reporting old and new parent+name to
fanotify groups with FAN_RENAME events.
Link: https://lore.kernel.org/r/20211129201537.1932819-5-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
FAN_REPORT_FID is ambiguous in that it reports the fid of the child for
some events and the fid of the parent for create/delete/move events.
The new FAN_REPORT_TARGET_FID flag is an implicit request to report
the fid of the target object of the operation (a.k.a the child inode)
also in create/delete/move events in addition to the fid of the parent
and the name of the child.
To reduce the test matrix for uninteresting use cases, the new
FAN_REPORT_TARGET_FID flag requires both FAN_REPORT_NAME and
FAN_REPORT_FID. The convenience macro FAN_REPORT_DFID_NAME_TARGET
combines FAN_REPORT_TARGET_FID with all the required flags.
Link: https://lore.kernel.org/r/20211129201537.1932819-4-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
They are two different types that use the same enum, so this confusing.
Use the object type to indicate the type of object mark is attached to
and the iter type to indicate the type of watch.
A group can have two different watches of the same object type (parent
and child watches) that match the same event.
Link: https://lore.kernel.org/r/20211129201537.1932819-3-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
In preparation for separating object type from iterator type, rename
some 'type' arguments in functions to 'obj_type' and remove the unused
interface to clear marks by object type mask.
Link: https://lore.kernel.org/r/20211129201537.1932819-2-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
When memory allocation of iinfo or block allocation fails, already
allocated struct udf_inode_info gets freed with iput() and
udf_evict_inode() may look at inode fields which are not properly
initialized. Fix it by marking inode bad before dropping reference to it
in udf_new_inode().
Reported-by: syzbot+9ca499bb57a2b9e4c652@syzkaller.appspotmail.com
Signed-off-by: Jan Kara <jack@suse.cz>
There is a potential deadlock between writeback process and a process
performing write_begin() or write_cache_pages() while trying to write
same compress file, but not compressable, as below:
[Process A] - doing checkpoint
[Process B] [Process C]
f2fs_write_cache_pages()
- lock_page() [all pages in cluster, 0-31]
- f2fs_write_multi_pages()
- f2fs_write_raw_pages()
- f2fs_write_single_data_page()
- f2fs_do_write_data_page()
- return -EAGAIN [f2fs_trylock_op() failed]
- unlock_page(page) [e.g., page 0]
- generic_perform_write()
- f2fs_write_begin()
- f2fs_prepare_compress_overwrite()
- prepare_compress_overwrite()
- lock_page() [e.g., page 0]
- lock_page() [e.g., page 1]
- lock_page(page) [e.g., page 0]
Since there is no compress process, it is no longer necessary to hold
locks on every pages in cluster within f2fs_write_raw_pages().
This patch changes f2fs_write_raw_pages() to release all locks first
and then perform write same as the non-compress file in
f2fs_write_cache_pages().
Fixes: 4c8ff7095b ("f2fs: support data compression")
Signed-off-by: Hyeong-Jun Kim <hj514.kim@samsung.com>
Signed-off-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Youngjin Gil <youngjin.gil@samsung.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Android OTA failed due to SBI_NEED_FSCK flag when pinning the file. Let's avoid
it since we can do in-place-updates.
Cc: stable@vger.kernel.org
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When logging a directory, once we finish processing a leaf that is full
of dir items, if we find the next leaf was not modified in the current
transaction, we grab the first key of that next leaf and log it as to
mark the end of a key range boundary.
However we did not update the value of ctx->last_dir_item_offset, which
tracks the offset of the last logged key. This can result in subsequent
logging of the same directory in the current transaction to not realize
that key was already logged, and then add it to the middle of a batch
that starts with a lower key, resulting later in a leaf with one key
that is duplicated and at non-consecutive slots. When that happens we get
an error later when writing out the leaf, reporting that there is a pair
of keys in wrong order. The report is something like the following:
Dec 13 21:44:50 kernel: BTRFS critical (device dm-0): corrupt leaf:
root=18446744073709551610 block=118444032 slot=21, bad key order, prev
(704687 84 4146773349) current (704687 84 1063561078)
Dec 13 21:44:50 kernel: BTRFS info (device dm-0): leaf 118444032 gen
91449 total ptrs 39 free space 546 owner 18446744073709551610
Dec 13 21:44:50 kernel: item 0 key (704687 1 0) itemoff 3835
itemsize 160
Dec 13 21:44:50 kernel: inode generation 35532 size
1026 mode 40755
Dec 13 21:44:50 kernel: item 1 key (704687 12 704685) itemoff
3822 itemsize 13
Dec 13 21:44:50 kernel: item 2 key (704687 24 3817753667)
itemoff 3736 itemsize 86
Dec 13 21:44:50 kernel: item 3 key (704687 60 0) itemoff 3728 itemsize 8
Dec 13 21:44:50 kernel: item 4 key (704687 72 0) itemoff 3720 itemsize 8
Dec 13 21:44:50 kernel: item 5 key (704687 84 140445108)
itemoff 3666 itemsize 54
Dec 13 21:44:50 kernel: dir oid 704793 type 1
Dec 13 21:44:50 kernel: item 6 key (704687 84 298800632)
itemoff 3599 itemsize 67
Dec 13 21:44:50 kernel: dir oid 707849 type 2
Dec 13 21:44:50 kernel: item 7 key (704687 84 476147658)
itemoff 3532 itemsize 67
Dec 13 21:44:50 kernel: dir oid 707901 type 2
Dec 13 21:44:50 kernel: item 8 key (704687 84 633818382)
itemoff 3471 itemsize 61
Dec 13 21:44:50 kernel: dir oid 704694 type 2
Dec 13 21:44:50 kernel: item 9 key (704687 84 654256665)
itemoff 3403 itemsize 68
Dec 13 21:44:50 kernel: dir oid 707841 type 1
Dec 13 21:44:50 kernel: item 10 key (704687 84 995843418)
itemoff 3331 itemsize 72
Dec 13 21:44:50 kernel: dir oid 2167736 type 1
Dec 13 21:44:50 kernel: item 11 key (704687 84 1063561078)
itemoff 3278 itemsize 53
Dec 13 21:44:50 kernel: dir oid 704799 type 2
Dec 13 21:44:50 kernel: item 12 key (704687 84 1101156010)
itemoff 3225 itemsize 53
Dec 13 21:44:50 kernel: dir oid 704696 type 1
Dec 13 21:44:50 kernel: item 13 key (704687 84 2521936574)
itemoff 3173 itemsize 52
Dec 13 21:44:50 kernel: dir oid 704704 type 2
Dec 13 21:44:50 kernel: item 14 key (704687 84 2618368432)
itemoff 3112 itemsize 61
Dec 13 21:44:50 kernel: dir oid 704738 type 1
Dec 13 21:44:50 kernel: item 15 key (704687 84 2676316190)
itemoff 3046 itemsize 66
Dec 13 21:44:50 kernel: dir oid 2167729 type 1
Dec 13 21:44:50 kernel: item 16 key (704687 84 3319104192)
itemoff 2986 itemsize 60
Dec 13 21:44:50 kernel: dir oid 704745 type 2
Dec 13 21:44:50 kernel: item 17 key (704687 84 3908046265)
itemoff 2929 itemsize 57
Dec 13 21:44:50 kernel: dir oid 2167734 type 1
Dec 13 21:44:50 kernel: item 18 key (704687 84 3945713089)
itemoff 2857 itemsize 72
Dec 13 21:44:50 kernel: dir oid 2167730 type 1
Dec 13 21:44:50 kernel: item 19 key (704687 84 4077169308)
itemoff 2795 itemsize 62
Dec 13 21:44:50 kernel: dir oid 704688 type 1
Dec 13 21:44:50 kernel: item 20 key (704687 84 4146773349)
itemoff 2727 itemsize 68
Dec 13 21:44:50 kernel: dir oid 707892 type 1
Dec 13 21:44:50 kernel: item 21 key (704687 84 1063561078)
itemoff 2674 itemsize 53
Dec 13 21:44:50 kernel: dir oid 704799 type 2
Dec 13 21:44:50 kernel: item 22 key (704687 96 2) itemoff 2612
itemsize 62
Dec 13 21:44:50 kernel: item 23 key (704687 96 6) itemoff 2551
itemsize 61
Dec 13 21:44:50 kernel: item 24 key (704687 96 7) itemoff 2498
itemsize 53
Dec 13 21:44:50 kernel: item 25 key (704687 96 12) itemoff
2446 itemsize 52
Dec 13 21:44:50 kernel: item 26 key (704687 96 14) itemoff
2385 itemsize 61
Dec 13 21:44:50 kernel: item 27 key (704687 96 18) itemoff
2325 itemsize 60
Dec 13 21:44:50 kernel: item 28 key (704687 96 24) itemoff
2271 itemsize 54
Dec 13 21:44:50 kernel: item 29 key (704687 96 28) itemoff
2218 itemsize 53
Dec 13 21:44:50 kernel: item 30 key (704687 96 62) itemoff
2150 itemsize 68
Dec 13 21:44:50 kernel: item 31 key (704687 96 66) itemoff
2083 itemsize 67
Dec 13 21:44:50 kernel: item 32 key (704687 96 75) itemoff
2015 itemsize 68
Dec 13 21:44:50 kernel: item 33 key (704687 96 79) itemoff
1948 itemsize 67
Dec 13 21:44:50 kernel: item 34 key (704687 96 82) itemoff
1882 itemsize 66
Dec 13 21:44:50 kernel: item 35 key (704687 96 83) itemoff
1810 itemsize 72
Dec 13 21:44:50 kernel: item 36 key (704687 96 85) itemoff
1753 itemsize 57
Dec 13 21:44:50 kernel: item 37 key (704687 96 87) itemoff
1681 itemsize 72
Dec 13 21:44:50 kernel: item 38 key (704694 1 0) itemoff 1521
itemsize 160
Dec 13 21:44:50 kernel: inode generation 35534 size 30
mode 40755
Dec 13 21:44:50 kernel: BTRFS error (device dm-0): block=118444032
write time tree block corruption detected
So fix that by adding the missing update of ctx->last_dir_item_offset with
the offset of the boundary key.
Reported-by: Chris Murphy <lists@colorremedies.com>
Link: https://lore.kernel.org/linux-btrfs/CAJCQCtT+RSzpUjbMq+UfzNUMe1X5+1G+DnAGbHC=OZ=iRS24jg@mail.gmail.com/
Fixes: dc2872247e ("btrfs: keep track of the last logged keys when logging a directory")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When creating a subvolume, at create_subvol(), we allocate an anonymous
device and later call btrfs_get_new_fs_root(), which in turn just calls
btrfs_get_root_ref(). There we call btrfs_init_fs_root() which assigns
the anonymous device to the root, but if after that call there's an error,
when we jump to 'fail' label, we call btrfs_put_root(), which frees the
anonymous device and then returns an error that is propagated back to
create_subvol(). Than create_subvol() frees the anonymous device again.
When this happens, if the anonymous device was not reallocated after
the first time it was freed with btrfs_put_root(), we get a kernel
message like the following:
(...)
[13950.282466] BTRFS: error (device dm-0) in create_subvol:663: errno=-5 IO failure
[13950.283027] ida_free called for id=65 which is not allocated.
[13950.285974] BTRFS info (device dm-0): forced readonly
(...)
If the anonymous device gets reallocated by another btrfs filesystem
or any other kernel subsystem, then bad things can happen.
So fix this by setting the root's anonymous device to 0 at
btrfs_get_root_ref(), before we call btrfs_put_root(), if an error
happened.
Fixes: 2dfb1e43f5 ("btrfs: preallocate anon block device at first phase of snapshot creation")
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Line 1169 (#3) allocates a memory chunk for victim_name by kmalloc(),
but when the function returns in line 1184 (#4) victim_name allocated
by line 1169 (#3) is not freed, which will lead to a memory leak.
There is a similar snippet of code in this function as allocating a memory
chunk for victim_name in line 1104 (#1) as well as releasing the memory
in line 1116 (#2).
We should kfree() victim_name when the return value of backref_in_log()
is less than zero and before the function returns in line 1184 (#4).
1057 static inline int __add_inode_ref(struct btrfs_trans_handle *trans,
1058 struct btrfs_root *root,
1059 struct btrfs_path *path,
1060 struct btrfs_root *log_root,
1061 struct btrfs_inode *dir,
1062 struct btrfs_inode *inode,
1063 u64 inode_objectid, u64 parent_objectid,
1064 u64 ref_index, char *name, int namelen,
1065 int *search_done)
1066 {
1104 victim_name = kmalloc(victim_name_len, GFP_NOFS);
// #1: kmalloc (victim_name-1)
1105 if (!victim_name)
1106 return -ENOMEM;
1112 ret = backref_in_log(log_root, &search_key,
1113 parent_objectid, victim_name,
1114 victim_name_len);
1115 if (ret < 0) {
1116 kfree(victim_name); // #2: kfree (victim_name-1)
1117 return ret;
1118 } else if (!ret) {
1169 victim_name = kmalloc(victim_name_len, GFP_NOFS);
// #3: kmalloc (victim_name-2)
1170 if (!victim_name)
1171 return -ENOMEM;
1180 ret = backref_in_log(log_root, &search_key,
1181 parent_objectid, victim_name,
1182 victim_name_len);
1183 if (ret < 0) {
1184 return ret; // #4: missing kfree (victim_name-2)
1185 } else if (!ret) {
1241 return 0;
1242 }
Fixes: d3316c8233 ("btrfs: Properly handle backref_in_log retval")
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Jianglei Nie <niejianglei2021@163.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When the per inode DAX hint changes while the file is still *opened*, it
is quite complicated and maybe fragile to dynamically change the DAX
state.
Hence mark the inode and corresponding dentries as DONE_CACHE once the
per inode DAX hint changes, so that the inode instance will be evicted
and freed as soon as possible once the file is closed and the last
reference to the inode is put. And then when the file gets reopened next
time, the new instantiated inode will reflect the new DAX state.
In summary, when the per inode DAX hint changes for an *opened* file, the
DAX state of the file won't be updated until this file is closed and
reopened later.
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Among the FUSE_INIT phase, client shall advertise per inode DAX if it's
mounted with "dax=inode". Then server is aware that client is in per
inode DAX mode, and will construct per-inode DAX attribute accordingly.
Server shall also advertise support for per inode DAX. If server doesn't
support it while client is mounted with "dax=inode", client will
silently fallback to "dax=never" since "dax=inode" is advisory only.
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
DAX may be limited in some specific situation. When the number of usable
DAX windows is under watermark, the recalim routine will be triggered to
reclaim some DAX windows. It may have a negative impact on the
performance, since some processes may need to wait for DAX windows to be
recalimed and reused then. To mitigate the performance degradation, the
overall DAX window need to be expanded larger.
However, simply expanding the DAX window may not be a good deal in some
scenario. To maintain one DAX window chunk (i.e., 2MB in size), 32KB
(512 * 64 bytes) memory footprint will be consumed for page descriptors
inside guest, which is greater than the memory footprint if it uses
guest page cache when DAX disabled. Thus it'd better disable DAX for
those files smaller than 32KB, to reduce the demand for DAX window and
thus avoid the unworthy memory overhead.
Per inode DAX feature is introduced to address this issue, by offering a
finer grained control for dax to users, trying to achieve a balance
between performance and memory overhead.
The FUSE_ATTR_DAX flag in FUSE_LOOKUP reply is used to indicate whether
DAX should be enabled or not for corresponding file. Currently the state
whether DAX is enabled or not for the file is initialized only when
inode is instantiated.
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
We add 'always', 'never', and 'inode' (default). '-o dax' continues to
operate the same which is equivalent to 'always'.
The following behavior is consistent with that on ext4/xfs:
- The default behavior (when neither '-o dax' nor
'-o dax=always|never|inode' option is specified) is equal to 'inode'
mode, while 'dax=inode' won't be printed among the mount option list.
- The 'inode' mode is only advisory. It will silently fallback to 'never'
mode if fuse server doesn't support that.
Also noted that by the time of this commit, 'inode' mode is actually equal
to 'always' mode, before the per inode DAX flag is introduced in the
following patch.
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This is in prep for following per inode DAX checking.
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Commit 054aa8d439 ("fget: check that the fd still exists after getting
a ref to it") fixed a race with getting a reference to a file just as it
was being closed. It was a fairly minimal patch, and I didn't think
re-checking the file pointer lookup would be a measurable overhead,
since it was all right there and cached.
But I was wrong, as pointed out by the kernel test robot.
The 'poll2' case of the will-it-scale.per_thread_ops benchmark regressed
quite noticeably. Admittedly it seems to be a very artificial test:
doing "poll()" system calls on regular files in a very tight loop in
multiple threads.
That means that basically all the time is spent just looking up file
descriptors without ever doing anything useful with them (not that doing
'poll()' on a regular file is useful to begin with). And as a result it
shows the extra "re-check fd" cost as a sore thumb.
Happily, the regression is fixable by just writing the code to loook up
the fd to be better and clearer. There's still a cost to verify the
file pointer, but now it's basically in the noise even for that
benchmark that does nothing else - and the code is more understandable
and has better comments too.
[ Side note: this patch is also a classic case of one that looks very
messy with the default greedy Myers diff - it's much more legible with
either the patience of histogram diff algorithm ]
Link: https://lore.kernel.org/lkml/20211210053743.GA36420@xsang-OptiPlex-9020/
Link: https://lore.kernel.org/lkml/20211213083154.GA20853@linux.intel.com/
Reported-by: kernel test robot <oliver.sang@intel.com>
Tested-by: Carel Si <beibei.si@intel.com>
Cc: Jann Horn <jannh@google.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We currently have a 'laundrette' for closing cached files - a different
work-item for each network-namespace.
These 'laundrettes' (aka struct nfsd_fcache_disposal) are currently on a
list, and are freed using rcu.
The list is not necessary as we have a per-namespace structure (struct
nfsd_net) which can hold a link to the nfsd_fcache_disposal.
The use of kfree_rcu is also unnecessary as the cache is cleaned of all
files associated with a given namespace, and no new files can be added,
before the nfsd_fcache_disposal is freed.
So add a '->fcache_disposal' link to nfsd_net, and discard the list
management and rcu usage.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Commit 7142b98d9f ("nfsd: Clean up drc cache in preparation for
global spinlock elimination"), billed as a clean-up, added
be32_to_cpu() to the DRC hash function without explanation. That
commit removed two comments that state that byte-swapping in the
hash function is unnecessary without explaining whether there was
a need for that change.
On some Intel CPUs, the swab32 instruction is known to cause a CPU
pipeline stall. be32_to_cpu() does not add extra randomness, since
the hash multiplication is done /before/ shifting to the high-order
bits of the result.
As a micro-optimization, remove the unnecessary transform from the
DRC hash function.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Now that thread management is consistent there is no need for
nfs-callback to use svc_create_pooled() as introduced in Commit
df807fffaa ("NFSv4.x/callback: Create the callback service through
svc_create_pooled"). So switch back to svc_create().
If service pools were configured, but the number of threads were left at
'1', nfs callback may not work reliably when svc_create_pooled() is used.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
svc_set_num_threads() does everything that lockd_start_svc() does, except
set sv_maxconn. It also (when passed 0) finds the threads and
stops them with kthread_stop().
So move the setting for sv_maxconn, and use svc_set_num_thread()
We now don't need nlmsvc_task.
Now that we use svc_set_num_threads() it makes sense to set svo_module.
This request that the thread exists with module_put_and_exit().
Also fix the documentation for svo_module to make this explicit.
svc_prepare_thread is now only used where it is defined, so it can be
made static.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
lockd_create_svc() already does an svc_get() if the service already
exists, so it is more like a "get" than a "create".
So:
- Move the increment of nlmsvc_users into the function as well
- rename to lockd_get().
It is now the inverse of lockd_put().
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
There is some cleanup that is duplicated in lockd_down() and the failure
path of lockd_up().
Factor these out into a new lockd_put() and call it from both places.
lockd_put() does *not* take the mutex - that must be held by the caller.
It decrements nlmsvc_users and if that reaches zero, it cleans up.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The normal place to call svc_exit_thread() is from the thread itself
just before it exists.
Do this for lockd.
This means that nlmsvc_rqst is not used out side of lockd_start_svc(),
so it can be made local to that function, and renamed to 'rqst'.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
lockd_start_svc() only needs to be called once, just after the svc is
created. If the start fails, the svc is discarded too.
It thus makes sense to call lockd_start_svc() from lockd_create_svc().
This allows us to remove the test against nlmsvc_rqst at the start of
lockd_start_svc() - it must always be NULL.
lockd_up() only held an extra reference on the svc until a thread was
created - then it dropped it. The thread - and thus the extra reference
- will remain until kthread_stop() is called.
Now that the thread is created in lockd_create_svc(), the extra
reference can be dropped there. So the 'serv' variable is no longer
needed in lockd_up().
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Now that the network status notifiers use nlmsvc_serv rather then
nlmsvc_rqst the management can be simplified.
Notifier unregistration synchronises with any pending notifications so
providing we unregister before nlm_serv is freed no further interlock
is required.
So we move the unregister call to just before the thread is killed
(which destroys the service) and just before the service is destroyed in
the failure-path of lockd_up().
Then nlm_ntf_refcnt and nlm_ntf_wq can be removed.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
lockd has two globals - nlmsvc_task and nlmsvc_rqst - but mostly it
wants the 'struct svc_serv', and when it doesn't want it exactly it can
get to what it wants from the serv.
This patch is a first step to removing nlmsvc_task and nlmsvc_rqst. It
introduces nlmsvc_serv to store the 'struct svc_serv*'. This is set as
soon as the serv is created, and cleared only when it is destroyed.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
nfsd currently maintains an open-coded read/write semaphore (refcount
and wait queue) for each network namespace to ensure the nfs service
isn't shut down while the notifier is running.
This is excessive. As there is unlikely to be contention between
notifiers and they run without sleeping, a single spinlock is sufficient
to avoid problems.
Signed-off-by: NeilBrown <neilb@suse.de>
[ cel: ensure nfsd_notifier_lock is static ]
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The ->svo_setup callback serves no purpose. It is always called from
within the same module that chooses which callback is needed. So
discard it and call the relevant function directly.
Now that svc_set_num_threads() is no longer used remove it and rename
svc_set_num_threads_sync() to remove the "_sync" suffix.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
nfsd cannot currently use svc_set_num_threads_sync. It instead
uses svc_set_num_threads which does *not* wait for threads to all
exit, and has a separate mechanism (nfsd_shutdown_complete) to wait
for completion.
The reason that nfsd is unlike other services is that nfsd threads can
exit separately from svc_set_num_threads being called - they die on
receipt of SIGKILL. Also, when the last thread exits, the service must
be shut down (sockets closed).
For this, the nfsd_mutex needs to be taken, and as that mutex needs to
be held while svc_set_num_threads is called, the one cannot wait for
the other.
This patch changes the nfsd thread so that it can drop the ref on the
service without blocking on nfsd_mutex, so that svc_set_num_threads_sync
can be used:
- if it can drop a non-last reference, it does that. This does not
trigger shutdown and does not require a mutex. This will likely
happen for all but the last thread signalled, and for all threads
being shut down by nfsd_shutdown_threads()
- if it can get the mutex without blocking (trylock), it does that
and then drops the reference. This will likely happen for the
last thread killed by SIGKILL
- Otherwise there might be an unrelated task holding the mutex,
possibly in another network namespace, or nfsd_shutdown_threads()
might be just about to get a reference on the service, after which
we can drop ours safely.
We cannot conveniently get wakeup notifications on these events,
and we are unlikely to need to, so we sleep briefly and check again.
With this we can discard nfsd_shutdown_complete and
nfsd_complete_shutdown(), and switch to svc_set_num_threads_sync.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
There is nothing happening in the start of nfsd() that requires
protection by the mutex, so don't take it until shutting down the thread
- which does still require protection - but only for nfsd_put().
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Using sv_lock means we don't need to hold the service mutex over these
updates.
In particular, svc_exit_thread() no longer requires synchronisation, so
threads can exit asynchronously.
Note that we could use an atomic_t, but as there are many more read
sites than writes, that would add unnecessary noise to the code.
Some reads are already racy, and there is no need for them to not be.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
This allows us to move the updates for th_cnt out of the mutex.
This is a step towards reducing mutex coverage in nfsd().
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The use of sv_nrthreads as a general refcount results in clumsy code, as
is seen by various comments needed to explain the situation.
This patch introduces a 'struct kref' and uses that for reference
counting, leaving sv_nrthreads to be a pure count of threads. The kref
is managed particularly in svc_get() and svc_put(), and also nfsd_put();
svc_destroy() now takes a pointer to the embedded kref, rather than to
the serv.
nfsd allows the svc_serv to exist with ->sv_nrhtreads being zero. This
happens when a transport is created before the first thread is started.
To support this, a 'keep_active' flag is introduced which holds a ref on
the svc_serv. This is set when any listening socket is successfully
added (unless there are running threads), and cleared when the number of
threads is set. So when the last thread exits, the nfs_serv will be
destroyed.
The use of 'keep_active' replaces previous code which checked if there
were any permanent sockets.
We no longer clear ->rq_server when nfsd() exits. This was done
to prevent svc_exit_thread() from calling svc_destroy().
Instead we take an extra reference to the svc_serv to prevent
svc_destroy() from being called.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
svc_destroy() is poorly named - it doesn't necessarily destroy the svc,
it might just reduce the ref count.
nfsd_destroy() is poorly named for the same reason.
This patch:
- removes the refcount functionality from svc_destroy(), moving it to
a new svc_put(). Almost all previous callers of svc_destroy() now
call svc_put().
- renames nfsd_destroy() to nfsd_put() and improves the code, using
the new svc_destroy() rather than svc_put()
- removes a few comments that explain the important for balanced
get/put calls. This should be obvious.
The only non-trivial part of this is that svc_destroy() would call
svc_sock_update() on a non-final decrement. It can no longer do that,
and svc_put() isn't really a good place of it. This call is now made
from svc_exit_thread() which seems like a good place. This makes the
call *before* sv_nrthreads is decremented rather than after. This
is not particularly important as the call just sets a flag which
causes sv_nrthreads set be checked later. A subsequent patch will
improve the ordering.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
It is common for 'get' functions to return the object that was 'got',
and there are a couple of places where users of svc_get() would be a
little simpler if svc_get() did that.
Make it so.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
If write_ports_add() fails, we shouldn't destroy the serv, unless we had
only just created it. So if there are any permanent sockets already
attached, leave the serv in place.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
/home/cel/src/linux/linux/fs/nfsd/nfs4proc.c:1539:24: warning: incorrect type in assignment (different base types)
/home/cel/src/linux/linux/fs/nfsd/nfs4proc.c:1539:24: expected restricted __be32 [usertype] status
/home/cel/src/linux/linux/fs/nfsd/nfs4proc.c:1539:24: got int
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Today the rules are a bit iffy and arbitrary about which kernel
threads have struct kthread present. Both idle threads and thread
started with create_kthread want struct kthread present so that is
effectively all kernel threads. Make the rule that if PF_KTHREAD
and the task is running then struct kthread is present.
This will allow the kernel thread code to using tsk->exit_code
with different semantics from ordinary processes.
To make ensure that struct kthread is present for all
kernel threads move it's allocation into copy_process.
Add a deallocation of struct kthread in exec for processes
that were kernel threads.
Move the allocation of struct kthread for the initial thread
earlier so that it is not repeated for each additional idle
thread.
Move the initialization of struct kthread into set_kthread_struct
so that the structure is always and reliably initailized.
Clear set_child_tid in free_kthread_struct to ensure the kthread
struct is reliably freed during exec. The function
free_kthread_struct does not need to clear vfork_done during exec as
exec_mm_release called from exec_mmap has already cleared vfork_done.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Update complete_and_exit to call kthread_exit instead of do_exit.
Change the name to reflect this change in functionality. All of the
users of complete_and_exit are causing the current kthread to exit so
this change makes it clear what is happening.
Move the implementation of kthread_complete_and_exit from
kernel/exit.c to to kernel/kthread.c. As this function is kthread
specific it makes most sense to live with the kthread functions.
There are no functional change.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Update module_put_and_exit to call kthread_exit instead of do_exit.
Change the name to reflect this change in functionality. All of the
users of module_put_and_exit are causing the current kthread to exit
so this change makes it clear what is happening. There is no
functional change.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
We have two io-wq creation paths:
- On queue enqueue
- When a worker goes to sleep
The latter invokes worker creation with the wqe->lock held, but that can
run into problems if we end up exiting and need to cancel the queued work.
syzbot caught this:
============================================
WARNING: possible recursive locking detected
5.16.0-rc4-syzkaller #0 Not tainted
--------------------------------------------
iou-wrk-6468/6471 is trying to acquire lock:
ffff88801aa98018 (&wqe->lock){+.+.}-{2:2}, at: io_worker_cancel_cb+0xb7/0x210 fs/io-wq.c:187
but task is already holding lock:
ffff88801aa98018 (&wqe->lock){+.+.}-{2:2}, at: io_wq_worker_sleeping+0xb6/0x140 fs/io-wq.c:700
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&wqe->lock);
lock(&wqe->lock);
*** DEADLOCK ***
May be due to missing lock nesting notation
1 lock held by iou-wrk-6468/6471:
#0: ffff88801aa98018 (&wqe->lock){+.+.}-{2:2}, at: io_wq_worker_sleeping+0xb6/0x140 fs/io-wq.c:700
stack backtrace:
CPU: 1 PID: 6471 Comm: iou-wrk-6468 Not tainted 5.16.0-rc4-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x1dc/0x2d8 lib/dump_stack.c:106
print_deadlock_bug kernel/locking/lockdep.c:2956 [inline]
check_deadlock kernel/locking/lockdep.c:2999 [inline]
validate_chain+0x5984/0x8240 kernel/locking/lockdep.c:3788
__lock_acquire+0x1382/0x2b00 kernel/locking/lockdep.c:5027
lock_acquire+0x19f/0x4d0 kernel/locking/lockdep.c:5637
__raw_spin_lock include/linux/spinlock_api_smp.h:133 [inline]
_raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:154
io_worker_cancel_cb+0xb7/0x210 fs/io-wq.c:187
io_wq_cancel_tw_create fs/io-wq.c:1220 [inline]
io_queue_worker_create+0x3cf/0x4c0 fs/io-wq.c:372
io_wq_worker_sleeping+0xbe/0x140 fs/io-wq.c:701
sched_submit_work kernel/sched/core.c:6295 [inline]
schedule+0x67/0x1f0 kernel/sched/core.c:6323
schedule_timeout+0xac/0x300 kernel/time/timer.c:1857
wait_woken+0xca/0x1b0 kernel/sched/wait.c:460
unix_msg_wait_data net/unix/unix_bpf.c:32 [inline]
unix_bpf_recvmsg+0x7f9/0xe20 net/unix/unix_bpf.c:77
unix_stream_recvmsg+0x214/0x2c0 net/unix/af_unix.c:2832
sock_recvmsg_nosec net/socket.c:944 [inline]
sock_recvmsg net/socket.c:962 [inline]
sock_read_iter+0x3a7/0x4d0 net/socket.c:1035
call_read_iter include/linux/fs.h:2156 [inline]
io_iter_do_read fs/io_uring.c:3501 [inline]
io_read fs/io_uring.c:3558 [inline]
io_issue_sqe+0x144c/0x9590 fs/io_uring.c:6671
io_wq_submit_work+0x2d8/0x790 fs/io_uring.c:6836
io_worker_handle_work+0x808/0xdd0 fs/io-wq.c:574
io_wqe_worker+0x395/0x870 fs/io-wq.c:630
ret_from_fork+0x1f/0x30
We can safely drop the lock before doing work creation, making the two
contexts the same in that regard.
Reported-by: syzbot+b18b8be69df33a3918e9@syzkaller.appspotmail.com
Fixes: 71a8538754 ("io-wq: check for wq exit after adding new worker task_work")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
- Fix a data corruption vector that can result from the ro remount
process failing to clear all speculative preallocations from files
and the rw remount process not noticing the incomplete cleanup.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmGyQ6UACgkQ+H93GTRK
tOskJA/+Jg+fn+Vs7dHwBN+SaHCBH81RT+YD9gbI7Lq5xcK16K7BWz3Hf360J9uJ
BpVn0fv6/+3bNg0/QMWUE67CotB2S43CzWnltUN0ymiGgAEO1eMKL1JHRtZ/89d4
pG2Lr9W1QsxMqP9L/TgxqHb5WC2xtromDLMWa7qYL8aH4qp4Dx6hv++u3qhbdTis
sfvTOMsDKTwK5RC+7yB2qxr/Txn7viIlBLjTWR3UvvO0K1LxjuBTXpBsiVeQE92T
ZqLgAhWDBu/wZFBibqtMsnLF0Wmn5oIZJgBoxp/+Ze8saUsuYdAI1X65ESdVmw7E
ZJtSyZObDHOlRhvDn7Vlb8BNF7QY4l4n2eeGFB4zoPdy0jjz/7QKCqMSm7YCMXqc
WycjeEIo1H+ZEk0AWu7KrTUAaO3zzqHB/DJt4zQA/w4QdSD14bAsyp8ZWyUEMnWy
7M0Pc0q2ptzOOp1rjd2J+0s9YSc24/dceW1lCDpIDSkOuxO82HdCxlWQ/YPEt2Lx
gGv2hll8Cg0j18VThB++XZ3ozImFu+sZhkQPPCM/kKY0tWZvd4Yby1bWLNzS+ppE
RIKh+hyhFTXmr0C2iJ8fGNFu/KA3V0efztvCIATHDmGcC6f4WePUQyLzZGFgS0Yw
uBPmvqWeUWySgfauIvqW7Oosf0pbKvYVEd+1fcLVdh0sgr2MUPE=
=N9Wf
-----END PGP SIGNATURE-----
Merge tag 'xfs-5.16-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fix from Darrick Wong:
"This fixes a race between a readonly remount process and other
processes that hold a file IOLOCK on files that previously experienced
copy on write, that could result in severe filesystem corruption if
the filesystem is then remounted rw.
I think this is fairly rare (since the only reliable reproducer I have
that fits the second criteria is the experimental xfs_scrub program),
but the race is clear, so we still need to fix this.
Summary:
- Fix a data corruption vector that can result from the ro remount
process failing to clear all speculative preallocations from files
and the rw remount process not noticing the incomplete cleanup"
* tag 'xfs-5.16-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: remove all COW fork extents when remounting readonly
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmG0PwYQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgphbiEACv95PrGVuOzgXwC1Ydk4Km5Yyt8vANKcxA
LFzeScrRhGG28vHCLZuPSxtFn9nGjBJPDKscDWQE0tpq8NwB6BAjr3Ps2AR4tI3M
2bn1ssHIHADdkaV1ixwTtQynHgCvOtKSTmLJ+TTDIYdgwAyRVlRBtpke/nXn5VCR
kUdQ/3hHSPbEY+WvrDz7yAvDIMxTrBHmt8TbDiI1FXzNf8AOd32ekpBNzwe2hTQi
OLhpaQ60S74UG78c4OGg6SV+twW0nkp1DyTf8MhC3WoBBnTwQdc/BpLNJbnUwH57
lJVJTz0sknbnQCtifJ8/Wjzn6r95cOwbYvaGgXyi7KtTGn0QXLcK+FohNVizFR0x
qTMWlFJ5aPmzqed6oMZ2m8iaGxR9813I0GP54Hsqw4PmoPb1GDN5FvZIN+MKdaJd
PSp9YH7RDHsusYZjjaUtHIcxhxztblf5wNVAkmkfrFZzNArnwsw0k9gX7DI0f+GX
+TazZzX8gLRonAskBHgoSDT+8WOzDOHYYIVVe0g/8TR+jW8oGDd9D0sJbH74dV1Y
3C4WLbJCquBzatotmcmgkt9CJzZz9y373cVVNu4HZgH49qjEZCJuFgBaTpanvwHk
Qk5KTl9yIfjs+H2FDWa2Pd6LnsVvuIpwOd4R37RDLGM2ZiX100eX1VbD6RxtIehQ
PPKP13kYAQ==
=BUVd
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.16-2021-12-10' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"A few fixes that are all bound for stable:
- Two syzbot reports for io-wq that turned out to be separate fixes,
but ultimately very closely related
- io_uring task_work running on cancelations"
* tag 'io_uring-5.16-2021-12-10' of git://git.kernel.dk/linux-block:
io-wq: check for wq exit after adding new worker task_work
io_uring: ensure task_work gets run as part of cancelations
io-wq: remove spurious bit clear on task_work addition
Instead of referencing the inode from a dentry via dentry->d_inode, use
the helper function d_inode(dentry) instead. This is the considered the
correct way to access it.
Reported-by: Christian Brauner <christian.brauner@ubuntu.com>
Reported: https://lore.kernel.org/all/20211208104454.nhxyvmmn6d2qhpwl@wittgenstein/
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmGwxl8ACgkQxWXV+ddt
WDvSeQ//T+mv4o7ucldQZCVdN0TTVCkQUhia+ZdMwBcPty2/ZEdap+KEIVmfCV/v
OLRmSNkIPDhHcIc/O3zJ1/AY0DFbb9brYGkMD/qidgPbRArhDZSrDIr+xnrKJ0iq
HFxM01B54l8hJe6GWIGFuuOz+nXUP1o9SfiDOwMDTkqgzz1JSvPec70RKMxG8pTC
4plVrGaUXkKTC8WyBXnSvkP2gvjfJxqnEKv2Ru1eP1t7Bq65aed0+Z32Mogzl2ip
ZSHVlRergeB03xF9YErWSfgofVWLEIXgLCh/Wfq73sMnHUHUGM/dpEKxP911MI00
A5TuTl3I25cVnDv3qMayBPECMCCGZc2vUyXTLCWUNYLb/vEUTI36QBu/KglRURmO
Zx20B+3/7Yu9pFeZ23S+nlNdGDADmjkOfvZIZDoYDzqBAKWM6kZ/9oWiib21Uwi6
ql5oZNl7G6UXLPvgJoq8dqZIj/HYeLEHeqwf/tepSVQLXYzQyWYDMp686z958XI1
K/A/TKaKk19nn1Dhsz4KJeb3xhMFlFN30K7BNjdH3XRH1RrwJP6pLF/WUduJq2qn
rAvJoODLHkby1D9+HqSKW/RToeR5NbBYYdfB0Q8zpy9zF/gZX7wgFc21/4B7qtib
hkKfNSfAVjedOJJWTFCUDnZwwrqtcrKxYzUpdcls/O1AjifK7+U=
=Mh19
-----END PGP SIGNATURE-----
Merge tag 'for-5.16-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more regression fixes and stable patches, mostly one-liners.
Regression fixes:
- fix pointer/ERR_PTR mismatch returned from memdup_user
- reset dedicated zoned mode relocation block group to avoid using it
and filling it without any recourse
Fixes:
- handle a case to FITRIM range (also to make fstests/generic/260
work)
- fix warning when extent buffer state and pages get out of sync
after an IO error
- fix transaction abort when syncing due to missing mapping error set
on metadata inode after inlining a compressed file
- fix transaction abort due to tree-log and zoned mode interacting in
an unexpected way
- fix memory leak of additional extent data when qgroup reservation
fails
- do proper handling of slot search call when deleting root refs"
* tag 'for-5.16-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: replace the BUG_ON in btrfs_del_root_ref with proper error handling
btrfs: zoned: clear data relocation bg on zone finish
btrfs: free exchange changeset on failures
btrfs: fix re-dirty process of tree-log nodes
btrfs: call mapping_set_error() on btree inode with a write error
btrfs: clear extent buffer uptodate when we fail to write it
btrfs: fail if fstrim_range->start == U64_MAX
btrfs: fix error pointer dereference in btrfs_ioctl_rm_dev_v2()
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmGzuJoACgkQiiy9cAdy
T1H9MAv/cOAk4iUhgDUsa+HYsBNiAQ+IOOu/WOZI56jEV/qVP0I3cpkiIJenG5gu
ZurivI4smsgNokHOFoT3vjtLXTfTl0OdHHY/mftd5IGIPG+KnXcg3+ZaE3T+fUV3
uHX0cH3a3Azo3RGf2rRiTfW6u9FXJnb9aAUTif7UDVwsU37wUAQrKEmcazaUXdaT
+j3KpqwGhCgbkneKuAd/FDTAg4wciJgRg3aE/4W2s2ovFiF6vsUcrmhQD1zP1EZh
sPWdx4/U+WpeV02RfKBLQlXi6ofqRF5qRT4HkD07G5Zhz8OOZcm06wYFNl/+aYhF
lTOTa9KAoTPfV2tFlTGZVEy07ggEFd3wDxeAPulDrqY8etwWSwRAHHRi5HstMlIX
iYz06DLS+LSPyriy6GV6CIL01OPg8vkeLPbrLFbs8oLSVQgmx74UDS339oxJMdFe
kkhLgtfRr5DfxS3uD0u17aBDL4ullH84dWdlsUKiUfGvVsBcoVSCPIQeWqkB8xjA
OTKih3J1
=+1L9
-----END PGP SIGNATURE-----
Merge tag '5.16-rc4-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Two cifs/smb3 fixes - one for stable, the other fixes a recently
reported NTLMSSP auth problem"
* tag '5.16-rc4-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: fix ntlmssp auth when there is no key exchange
cifs: Fix crash on unload of cifs_arc4.ko
has been around for years, but I suspect recent changes may have
widened the race window a little, so I'd like to go ahead and get it in.
-----BEGIN PGP SIGNATURE-----
iQJJBAABCAAzFiEEYtFWavXG9hZotryuJ5vNeUKO4b4FAmGzuB8VHGJmaWVsZHNA
ZmllbGRzZXMub3JnAAoJECebzXlCjuG+rYQP/1N61E3rf2MOfF4sluubaGGi0fVR
cx3w8pTDSALfxAiz0ZTA6hJbZN0SR7MZuXMhWDHT/RiG7jFfuMJACU81BWn0cCPJ
zuJziE3Y6fElGi5Ifp71yMLABqTlFPOUHBZkpzaADqXGwmZk4fecXmqKbtqa9kCU
1mCQJlZDGU1ljGVYuXU4SaBOzsrCpPfbxby2ZWFDs8bu5cGKDTq9OSVF1LaGzjMN
2+v0y2Qni+dRfTBCGAmxF8AukRa2r5i2rfHRatgKQO8PdROv/pguy1hYjjeeofhX
/Jlq2RIVl1/9/Fwar4Cew2+fpQhpaY62wUuo4KG4PK5WZ92LGDjSpYqvu/I4lI6z
BXdge0HzDPRnpZtD+tu4IJ7nQwG+JnO2Ws91rPxwT6DDX4zX9gNZc+7EGevVeXJI
XTO+61Tyz+7/6YLR8UvnvyER9KvKjQiUWZkO81LZpuAAwn1PE3rahD48jxkmwwYE
m5EvfuA6YR8oyk3Ml+nZNuKwgrNqPcvohyJz4QHpNpqmvOlVrb9I1k90JzIm8D9Q
09M8An1Z3AGw76+RMnvBnWewn1Nx1bKIS32SJk9x3zBoeyXLS17G6FmpEhW81xvN
LoToRqD/GNx+4AtfvkHqwv2/Qd1rvONCsjbZv+2P0DAj8ZEOWn0YHk18WlvAIZUg
4ipyZbwMUbXeytKP
=GgUu
-----END PGP SIGNATURE-----
Merge tag 'nfsd-5.16-2' of git://linux-nfs.org/~bfields/linux
Pull nfsd fixes from Bruce Fields:
"Fix a race on startup and another in the delegation code.
The latter has been around for years, but I suspect recent changes may
have widened the race window a little, so I'd like to go ahead and get
it in"
* tag 'nfsd-5.16-2' of git://linux-nfs.org/~bfields/linux:
nfsd: fix use-after-free due to delegation race
nfsd: Fix nsfd startup race (again)
Added a new sysfs node called gc_urgent_high_remaining. The user can
set the trial count limit for GC urgent high mode with this value. If
GC thread gets to the limit, the mode will turn back to GC normal mode.
By default, the value is zero, which means there is no limit like before.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In fuzzed image, SSA table may indicate that a data block belongs to
invalid node, which node ID is out-of-range (0, 1, 2 or max_nid), in
order to avoid migrating inconsistent data in such corrupted image,
let's do sanity check anyway before data block migration.
Cc: stable@vger.kernel.org
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
As report by Wenqing Liu in bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=215231
If we enable CONFIG_F2FS_CHECK_FS config, and with fuzzed image attached
in above link, we will encounter panic when executing below script:
1. mkdir mnt
2. mount -t f2fs tmp1.img mnt
3. touch tmp
F2FS-fs (loop11): mismatched blkaddr 5765 (source_blkaddr 1) in seg 3
kernel BUG at fs/f2fs/gc.c:1042!
do_garbage_collect+0x90f/0xa80 [f2fs]
f2fs_gc+0x294/0x12a0 [f2fs]
f2fs_balance_fs+0x2c5/0x7d0 [f2fs]
f2fs_create+0x239/0xd90 [f2fs]
lookup_open+0x45e/0xa90
open_last_lookups+0x203/0x670
path_openat+0xae/0x490
do_filp_open+0xbc/0x160
do_sys_openat2+0x2f1/0x500
do_sys_open+0x5e/0xa0
__x64_sys_openat+0x28/0x40
Previously, f2fs tries to catch data inconcistency exception in between
SSA and SIT table during GC, however once the exception is caught, it will
call f2fs_bug_on to hang kernel, it's not needed, instead, let's set
SBI_NEED_FSCK flag and skip migrating current block.
Fixes: bbf9f7d90f ("f2fs: Fix indefinite loop in f2fs_gc()")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
As report by Wenqing Liu in bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=215231
- Overview
kernel NULL pointer dereference triggered in folio_mark_dirty() when mount and operate on a crafted f2fs image
- Reproduce
tested on kernel 5.16-rc3, 5.15.X under root
1. mkdir mnt
2. mount -t f2fs tmp1.img mnt
3. touch tmp
4. cp tmp mnt
F2FS-fs (loop0): sanity_check_inode: inode (ino=49) extent info [5942, 4294180864, 4] is incorrect, run fsck to fix
F2FS-fs (loop0): f2fs_check_nid_range: out-of-range nid=31340049, run fsck to fix.
BUG: kernel NULL pointer dereference, address: 0000000000000000
folio_mark_dirty+0x33/0x50
move_data_page+0x2dd/0x460 [f2fs]
do_garbage_collect+0xc18/0x16a0 [f2fs]
f2fs_gc+0x1d3/0xd90 [f2fs]
f2fs_balance_fs+0x13a/0x570 [f2fs]
f2fs_create+0x285/0x840 [f2fs]
path_openat+0xe6d/0x1040
do_filp_open+0xc5/0x140
do_sys_openat2+0x23a/0x310
do_sys_open+0x57/0x80
The root cause is for special file: e.g. character, block, fifo or socket file,
f2fs doesn't assign address space operations pointer array for mapping->a_ops field,
so, in a fuzzed image, SSA table indicates a data block belong to special file, when
f2fs tries to migrate that block, it causes NULL pointer access once move_data_page()
calls a_ops->set_dirty_page().
Cc: stable@vger.kernel.org
Reported-by: Wenqing Liu <wenqingliu0120@gmail.com>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This information can be used to check how much time we need to give to issue
all the discard commands.
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously, compressed page cache drop when clean page cache, but
POSIX_FADV_DONTNEED can't clean compressed page cache because raw page
don't have private data, and won't call f2fs_invalidate_compress_pages.
This commit call f2fs_invalidate_compress_pages() directly in
f2fs_file_fadvise() for POSIX_FADV_DONTNEED case.
Signed-off-by: Fengnan Chang <changfengnan@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Since compress inode not a regular file, generic_error_remove_page in
f2fs_invalidate_compress_pages will always be failed, set compress
inode as a regular file to fix it.
Fixes: 6ce19aff0b ("f2fs: compress: add compress_inode to cache compressed blocks")
Signed-off-by: Fengnan Chang <changfengnan@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Make f2fs_file_read_iter() and f2fs_file_write_iter() use the iomap
direct I/O implementation instead of the fs/direct-io.c one.
The iomap implementation is more efficient, and it also avoids the need
to add new features and optimizations to the old implementation.
This new implementation also eliminates the need for f2fs to hook bio
submission and completion and to allocate memory per-bio. This is
because it's possible to correctly update f2fs's in-flight DIO counters
using __iomap_dio_rw() in combination with an implementation of
iomap_dio_ops::end_io() (as suggested by Christoph Hellwig).
When possible, this new implementation preserves existing f2fs behavior
such as the conditions for falling back to buffered I/O.
This patch has been tested with xfstests by running 'gce-xfstests -c
f2fs -g auto -X generic/017' with and without this patch; no regressions
were seen. (Some tests fail both before and after. generic/017 hangs
both before and after, so it had to be excluded.)
Signed-off-by: Eric Biggers <ebiggers@google.com>
[Jaegeuk Kim: use spin_lock_bh for f2fs_update_iostat in softirq]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
- Have tracefs honor the gid mount option
- Have new files in tracefs inherit the parent ownership
- Have direct_ops unregister when it has no more functions
- Properly clean up the ops when unregistering multi direct ops
- Add a sample module to test the multiple direct ops
- Fix memory leak in error path of __create_synth_event()
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYbOgPBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qgOtAP0YD+cRLxnRKA376oQVB8zmuZ3mZ/4x
6M1hqruSDlno3AEA19PyHpxl7flFwnBb6Gnfo9VeefcMS5ENDH9p1aHX4wU=
=Tr6t
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Tracing, ftrace and tracefs fixes:
- Have tracefs honor the gid mount option
- Have new files in tracefs inherit the parent ownership
- Have direct_ops unregister when it has no more functions
- Properly clean up the ops when unregistering multi direct ops
- Add a sample module to test the multiple direct ops
- Fix memory leak in error path of __create_synth_event()"
* tag 'trace-v5.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix possible memory leak in __create_synth_event() error path
ftrace/samples: Add module to test multi direct modify interface
ftrace: Add cleanup to unregister_ftrace_direct_multi
ftrace: Use direct_ops hash in unregister_ftrace_direct
tracefs: Set all files to the same group ownership as the mount option
tracefs: Have new files inherit the ownership of their parent
Fix three bugs in aio poll, and one issue with POLLFREE more broadly:
- aio poll didn't handle POLLFREE, causing a use-after-free.
- aio poll could block while the file is ready.
- aio poll called eventfd_signal() when it isn't allowed.
- POLLFREE didn't handle multiple exclusive waiters correctly.
This has been tested with the libaio test suite, as well as with test
programs I wrote that reproduce the first two bugs. I am sending this
pull request myself as no one seems to be maintaining this code.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCYbObthQcZWJpZ2dlcnNA
Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK+3mAQC9W8ApzBleEPI6FXzIIo5AiQT/2jGl
7FbO1MtkdUBU4QEAzf+VWl4Z4BJTgxl44avRdVDpXGAMnbWkd7heY+e3HwA=
=mp+r
-----END PGP SIGNATURE-----
Merge tag 'aio-poll-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux
Pull aio poll fixes from Eric Biggers:
"Fix three bugs in aio poll, and one issue with POLLFREE more broadly:
- aio poll didn't handle POLLFREE, causing a use-after-free.
- aio poll could block while the file is ready.
- aio poll called eventfd_signal() when it isn't allowed.
- POLLFREE didn't handle multiple exclusive waiters correctly.
This has been tested with the libaio test suite, as well as with test
programs I wrote that reproduce the first two bugs. I am sending this
pull request myself as no one seems to be maintaining this code"
* tag 'aio-poll-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux:
aio: Fix incorrect usage of eventfd_signal_allowed()
aio: fix use-after-free due to missing POLLFREE handling
aio: keep poll requests on waitqueue until completed
signalfd: use wake_up_pollfree()
binder: use wake_up_pollfree()
wait: add wake_up_pollfree()
We check IO_WQ_BIT_EXIT before attempting to create a new worker, and
wq exit cancels pending work if we have any. But it's possible to have
a race between the two, where creation checks exit finding it not set,
but we're in the process of exiting. The exit side will cancel pending
creation task_work, but there's a gap where we add task_work after we've
canceled existing creations at exit time.
Fix this by checking the EXIT bit post adding the creation task_work.
If it's set, run the same cancelation that exit does.
Reported-and-tested-by: syzbot+b60c982cb0efc5e05a47@syzkaller.appspotmail.com
Reviewed-by: Hao Xu <haoxu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If we successfully cancel a work item but that work item needs to be
processed through task_work, then we can be sleeping uninterruptibly
in io_uring_cancel_generic() and never process it. Hence we don't
make forward progress and we end up with an uninterruptible sleep
warning.
While in there, correct a comment that should be IFF, not IIF.
Reported-and-tested-by: syzbot+21e6887c0be14181206d@syzkaller.appspotmail.com
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A delegation break could arrive as soon as we've called vfs_setlease. A
delegation break runs a callback which immediately (in
nfsd4_cb_recall_prepare) adds the delegation to del_recall_lru. If we
then exit nfs4_set_delegation without hashing the delegation, it will be
freed as soon as the callback is done with it, without ever being
removed from del_recall_lru.
Symptoms show up later as use-after-free or list corruption warnings,
usually in the laundromat thread.
I suspect aba2072f45 "nfsd: grant read delegations to clients holding
writes" made this bug easier to hit, but I looked as far back as v3.0
and it looks to me it already had the same problem. So I'm not sure
where the bug was introduced; it may have been there from the beginning.
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Commit bd5ae9288d ("nfsd: register pernet ops last, unregister first")
has re-opened rpc_pipefs_event() race against nfsd_net_id registration
(register_pernet_subsys()) which has been fixed by commit bb7ffbf29e
("nfsd: fix nsfd startup race triggering BUG_ON").
Restore the order of register_pernet_subsys() vs register_cld_notifier().
Add WARN_ON() to prevent a future regression.
Crash info:
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000012
CPU: 8 PID: 345 Comm: mount Not tainted 5.4.144-... #1
pc : rpc_pipefs_event+0x54/0x120 [nfsd]
lr : rpc_pipefs_event+0x48/0x120 [nfsd]
Call trace:
rpc_pipefs_event+0x54/0x120 [nfsd]
blocking_notifier_call_chain
rpc_fill_super
get_tree_keyed
rpc_fs_get_tree
vfs_get_tree
do_mount
ksys_mount
__arm64_sys_mount
el0_svc_handler
el0_svc
Fixes: bd5ae9288d ("nfsd: register pernet ops last, unregister first")
Cc: stable@vger.kernel.org
Signed-off-by: Alexander Sverdlin <alexander.sverdlin@nokia.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Remove unused match_table_t, slim down mount_opts structure by removing
unnecessary definitions, remove redundant MOPT_ flags and clean up
ext4_parse_param() by converting the most of the if/else branching to
switch except for the MOPT_SET/MOPT_CEAR handling.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Link: https://lore.kernel.org/r/20211027141857.33657-14-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Add the necessary functions for the fs_context_operations. Convert and
rename ext4_remount() and ext4_fill_super() to ext4_get_tree() and
ext4_reconfigure() respectively and switch the ext4 to use the new api.
One user facing change is the fact that we no longer have access to the
entire string of mount options provided by mount(2) since the mount api
does not store it anywhere. As a result we can't print the options to
the log as we did in the past after the successful mount.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Link: https://lore.kernel.org/r/20211027141857.33657-13-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Change token2str() to use ext4_param_specs instead of tokens so that we
can get rid of tokens entirely.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Link: https://lore.kernel.org/r/20211027141857.33657-12-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Clean up return values in handle_mount_opt() and rename the function to
ext4_parse_param()
Now we can use it in fs_context_operations as .parse_param.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Link: https://lore.kernel.org/r/20211027141857.33657-11-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The new mount api separates option parsing and super block setup into
two distinct steps and so we need to separate the options parsing out of
the ext4_fill_super() and ext4_remount().
In order to achieve this we have to create new ext4_fill_super() and
ext4_remount() functions which will serve its purpose only until we
actually do convert to the new api (as such they are only temporary for
this patch series) and move the option parsing out of the old function
which will now be renamed to __ext4_fill_super() and __ext4_remount().
There is a small complication in the fact that while the mount option
parsing is going to happen before we get to __ext4_fill_super(), the
mount options stored in the super block itself needs to be applied
first, before the user specified mount options.
So with this patch we're going through the following sequence:
- parse user provided options (including sb block)
- initialize sbi and store s_sb_block if provided
- in __ext4_fill_super()
- read the super block
- parse and apply options specified in s_mount_opts
- check and apply user provided options stored in ctx
- continue with the regular ext4_fill_super operation
It's not exactly the most elegant solution, but if we still want to
support s_mount_opts we have to do it in this order.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Link: https://lore.kernel.org/r/20211027141857.33657-10-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
At the parsing phase of mount in the new mount api sb will not be
available. We've already removed some uses of sb and sbi, but now we
need to get rid of the rest of it.
Use ext4_fs_context to store all of the configuration specification so
that it can be later applied to the super block and sbi.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Link: https://lore.kernel.org/r/20211027141857.33657-9-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
At the parsing phase of mount in the new mount api sb will not be
available so move ext2/3 compatibility check outside handle_mount_opt().
Unfortunately we will lose the ability to show exactly which option is
not compatible.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Link: https://lore.kernel.org/r/20211027141857.33657-8-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
At the parsing phase of mount in the new mount api sb will not be
available so move quota confiquration out of handle_mount_opt() by
noting the quota file names in the ext4_fs_context structure to be
able to apply it later.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Link: https://lore.kernel.org/r/20211027141857.33657-7-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
At the parsing phase of mount in the new mount api sb will not be
available so allow sb to be NULL in ext4_msg and use that in
handle_mount_opt().
Also change return value to appropriate -EINVAL where needed.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Link: https://lore.kernel.org/r/20211027141857.33657-6-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Use the new mount option specifications to parse the options in
handle_mount_opt(). However we're still using the old API to get the
options string.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Link: https://lore.kernel.org/r/20211027141857.33657-5-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Move option validation out of parse_options() into a separate function
ext4_validate_options().
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Link: https://lore.kernel.org/r/20211027141857.33657-4-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Create an array of fs_parameter_spec called ext4_param_specs to
hold the mount option specifications we're going to be using with the
new mount api.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Link: https://lore.kernel.org/r/20211027141857.33657-3-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Allow parameter value to be empty by specifying fs_param_can_be_empty
flag.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Link: https://lore.kernel.org/r/20211027141857.33657-2-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
We should defer eventfd_signal() to the workqueue when
eventfd_signal_allowed() return false rather than return
true.
Fixes: b542e383d8 ("eventfd: Make signal recursion protection a task bit")
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Link: https://lore.kernel.org/r/20210913111928.98-1-xieyongji@bytedance.com
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
signalfd_poll() and binder_poll() are special in that they use a
waitqueue whose lifetime is the current task, rather than the struct
file as is normally the case. This is okay for blocking polls, since a
blocking poll occurs within one task; however, non-blocking polls
require another solution. This solution is for the queue to be cleared
before it is freed, by sending a POLLFREE notification to all waiters.
Unfortunately, only eventpoll handles POLLFREE. A second type of
non-blocking poll, aio poll, was added in kernel v4.18, and it doesn't
handle POLLFREE. This allows a use-after-free to occur if a signalfd or
binder fd is polled with aio poll, and the waitqueue gets freed.
Fix this by making aio poll handle POLLFREE.
A patch by Ramji Jiyani <ramjiyani@google.com>
(https://lore.kernel.org/r/20211027011834.2497484-1-ramjiyani@google.com)
tried to do this by making aio_poll_wake() always complete the request
inline if POLLFREE is seen. However, that solution had two bugs.
First, it introduced a deadlock, as it unconditionally locked the aio
context while holding the waitqueue lock, which inverts the normal
locking order. Second, it didn't consider that POLLFREE notifications
are missed while the request has been temporarily de-queued.
The second problem was solved by my previous patch. This patch then
properly fixes the use-after-free by handling POLLFREE in a
deadlock-free way. It does this by taking advantage of the fact that
freeing of the waitqueue is RCU-delayed, similar to what eventpoll does.
Fixes: 2c14fa838c ("aio: implement IOCB_CMD_POLL")
Cc: <stable@vger.kernel.org> # v4.18+
Link: https://lore.kernel.org/r/20211209010455.42744-6-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Currently, aio_poll_wake() will always remove the poll request from the
waitqueue. Then, if aio_poll_complete_work() sees that none of the
polled events are ready and the request isn't cancelled, it re-adds the
request to the waitqueue. (This can easily happen when polling a file
that doesn't pass an event mask when waking up its waitqueue.)
This is fundamentally broken for two reasons:
1. If a wakeup occurs between vfs_poll() and the request being
re-added to the waitqueue, it will be missed because the request
wasn't on the waitqueue at the time. Therefore, IOCB_CMD_POLL
might never complete even if the polled file is ready.
2. When the request isn't on the waitqueue, there is no way to be
notified that the waitqueue is being freed (which happens when its
lifetime is shorter than the struct file's). This is supposed to
happen via the waitqueue entries being woken up with POLLFREE.
Therefore, leave the requests on the waitqueue until they are actually
completed (or cancelled). To keep track of when aio_poll_complete_work
needs to be scheduled, use new fields in struct poll_iocb. Remove the
'done' field which is now redundant.
Note that this is consistent with how sys_poll() and eventpoll work;
their wakeup functions do *not* remove the waitqueue entries.
Fixes: 2c14fa838c ("aio: implement IOCB_CMD_POLL")
Cc: <stable@vger.kernel.org> # v4.18+
Link: https://lore.kernel.org/r/20211209010455.42744-5-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
wake_up_poll() uses nr_exclusive=1, so it's not guaranteed to wake up
all exclusive waiters. Yet, POLLFREE *must* wake up all waiters. epoll
and aio poll are fortunately not affected by this, but it's very
fragile. Thus, the new function wake_up_pollfree() has been introduced.
Convert signalfd to use wake_up_pollfree().
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Fixes: d80e731eca ("epoll: introduce POLLFREE to flush ->signalfd_wqh before kfree()")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20211209010455.42744-4-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Since the new type of chunk-based files is introduced, there is no
need to leave flatmode tracepoints.
Rename to erofs_map_blocks instead.
Link: https://lore.kernel.org/r/20211209012918.30337-1-hsiangkao@linux.alibaba.com
Reviewed-by: Yue Hu <huyue2@yulong.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Warn on the lack of key exchange during NTLMSSP authentication rather
than aborting it as there are some servers that do not set it in
CHALLENGE message.
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
In previous patches, we have already gathered some tw with
io_req_task_complete() as callback in prior_task_list, let's complete
them in batch while we cannot grab uring lock. In this way, we batch
the req_complete_post path.
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20211208052125.351587-1-haoxu@linux.alibaba.com
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
@bytes also holds the return value from iomap_write_end, which can
contain a negative error value. As @bytes is always less than the page
size even the signed type can hold the entire possible range.
Fixes: c6f4046865 ("fsdax: decouple zeroing from the iomap buffered I/O code")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211208091203.2927754-1-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
I hit the BUG_ON() with generic/475 test case, and to my surprise, all
callers of btrfs_del_root_ref() are already aborting transaction, thus
there is not need for such BUG_ON(), just go to @out label and caller
will properly handle the error.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When finishing a zone that is used by a dedicated data relocation
block group, also remove its reference from fs_info, so we're not trying
to use a full block group for allocations during data relocation, which
will always fail.
The result is we're not making any forward progress and end up in a
deadlock situation.
Fixes: c2707a2556 ("btrfs: zoned: add a dedicated data relocation block group")
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Fstests runs on my VMs have show several kmemleak reports like the following.
unreferenced object 0xffff88811ae59080 (size 64):
comm "xfs_io", pid 12124, jiffies 4294987392 (age 6.368s)
hex dump (first 32 bytes):
00 c0 1c 00 00 00 00 00 ff cf 1c 00 00 00 00 00 ................
90 97 e5 1a 81 88 ff ff 90 97 e5 1a 81 88 ff ff ................
backtrace:
[<00000000ac0176d2>] ulist_add_merge+0x60/0x150 [btrfs]
[<0000000076e9f312>] set_state_bits+0x86/0xc0 [btrfs]
[<0000000014fe73d6>] set_extent_bit+0x270/0x690 [btrfs]
[<000000004f675208>] set_record_extent_bits+0x19/0x20 [btrfs]
[<00000000b96137b1>] qgroup_reserve_data+0x274/0x310 [btrfs]
[<0000000057e9dcbb>] btrfs_check_data_free_space+0x5c/0xa0 [btrfs]
[<0000000019c4511d>] btrfs_delalloc_reserve_space+0x1b/0xa0 [btrfs]
[<000000006d37e007>] btrfs_dio_iomap_begin+0x415/0x970 [btrfs]
[<00000000fb8a74b8>] iomap_iter+0x161/0x1e0
[<0000000071dff6ff>] __iomap_dio_rw+0x1df/0x700
[<000000002567ba53>] iomap_dio_rw+0x5/0x20
[<0000000072e555f8>] btrfs_file_write_iter+0x290/0x530 [btrfs]
[<000000005eb3d845>] new_sync_write+0x106/0x180
[<000000003fb505bf>] vfs_write+0x24d/0x2f0
[<000000009bb57d37>] __x64_sys_pwrite64+0x69/0xa0
[<000000003eba3fdf>] do_syscall_64+0x43/0x90
In case brtfs_qgroup_reserve_data() or btrfs_delalloc_reserve_metadata()
fail the allocated extent_changeset will not be freed.
So in btrfs_check_data_free_space() and btrfs_delalloc_reserve_space()
free the allocated extent_changeset to get rid of the allocated memory.
The issue currently only happens in the direct IO write path, but only
after 65b3c08606e5 ("btrfs: fix ENOSPC failure when attempting direct IO
write into NOCOW range"), and also at defrag_one_locked_target(). Every
other place is always calling extent_changeset_free() even if its call
to btrfs_delalloc_reserve_space() or btrfs_check_data_free_space() has
failed.
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is a report of a transaction abort of -EAGAIN with the following
script.
#!/bin/sh
for d in sda sdb; do
mkfs.btrfs -d single -m single -f /dev/\${d}
done
mount /dev/sda /mnt/test
mount /dev/sdb /mnt/scratch
for dir in test scratch; do
echo 3 >/proc/sys/vm/drop_caches
fio --directory=/mnt/\${dir} --name=fio.\${dir} --rw=read --size=50G --bs=64m \
--numjobs=$(nproc) --time_based --ramp_time=5 --runtime=480 \
--group_reporting |& tee /dev/shm/fio.\${dir}
echo 3 >/proc/sys/vm/drop_caches
done
for d in sda sdb; do
umount /dev/\${d}
done
The stack trace is shown in below.
[3310.967991] BTRFS: error (device sda) in btrfs_commit_transaction:2341: errno=-11 unknown (Error while writing out transaction)
[3310.968060] BTRFS info (device sda): forced readonly
[3310.968064] BTRFS warning (device sda): Skipping commit of aborted transaction.
[3310.968065] ------------[ cut here ]------------
[3310.968066] BTRFS: Transaction aborted (error -11)
[3310.968074] WARNING: CPU: 14 PID: 1684 at fs/btrfs/transaction.c:1946 btrfs_commit_transaction.cold+0x209/0x2c8
[3310.968131] CPU: 14 PID: 1684 Comm: fio Not tainted 5.14.10-300.fc35.x86_64 #1
[3310.968135] Hardware name: DIAWAY Tartu/Tartu, BIOS V2.01.B10 04/08/2021
[3310.968137] RIP: 0010:btrfs_commit_transaction.cold+0x209/0x2c8
[3310.968144] RSP: 0018:ffffb284ce393e10 EFLAGS: 00010282
[3310.968147] RAX: 0000000000000026 RBX: ffff973f147b0f60 RCX: 0000000000000027
[3310.968149] RDX: ffff974ecf098a08 RSI: 0000000000000001 RDI: ffff974ecf098a00
[3310.968150] RBP: ffff973f147b0f08 R08: 0000000000000000 R09: ffffb284ce393c48
[3310.968151] R10: ffffb284ce393c40 R11: ffffffff84f47468 R12: ffff973f101bfc00
[3310.968153] R13: ffff971f20cf2000 R14: 00000000fffffff5 R15: ffff973f147b0e58
[3310.968154] FS: 00007efe65468740(0000) GS:ffff974ecf080000(0000) knlGS:0000000000000000
[3310.968157] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[3310.968158] CR2: 000055691bcbe260 CR3: 000000105cfa4001 CR4: 0000000000770ee0
[3310.968160] PKRU: 55555554
[3310.968161] Call Trace:
[3310.968167] ? dput+0xd4/0x300
[3310.968174] btrfs_sync_file+0x3f1/0x490
[3310.968180] __x64_sys_fsync+0x33/0x60
[3310.968185] do_syscall_64+0x3b/0x90
[3310.968190] entry_SYSCALL_64_after_hwframe+0x44/0xae
[3310.968194] RIP: 0033:0x7efe6557329b
[3310.968200] RSP: 002b:00007ffe0236ebc0 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
[3310.968203] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007efe6557329b
[3310.968204] RDX: 0000000000000000 RSI: 00007efe58d77010 RDI: 0000000000000006
[3310.968205] RBP: 0000000004000000 R08: 0000000000000000 R09: 00007efe58d77010
[3310.968207] R10: 0000000016cacc0c R11: 0000000000000293 R12: 00007efe5ce95980
[3310.968208] R13: 0000000000000000 R14: 00007efe6447c790 R15: 0000000c80000000
[3310.968212] ---[ end trace 1a346f4d3c0d96ba ]---
[3310.968214] BTRFS: error (device sda) in cleanup_transaction:1946: errno=-11 unknown
The abort occurs because of a write hole while writing out freeing tree
nodes of a tree-log tree. For zoned btrfs, we re-dirty a freed tree
node to ensure btrfs can write the region and does not leave a hole on
write on a zoned device. The current code fails to re-dirty a node
when the tree-log tree's depth is greater or equal to 2. That leads to
a transaction abort with -EAGAIN.
Fix the issue by properly re-dirtying a node on walking up the tree.
Fixes: d3575156f6 ("btrfs: zoned: redirty released extent buffers")
CC: stable@vger.kernel.org # 5.12+
Link: https://github.com/kdave/btrfs-progs/issues/415
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
generic/484 fails sometimes with compression on because the write ends
up small enough that it goes into the btree. This means that we never
call mapping_set_error() on the inode itself, because the page gets
marked as fine when we inline it into the metadata. When the metadata
writeback happens we see it and abort the transaction properly and mark
the fs as readonly, however we don't do the mapping_set_error() on
anything. In syncfs() we will simply return 0 if the sb is marked
read-only, so we can't check for this in our syncfs callback. The only
way the error gets returned if we called mapping_set_error() on
something. Fix this by calling mapping_set_error() on the btree inode
mapping. This allows us to properly return an error on syncfs and pass
generic/484 with compression on.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
I got dmesg errors on generic/281 on our overnight fstests. Looking at
the history this happens occasionally, with errors like this
WARNING: CPU: 0 PID: 673217 at fs/btrfs/extent_io.c:6848 assert_eb_page_uptodate+0x3f/0x50
CPU: 0 PID: 673217 Comm: kworker/u4:13 Tainted: G W 5.16.0-rc2+ #469
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Workqueue: btrfs-cache btrfs_work_helper
RIP: 0010:assert_eb_page_uptodate+0x3f/0x50
RSP: 0018:ffffae598230bc60 EFLAGS: 00010246
RAX: 0017ffffc0002112 RBX: ffffebaec4100900 RCX: 0000000000001000
RDX: ffffebaec45733c7 RSI: ffffebaec4100900 RDI: ffff9fd98919f340
RBP: 0000000000000d56 R08: ffff9fd98e300000 R09: 0000000000000000
R10: 0001207370a91c50 R11: 0000000000000000 R12: 00000000000007b0
R13: ffff9fd98919f340 R14: 0000000001500000 R15: 0000000001cb0000
FS: 0000000000000000(0000) GS:ffff9fd9fbc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f549fcf8940 CR3: 0000000114908004 CR4: 0000000000370ef0
Call Trace:
extent_buffer_test_bit+0x3f/0x70
free_space_test_bit+0xa6/0xc0
load_free_space_tree+0x1d6/0x430
caching_thread+0x454/0x630
? rcu_read_lock_sched_held+0x12/0x60
? rcu_read_lock_sched_held+0x12/0x60
? rcu_read_lock_sched_held+0x12/0x60
? lock_release+0x1f0/0x2d0
btrfs_work_helper+0xf2/0x3e0
? lock_release+0x1f0/0x2d0
? finish_task_switch.isra.0+0xf9/0x3a0
process_one_work+0x270/0x5a0
worker_thread+0x55/0x3c0
? process_one_work+0x5a0/0x5a0
kthread+0x174/0x1a0
? set_kthread_struct+0x40/0x40
ret_from_fork+0x1f/0x30
This happens because we're trying to read from a extent buffer page that
is !PageUptodate. This happens because we will clear the page uptodate
when we have an IO error, but we don't clear the extent buffer uptodate.
If we do a read later and find this extent buffer we'll think its valid
and not return an error, and then trip over this warning.
Fix this by also clearing uptodate on the extent buffer when this
happens, so that we get an error when we do a btrfs_search_slot() and
find this block later.
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We've always been failing generic/260 because it's testing things we
actually don't care about and thus won't fail for. However we probably
should fail for fstrim_range->start == U64_MAX since we clearly can't
trim anything past that. This in combination with an update to
generic/260 will allow us to pass this test properly.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If memdup_user() fails the error handing will crash when it tries
to kfree() an error pointer. Just return directly because there is
no cleanup required.
Fixes: 1a15eb724a ("btrfs: use btrfs_get_dev_args_from_path in dev removal ioctls")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As people have been asking to allow non-root processes to have access to
the tracefs directory, it was considered best to only allow groups to have
access to the directory, where it is easier to just set the tracefs file
system to a specific group (as other would be too dangerous), and that way
the admins could pick which processes would have access to tracefs.
Unfortunately, this broke tooling on Android that expected the other bit
to be set. For some special cases, for non-root tools to trace the system,
tracefs would be mounted and change the permissions of the top level
directory which gave access to all running tasks permission to the
tracing directory. Even though this would be dangerous to do in a
production environment, for testing environments this can be useful.
Now with the new changes to not allow other (which is still the proper
thing to do), it breaks the testing tooling. Now more code needs to be
loaded on the system to change ownership of the tracing directory.
The real solution is to have tracefs honor the gid=xxx option when
mounting. That is,
(tracing group tracing has value 1003)
mount -t tracefs -o gid=1003 tracefs /sys/kernel/tracing
should have it that all files in the tracing directory should be of the
given group.
Copy the logic from d_walk() from dcache.c and simplify it for the mount
case of tracefs if gid is set. All the files in tracefs will be walked and
their group will be set to the value passed in.
Link: https://lkml.kernel.org/r/20211207171729.2a54e1b3@gandalf.local.home
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reported-by: Kalesh Singh <kaleshsingh@google.com>
Reported-by: Yabin Cui <yabinc@google.com>
Fixes: 49d67e4457 ("tracefs: Have tracefs directories not set OTH permission bits by default")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
If directories in tracefs have their ownership changed, then any new files
and directories that are created under those directories should inherit
the ownership of the director they are created in.
Link: https://lkml.kernel.org/r/20211208075720.4855d180@gandalf.local.home
Cc: Kees Cook <keescook@chromium.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Yabin Cui <yabinc@google.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: stable@vger.kernel.org
Fixes: 4282d60689 ("tracefs: Add new tracefs file system")
Reported-by: Kalesh Singh <kaleshsingh@google.com>
Reported: https://lore.kernel.org/all/CAC_TJve8MMAv+H_NdLSJXZUSoxOEq2zB_pVaJ9p=7H6Bu3X76g@mail.gmail.com/
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The exit function is wrongly placed in the __init section and this leads
to a crash when the module is unloaded. Just remove both the init and
exit functions since this module does not need them.
Fixes: 71c0286324 ("cifs: fork arc4 and create a separate module...")
Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Acked-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Cc: stable@vger.kernel.org # 5.15
Signed-off-by: Steve French <stfrench@microsoft.com>
Although readpage is a synchronous path, there will be no additional
kworker scheduling overhead in non-atomic contexts together with
dm-verity.
Let's add a sysfs node to disable sync decompression as an option.
Link: https://lore.kernel.org/r/20211206143552.8384-1-huangjianan@oppo.com
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Huang Jianan <huangjianan@oppo.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Now we have a lot of task_work users, some are just to complete a req
and generate a cqe. Let's put the work to a new tw list which has a
higher priority, so that it can be handled quickly and thus to reduce
avg req latency and users can issue next round of sqes earlier.
An explanatory case:
origin timeline:
submit_sqe-->irq-->add completion task_work
-->run heavy work0~n-->run completion task_work
now timeline:
submit_sqe-->irq-->add completion task_work
-->run completion task_work-->run heavy work0~n
Limitation: this optimization is only for those that submission and
reaping process are in different threads. Otherwise anyhow we have to
submit new sqes after returning to userspace, then the order of TWs
doesn't matter.
Tested this patch(and the following ones) by manually replace
__io_queue_sqe() in io_queue_sqe() by io_req_task_queue() to construct
'heavy' task works. Then test with fio:
ioengine=io_uring
sqpoll=1
thread=1
bs=4k
direct=1
rw=randread
time_based=1
runtime=600
randrepeat=0
group_reporting=1
filename=/dev/nvme0n1
Tried various iodepth.
The peak IOPS for this patch is 710K, while the old one is 665K.
For avg latency, difference shows when iodepth grow:
depth and avg latency(usec):
depth new old
1 7.05 7.10
2 8.47 8.60
4 10.42 10.42
8 13.78 13.22
16 27.41 24.33
32 49.40 53.08
64 102.53 103.36
128 196.98 205.61
256 372.99 414.88
512 747.23 791.30
1024 1472.59 1538.72
2048 3153.49 3329.01
4096 6387.86 6682.54
8192 12150.25 12774.14
16384 23085.58 26044.71
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20211207093951.247840-3-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
add a helper to merge two wq_lists, it will be useful in the next
patches.
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20211207093951.247840-2-haoxu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch introduces a kmem cache for dlm_msg handles which are used
always if dlm sends a message out. Even if their are covered by midcomms
layer or not.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
This patch introduces a kmem cache for writequeue entry. A writequeue
entry get quite a lot allocated if dlm transmit messages.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
This patch will introduce a kmem cache for allocating message handles
which are needed for midcomms layer to take track of lowcomms messages.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
This patch cleanups the code for allocating a new buffer in the dlm
writequeue mechanism. There was a possible tuneup to allow scheduling
while a new writequeue entry needs to be allocated because either no
sending page is available or are full. To avoid multiple concurrent
users checking at the same time if an entry is available or full
alloc_wq was introduce that those are waiting if there is currently a
new writequeue entry in process to be queued so possible further users
will check on the new allocated writequeue entry if it's full.
To simplify the code we just remove this mutex and switch that the
already introduced spin lock will be held during writequeue check,
allocation and queueing. So other users can never check on available
writequeues while there is a new one in process but not queued yet.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
This patch will use an event based waitqueue to wait for a possible clash
with the ls_remove_name field of dlm_ls instead of doing busy waiting.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
Currently we don't care if the DLM application stack is filling buffers
(not committed yet) while we transmit some already committed buffers.
By checking on active writequeue users before dequeue a writequeue entry
we know there is coming more data and do nothing. We wait until the send
worker will be triggered again if the writequeue entry users hit zero.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
This patch will use list_empty(&ls->ls_cb_delay) to check for last list
iteration. In case of a multiply count of MAX_CB_QUEUE and the list is
empty we do a extra goto more which we can avoid by checking on
list_empty().
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
As part of multiple customer escalations due to file data corruption
after copy on write operations, I wrote some fstests that use fsstress
to hammer on COW to shake things loose. Regrettably, I caught some
filesystem shutdowns due to incorrect rmap operations with the following
loop:
mount <filesystem> # (0)
fsstress <run only readonly ops> & # (1)
while true; do
fsstress <run all ops>
mount -o remount,ro # (2)
fsstress <run only readonly ops>
mount -o remount,rw # (3)
done
When (2) happens, notice that (1) is still running. xfs_remount_ro will
call xfs_blockgc_stop to walk the inode cache to free all the COW
extents, but the blockgc mechanism races with (1)'s reader threads to
take IOLOCKs and loses, which means that it doesn't clean them all out.
Call such a file (A).
When (3) happens, xfs_remount_rw calls xfs_reflink_recover_cow, which
walks the ondisk refcount btree and frees any COW extent that it finds.
This function does not check the inode cache, which means that incore
COW forks of inode (A) is now inconsistent with the ondisk metadata. If
one of those former COW extents are allocated and mapped into another
file (B) and someone triggers a COW to the stale reservation in (A), A's
dirty data will be written into (B) and once that's done, those blocks
will be transferred to (A)'s data fork without bumping the refcount.
The results are catastrophic -- file (B) and the refcount btree are now
corrupt. Solve this race by forcing the xfs_blockgc_free_space to run
synchronously, which causes xfs_icwalk to return to inodes that were
skipped because the blockgc code couldn't take the IOLOCK. This is safe
to do here because the VFS has already prohibited new writer threads.
Fixes: 10ddf64e42 ("xfs: remove leftover CoW reservations when remounting ro")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
The order of these two parameters is just reversed. gcc didn't warn on
that, probably because 'void *' can be converted from or to other
pointer types without warning.
Cc: stable@vger.kernel.org
Fixes: 3d3c950467 ("netfs: Provide readahead and readpage netfs helpers")
Fixes: e1b1240c1f ("netfs: Add write_begin helper")
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Link: https://lore.kernel.org/r/20211207031449.100510-1-jefflexu@linux.alibaba.com/ # v1
The acceptable maximum value of lend parameter in
filemap_write_and_wait_range() is LLONG_MAX rather than -1. And there is
also some logic depending on LLONG_MAX check in write_cache_pages(). So
let's pass LLONG_MAX to filemap_write_and_wait_range() in
fuse_writeback_range() instead.
Fixes: 59bda8ecee ("fuse: flush extending writes")
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Cc: <stable@vger.kernel.org> # v5.15
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Taking sb_writers whilst holding mmap_lock isn't allowed and will result in
a lockdep warning like that below. The problem comes from cachefiles
needing to take the sb_writers lock in order to do a write to the cache,
but being asked to do this by netfslib called from readpage, readahead or
write_begin[1].
Fix this by always offloading the write to the cache off to a worker
thread. The main thread doesn't need to wait for it, so deadlock can be
avoided.
This can be tested by running the quick xfstests on something like afs or
ceph with lockdep enabled.
WARNING: possible circular locking dependency detected
5.15.0-rc1-build2+ #292 Not tainted
------------------------------------------------------
holetest/65517 is trying to acquire lock:
ffff88810c81d730 (mapping.invalidate_lock#3){.+.+}-{3:3}, at: filemap_fault+0x276/0x7a5
but task is already holding lock:
ffff8881595b53e8 (&mm->mmap_lock#2){++++}-{3:3}, at: do_user_addr_fault+0x28d/0x59c
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (&mm->mmap_lock#2){++++}-{3:3}:
validate_chain+0x3c4/0x4a8
__lock_acquire+0x89d/0x949
lock_acquire+0x2dc/0x34b
__might_fault+0x87/0xb1
strncpy_from_user+0x25/0x18c
removexattr+0x7c/0xe5
__do_sys_fremovexattr+0x73/0x96
do_syscall_64+0x67/0x7a
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #1 (sb_writers#10){.+.+}-{0:0}:
validate_chain+0x3c4/0x4a8
__lock_acquire+0x89d/0x949
lock_acquire+0x2dc/0x34b
cachefiles_write+0x2b3/0x4bb
netfs_rreq_do_write_to_cache+0x3b5/0x432
netfs_readpage+0x2de/0x39d
filemap_read_page+0x51/0x94
filemap_get_pages+0x26f/0x413
filemap_read+0x182/0x427
new_sync_read+0xf0/0x161
vfs_read+0x118/0x16e
ksys_read+0xb8/0x12e
do_syscall_64+0x67/0x7a
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #0 (mapping.invalidate_lock#3){.+.+}-{3:3}:
check_noncircular+0xe4/0x129
check_prev_add+0x16b/0x3a4
validate_chain+0x3c4/0x4a8
__lock_acquire+0x89d/0x949
lock_acquire+0x2dc/0x34b
down_read+0x40/0x4a
filemap_fault+0x276/0x7a5
__do_fault+0x96/0xbf
do_fault+0x262/0x35a
__handle_mm_fault+0x171/0x1b5
handle_mm_fault+0x12a/0x233
do_user_addr_fault+0x3d2/0x59c
exc_page_fault+0x85/0xa5
asm_exc_page_fault+0x1e/0x30
other info that might help us debug this:
Chain exists of:
mapping.invalidate_lock#3 --> sb_writers#10 --> &mm->mmap_lock#2
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&mm->mmap_lock#2);
lock(sb_writers#10);
lock(&mm->mmap_lock#2);
lock(mapping.invalidate_lock#3);
*** DEADLOCK ***
1 lock held by holetest/65517:
#0: ffff8881595b53e8 (&mm->mmap_lock#2){++++}-{3:3}, at: do_user_addr_fault+0x28d/0x59c
stack backtrace:
CPU: 0 PID: 65517 Comm: holetest Not tainted 5.15.0-rc1-build2+ #292
Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
Call Trace:
dump_stack_lvl+0x45/0x59
check_noncircular+0xe4/0x129
? print_circular_bug+0x207/0x207
? validate_chain+0x461/0x4a8
? add_chain_block+0x88/0xd9
? hlist_add_head_rcu+0x49/0x53
check_prev_add+0x16b/0x3a4
validate_chain+0x3c4/0x4a8
? check_prev_add+0x3a4/0x3a4
? mark_lock+0xa5/0x1c6
__lock_acquire+0x89d/0x949
lock_acquire+0x2dc/0x34b
? filemap_fault+0x276/0x7a5
? rcu_read_unlock+0x59/0x59
? add_to_page_cache_lru+0x13c/0x13c
? lock_is_held_type+0x7b/0xd3
down_read+0x40/0x4a
? filemap_fault+0x276/0x7a5
filemap_fault+0x276/0x7a5
? pagecache_get_page+0x2dd/0x2dd
? __lock_acquire+0x8bc/0x949
? pte_offset_kernel.isra.0+0x6d/0xc3
__do_fault+0x96/0xbf
? do_fault+0x124/0x35a
do_fault+0x262/0x35a
? handle_pte_fault+0x1c1/0x20d
__handle_mm_fault+0x171/0x1b5
? handle_pte_fault+0x20d/0x20d
? __lock_release+0x151/0x254
? mark_held_locks+0x1f/0x78
? rcu_read_unlock+0x3a/0x59
handle_mm_fault+0x12a/0x233
do_user_addr_fault+0x3d2/0x59c
? pgtable_bad+0x70/0x70
? rcu_read_lock_bh_held+0xab/0xab
exc_page_fault+0x85/0xa5
? asm_exc_page_fault+0x8/0x30
asm_exc_page_fault+0x1e/0x30
RIP: 0033:0x40192f
Code: ff 48 89 c3 48 8b 05 50 28 00 00 48 85 ed 7e 23 31 d2 4b 8d 0c 2f eb 0a 0f 1f 00 48 8b 05 39 28 00 00 48 0f af c2 48 83 c2 01 <48> 89 1c 01 48 39 d5 7f e8 8b 0d f2 27 00 00 31 c0 85 c9 74 0e 8b
RSP: 002b:00007f9931867eb0 EFLAGS: 00010202
RAX: 0000000000000000 RBX: 00007f9931868700 RCX: 00007f993206ac00
RDX: 0000000000000001 RSI: 0000000000000000 RDI: 00007ffc13e06ee0
RBP: 0000000000000100 R08: 0000000000000000 R09: 00007f9931868700
R10: 00007f99318689d0 R11: 0000000000000202 R12: 00007ffc13e06ee0
R13: 0000000000000c00 R14: 00007ffc13e06e00 R15: 00007f993206a000
Fixes: 726218fdc2 ("netfs: Define an interface to talk to a cache")
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Jeff Layton <jlayton@kernel.org>
cc: Jan Kara <jack@suse.cz>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20210922110420.GA21576@quack2.suse.cz/ [1]
Link: https://lore.kernel.org/r/163887597541.1596626.2668163316598972956.stgit@warthog.procyon.org.uk/ # v1
There's a small race here where the task_work could finish and drop
the worker itself, so that by the time that task_work_add() returns
with a successful addition we've already put the worker.
The worker callbacks clear this bit themselves, so we don't actually
need to manually clear it in the caller. Get rid of it.
Reported-by: syzbot+b60c982cb0efc5e05a47@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When iopolling the userspace specifies the minimum number of "events" it
expects. Previously, we had one CQE per request, so the definition of
an "event" was unequivocal, but that's not more the case anymore with
REQ_F_CQE_SKIP.
Currently it counts the number of completed requests, replace it with
the number of posted CQEs. This allows users of the "one CQE per link"
scheme to wait for all N links in a single syscall, which is not
possible without the patch and requires extra context switches.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d5a965c4d2249827392037bbd0186f87fea49c55.1638714983.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In previous patches we added new and modified existing helpers to handle
idmapped mounts of filesystems mounted with an idmapping. In this final
patch we convert all relevant places in the vfs to actually pass the
filesystem's idmapping into these helpers.
With this the vfs is in shape to handle idmapped mounts of filesystems
mounted with an idmapping. Note that this is just the generic
infrastructure. Actually adding support for idmapped mounts to a
filesystem mountable with an idmapping is follow-up work.
In this patch we extend the definition of an idmapped mount from a mount
that that has the initial idmapping attached to it to a mount that has
an idmapping attached to it which is not the same as the idmapping the
filesystem was mounted with.
As before we do not allow the initial idmapping to be attached to a
mount. In addition this patch prevents that the idmapping the filesystem
was mounted with can be attached to a mount created based on this
filesystem.
This has multiple reasons and advantages. First, attaching the initial
idmapping or the filesystem's idmapping doesn't make much sense as in
both cases the values of the i_{g,u}id and other places where k{g,u}ids
are used do not change. Second, a user that really wants to do this for
whatever reason can just create a separate dedicated identical idmapping
to attach to the mount. Third, we can continue to use the initial
idmapping as an indicator that a mount is not idmapped allowing us to
continue to keep passing the initial idmapping into the mapping helpers
to tell them that something isn't an idmapped mount even if the
filesystem is mounted with an idmapping.
Link: https://lore.kernel.org/r/20211123114227.3124056-11-brauner@kernel.org (v1)
Link: https://lore.kernel.org/r/20211130121032.3753852-11-brauner@kernel.org (v2)
Link: https://lore.kernel.org/r/20211203111707.3901969-11-brauner@kernel.org
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Enable the mapped_fs{g,u}id() helpers to support filesystems mounted
with an idmapping. Apart from core mapping helpers that use
mapped_fs{g,u}id() to initialize struct inode's i_{g,u}id fields xfs is
the only place that uses these low-level helpers directly.
The patch only extends the helpers to be able to take the filesystem
idmapping into account. Since we don't actually yet pass the
filesystem's idmapping in no functional changes happen. This will happen
in a final patch.
Link: https://lore.kernel.org/r/20211123114227.3124056-9-brauner@kernel.org (v1)
Link: https://lore.kernel.org/r/20211130121032.3753852-9-brauner@kernel.org (v2)
Link: https://lore.kernel.org/r/20211203111707.3901969-9-brauner@kernel.org
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmGr0HsACgkQiiy9cAdy
T1FXmgv+LxMKh3FtGdLyzeJ9OJeUP87WtHCn8Tek5b0BUYxBrC12lzdLuYl8+zQr
O09RcJ8THZwWYKqTDlkaQg4NnNeMEVeouottGIEQPuBztLVWNmvXD6g5c1YPhYMI
0AfeUbkCQ9Gb+zoAJtXTLMCkqYj/x+jVrgO56V12gjIfZCB1XRPhnhJOao2BRo88
go35pSn8l3lv8dLeQxUCBZ73+z/R8Kwk/8NgkFPT/N7vQwuSK7Ax27h8pyV0m8Zx
lziNBiVM0HRTB5WREJsMWVD8NRcm8e+Xf7bFjhR0FbT7U3cSy4kLOPTd9RDKxxkj
sGI56r1OoRCQFrv456kz6SPLLYx2ecP9WzDp6ghOvGENJ3UYFoUcv0L9+aQL1gyy
Gxn+bTng6TMHxjpzhRgS4pJg3TME2US4HQepsRrrNZ0FXM67Gt4pFwoxtA/ZLy4X
WyuROJ1Xy46G7Yn0glTPWGJ3yDDud88etLdF3RbH+wWtB88WvBtIjyPQ0QCEOFpU
0UMD4ow9
=hChf
-----END PGP SIGNATURE-----
Merge tag '5.16-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Three SMB3 multichannel/fscache fixes and a DFS fix.
In testing multichannel reconnect scenarios recently various problems
with the cifs.ko implementation of fscache were found (e.g. incorrect
initialization of fscache cookies in some cases)"
* tag '5.16-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: avoid use of dstaddr as key for fscache client cookie
cifs: add server conn_id to fscache client cookie
cifs: wait for tcon resource_id before getting fscache super
cifs: fix missed refcounting of ipc tcon
The description of gfs2_instantiate accidentally lists a glock argument,
but the function takes a glock holder.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Function rgrp_go_inval calls gfs2_rgrp_brelse to invalidate the
in-core rgrp structures. After the call it set GLF_INSTANTIATE_NEEDED,
which is redundant, since gfs2_rgrp_brelse also sets it.
This patch simply removes the redundant set_bit.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
The function name in the kernel-doc comment wasn't updated when the
function was renamed.
Fixes: b016d9a84a ("gfs2: Save ip from gfs2_glock_nq_init")
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Implement 'struct iomap_ops' for f2fs, in preparation for making f2fs
use iomap for direct I/O.
Note that this may be used for other things besides direct I/O in the
future; however, for now I've only tested it for direct I/O.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Pass in the original position and count rather than the position and
count that were updated by the write. Also use the correct types for
all arguments, in particular the file offset which was being truncated
to 32 bits on 32-bit platforms.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
DIO preallocates physical blocks before writing data, but if an error occurrs
or power-cut happens, we can see block contents from the disk. This patch tries
to fix it by 1) turning to buffered writes for DIO into holes, 2) truncating
unwritten blocks from error or power-cut.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The file system DAX code now does not require the block code. So allow
building a kernel with fuse DAX but not block layer.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20211129102203.2243509-30-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Only build the block based iomap code if CONFIG_BLOCK is set. Currently
that is always the case, but it will change soon.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20211129102203.2243509-29-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Remove the last user of ->bdev in dax.c by requiring the file system to
pass in an address that already includes the DAX offset. As part of the
only set ->bdev or ->daxdev when actually required in the ->iomap_begin
methods.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> [erofs]
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20211129102203.2243509-27-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Prepare for the removal of the block_device from the DAX I/O path by
returning the partition offset from fs_dax_get_by_bdev so that the file
systems have it at hand for use during I/O.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20211129102203.2243509-26-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Add a flag so that the file system can easily detect DAX operations
based just on the iomap operation requested instead of looking at
inode state using IS_DAX. This will be needed to apply the to be
added partition offset only for operations that actually use DAX,
but not things like fiemap that are based on the block device.
In the long run it should also allow turning the bdev, dax_dev
and inline_data into a union.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20211129102203.2243509-25-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
To prepare for looking at the IOMAP_DAX flag in xfs_bmbt_to_iomap pass in
the input mapping flags to xfs_bmbt_to_iomap.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20211129102203.2243509-24-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
While the buffered write iomap ops do work due to the fact that zeroing
never allocates blocks, the DAX zeroing should use the direct ops just
like actual DAX I/O.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20211129102203.2243509-23-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Hide the DAX device lookup from the xfs_super.c code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Link: https://lore.kernel.org/r/20211129102203.2243509-22-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Only call fs_dax_get_by_bdev once the sbi has been allocated and remove
the need for the dax_dev local variable.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20211129102203.2243509-21-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Only call fs_dax_get_by_bdev once the sbi has been allocated and remove
the need for the dax_dev local variable.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20211129102203.2243509-20-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Unshare the DAX and iomap buffered I/O page zeroing code. This code
previously did a IS_DAX check deep inside the iomap code, which in
fact was the only DAX check in the code. Instead move these checks
into the callers. Most callers already have DAX special casing anyway
and XFS will need it for reflink support as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20211129102203.2243509-19-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Factor out a helper for the "manual" zeroing of a DAX range to clean
up dax_iomap_zero a lot.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20211129102203.2243509-18-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The file relative offset must have the same alignment as the storage
offset, so use that and get rid of the call to iomap_sector.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20211129102203.2243509-17-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Add helpers to prepare for using different DAX operations.
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
[hch: split from a larger patch + slight cleanups]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20211129102203.2243509-16-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Replace the two steps of dax_iomap_sector and bdev_dax_pgoff with a
single dax_iomap_pgoff helper that avoids lots of cumbersome sector
conversions.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20211129102203.2243509-15-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Just pass the vm_fault and iomap_iter structures, and figure out the rest
locally. Note that this requires moving dax_iomap_sector up in the file.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20211129102203.2243509-14-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Despite its name copy_user_page expected kernel addresses, which is what
we already have.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20211129102203.2243509-13-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Just open code the block size and dax_dev == NULL checks in the callers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> [erofs]
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20211129102203.2243509-9-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Factor out another DAX setup helper to simplify future changes. Also
move the experimental warning after the checks to not clutter the log
too much if the setup failed.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20211129102203.2243509-8-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Replace the dax_host_hash with an xarray indexed by the pointer value
of the gendisk, and require explicitly calls from the block drivers that
want to associate their gendisk with a dax_device.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20211129102203.2243509-5-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
CONFIG_DAX_DRIVER only selects CONFIG_DAX now, so remove it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Link: https://lore.kernel.org/r/20211129102203.2243509-4-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmGqehoQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpo+jD/4kPAd6kkKBnYWxBhnjZUuL5pio7G9wVM5a
OFDwJ8gh4blp+eqSFf7RnsTiP2xSIjOykBtCSl35jiUql1KPc0SN/rE9I+mxuTfj
k1nW+hTaFeQXUL3Pz8oP+vwfWhfCmgPPH5hrNd9n6vmNas1JH29ZOK5Z2AQU7OLl
h1a6Kwm7Jqmi346ZSOfdr4xCMVzXrUpZeN2AFEID9/rvpznKiENoMdO8dsDypgjc
bRLQIP3X10qSu1KglRJo1+zGFDB/xKOPZ4EdoD0xNFHhUJQpe1YE0wVCliheCmcq
kLVCPBeqh5UTW+orBfqVEI2UQlhhaqLW1Yq1zRMm35aS8tMQIaTfWDdvCO+Iq2Le
0VLEWCLohP17nRxlJETxxzOPfCw/vE/ZxygLD5TRgZGdNB1+A1PV4s565fhvFE8t
yRMhj4hYAY2XTBs44erJHJqBabt2VXLNw4A69DpdlDex+65lyNqGDez6x3VEKfZl
/4ohF3XCAeklZxegK/06EQdgwqx7bovSEnAIFZs4jrdcTqZmubWoIyLLFOG8jC68
ey50VOYZJhET+vb0KUn+zK2shcHJu5qSGH73ARHJ6pJpRsdEAdLaigTCSOKpUayA
Ml/jg2JIGIeUAsfsA5mhM6kTJ+ymc1fM7CQUkH1SSXB+qOUbsML4FKhkY7WL7vH2
IJ3zGtTNLw==
=Xy6S
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.16-2021-12-03' of git://git.kernel.dk/linux-block
Pull io_uring fix from Jens Axboe:
"Just a single fix preventing repeated retries of task_work based io-wq
thread creation, fixing a regression from when io-wq was made more (a
bit too much) resilient against signals"
* tag 'io_uring-5.16-2021-12-03' of git://git.kernel.dk/linux-block:
io-wq: don't retry task_work creation failure on fatal conditions
* Since commit 486408d690 ("gfs2: Cancel remote delete work
asynchronously"), inode create and lookup-by-number can overlap more
easily and we can end up with temporary duplicate inodes. Fix the
code to prevent that.
* Fix a BUG demoting weak glock holders from a remote node.
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAmGrPtYUHGFncnVlbmJh
QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTrtgA/+MESXTK8h/HK+M+CAnsJ5jFSVXV8W
wmhhOtqikNGGruxnTJdkipe9PcPRsdwh4XLtmbfGairXnmFqJ+oEG8oRk/Tqoap8
9RxWN+B+Wzh79mNvI3Il3RtvUJGNQGrfDSuT5F4jZ6eZSfjsyY3cgO2KMt2TN5sL
7WXwYnj4VyPa8TdkSFIhfxwiXyal7aC8BVS6UX83CUIrhJvNG70A3hqVrIuKymUf
dNiLWURdQ2oiDnX7hMjq/RnX84zAcp2B2wfKZVsX0ZcIqVdmHS5Er4PKqffEVOzq
nf2SU4pAo4q7L7/66TdUtGoRgUSK74O83VTqsb1Lex5roON7unmVsjKyc7dmKUrR
xt2Is4/OsCSZEsErWdMgtqer3fOxG1oGdSHzmgdRwqA22+Im5COTKjOBJ+3MXHaW
E/36FGYYiugOLQn8hrqTqp3MIdPoZe8B/4ZF82qu6qYBQiA3PIHMlfElym5+45oR
fqWLcwids4LV5esHK3YzZdQ3f9+FVIvcBayUUx71ZCD0QkNDUzTnckb5vTxwF3r1
zU/jIsv84dvQiTsCMhTxQq+JDY98V544sapa+0k7chwtwpTdeajTHjocECGRKE1E
o5vCWX6gAiJQ/5RtgfBS2DWXrBQNueUsYpBQM9jp2fr6kxK+Hc5MT8JMdZWw7Rb7
1ZDw9PNs8TpByzc=
=7HvB
-----END PGP SIGNATURE-----
Merge tag 'gfs2-v5.16-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 fixes from Andreas Gruenbacher:
- Since commit 486408d690 ("gfs2: Cancel remote delete work
asynchronously"), inode create and lookup-by-number can overlap more
easily and we can end up with temporary duplicate inodes. Fix the
code to prevent that.
- Fix a BUG demoting weak glock holders from a remote node.
* tag 'gfs2-v5.16-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: gfs2_create_inode rework
gfs2: gfs2_inode_lookup rework
gfs2: gfs2_inode_lookup cleanup
gfs2: Fix remote demote of weak glock holders
No functional changes in this patch, just in preparation for efficiently
calling this light function from the block O_DIRECT handling.
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
server->dstaddr can change when the DNS mapping for the
server hostname changes. But conn_id is a u64 counter
that is incremented each time a new TCP connection
is setup. So use only that as a key.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
The fscache client cookie uses the server address
(and port) as the cookie key. This is a problem when
nosharesock is used. Two different connections will
use duplicate cookies. Avoid this by adding
server->conn_id to the key, so that it's guaranteed
that cookie will not be duplicated.
Also, for secondary channels of a session, copy the
fscache pointer from the primary channel. The primary
channel is guaranteed not to go away as long as secondary
channels are in use. Also addresses minor problem found
by kernel test robot.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
The logic for initializing tcon->resource_id is done inside
cifs_root_iget. fscache super cookie relies on this for aux
data. So we need to push the fscache initialization to this
later point during mount.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
Fix missed refcounting of IPC tcon used for getting domain-based DFS
root referrals. We want to keep it alive as long as mount is active
and can be refreshed. For standalone DFS root referrals it wouldn't
be a problem as the client ends up having an IPC tcon for both mount
and cache.
Fixes: c88f7dcd6d ("cifs: support nested dfs links over reconnect")
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
Jann Horn points out that there is another possible race wrt Unix domain
socket garbage collection, somewhat reminiscent of the one fixed in
commit cbcf01128d ("af_unix: fix garbage collect vs MSG_PEEK").
See the extended comment about the garbage collection requirements added
to unix_peek_fds() by that commit for details.
The race comes from how we can locklessly look up a file descriptor just
as it is in the process of being closed, and with the right artificial
timing (Jann added a few strategic 'mdelay(500)' calls to do that), the
Unix domain socket garbage collector could see the reference count
decrement of the close() happen before fget() took its reference to the
file and the file was attached onto a new file descriptor.
This is all (intentionally) correct on the 'struct file *' side, with
RCU lookups and lockless reference counting very much part of the
design. Getting that reference count out of order isn't a problem per
se.
But the garbage collector can get confused by seeing this situation of
having seen a file not having any remaining external references and then
seeing it being attached to an fd.
In commit cbcf01128d ("af_unix: fix garbage collect vs MSG_PEEK") the
fix was to serialize the file descriptor install with the garbage
collector by taking and releasing the unix_gc_lock.
That's not really an option here, but since this all happens when we are
in the process of looking up a file descriptor, we can instead simply
just re-check that the file hasn't been closed in the meantime, and just
re-do the lookup if we raced with a concurrent close() of the same file
descriptor.
Reported-and-tested-by: Jann Horn <jannh@google.com>
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The low-level mapping helpers were so far crammed into fs.h. They are
out of place there. The fs.h header should just contain the higher-level
mapping helpers that interact directly with vfs objects such as struct
super_block or struct inode and not the bare mapping helpers. Similarly,
only vfs and specific fs code shall interact with low-level mapping
helpers. And so they won't be made accessible automatically through
regular {g,u}id helpers.
Link: https://lore.kernel.org/r/20211123114227.3124056-3-brauner@kernel.org (v1)
Link: https://lore.kernel.org/r/20211130121032.3753852-3-brauner@kernel.org (v2)
Link: https://lore.kernel.org/r/20211203111707.3901969-3-brauner@kernel.org
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Multiple places open-code the same check to determine whether a given
mount is idmapped. Introduce a simple helper function that can be used
instead. This allows us to get rid of the fragile open-coding. We will
later change the check that is used to determine whether a given mount
is idmapped. Introducing a helper allows us to do this in a single
place instead of doing it for multiple places.
Link: https://lore.kernel.org/r/20211123114227.3124056-2-brauner@kernel.org (v1)
Link: https://lore.kernel.org/r/20211130121032.3753852-2-brauner@kernel.org (v2)
Link: https://lore.kernel.org/r/20211203111707.3901969-2-brauner@kernel.org
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Marek reported the warning below.
=========================
WARNING: held lock freed!
5.16.0-rc2+ #10984 Not tainted
-------------------------
kworker/1:0/18 is freeing memory ffff00004034e200-ffff00004034e3ff,
with a lock still held there!
ffff00004034e348 (&root->kernfs_rwsem){++++}-{3:3}, at:
__kernfs_remove+0x310/0x37c
3 locks held by kworker/1:0/18:
#0: ffff000040107938 ((wq_completion)cgroup_destroy){+.+.}-{0:0}, at:
process_one_work+0x1f0/0x6f0
#1: ffff80000b55bdc0
((work_completion)(&(&css->destroy_rwork)->work)){+.+.}-{0:0}, at:
process_one_work+0x1f0/0x6f0
#2: ffff00004034e348 (&root->kernfs_rwsem){++++}-{3:3}, at:
__kernfs_remove+0x310/0x37c
stack backtrace:
CPU: 1 PID: 18 Comm: kworker/1:0 Not tainted 5.16.0-rc2+ #10984
Hardware name: Raspberry Pi 4 Model B (DT)
Workqueue: cgroup_destroy css_free_rwork_fn
Call trace:
dump_backtrace+0x0/0x1ac
show_stack+0x18/0x24
dump_stack_lvl+0x8c/0xb8
dump_stack+0x18/0x34
debug_check_no_locks_freed+0x124/0x140
kfree+0xf0/0x3a4
kernfs_put+0x1f8/0x224
__kernfs_remove+0x1b8/0x37c
kernfs_destroy_root+0x38/0x50
css_free_rwork_fn+0x288/0x3d4
process_one_work+0x288/0x6f0
worker_thread+0x74/0x470
kthread+0x188/0x194
ret_from_fork+0x10/0x20
Since kernfs moves the kernfs_rwsem lock into root, it couldn't hold
the lock when the root node is tearing down. Thus, get the refcount
of root node.
Fixes: 393c371408 ("kernfs: switch global kernfs_rwsem lock to per-fs lock")
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Link: https://lore.kernel.org/r/20211201231648.1027165-1-minchan@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We don't want to be retrying task_work creation failure if there's
an actual signal pending for the parent task. If we do, then we can
enter an infinite loop of perpetually retrying and each retry failing
with -ERESTARTNOINTR because a signal is pending.
Fixes: 3146cba99a ("io-wq: make worker creation resilient against signals")
Reported-by: Florian Fischer <florian.fl.fischer@fau.de>
Link: https://lore.kernel.org/io-uring/20211202165606.mqryio4yzubl7ms5@pasture/
Tested-by: Florian Fischer <florian.fl.fischer@fau.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When gfs2_lookup_by_inum() calls gfs2_inode_lookup() for an uncached
inode, gfs2_inode_lookup() will place a new tentative inode into the
inode cache before verifying that there is a valid inode at the given
address. This can race with gfs2_create_inode() which doesn't check for
duplicates inodes. gfs2_create_inode() will try to assign the new inode
to the corresponding inode glock, and glock_set_object() will complain
that the glock is still in use by gfs2_inode_lookup's tentative inode.
We noticed this bug after adding commit 486408d690 ("gfs2: Cancel
remote delete work asynchronously") which allowed delete_work_func() to
race with gfs2_create_inode(), but the same race exists for
open-by-handle.
Fix that by switching from insert_inode_hash() to
insert_inode_locked4(), which does check for duplicate inodes. We know
we've just managed to to allocate the new inode, so an inode tentatively
created by gfs2_inode_lookup() will eventually go away and
insert_inode_locked4() will always succeed.
In addition, don't flush the inode glock work anymore (this can now only
make things worse) and clean up glock_{set,clear}_object for the inode
glock somewhat.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Rework gfs2_inode_lookup() to only set up the new inode's glocks after
verifying that the new inode is valid.
There is no need for flushing the inode glock work queue anymore now,
so remove that as well.
While at it, get rid of the useless wrapper around iget5_locked() and
its unnecessary is_bad_inode() check.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
In gfs2_inode_lookup, once the inode has been looked up, we check if the
inode generation (no_formal_ino) is the one we're looking for. If it
isn't and the inode wasn't in the inode cache, we discard the newly
looked up inode. This is unnecessary, complicates the code, and makes
future changes to gfs2_inode_lookup harder, so change the code to retain
newly looked up inodes instead.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
When we mock up a temporary holder in gfs2_glock_cb to demote weak holders in
response to a remote locking conflict, we don't set the HIF_HOLDER flag. This
causes function may_grant to BUG. Fix by setting the missing HIF_HOLDER flag
in the mock glock holder.
In addition, define the mock glock holder where it is used.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
This ASSERT in xfs_rename is a) incorrect, because
(RENAME_WHITEOUT|RENAME_NOREPLACE) is a valid combination, and
b) unnecessary, because actual invalid flag combinations are already
handled at the vfs level in do_renameat2() before we get called.
So, remove it.
Reported-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Ceph always inherits the SGID bit if it is set on the parent inode,
while the generic inode_init_owner does not do this in a few cases where
it can create a possible security problem (cf. [1]).
Update ceph to strip the SGID bit just as inode_init_owner would.
This bug was detected by the mapped mount testsuite in [3]. The
testsuite tests all core VFS functionality and semantics with and
without mapped mounts. That is to say it functions as a generic VFS
testsuite in addition to a mapped mount testsuite. While working on
mapped mount support for ceph, SIGD inheritance was the only failing
test for ceph after the port.
The same bug was detected by the mapped mount testsuite in XFS in
January 2021 (cf. [2]).
[1]: commit 0fa3ecd878 ("Fix up non-directory creation in SGID directories")
[2]: commit 01ea173e10 ("xfs: fix up non-directory creation in SGID directories")
[3]: https://git.kernel.org/fs/xfs/xfstests-dev.git
Cc: stable@vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The smatch static checker warned about an uninitialized symbol usage in
this function, in the case where ceph_mdsc_build_path returns an error.
It turns out that that case is harmless, but it just looks sketchy.
Initialize the variable at declaration time, and remove the unneeded
setting of it later.
Fixes: a33f6432b3 ("ceph: encode inodes' parent/d_name in cap reconnect message")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Newer compilers seem to determine that this variable being uninitialized
isn't a problem, but older compilers (from the RHEL8 era) seem to choke
on it and complain that it could be used uninitialized.
Go ahead and initialize the variable at declaration time to silence
potential compiler warnings.
Fixes: c3d8e0b5de ("ceph: return the real size read when it hits EOF")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
opened_inodes is incremented twice when the same inode is opened twice
with O_RDONLY and O_WRONLY respectively.
To reproduce, run this python script, then check the metrics:
import os
for _ in range(10000):
fd_r = os.open('a', os.O_RDONLY)
fd_w = os.open('a', os.O_WRONLY)
os.close(fd_r)
os.close(fd_w)
Fixes: 1dd8d47081 ("ceph: metrics for opened files, pinned caps and opened inodes")
Signed-off-by: Hu Weiwen <sehuww@mail.scut.edu.cn>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Renaming lz4_0padding to zero_padding globally since LZMA and later
algorithms also need that.
Link: https://lore.kernel.org/r/20211112160935.19394-1-jnhuang95@gmail.com
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Huang Jianan <huangjianan@oppo.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmGiaB4ACgkQiiy9cAdy
T1H4PAv+OG94BZe+MyMdXgoRbOLiEeye7dm/TZnlpdtV96WmyrMkA4/1nyOc8k7F
bNRn3ocuH3YCjjUB2kU6MZW5Kh9aSxULYgEC0zxuPcA9q4Ig4FehZD2U3r6fztFM
U5Do9n1xRXjIgS/gI7DfZHSQ+SYVwBMAzRB3opplcs1CW68N7+WmKa5CQ6wH5vZM
GqhU4+yFLhZGiXdxXTFF/bLKWer2PF9p5J4mLzRH0ugTG26tY7WnJcQIj+XrNTrN
ccBSXr+9fHFFa9iGcLY08pghk8s1F6dXl/BLv7DFswKOoje7HsDLcWpNSVtVDojO
0Fg5vVtuDKapCOrpmtPt8Khc4qgkxY6VpJyELMCwJuxrSzTE59C54gQcgJfZO97d
bpGeV9L6D6VTLTe1LhUFCRQbc0FFO3daPWRrxrZnvtJeef70xVSZPtrQqNtpnAtM
RwkN2Rf/Enl03w9U25nv3ymlKoHERiR2XZADLfWc5XdB40Lxc+fDccyoVP1wmxT8
uicPpySr
=bEt3
-----END PGP SIGNATURE-----
Merge tag '5.16-rc2-ksmbd-fixes' of git://git.samba.org/ksmbd
Pull ksmbd fixes from Steve French:
"Five ksmbd server fixes, four of them for stable:
- memleak fix
- fix for default data stream on filesystems that don't support xattr
- error logging fix
- session setup fix
- minor doc cleanup"
* tag '5.16-rc2-ksmbd-fixes' of git://git.samba.org/ksmbd:
ksmbd: fix memleak in get_file_stream_info()
ksmbd: contain default data stream even if xattr is empty
ksmbd: downgrade addition info error msg to debug in smb2_get_info_sec()
docs: filesystem: cifs: ksmbd: Fix small layout issues
ksmbd: Fix an error handling path in 'smb2_sess_setup()'
NTFS_RW code allocates page size dependent arrays on the stack. This
results in build failures if the page size is 64k or larger.
fs/ntfs/aops.c: In function 'ntfs_write_mst_block':
fs/ntfs/aops.c:1311:1: error:
the frame size of 2240 bytes is larger than 2048 bytes
Since commit f22969a660 ("powerpc/64s: Default to 64K pages for 64 bit
book3s") this affects ppc:allmodconfig builds, but other architectures
supporting page sizes of 64k or larger are also affected.
Increasing the maximum frame size for affected architectures just to
silence this error does not really help. The frame size would have to
be set to a really large value for 256k pages. Also, a large frame size
could potentially result in stack overruns in this code and elsewhere
and is therefore not desirable. Make NTFS_RW dependent on page sizes
smaller than 64k instead.
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Cc: Anton Altaparmakov <anton@tuxera.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Fix buffer resource leak that could lead to livelock on corrupt fs.
- Remove unused function xfs_inew_wait to shut up the build robots.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmGef+UACgkQ+H93GTRK
tOtAQQ/9EAzGgADQR2dcCXoZwRP3LKz5WGops0qCqywvH/BfbLmKmqzgfUXaf026
dMmrnP9+d6BFytoLk8IXmpydML8qK2/k8lmwJRUG8arPRbHQwVSckDM4vXVrI2X2
K4f8nu1CBR8MDavVS7cR8CZWO3XJMLKTZtxCTdOQlRAw+m9P1+S00LWkiuDTPTTX
YRAGkYEVCtCQLuqJgClf267/5+MaZXRJfFVh1hBQkUtPFXhu1LXfJqZ/thgacmlD
D6/tfwt6Ad7iKg4LjtoJC6zkvoTFN7rB39PGPGILWgS3Nimp0xWgKoTnJ6VLRblY
FwItA6zERXQoHRse7eGMfQ4ZnGT40pvIiN+JVZ4jju4hElY7dkigBfJv8oLkVm3D
xV2dA7YO4DcYS4UAifZ+C00T3pYo/rQnsIwfAGsbh28Xlshyi4+cEqGRTkqrJHnx
gkE4uzp0acs6HVTMC3S0pL6oTirPbHAQtt5tjS8ZtxALlYqF3+Y8xzQhMyB7Fo0b
is213My5aP6VSr2UZajxpkIOl+5OvQ4v37fhmBXMGKmiz7XCzatyvtUjEBjahFfF
FJeug4hRDom/uCKPM3eHfuZxGorlIB1GIeSe7+0ZIXcn+Wuo5seWfJrNP+1H/Asb
Vp6TNueGiCKLBD/+MLWp65zyIZu+b2gNEyYCuhwnOf2sQflhZmI=
=Fa6X
-----END PGP SIGNATURE-----
Merge tag 'xfs-5.16-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
"Fixes for a resource leak and a build robot complaint about totally
dead code:
- Fix buffer resource leak that could lead to livelock on corrupt fs.
- Remove unused function xfs_inew_wait to shut up the build robots"
* tag 'xfs-5.16-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: remove xfs_inew_wait
xfs: Fix the free logic of state in xfs_attr_node_hasname
- Fix an accounting problem where unaligned inline data reads can run
off the end of the read iomap iterator. iomap has historically
required that inline data mappings only exist at the end of a file,
though this wasn't documented anywhere.
- Document iomap_read_inline_data and change its return type to be
appropriate for the information that it's actually returning.
-----BEGIN PGP SIGNATURE-----
iQIyBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmGehZoACgkQ+H93GTRK
tOu7ig/44n/PzNzFv0dLAadoanBH/d3uMDs/s9DBWw6s7RYU0wjUMHgGyla9vFgT
aw0xxLZfppvk59Gkme5WPiv/ksRwB9ZcWvEFUOMX/zt55uSNueCXCVpckduf9j29
gKUzFvRGVssBQ2ACHvuH/s6c6hF9EOhCHadREHinqemU3zg/+eH/+L+dgHIitzMg
WiGdWEaojQX8brxD4kH7xfsNUbuFwCNO5UbGtndzsK/5b/8QGXIXOUrJ0JbefoBf
Kscz4opZjfJIuGjczhIhollgV0jihMOH3OIJYfHQHVOUGVwQ/2epHo2cwq6Wujk0
3qXsjuYloR5xkyQwoLfr382BBO4teQW75nNUt+ez4tvwYs7Ck3U1oOFG+j3KiU/P
gsMcSPzgQIiFdU1DRR5r6li6daLJJWK34PeZ4DtE0zFUKwslSUKytv+pT99yNPTG
xkhvdU6R4jchUOPJCZCh8zdARhofTiaxrLPlZ0xKenqxlLxdMm+W0fCkuGFrLHSq
g39CGJhuPe7OexK8lBY9fQ08zwI7LJIx/vQGQ3hbqZaxq14uCZVr/nGwcXlzBhzl
BufZslO9Aj/eG1/MzwSN/xDTo3Jl+RuCs9lBgpg8pu6WNzueXbRMQnURy6Vf47Ua
U6Awq0OrGbippg4Y7P/IiyYDKjFWPjgftYRJhoosxxfbO00BSA==
=hIEG
-----END PGP SIGNATURE-----
Merge tag 'iomap-5.16-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull iomap fixes from Darrick Wong:
"A single iomap bug fix and a cleanup for 5.16-rc2.
The bug fix changes how iomap deals with reading from an inline data
region -- whereas the current code (incorrectly) lets the iomap read
iter try for more bytes after reading the inline region (which zeroes
the rest of the page!) and hopes the next iteration terminates, we
surveyed the inlinedata implementations and realized that all
inlinedata implementations also require that the inlinedata region end
at EOF, so we can simply terminate the read.
The second patch documents these assumptions in the code so that
they're not subtle implications anymore, and cleans up some of the
grosser parts of that function.
Summary:
- Fix an accounting problem where unaligned inline data reads can run
off the end of the read iomap iterator. iomap has historically
required that inline data mappings only exist at the end of a file,
though this wasn't documented anywhere.
- Document iomap_read_inline_data and change its return type to be
appropriate for the information that it's actually returning"
* tag 'iomap-5.16-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
iomap: iomap_read_inline_data cleanup
iomap: Fix inline extent handling in iomap_readpage
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmGiPsAQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpr6oD/9i1PrY0nhHbNdpEKC5ZB2TeX8inxvCQj5i
HmAm421s9umLjUXrRZQH6VMsSOWAS6viaXA9MdeWkxXSA9950Stx4Tr5LpujN/iJ
hZneHgX1V3kfGkIOWdstPE62QTKCoMDtBFx1Jk+XZfZ7N9ogWHlmvr3C7iD3QLId
+RODIIZlOZXYP5cYIUon9hK4ydMRjAh5jGik73ckN6gG+vsEWlsVZPlKLgqQ+AyT
CXkXHh5Ad5tQt1vqio+PH2Fk42Ce/+CeY0vhNS2ZUWDHwQEKSz8qIHz3kx+zaCCU
Y3e5CVe8MW64MOZIQPayEeNyposT0YNxTDgOMZxRi2J3IMeYFluJTm57zMNnnH/N
sgKZHZDItAfsgkptVkptoIEeg+2nus3GWmhAKVWu7zPaS8LfAVROt+OCx6+2CUvr
DHJwuGbkG3XR4vwO69ugTbjgRh97P3VJJfg4t6QZZNE9JDSAviHUrZDu9X53xHN5
hAsxg98QRz5BqvqPq99KIB6lq2zQ8LBHLZnnjEXhF3q031LpxCAhpVNBNt+LgQo2
MiGyx4lgznAl6P7/2uXaq5VxCLhWWHI5BtLdlGvJH3v8Uckci/JQKYoO65RNL4D8
gzdGnLQwJ80mDVlHcoN/9vUIYirru4+E+BiW6ua1/5MGmTAQvt6zw4Lo88MZMa5i
mKOxZxL30g==
=h6A3
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.16-2021-11-27' of git://git.kernel.dk/linux-block
Pull more io_uring fixes from Jens Axboe:
"The locking fixup that was applied earlier this rc has both a deadlock
and IRQ safety issue, let's get that ironed out before -rc3. This
contains:
- Link traversal locking fix (Pavel)
- Cancelation fix (Pavel)
- Relocate cond_resched() for huge buffer chain freeing, avoiding a
softlockup warning (Ye)
- Fix timespec validation (Ye)"
* tag 'io_uring-5.16-2021-11-27' of git://git.kernel.dk/linux-block:
io_uring: Fix undefined-behaviour in io_issue_sqe
io_uring: fix soft lockup when call __io_remove_buffers
io_uring: fix link traversal locking
io_uring: fail cancellation for EXITING tasks
Highlights include:
Stable fixes:
- NFSv42: Fix pagecache invalidation after COPY/CLONE
Bugfixes:
- NFSv42: Don't fail clone() just because the server failed to return
post-op attributes
- SUNRPC: use different lockdep keys for INET6 and LOCAL
- NFSv4.1: handle NFS4ERR_NOSPC from CREATE_SESSION
- SUNRPC: fix header include guard in trace header
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmGiTnMACgkQZwvnipYK
APJoJQ//VZYSCx/mGaTIj5oUwjBKE/n/9rz2EUGS1cfYjZjPpb5Xgm1tn+1A4c01
Ztu9/hKgwrDqknqkmtKvP1GsX5vYUqgfqAlc880Q2nXAqaLJBBZgB6BFMmTtcoQx
C24L0tgxlZVD9Vw0DJEVDVgDxXA/9VmdSQK6uptQRQhcYf4VtR1wAzELHWdkdkfq
1WrREeAwGWw1BaPTrlPn9XwW9qTaMlBH05XRHh6dM7gFmoIe3td7kq7BOnFxFsnA
AQZ/nCgeMTE04kQQMzYqUc4YqZvnzUHxueZ6q8s0K1RKJBpNIQNgdWUMa285Qo4a
JA9oBCPPo0JjmsEge2Km12zyBJoA7lLQDfc6UQJON50ADF0sTu3wszgGuC63KkhE
V+kUogK2WOlnGky2yYrHmv43mcCcyoJ/g+g+38GXNYGorsFi/XUhvctEpFFFPF71
0umQwWhA6Dhc52hMj5DN4nfspp//hEuV9o7/zJzlPi0elC+xtVaBWZCorNvznlXW
C/O5yobJVd89PuIE17Stg+c0Rq3k7RVPPoyS2IaMkZ5cs7DtT5Tz3nKClPAA8Bur
mPLAMkHSOSLO26cy30SVZCIx1JDMJU9dx/PkFemCexkzYXQxgp9px/8wmM2Xe6Oc
/hDqi8V7ayJUuYpYuJ6sA8oUqj3j+NkoP4w5HhlnmZDBhFMEBhM=
=jeHm
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.16-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client fixes from Trond Myklebust:
"Highlights include:
Stable fixes:
- NFSv42: Fix pagecache invalidation after COPY/CLONE
Bugfixes:
- NFSv42: Don't fail clone() just because the server failed to return
post-op attributes
- SUNRPC: use different lockdep keys for INET6 and LOCAL
- NFSv4.1: handle NFS4ERR_NOSPC from CREATE_SESSION
- SUNRPC: fix header include guard in trace header"
* tag 'nfs-for-5.16-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
SUNRPC: use different lock keys for INET6 and LOCAL
sunrpc: fix header include guard in trace header
NFSv4.1: handle NFS4ERR_NOSPC by CREATE_SESSION
NFSv42: Fix pagecache invalidation after COPY/CLONE
NFS: Add a tracepoint to show the results of nfs_set_cache_invalid()
NFSv42: Don't fail clone() unless the OP_CLONE operation failed
- Fix an ABBA deadlock introduced by XArray convention.
-----BEGIN PGP SIGNATURE-----
iIcEABYIAC8WIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCYaG1ZxEceGlhbmdAa2Vy
bmVsLm9yZwAKCRA5NzHcH7XmBFQqAP0Zk3Nozb1BnXnQ/xGZR+liyTimNlujsWqQ
YtHIwB4EiwEAopRnuZRY00txMqzYZNz3sLEV6ylATbN0i/EJ2brOpA8=
=ZHsQ
-----END PGP SIGNATURE-----
Merge tag 'erofs-for-5.16-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs fix from Gao Xiang:
"Fix an ABBA deadlock introduced by XArray conversion"
* tag 'erofs-for-5.16-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: fix deadlock when shrink erofs slab
We got issue as follows:
================================================================================
UBSAN: Undefined behaviour in ./include/linux/ktime.h:42:14
signed integer overflow:
-4966321760114568020 * 1000000000 cannot be represented in type 'long long int'
CPU: 1 PID: 2186 Comm: syz-executor.2 Not tainted 4.19.90+ #12
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace+0x0/0x3f0 arch/arm64/kernel/time.c:78
show_stack+0x28/0x38 arch/arm64/kernel/traps.c:158
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x170/0x1dc lib/dump_stack.c:118
ubsan_epilogue+0x18/0xb4 lib/ubsan.c:161
handle_overflow+0x188/0x1dc lib/ubsan.c:192
__ubsan_handle_mul_overflow+0x34/0x44 lib/ubsan.c:213
ktime_set include/linux/ktime.h:42 [inline]
timespec64_to_ktime include/linux/ktime.h:78 [inline]
io_timeout fs/io_uring.c:5153 [inline]
io_issue_sqe+0x42c8/0x4550 fs/io_uring.c:5599
__io_queue_sqe+0x1b0/0xbc0 fs/io_uring.c:5988
io_queue_sqe+0x1ac/0x248 fs/io_uring.c:6067
io_submit_sqe fs/io_uring.c:6137 [inline]
io_submit_sqes+0xed8/0x1c88 fs/io_uring.c:6331
__do_sys_io_uring_enter fs/io_uring.c:8170 [inline]
__se_sys_io_uring_enter fs/io_uring.c:8129 [inline]
__arm64_sys_io_uring_enter+0x490/0x980 fs/io_uring.c:8129
invoke_syscall arch/arm64/kernel/syscall.c:53 [inline]
el0_svc_common+0x374/0x570 arch/arm64/kernel/syscall.c:121
el0_svc_handler+0x190/0x260 arch/arm64/kernel/syscall.c:190
el0_svc+0x10/0x218 arch/arm64/kernel/entry.S:1017
================================================================================
As ktime_set only judge 'secs' if big than KTIME_SEC_MAX, but if we pass
negative value maybe lead to overflow.
To address this issue, we must check if 'sec' is negative.
Signed-off-by: Ye Bin <yebin10@huawei.com>
Link: https://lore.kernel.org/r/20211118015907.844807-1-yebin10@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCYZ+ZHwAKCRDh3BK/laaZ
PFcKAQC/1E+WRv8MT7BmtS3FBnnuH0hSBJe5KTVL+sQI72NC2AD/eKeX3FKXpimt
C3hrTAEegduHSRNMCZn+4juHZGFRtAw=
=n755
-----END PGP SIGNATURE-----
Merge tag 'fuse-fixes-5.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse fix from Miklos Szeredi:
"Fix a regression caused by a bugfix in the previous release. The
symptom is a VM_BUG_ON triggered from splice to the fuse device.
Unfortunately the original bugfix was already backported to a number
of stable releases, so this fix-fix will need to be backported as
well"
* tag 'fuse-fixes-5.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: release pipe buf after last use
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmGg+PIACgkQxWXV+ddt
WDsWDBAAk//y15bs3SGQLPshFuFidcRmudQkXGE8431ftjWZtuz1UY046rVvC8I5
FkotiiDVrnuyg1kBU1k0vYGS0Fo687JHw2E+8abD2KCicF4MSNIAlc5D9M5B7jzp
GpSIPoZjs05H85kxcoy3BmdQR1DyjR/xOqvTe2IVswPQSj2B1qZKMCbU4927U+e8
Tu18kr7FTxpnzLhtt9Ahr8xVol6bcV/3zB0nC+O5hRbfg5gH87tpb7giLICwfipq
eYY8I361iYDKtDQlR/qvpmkUAfPO4ahYz1yumTxm0twuIv34PugcCP0oLtGCbWwU
71YflcuZa6L2vmXNK8cjXf/9Frg/7k0FlyepEAhjhbooqT92m9Sv4iCX5z9mrsqf
40aoLBXrrPCewa9j31Aw+JuUEjWC4G7U2v+TqJ3waHgUyfeDQUGhOGLmvQJqGyMd
SL4QhGz9aQGlLmUVkDlUemkmMtGBwn87sD1HCJkkNrHS0OWOHb12tpijOu7UAIp3
ih/nXtAayVZV5LS1hfxfYuXzpP8E6dXUnR2vWpvvDqihvfq6ubkt497yVN+vUFBw
6GEhvePwa/w0U1Kn4cZUZxFnalBAmeYcahBBL9ngrHLBD57IJL/yLcOue5nCgqqN
DvLKl8tJUDQTmayx1MjdpdPNckAbY3cp0OP8uEowL0rJjxJPyAE=
=/pKc
-----END PGP SIGNATURE-----
Merge tag 'for-5.16-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"One more fix to the lzo code, a missing put_page causing memory leaks
when some error branches are taken"
* tag 'for-5.16-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix the memory leak caused in lzo_compress_pages()
WARNING: inconsistent lock state
5.16.0-rc2-syzkaller #0 Not tainted
inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.
ffff888078e11418 (&ctx->timeout_lock
){?.+.}-{2:2}
, at: io_timeout_fn+0x6f/0x360 fs/io_uring.c:5943
{HARDIRQ-ON-W} state was registered at:
[...]
spin_unlock_irq include/linux/spinlock.h:399 [inline]
__io_poll_remove_one fs/io_uring.c:5669 [inline]
__io_poll_remove_one fs/io_uring.c:5654 [inline]
io_poll_remove_one+0x236/0x870 fs/io_uring.c:5680
io_poll_remove_all+0x1af/0x235 fs/io_uring.c:5709
io_ring_ctx_wait_and_kill+0x1cc/0x322 fs/io_uring.c:9534
io_uring_release+0x42/0x46 fs/io_uring.c:9554
__fput+0x286/0x9f0 fs/file_table.c:280
task_work_run+0xdd/0x1a0 kernel/task_work.c:164
exit_task_work include/linux/task_work.h:32 [inline]
do_exit+0xc14/0x2b40 kernel/exit.c:832
674ee8e1b4 ("io_uring: correct link-list traversal locking") fixed a
data race but introduced a possible deadlock and inconsistentcy in irq
states. E.g.
io_poll_remove_all()
spin_lock_irq(timeout_lock)
io_poll_remove_one()
spin_lock/unlock_irq(poll_lock);
spin_unlock_irq(timeout_lock)
Another type of problem is freeing a request while holding
->timeout_lock, which may leads to a deadlock in
io_commit_cqring() -> io_flush_timeouts() and other places.
Having 3 nested locks is also too ugly. Add io_match_task_safe(), which
would briefly take and release timeout_lock for race prevention inside,
so the actuall request cancellation / free / etc. code doesn't have it
taken.
Reported-by: syzbot+ff49a3059d49b0ca0eec@syzkaller.appspotmail.com
Reported-by: syzbot+847f02ec20a6609a328b@syzkaller.appspotmail.com
Reported-by: syzbot+3368aadcd30425ceb53b@syzkaller.appspotmail.com
Reported-by: syzbot+51ce8887cdef77c9ac83@syzkaller.appspotmail.com
Reported-by: syzbot+3cb756a49d2f394a9ee3@syzkaller.appspotmail.com
Fixes: 674ee8e1b4 ("io_uring: correct link-list traversal locking")
Cc: stable@kernel.org # 5.15+
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/397f7ebf3f4171f1abe41f708ac1ecb5766f0b68.1637937097.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
WARNING: CPU: 1 PID: 20 at fs/io_uring.c:6269 io_try_cancel_userdata+0x3c5/0x640 fs/io_uring.c:6269
CPU: 1 PID: 20 Comm: kworker/1:0 Not tainted 5.16.0-rc1-syzkaller #0
Workqueue: events io_fallback_req_func
RIP: 0010:io_try_cancel_userdata+0x3c5/0x640 fs/io_uring.c:6269
Call Trace:
<TASK>
io_req_task_link_timeout+0x6b/0x1e0 fs/io_uring.c:6886
io_fallback_req_func+0xf9/0x1ae fs/io_uring.c:1334
process_one_work+0x9b2/0x1690 kernel/workqueue.c:2298
worker_thread+0x658/0x11f0 kernel/workqueue.c:2445
kthread+0x405/0x4f0 kernel/kthread.c:327
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
</TASK>
We need original task's context to do cancellations, so if it's dying
and the callback is executed in a fallback mode, fail the cancellation
attempt.
Fixes: 89b263f6d5 ("io_uring: run linked timeouts from task_work")
Cc: stable@kernel.org # 5.15+
Reported-by: syzbot+ab0cfe96c2b3cd1c1153@syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4c41c5f379c6941ad5a07cd48cb66ed62199cf7e.1637937097.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
[BUG]
Fstests generic/027 is pretty easy to trigger a slow but steady memory
leak if run with "-o compress=lzo" mount option.
Normally one single run of generic/027 is enough to eat up at least 4G ram.
[CAUSE]
In commit d4088803f5 ("btrfs: subpage: make lzo_compress_pages()
compatible") we changed how @page_in is released.
But that refactoring makes @page_in only released after all pages being
compressed.
This leaves error path not releasing @page_in. And by "error path"
things like incompressible data will also be treated as an error
(-E2BIG).
Thus it can cause a memory leak if even nothing wrong happened.
[FIX]
Add check under @out label to release @page_in when needed, so when we
hit any error, the input page is properly released.
Reported-by: Josef Bacik <josef@toxicpanda.com>
Fixes: d4088803f5 ("btrfs: subpage: make lzo_compress_pages() compatible")
Reviewed-and-tested-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmGfuAQQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpmBbD/9yy8RHZAFfPLhjuyVaUak6/KlLe39UEhqu
75tqBNyWkN/mBon34tIxoUJdDCw15GZXW3G+xDgHOP2+RgpsqZX97Oj7f1m/0L+Z
XpYHKc8j1GmNiXU22nWt7GGSAKFiGZtlW0ppe2BX75xVcqPCiPxge20Ev0wTZAcr
Vrm0r2XKZQ+WI1qLXoKIOxO+bNBzlzhZ/sV2W/6akz6lQIeztFqk6PhovERbBBwi
/UlyWJ4z+ZKQYjpomKA8f41IvVrh3MrQlILYFOHQFj83Wa1OrBM4mD8i1ZAJf907
wWEBgh0Em5GRLzQijXQo/vz3RNgklzX1sP4C5j6njlukgCklEOLhhunEFkk+qtps
Vbc+Hm+jAP2VJjHfMmsttMKCqlJYtR85j71dip+j3JXheaGirUxf97RDHWOFtCFO
+AFbVu7S7H2GI9LJb/WEr66EmbE5RlzCKSeyrBZb96Y7PYdfTwq8TgjzDPSZEm3B
tPvYQgXPXoiFjzlaCgN372YJJIfJhpVOhBTu0hJ5oyPW/Rsy/vGyrhOWEfoc+nGh
RY8ou7qOnnspVwa8PVknjk01BWTvSNgvqVp82WywqZHqQFxEj5YCeEwg2EVuHQt7
owXNO7dtJ+W1SNJDD3KfQx+cwNv8Alf1G02A4dl8xzTgETiHsng+R/t0IQGb7mrf
68hXzT+qSQ==
=y8h2
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.16-2021-11-25' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"A locking fix for link traversal, and fixing up an outdated function
name in a comment"
* tag 'io_uring-5.16-2021-11-25' of git://git.kernel.dk/linux-block:
io_uring: correct link-list traversal locking
io_uring: fix missed comment from *task_file rename
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmGfEPEACgkQiiy9cAdy
T1F7Ogv+J9/ODordrHKl5AQWwtFfcWAVCv2K9hJHoaYbj4fpQVzreyBwuSFqryxe
udO8msORq/KySIOMATvPkse8+ZZ+0ZMFF+n4I5kKMaoJL3UZHqjJopUL+zv61817
Qwyd8KCwnrskpZ+q8OfbHDOvKxHpBt/YVP6RdE6R9DBtBIY0zJB8dcD3srYkfDi+
yWjab3cCFoe2p9dzs9PG6eWmmDMM/AcO3m5WEI12hBE30NS+n8CACVFbjImC/ZVV
8SokX33X4CvCZWn2BGdJBE1OFb4Nviz6QChbRd18Z0JOGiWAtlBWclxykedvMGcn
FR/pUFF+7RhXFFK4lTkzJwqj34FoovTiS9X81fLo7lCCXTT0Ki0yp+mKtoNQAaBj
YTDoISjG5+ZQfes12zy1NwARzpKWEwFAR6A8mHlj3QUVP7wkClNlKqZSw9bzRDE7
yOd+4PoH8Hy0ZiprmgTnV92s7+HiD5vucFkzurvBM7hqQAC9LcW1CEix925jMOw0
71Oow+aE
=sMmj
-----END PGP SIGNATURE-----
Merge tag '5.16-rc2-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Four small cifs/smb3 fixes:
- two multichannel fixes
- fix problem noted by kernel test robot
- update internal version number"
* tag '5.16-rc2-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: update internal version number
smb2: clarify rc initialization in smb2_reconnect
cifs: populate server_hostname for extra channels
cifs: nosharesock should be set on new server
- Fix compilation warnings on csky and sparc
- Rename multipage folios to large folios
- Rename AS_THP_SUPPORT and FS_THP_SUPPORT
- Add functions to zero portions of a folio
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmGem6wACgkQDpNsjXcp
gj7uvwgAjNqDWOVgwYU98daN6nKQQf5Vv35f0bzeKcKcHIOEWZ2+MUeXkI55h8TD
ss5L3O86sPtQmpKUQJJChZC4AhpIPRyjPA0JW6vYqXQd912M331WpGgFFyX5eI+3
OxfKLRULmopeWP1RjWmkWqlhYQHL5OLgAMC4VaBSfDHd1UMRf+F9JNm9qR7GCp9Q
Vb0qcmBMaQYt/K5sWRQyPUACVTF+27RLKAs+Om37NGekv1UqgOPMzi9nAyi9RjCi
rRY6oGupNgC+Y41jzlpaNoL71RPS92H769FBh/Fe4qu55VSPjfcN77qAnVhX5Ykn
4RhzZcEUoqlx9xG9xynk0mmbx2Bf4g==
=kvqM
-----END PGP SIGNATURE-----
Merge tag 'folio-5.16b' of git://git.infradead.org/users/willy/pagecache
Pull folio fixes from Matthew Wilcox:
"In the course of preparing the folio changes for iomap for next merge
window, we discovered some problems that would be nice to address now:
- Renaming multi-page folios to large folios.
mapping_multi_page_folio_support() is just a little too long, so we
settled on mapping_large_folio_support(). That meant renaming, eg
folio_test_multi() to folio_test_large().
Rename AS_THP_SUPPORT to match
- I hadn't included folio wrappers for zero_user_segments(), etc.
Also, multi-page^W^W large folio support is now independent of
CONFIG_TRANSPARENT_HUGEPAGE, so machines with HIGHMEM always need
to fall back to the out-of-line zero_user_segments().
Remove FS_THP_SUPPORT to match
- The build bots finally got round to telling me that I missed a
couple of architectures when adding flush_dcache_folio(). Christoph
suggested that we just add linux/cacheflush.h and not rely on
asm-generic/cacheflush.h"
* tag 'folio-5.16b' of git://git.infradead.org/users/willy/pagecache:
mm: Add functions to zero portions of a folio
fs: Rename AS_THP_SUPPORT and mapping_thp_support
fs: Remove FS_THP_SUPPORT
mm: Remove folio_test_single
mm: Rename folio_test_multi to folio_test_large
Add linux/cacheflush.h
When a new inode is created, send its security context to server along with
creation request (FUSE_CREAT, FUSE_MKNOD, FUSE_MKDIR and FUSE_SYMLINK).
This gives server an opportunity to create new file and set security
context (possibly atomically). In all the configurations it might not be
possible to set context atomically.
Like nfs and ceph, use security_dentry_init_security() to dermine security
context of inode and send it with create, mkdir, mknod, and symlink
requests.
Following is the information sent to server.
fuse_sectx_header, fuse_secctx, xattr_name, security_context
- struct fuse_secctx_header
This contains total number of security contexts being sent and total
size of all the security contexts (including size of
fuse_secctx_header).
- struct fuse_secctx
This contains size of security context which follows this structure.
There is one fuse_secctx instance per security context.
- xattr name string
This string represents name of xattr which should be used while setting
security context.
- security context
This is the actual security context whose size is specified in
fuse_secctx struct.
Also add the FUSE_SECURITY_CTX flag for the `flags` field of the
fuse_init_out struct. When this flag is set the kernel will append the
security context for a newly created inode to the request (create, mkdir,
mknod, and symlink). The server is responsible for ensuring that the inode
appears atomically (preferrably) with the requested security context.
For example, If the server is using SELinux and backed by a "real" linux
file system that supports extended attributes it can write the security
context value to /proc/thread-self/attr/fscreate before making the syscall
to create the inode.
This patch is based on patch from Chirantan Ekbote <chirantan@chromium.org>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Checking buf->flags should be done before the pipe_buf_release() is called
on the pipe buffer, since releasing the buffer might modify the flags.
This is exactly what page_cache_pipe_buf_release() does, and which results
in the same VM_BUG_ON_PAGE(PageLRU(page)) that the original patch was
trying to fix.
Reported-by: Justin Forbes <jmforbes@linuxtx.org>
Fixes: 712a951025 ("fuse: fix page stealing")
Cc: <stable@vger.kernel.org> # v2.6.35
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
FUSE_INIT flags are close to running out, so add another 32bits worth of
space.
Add FUSE_INIT_EXT flag to the old flags field in fuse_init_in. If this
flag is set, then fuse_init_in is extended by 48bytes, in which a flags_hi
field is allocated to contain the high 32bits of the flags.
A flags_hi field is also added to fuse_init_out, allocated out of the
remaining unused fields.
Known userspace implementations of the fuse protocol have been checked to
accept the extended FUSE_INIT request, but this might cause problems with
other implementations. If that happens to be the case, the protocol
negotiation will have to be extended with an extra initialization request
roundtrip.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
If xattr is not supported like exfat or fat, ksmbd server doesn't
contain default data stream in FILE_STREAM_INFORMATION response. It will
cause ppt or doc file update issue if local filesystem is such as ones.
This patch move goto statement to contain it.
Fixes: 9f6323311c ("ksmbd: add default data stream name in FILE_STREAM_INFORMATION")
Cc: stable@vger.kernel.org # v5.15
Acked-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
While file transfer through windows client, This error flood message
happen. This flood message will cause performance degradation and
misunderstand server has problem.
Fixes: e294f78d34 ("ksmbd: allow PROTECTED_DACL_SECINFO and UNPROTECTED_DACL_SECINFO addition information in smb2 set info security")
Cc: stable@vger.kernel.org # v5.15
Acked-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
All the error handling paths of 'smb2_sess_setup()' end to 'out_err'.
All but the new error handling path added by the commit given in the Fixes
tag below.
Fix this error handling path and branch to 'out_err' as well.
Fixes: 0d994cd482 ("ksmbd: add buffer validation in session setup")
Cc: stable@vger.kernel.org # v5.15
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Steve French <stfrench@microsoft.com>
Current IOSQE_IO_DRAIN implementation doesn't work well with CQE
skipping and it's not allowed, otherwise some requests might be not
executed until the ring is destroyed and the userspace would hang.
Let's fail all drain requests after seeing IOSQE_CQE_SKIP_SUCCESS at
least once. All drained requests prior to that will get run normally,
so there should be no stalls. However, even though such mixing wouldn't
lead to issues at the moment, it's still not allowed as the behaviour
may change.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/bcf7164f8bf3eb54b7bb7b4fd119907fa4d4d43b.1636559119.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Emitting a CQE is expensive from the kernel perspective. Often, it's
also not convenient for the userspace, spends some cycles on processing
and just complicates the logic. A similar problems goes for linked
requests, where we post an CQE for each request in the link.
Introduce a new flags, IOSQE_CQE_SKIP_SUCCESS, trying to help with it.
When set and a request completed successfully, it won't generate a CQE.
When fails, it produces an CQE, but all following linked requests will
be CQE-less, regardless whether they have IOSQE_CQE_SKIP_SUCCESS or not.
The notion of "fail" is the same as for link failing-cancellation, where
it's opcode dependent, and _usually_ result >= 0 is a success, but not
always.
Linked timeouts are a bit special. When the requests it's linked to was
not attempted to be executed, e.g. failing linked requests, it follows
the description above. Otherwise, whether a linked timeout will post a
completion or not solely depends on IOSQE_CQE_SKIP_SUCCESS of that
linked timeout request. Linked timeout never "fail" during execution, so
for them it's unconditional. It's expected for users to not really care
about the result of it but rely solely on the result of the master
request. Another reason for such a treatment is that it's racy, and the
timeout callback may be running awhile the master request posts its
completion.
use case 1:
If one doesn't care about results of some requests, e.g. normal
timeouts, just set IOSQE_CQE_SKIP_SUCCESS. Error result will still be
posted and need to be handled.
use case 2:
Set IOSQE_CQE_SKIP_SUCCESS for all requests of a link but the last,
and it'll post a completion only for the last one if everything goes
right, otherwise there will be one only one CQE for the first failed
request.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0220fbe06f7cf99e6fc71b4297bb1cb6c0e89c2c.1636559119.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Split io_cqring_fill_event() into a couple of more targeted functions.
The first on is io_fill_cqe_aux() for completions that are not
associated with request completions and doing the ->cq_extra accounting.
Examples are additional CQEs from multishot poll and rsrc notifications.
The second is io_fill_cqe_req(), should be called when it's a normal
request completion. Nothing more to it at the moment, will be used in
later patches.
The last one is inlined __io_fill_cqe() for a finer grained control,
should be used with caution and in hottest places.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/59a9117a4a44fc9efcf04b3afa51e0d080f5943c.1636559119.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Change iomap_read_inline_data to return 0 or an error code; this
simplifies the callers. Add a description.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
[djwong: document the return value of iomap_read_inline_data explicitly]
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
With the remove of xfs_dqrele_all_inodes, xfs_inew_wait and all the
infrastructure used to wake the XFS_INEW bit waitqueue is unused.
Reported-by: kernel test robot <lkp@intel.com>
Fixes: 777eb1fa85 ("xfs: remove xfs_dqrele_all_inodes")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
When testing xfstests xfs/126 on lastest upstream kernel, it will hang on some machine.
Adding a getxattr operation after xattr corrupted, I can reproduce it 100%.
The deadlock as below:
[983.923403] task:setfattr state:D stack: 0 pid:17639 ppid: 14687 flags:0x00000080
[ 983.923405] Call Trace:
[ 983.923410] __schedule+0x2c4/0x700
[ 983.923412] schedule+0x37/0xa0
[ 983.923414] schedule_timeout+0x274/0x300
[ 983.923416] __down+0x9b/0xf0
[ 983.923451] ? xfs_buf_find.isra.29+0x3c8/0x5f0 [xfs]
[ 983.923453] down+0x3b/0x50
[ 983.923471] xfs_buf_lock+0x33/0xf0 [xfs]
[ 983.923490] xfs_buf_find.isra.29+0x3c8/0x5f0 [xfs]
[ 983.923508] xfs_buf_get_map+0x4c/0x320 [xfs]
[ 983.923525] xfs_buf_read_map+0x53/0x310 [xfs]
[ 983.923541] ? xfs_da_read_buf+0xcf/0x120 [xfs]
[ 983.923560] xfs_trans_read_buf_map+0x1cf/0x360 [xfs]
[ 983.923575] ? xfs_da_read_buf+0xcf/0x120 [xfs]
[ 983.923590] xfs_da_read_buf+0xcf/0x120 [xfs]
[ 983.923606] xfs_da3_node_read+0x1f/0x40 [xfs]
[ 983.923621] xfs_da3_node_lookup_int+0x69/0x4a0 [xfs]
[ 983.923624] ? kmem_cache_alloc+0x12e/0x270
[ 983.923637] xfs_attr_node_hasname+0x6e/0xa0 [xfs]
[ 983.923651] xfs_has_attr+0x6e/0xd0 [xfs]
[ 983.923664] xfs_attr_set+0x273/0x320 [xfs]
[ 983.923683] xfs_xattr_set+0x87/0xd0 [xfs]
[ 983.923686] __vfs_removexattr+0x4d/0x60
[ 983.923688] __vfs_removexattr_locked+0xac/0x130
[ 983.923689] vfs_removexattr+0x4e/0xf0
[ 983.923690] removexattr+0x4d/0x80
[ 983.923693] ? __check_object_size+0xa8/0x16b
[ 983.923695] ? strncpy_from_user+0x47/0x1a0
[ 983.923696] ? getname_flags+0x6a/0x1e0
[ 983.923697] ? _cond_resched+0x15/0x30
[ 983.923699] ? __sb_start_write+0x1e/0x70
[ 983.923700] ? mnt_want_write+0x28/0x50
[ 983.923701] path_removexattr+0x9b/0xb0
[ 983.923702] __x64_sys_removexattr+0x17/0x20
[ 983.923704] do_syscall_64+0x5b/0x1a0
[ 983.923705] entry_SYSCALL_64_after_hwframe+0x65/0xca
[ 983.923707] RIP: 0033:0x7f080f10ee1b
When getxattr calls xfs_attr_node_get function, xfs_da3_node_lookup_int fails with EFSCORRUPTED in
xfs_attr_node_hasname because we have use blocktrash to random it in xfs/126. So it
free state in internal and xfs_attr_node_get doesn't do xfs_buf_trans release job.
Then subsequent removexattr will hang because of it.
This bug was introduced by kernel commit 07120f1abd ("xfs: Add xfs_has_attr and subroutines").
It adds xfs_attr_node_hasname helper and said caller will be responsible for freeing the state
in this case. But xfs_attr_node_hasname will free state itself instead of caller if
xfs_da3_node_lookup_int fails.
Fix this bug by moving the step of free state into caller.
Also, use "goto error/out" instead of returning error directly in xfs_attr_node_addname_find_attr and
xfs_attr_node_removename_setup function because we should free state ourselves.
Fixes: 07120f1abd ("xfs: Add xfs_has_attr and subroutines")
Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
The kernfs implementation has big lock granularity(kernfs_rwsem) so
every kernfs-based(e.g., sysfs, cgroup) fs are able to compete the
lock. It makes trouble for some cases to wait the global lock
for a long time even though they are totally independent contexts
each other.
A general example is process A goes under direct reclaim with holding
the lock when it accessed the file in sysfs and process B is waiting
the lock with exclusive mode and then process C is waiting the lock
until process B could finish the job after it gets the lock from
process A.
This patch switches the global kernfs_rwsem to per-fs lock, which
put the rwsem into kernfs_root.
Suggested-by: Tejun Heo <tj@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Link: https://lore.kernel.org/r/20211118230008.2679780-1-minchan@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Simplify failed resubmission prep in kiocb_done(), it's a bit ugly with
conditional logic and hand handling cflags / select buffers. Instead,
punt to tw and use io_req_task_complete() already handling all the
cases.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/667c33484b05b612e9420e1b1d5f4dc46d0ee9ce.1637524285.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It is clearer to initialize rc at the beginning of the function.
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
Recently, a new field got added to the smb3_fs_context struct
named server_hostname. While creating extra channels, pick up
this field from primary channel.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
Recent fix to maintain a nosharesock state on the
server struct caused a regression. It updated this
field in the old tcp session, and not the new one.
This caused the multichannel scenario to misbehave.
Fixes: c9f1c19cf7 (cifs: nosharesock should not share socket with future sessions)
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
We observed the following deadlock in the stress test under low
memory scenario:
Thread A Thread B
- erofs_shrink_scan
- erofs_try_to_release_workgroup
- erofs_workgroup_try_to_freeze -- A
- z_erofs_do_read_page
- z_erofs_collection_begin
- z_erofs_register_collection
- erofs_insert_workgroup
- xa_lock(&sbi->managed_pslots) -- B
- erofs_workgroup_get
- erofs_wait_on_workgroup_freezed -- A
- xa_erase
- xa_lock(&sbi->managed_pslots) -- B
To fix this, it needs to hold xa_lock before freezing the workgroup
since xarray will be touched then. So let's hold the lock before
accessing each workgroup, just like what we did with the radix tree
before.
[ Gao Xiang: Jianhua Hao also reports this issue at
https://lore.kernel.org/r/b10b85df30694bac8aadfe43537c897a@xiaomi.com ]
Link: https://lore.kernel.org/r/20211118135844.3559-1-huangjianan@oppo.com
Fixes: 64094a0441 ("erofs: convert workstn to XArray")
Reviewed-by: Chao Yu <chao@kernel.org>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Huang Jianan <huangjianan@oppo.com>
Reported-by: Jianhua Hao <haojianhua1@xiaomi.com>
Signed-off-by: Gao Xiang <xiang@kernel.org>
Before commit 740499c784 ("iomap: fix the iomap_readpage_actor return
value for inline data"), when hitting an IOMAP_INLINE extent,
iomap_readpage_actor would report having read the entire page. Since
then, it only reports having read the inline data (iomap->length).
This will force iomap_readpage into another iteration, and the
filesystem will report an unaligned hole after the IOMAP_INLINE extent.
But iomap_readpage_actor (now iomap_readpage_iter) isn't prepared to
deal with unaligned extents, it will get things wrong on filesystems
with a block size smaller than the page size, and we'll eventually run
into the following warning in iomap_iter_advance:
WARN_ON_ONCE(iter->processed > iomap_length(iter));
Fix that by changing iomap_readpage_iter to return 0 when hitting an
inline extent; this will cause iomap_iter to stop immediately.
To fix readahead as well, change iomap_readahead_iter to pass on
iomap_readpage_iter return values less than or equal to zero.
Fixes: 740499c784 ("iomap: fix the iomap_readpage_actor return value for inline data")
Cc: stable@vger.kernel.org # v5.15+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Merge misc fixes from Andrew Morton:
"15 patches.
Subsystems affected by this patch series: ipc, hexagon, mm (swap,
slab-generic, kmemleak, hugetlb, kasan, damon, and highmem), and proc"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
proc/vmcore: fix clearing user buffer by properly using clear_user()
kmap_local: don't assume kmap PTEs are linear arrays in memory
mm/damon/dbgfs: fix missed use of damon_dbgfs_lock
mm/damon/dbgfs: use '__GFP_NOWARN' for user-specified size buffer allocation
kasan: test: silence intentional read overflow warnings
hugetlb, userfaultfd: fix reservation restore on userfaultfd error
hugetlb: fix hugetlb cgroup refcounting during mremap
mm: kmemleak: slob: respect SLAB_NOLEAKTRACE flag
hexagon: ignore vmlinux.lds
hexagon: clean up timer-regs.h
hexagon: export raw I/O routines for modules
mm: emit the "free" trace report before freeing memory in kmem_cache_free()
shm: extend forced shm destroy to support objects from several IPC nses
ipc: WARN if trying to remove ipc object which is absent
mm/swap.c:put_pages_list(): reinitialise the page list
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmGYctEQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpqXjD/9Ef3i/OXjvUOOG19gEk4B1rr0youVioJg9
0HYmXYGDLHZZd01mBdOCrbvadYaJRw3SG+whzNyayekPhs38kFHTqBNHcXOGphm1
4snuDVFz0kElDswRjnh/q9QoJNDGwDidy6L+KHXMbHhTPuXALXshc+6U3GGWJ0Gg
5EEKwfJGRkIzJJ9fL9d9GbyAFMLq8xXr1pf9LdNJEZL2RaBkh4gI7Uu0q6vYGh88
N1iz2LdF4D4uopB6GkT6Eup/5iKakGbv2M2edcHpICdG/3EKn3Q8pnUajxVvXkV/
kOR4zRy2on7xYKIVZGKvq6/8e7Wde9yfBGVQ1D/45uzKIeTiDrP0MWo8ldP5vkbU
yTc3/BiLW5Nkk67RXs3Llg5QDX+n69BHNtXepO8W6DBMdLRFqZSiM0Y1Xrba/+yU
4TJif/wlArErjUUsWuWlnSzKrn1CyRLxxmewXfhxgpy12d4pTQsDLMKs7CojCmoF
t265dmvzXeNtAymuS/WSk0GlnobEO+wvfVUzDUQir6PukDSvz1vw5OIdsMaf2dmX
QDrcgVnGIjNQfLQVoX8FF0u52HcElP1+iikcs/XCNj67f9qImiBYsUYqDtfW6yFl
56IKkq4akSWPgA/qnV9oJ/Cf/AV8WI7rXV32hivo41nXaclrJouRfrarffTRzV9n
gHP0k9W6Cg==
=IqIF
-----END PGP SIGNATURE-----
Merge tag 'block-5.16-2021-11-19' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
- Flip a cap check to avoid a selinux error (Alistair)
- Fix for a regression this merge window where we can miss a queue ref
put (me)
- Un-mark pstore-blk as broken, as the condition that triggered that
change has been rectified (Kees)
- Queue quiesce and sync fixes (Ming)
- FUA insertion fix (Ming)
- blk-cgroup error path put fix (Yu)
* tag 'block-5.16-2021-11-19' of git://git.kernel.dk/linux-block:
blk-mq: don't insert FUA request with data into scheduler queue
blk-cgroup: fix missing put device in error path from blkg_conf_pref()
block: avoid to quiesce queue in elevator_init_mq
Revert "mark pstore-blk as broken"
blk-mq: cancel blk-mq dispatch work in both blk_cleanup_queue and disk_release()
block: fix missing queue put in error path
block: Check ADMIN before NICE for IOPRIO_CLASS_RT
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmGYKGIACgkQiiy9cAdy
T1ED5wwAtDe73G0fLjajrlVaOfBgHVqRWg1QkGxrwe0Pk8BXJkrrfze5kizD7B5D
xgVDbAWmF3wfkoGukggpjrLpwe/36F/xdHHIATRH2xG7zielDSab/RHPZQx4xPJQ
Qz/F7f5N8cSvDYX5HdTYncAtF3yV6MM48n9N6fBoKTL43mDWK7EI90KM5EkL2Mdc
DT1wYzTpuNoR1qY4oBIftV8mau6DAVtE/GIdpijzIbCf07xADvaM62QmA1qFLCFT
zlya3RmgyTS2UtCV9pKnbzZ2o1rm7J/C6YWvqrggH24Fu7V4nGTAx5yY/X2Zm3iu
uyoMTT1uvaiaihFCVUYN2e4jhYX/SuWA2Q0WczHAcx0LFXWWsIe5pOl78i9aQkU3
UQhTk4G1HNd35CEtR53HEil8wt674p2D/kJpZ5VL6OIg7H1jgwG6UoterNEbEUVT
qFvKCbePXyt8kDdgk9DqsAQQZTkJKMeiuk+hwVUngDRd/jsp7N7p4ISdBQNzNVG3
JVtinWzM
=IGYd
-----END PGP SIGNATURE-----
Merge tag '5.16-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Three small cifs/smb3 fixes: two to address minor coverity issues and
one cleanup"
* tag '5.16-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: introduce cifs_ses_mark_for_reconnect() helper
cifs: protect srv_count with cifs_tcp_ses_lock
cifs: move debug print out of spinlock
To clear a user buffer we cannot simply use memset, we have to use
clear_user(). With a virtio-mem device that registers a vmcore_cb and
has some logically unplugged memory inside an added Linux memory block,
I can easily trigger a BUG by copying the vmcore via "cp":
systemd[1]: Starting Kdump Vmcore Save Service...
kdump[420]: Kdump is using the default log level(3).
kdump[453]: saving to /sysroot/var/crash/127.0.0.1-2021-11-11-14:59:22/
kdump[458]: saving vmcore-dmesg.txt to /sysroot/var/crash/127.0.0.1-2021-11-11-14:59:22/
kdump[465]: saving vmcore-dmesg.txt complete
kdump[467]: saving vmcore
BUG: unable to handle page fault for address: 00007f2374e01000
#PF: supervisor write access in kernel mode
#PF: error_code(0x0003) - permissions violation
PGD 7a523067 P4D 7a523067 PUD 7a528067 PMD 7a525067 PTE 800000007048f867
Oops: 0003 [#1] PREEMPT SMP NOPTI
CPU: 0 PID: 468 Comm: cp Not tainted 5.15.0+ #6
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-27-g64f37cc530f1-prebuilt.qemu.org 04/01/2014
RIP: 0010:read_from_oldmem.part.0.cold+0x1d/0x86
Code: ff ff ff e8 05 ff fe ff e9 b9 e9 7f ff 48 89 de 48 c7 c7 38 3b 60 82 e8 f1 fe fe ff 83 fd 08 72 3c 49 8d 7d 08 4c 89 e9 89 e8 <49> c7 45 00 00 00 00 00 49 c7 44 05 f8 00 00 00 00 48 83 e7 f81
RSP: 0018:ffffc9000073be08 EFLAGS: 00010212
RAX: 0000000000001000 RBX: 00000000002fd000 RCX: 00007f2374e01000
RDX: 0000000000000001 RSI: 00000000ffffdfff RDI: 00007f2374e01008
RBP: 0000000000001000 R08: 0000000000000000 R09: ffffc9000073bc50
R10: ffffc9000073bc48 R11: ffffffff829461a8 R12: 000000000000f000
R13: 00007f2374e01000 R14: 0000000000000000 R15: ffff88807bd421e8
FS: 00007f2374e12140(0000) GS:ffff88807f000000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f2374e01000 CR3: 000000007a4aa000 CR4: 0000000000350eb0
Call Trace:
read_vmcore+0x236/0x2c0
proc_reg_read+0x55/0xa0
vfs_read+0x95/0x190
ksys_read+0x4f/0xc0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
Some x86-64 CPUs have a CPU feature called "Supervisor Mode Access
Prevention (SMAP)", which is used to detect wrong access from the kernel
to user buffers like this: SMAP triggers a permissions violation on
wrong access. In the x86-64 variant of clear_user(), SMAP is properly
handled via clac()+stac().
To fix, properly use clear_user() when we're dealing with a user buffer.
Link: https://lkml.kernel.org/r/20211112092750.6921-1-david@redhat.com
Fixes: 997c136f51 ("fs/proc/vmcore.c: add hook to read_from_oldmem() to check for non-ram pages")
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Philipp Rudo <prudo@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmGWiSAACgkQxWXV+ddt
WDtKiA//VFrxg5I1yrTyyVvc2RqcPg0aCopO6wIAgcHV1yzseJ7AyP7two1p5dg8
3DPDKaXFvONZYXl8j9ZuzFiryKPGJxp1KSagKyt6EKDRYm50HOreTC1Qt2ZvLJHn
wHohwHX96yv+4gyKvpCBZVpp3dSIDbsbCxlpz3mm7kZv//wHxA5l0chZpHbTqUF6
JloRSrOIGlSeQYPog1Lnu1c92qoGzLL5n47aXS3s5afpkqqkOlKZLsyb90N4uJx4
M1htsl4ga7b3OB8jbR95wlbd/qXsB+dvaBUQHgDm4hafW6ma5ft9NhuePQnQlaVH
ub/rlfNTsKl6jly9eNJ6wGpqi/OBlhA4qCmQVbVDE+HhWUJbdUiQ5UgxoOrQlkOP
Pd3NvW+95qg+Lj/egUA/Mrtz1v/6oSKcf3gQVKMNIrnk6lOUVZWtQhBe5YS3qHih
PzxrCp4ThlvmVeemHS7783akiwkI49wUn7a6dUD87x81ghemUHJzC83/mgs1rl/0
7Q1QLetgfrZpko3W4GzS2J3WwKTB0tvBXxsZ8gU5gI0FNkx90bR8+xI0fVF8IGJo
QglHn9gepb6si7BCxyKDTlQNMt23s7GFH5/4hHtkomtlR6vpRbPJAq5mpOrqsLgJ
VGc/SwCJPSmynqRAxuCn+DqlfaMZZaqtvgVVWnhJl9ylKyUAQKU=
=ze0L
-----END PGP SIGNATURE-----
Merge tag 'for-5.16-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Several xes and one old ioctl deprecation. Namely there's fix for
crashes/warnings with lzo compression that was suspected to be caused
by first pull merge resolution, but it was a different bug.
Summary:
- regression fix for a crash in lzo due to missing boundary checks of
the page array
- fix crashes on ARM64 due to missing barriers when synchronizing
status bits between work queues
- silence lockdep when reading chunk tree during mount
- fix false positive warning in integrity checker on devices with
disabled write caching
- fix signedness of bitfields in scrub
- start deprecation of balance v1 ioctl"
* tag 'for-5.16-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: deprecate BTRFS_IOC_BALANCE ioctl
btrfs: make 1-bit bit-fields of scrub_page unsigned int
btrfs: check-integrity: fix a warning on write caching disabled disk
btrfs: silence lockdep when reading chunk tree during mount
btrfs: fix memory ordering between normal and ordered work functions
btrfs: fix a out-of-bound access in copy_compressed_data_to_page()
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmGWL+AACgkQnJ2qBz9k
QNnK3ggAtJLlaCwg4TxyIIGEWBP/KBKs2STLf80VDavhxZuplKKZ0o5/9IPZ2uu9
v9SgME4hnGpUNaECttVWqijT823SiIxA/4uFSBnN1hLlvPj2Wkwl0oRWb6vCFzZ5
OBUiZrycZKxEq78p2Ao8+FxeRsJ9RuXHnPkZULAqizkEue23d+gW33cO0x//UwB0
lB6i32yDqCmOAtLaT1Div5ttJ3oM/hGBbOURkh3YB7jV2MxojHX28ZjDv657S30k
lhpQwvmHF70jM7MPWbxtcbSTBdPKQdaF5+oDqpAFwDQn3nhEk1sjw6IrCUKr2NAK
Z+YFlRgv5eD+loRPEBVVU/6QrmfSEQ==
=ueAF
-----END PGP SIGNATURE-----
Merge tag 'fs_for_v5.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull UDF fix from Jan Kara:
"A fix for a long-standing UDF bug where we were not properly
validating directory position inside readdir"
* tag 'fs_for_v5.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
udf: Fix crash after seekdir
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYZZM2AAKCRCRxhvAZXjc
onQYAP4oKQBguYvFThF4H4VnJY1xoS/33rNOqwPumubdj/8P2AD9EFOegFKYBFr7
tCpokujZl3jZTOqV6h32JDQkPZB0kAc=
=AqiS
-----END PGP SIGNATURE-----
Merge tag 'fs.idmapped.v5.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull setattr idmapping fix from Christian Brauner:
"This contains a simple fix for setattr. When determining the validity
of the attributes the ia_{g,u}id fields contain the value that will be
written to inode->i_{g,u}id. When the {g,u}id attribute of the file
isn't altered and the caller's fs{g,u}id matches the current {g,u}id
attribute the attribute change is allowed.
The value in ia_{g,u}id does already account for idmapped mounts and
will have taken the relevant idmapping into account. So in order to
verify that the {g,u}id attribute isn't changed we simple need to
compare the ia_{g,u}id value against the inode's i_{g,u}id value.
This only has any meaning for idmapped mounts as idmapping helpers are
idempotent without them. And for idmapped mounts this really only has
a meaning when circular idmappings are used, i.e. mappings where e.g.
id 1000 is mapped to id 1001 and id 1001 is mapped to id 1000. Such
ciruclar mappings can e.g. be useful when sharing the same home
directory between multiple users at the same time.
Before this patch we could end up denying legitimate attribute changes
and allowing invalid attribute changes when circular mappings are
used. To even get into this situation the caller must've been
privileged both to create that mapping and to create that idmapped
mount.
This hasn't been seen in the wild anywhere but came up when expanding
the fstest suite during work on a series of hardening patches. All
idmapped fstests pass without any regressions and we're adding new
tests to verify the behavior of circular mappings.
The new tests can be found at [1]"
Link: https://lore.kernel.org/linux-fsdevel/20211109145713.1868404-2-brauner@kernel.org [1]
* tag 'fs.idmapped.v5.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
fs: handle circular mappings correctly
Without a module param knob there was no way to enable pstore ftrace
recording early enough to debug hangs happening during the boot process
before userspace is up enough to enable it via the regular debugfs
knobs.
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Co-developed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210610082134.20636-1-u.kleine-koenig@pengutronix.de
Block devices do not, in general, report congestion any more, so this
congestion_wait() is effectively just a sleep.
It isn't entirely clear what is being waited for, but as we only wait
when j_async_throttle is elevated, it seems reasonable to stop waiting
when j_async_throttle becomes zero - or after the same timeout.
So change to use wait_event_event_timeout() for waiting, and
wake_up_var() to signal an end to waiting.
Link: https://lore.kernel.org/r/163712368225.13692.3419908086400748349@noble.neil.brown.name
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jan Kara <jack@suse.cz>
* The current iomap_file_buffered_write behavior of failing the entire
write when part of the user buffer cannot be faulted in leads to an
endless loop in gfs2. Work around that in gfs2 for now.
* Various other bugs all over the place.
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAmGVjogUHGFncnVlbmJh
QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTojHA/7BvjlQKdDG84pbgFwWa2JBHgkfB5u
ayd4gBR9mq3WwOzMO7Wp9dUmUgDYjjj2ZJZkExHYgla+52ScRJAbcTk0VkNpSgy6
pOAZmPpTjmj4KGUghAZuUUqm55S5lpB6A2VciscW0GFGWE7kEOOcs/Gw5heSTpHW
zSLM2cQZ303kqd34KE747DchG6ruwlzsRms8qxwaV6Qe5ZyYx6xSBoL7bJ+VVqdF
Ujn3KEpFd5+aRkffwyypltkIsMh1yQceOCoTOp5G3U8zwl+bT8hhpY6qMvtMtT90
Pl5A7abpM7kGzdbdi3z+RtJK3DwmDgtfFrP6z4LTa4QJtUE8Dn7akIPH5v2E1ajI
vpQVepIS+ryn614nqCyXgGed975gFqNNgZheqiZq2+mQMsAO/fxxI4jMYXpB2Hyi
k4PoFm7K+Gcn8ARkuhlbZiZoo8CldmorZz7lVthyb3hNNzg2EzMK/K14KAz7F7W9
ndcXLnMZdxS3R4wlGVKtfLi/E9+VsM0MPTsfvsJtubq/lfu5d2G47FoQvsYFRRDW
icCzMFC+MshFS+GurSYHGni3x11C6M6wsL7m+YvkPLkDC4Jn4Ok9mdf77DiaHy2L
znc0w32FHSQHQZuYN+Fakr8K+W5k0XGkHeJpVzRglkQf2qH0hClqXnzmgbt60ECu
2rxg2KuSsEiYtrU=
=/Vqx
-----END PGP SIGNATURE-----
Merge tag 'gfs2-v5.16-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 fixes from Andreas Gruenbacher:
- The current iomap_file_buffered_write behavior of failing the entire
write when part of the user buffer cannot be faulted in leads to an
endless loop in gfs2. Work around that in gfs2 for now.
- Various other bugs all over the place.
* tag 'gfs2-v5.16-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Prevent endless loops in gfs2_file_buffered_write
gfs2: Fix "Introduce flag for glock holder auto-demotion"
gfs2: Fix length of holes reported at end-of-file
gfs2: release iopen glock early in evict
gfs2: Fix atomic bug in gfs2_instantiate
gfs2: Only dereference i->iov when iter_is_iovec(i)
When the client receives ERR_NOSPC on reply to CREATE_SESSION
it leads to a client hanging in nfs_wait_client_init_complete().
Instead, complete and fail the client initiation with an EIO
error which allows for the mount command to fail instead of
hanging.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
f2fs_write_begin() assumes that all blocks were preallocated by
default unless FI_NO_PREALLOC is explicitly set. This invites data
corruption, as there are cases in which not all blocks are preallocated.
Commit 47501f87c6 ("f2fs: preallocate DIO blocks when forcing
buffered_io") fixed one case, but there are others remaining.
Fix up this logic by replacing this flag with FI_PREALLOCATED_ALL, which
only gets set if all blocks for the current write were preallocated.
Also clean up f2fs_preallocate_blocks(), move it to file.c, and make it
handle some of the logic that was previously in write_iter() directly.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Don't alloc new page pointers array to replace old, just use old, introduce
valid_nr_cpages to indicate valid number of page pointers in array, try to
reduce one page array alloc and free when write compress page.
Signed-off-by: Fengnan Chang <changfengnan@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The mechanism in use to allow the client to see the results of COPY/CLONE
is to drop those pages from the pagecache. This forces the client to read
those pages once more from the server. However, truncate_pagecache_range()
zeros out partial pages instead of dropping them. Let us instead use
invalidate_inode_pages2_range() with full-page offsets to ensure the client
properly sees the results of COPY/CLONE operations.
Cc: <stable@vger.kernel.org> # v4.7+
Fixes: 2e72448b07 ("NFS: Add COPY nfs operation")
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
This provides some insight into the client's invalidation behavior to show
both when the client uses the helper, and the results of calling the
helper which can vary depending on how the helper is called.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The failure to retrieve post-op attributes has no bearing on whether or
not the clone operation itself was successful. We must therefore ignore
the return value of decode_getfattr() when looking at the success or
failure of nfs4_xdr_dec_clone().
Fixes: 36022770de ("nfs42: add CLONE xdr functions")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
In the event that a tracer changes which signal needs to be delivered
and that signal is currently blocked then the signal needs to be
requeued for later delivery.
With the advent of CLONE_THREAD the kernel has 2 signal queues per
task. The per process queue and the per task queue. Update the code
so that if the signal is removed from the per process queue it is
requeued on the per process queue. This is necessary to make it
appear the signal was never dequeued.
The rr debugger reasonably believes that the state of the process from
the last ptrace_stop it observed until PTRACE_EVENT_EXIT can be recreated
by simply letting a process run. If a SIGKILL interrupts a ptrace_stop
this is not true today.
So return signals to their original queue in ptrace_signal so that
signals that are not delivered appear like they were never dequeued.
Fixes: 794aa320b79d ("[PATCH] sigfix-2.5.40-D6")
History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.gi
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/87zgq4d5r4.fsf_-_@email.froward.int.ebiederm.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
-----BEGIN PGP SIGNATURE-----
iQJJBAABCAAzFiEEYtFWavXG9hZotryuJ5vNeUKO4b4FAmGS3msVHGJmaWVsZHNA
ZmllbGRzZXMub3JnAAoJECebzXlCjuG+wCMQAIm7hCuZ7bNtJwdgabv/z3u9Cgre
2lBuFN5edrymgahKERBnA5bZCQGEnG/yVG6w69nB8LnqWphN2caG5ln17kSjfFCO
W/J7FBH9u7662GQhmxqZyrNVm/Td0vgyKH0uh2RTiaitN0JrPg+4gjAWOPUPq53I
lYVgm20Aj3LkH83MEwwp6K2u3pqJ5y+pqfDv6ROX6/HkPV+7yczleWLafB/EYQrs
zX4vSyrR7aLjJ5ZEFz4rokcsemq1iI4eqBr6fiwSZwIDbRBwPFdIlQTwuww4PGSW
ingM4y/RU3okUXV5exchex7ffzmPi8IvkTBOdn0RicHRcbm9f0Rky6wXiASJLTqu
QURh+rsvupfrHnLQ/b1bJtOrSJCdJXdidw8bA7vrpsmpatImnS+u+iWO9RkesL+g
sVdQJV+0ZmOtyLTvw6xpRtXXcpMaJvUksmtiHvySBZot9waX03X/h7TFEmtx+P3E
k0znywn9Ebu5d7X8vBwwDqq9f7Xe7pzo7zALkMeC1qXULIzCzYmTuvSTm870Cz9S
JPh0ojuYJvlvoKNkjYfRKRyE28VLZe/hEwtVWL+kgQD8zR8gPlQNpAcBRJ++Ett+
fwjffdAQ/rfb43W2T5AhoSK1173pu0AWSVEH9h1h604HDvf3iCCw0Wew3JsnJYCS
+tcj/6NetJG64aHe
=beJ8
-----END PGP SIGNATURE-----
Merge tag 'nfsd-5.16-1' of git://linux-nfs.org/~bfields/linux
Pull nfsd bugfix from Bruce Fields:
"This is just one bugfix for a buffer overflow in knfsd's xdr decoding"
* tag 'nfsd-5.16-1' of git://linux-nfs.org/~bfields/linux:
NFSD: Fix exposure in nfsd4_decode_bitmap()
Instead of setting a bit in the fs_flags to set a bit in the
address_space, set the bit in the address_space directly.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
This patch will surround the AF_INET6 case in sk_error_report() of dlm
with a #if IS_ENABLED(CONFIG_IPV6). The field sk->sk_v6_daddr is not
defined when CONFIG_IPV6 is disabled. If CONFIG_IPV6 is disabled, the
socket creation with AF_INET6 should already fail because a runtime
check if AF_INET6 is registered. However if there is the possibility
that AF_INET6 is set as sk_family the sk_error_report() callback will
print then an invalid family type error.
Reported-by: kernel test robot <lkp@intel.com>
Fixes: 4c3d90570b ("fs: dlm: don't call kernel_getpeername() in error_report()")
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
When calling setattr_prepare() to determine the validity of the attributes the
ia_{g,u}id fields contain the value that will be written to inode->i_{g,u}id.
When the {g,u}id attribute of the file isn't altered and the caller's fs{g,u}id
matches the current {g,u}id attribute the attribute change is allowed.
The value in ia_{g,u}id does already account for idmapped mounts and will have
taken the relevant idmapping into account. So in order to verify that the
{g,u}id attribute isn't changed we simple need to compare the ia_{g,u}id value
against the inode's i_{g,u}id value.
This only has any meaning for idmapped mounts as idmapping helpers are
idempotent without them. And for idmapped mounts this really only has a meaning
when circular idmappings are used, i.e. mappings where e.g. id 1000 is mapped
to id 1001 and id 1001 is mapped to id 1000. Such ciruclar mappings can e.g. be
useful when sharing the same home directory between multiple users at the same
time.
As an example consider a directory with two files: /source/file1 owned by
{g,u}id 1000 and /source/file2 owned by {g,u}id 1001. Assume we create an
idmapped mount at /target with an idmapping that maps files owned by {g,u}id
1000 to being owned by {g,u}id 1001 and files owned by {g,u}id 1001 to being
owned by {g,u}id 1000. In effect, the idmapped mount at /target switches the
ownership of /source/file1 and source/file2, i.e. /target/file1 will be owned
by {g,u}id 1001 and /target/file2 will be owned by {g,u}id 1000.
This means that a user with fs{g,u}id 1000 must be allowed to setattr
/target/file2 from {g,u}id 1000 to {g,u}id 1000. Similar, a user with fs{g,u}id
1001 must be allowed to setattr /target/file1 from {g,u}id 1001 to {g,u}id
1001. Conversely, a user with fs{g,u}id 1000 must fail to setattr /target/file1
from {g,u}id 1001 to {g,u}id 1000. And a user with fs{g,u}id 1001 must fail to
setattr /target/file2 from {g,u}id 1000 to {g,u}id 1000. Both cases must fail
with EPERM for non-capable callers.
Before this patch we could end up denying legitimate attribute changes and
allowing invalid attribute changes when circular mappings are used. To even get
into this situation the caller must've been privileged both to create that
mapping and to create that idmapped mount.
This hasn't been seen in the wild anywhere but came up when expanding the
testsuite during work on a series of hardening patches. All idmapped fstests
pass without any regressions and we add new tests to verify the behavior of
circular mappings.
Link: https://lore.kernel.org/r/20211109145713.1868404-1-brauner@kernel.org
Fixes: 2f221d6f7b ("attr: handle idmapped mounts")
Cc: Seth Forshee <seth.forshee@digitalocean.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@vger.kernel.org
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Fix comment referring to function "io_uring_del_task_file()", now called
"io_uring_del_tctx_node()".
Fixes: eef51daa72 ("io_uring: rename function *task_file")
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
Link: https://lore.kernel.org/r/20211116175530.31608-1-kamal@canonical.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This reverts commit d07f3b081e.
pstore-blk was fixed to avoid the unwanted APIs in commit 7bb9557b48
("pstore/blk: Use the normal block device I/O path"), which landed in
the same release as the commit adding BROKEN.
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20211116181559.3975566-1-keescook@chromium.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use new cifs_ses_mark_for_reconnect() helper to mark all session
channels for reconnect instead of duplicating it in different places.
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
Updates to the srv_count field are protected elsewhere
with the cifs_tcp_ses_lock spinlock. Add one missing place
(cifs_get_tcp_sesion).
CC: Shyam Prasad N <sprasad@microsoft.com>
Addresses-Coverity: 1494149 ("Data Race Condition")
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
It is better to print debug messages outside of the chan_lock
spinlock where possible.
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Addresses-Coverity: 1493854 ("Thread deadlock")
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
The v2 balance ioctl has been introduced more than 9 years ago. Users of
the old v1 ioctl should have long been migrated to it. It's time we
deprecate it and eventually remove it.
The only known user is in btrfs-progs that tries v1 as a fallback in
case v2 is not supported. This is not necessary anymore.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The bitfields have_csum and io_error are currently signed which is not
recommended as the representation is an implementation defined
behaviour. Fix this by making the bit-fields unsigned ints.
Fixes: 2c36395430 ("btrfs: scrub: remove the anonymous structure from scrub_page")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When a disk has write caching disabled, we skip submission of a bio with
flush and sync requests before writing the superblock, since it's not
needed. However when the integrity checker is enabled, this results in
reports that there are metadata blocks referred by a superblock that
were not properly flushed. So don't skip the bio submission only when
the integrity checker is enabled for the sake of simplicity, since this
is a debug tool and not meant for use in non-debug builds.
fstests/btrfs/220 trigger a check-integrity warning like the following
when CONFIG_BTRFS_FS_CHECK_INTEGRITY=y and the disk with WCE=0.
btrfs: attempt to write superblock which references block M @5242880 (sdb2/5242880/0) which is not flushed out of disk's write cache (block flush_gen=1, dev->flush_gen=0)!
------------[ cut here ]------------
WARNING: CPU: 28 PID: 843680 at fs/btrfs/check-integrity.c:2196 btrfsic_process_written_superblock+0x22a/0x2a0 [btrfs]
CPU: 28 PID: 843680 Comm: umount Not tainted 5.15.0-0.rc5.39.el8.x86_64 #1
Hardware name: Dell Inc. Precision T7610/0NK70N, BIOS A18 09/11/2019
RIP: 0010:btrfsic_process_written_superblock+0x22a/0x2a0 [btrfs]
RSP: 0018:ffffb642afb47940 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000002 RCX: 0000000000000000
RDX: 00000000ffffffff RSI: ffff8b722fc97d00 RDI: ffff8b722fc97d00
RBP: ffff8b5601c00000 R08: 0000000000000000 R09: c0000000ffff7fff
R10: 0000000000000001 R11: ffffb642afb476f8 R12: ffffffffffffffff
R13: ffffb642afb47974 R14: ffff8b5499254c00 R15: 0000000000000003
FS: 00007f00a06d4080(0000) GS:ffff8b722fc80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fff5cff5ff0 CR3: 00000001c0c2a006 CR4: 00000000001706e0
Call Trace:
btrfsic_process_written_block+0x2f7/0x850 [btrfs]
__btrfsic_submit_bio.part.19+0x310/0x330 [btrfs]
? bio_associate_blkg_from_css+0xa4/0x2c0
btrfsic_submit_bio+0x18/0x30 [btrfs]
write_dev_supers+0x81/0x2a0 [btrfs]
? find_get_pages_range_tag+0x219/0x280
? pagevec_lookup_range_tag+0x24/0x30
? __filemap_fdatawait_range+0x6d/0xf0
? __raw_callee_save___native_queued_spin_unlock+0x11/0x1e
? find_first_extent_bit+0x9b/0x160 [btrfs]
? __raw_callee_save___native_queued_spin_unlock+0x11/0x1e
write_all_supers+0x1b3/0xa70 [btrfs]
? __raw_callee_save___native_queued_spin_unlock+0x11/0x1e
btrfs_commit_transaction+0x59d/0xac0 [btrfs]
close_ctree+0x11d/0x339 [btrfs]
generic_shutdown_super+0x71/0x110
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0xb8/0x140
task_work_run+0x6d/0xb0
exit_to_user_mode_prepare+0x1f0/0x200
syscall_exit_to_user_mode+0x12/0x30
do_syscall_64+0x46/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f009f711dfb
RSP: 002b:00007fff5cff7928 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 000055b68c6c9970 RCX: 00007f009f711dfb
RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000055b68c6c9b50
RBP: 0000000000000000 R08: 000055b68c6ca900 R09: 00007f009f795580
R10: 0000000000000000 R11: 0000000000000246 R12: 000055b68c6c9b50
R13: 00007f00a04bf184 R14: 0000000000000000 R15: 00000000ffffffff
---[ end trace 2c4b82abcef9eec4 ]---
S-65536(sdb2/65536/1)
-->
M-1064960(sdb2/1064960/1)
Reviewed-by: Filipe Manana <fdmanana@gmail.com>
Signed-off-by: Wang Yugui <wangyugui@e16-tech.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Ordered work functions aren't guaranteed to be handled by the same thread
which executed the normal work functions. The only way execution between
normal/ordered functions is synchronized is via the WORK_DONE_BIT,
unfortunately the used bitops don't guarantee any ordering whatsoever.
This manifested as seemingly inexplicable crashes on ARM64, where
async_chunk::inode is seen as non-null in async_cow_submit which causes
submit_compressed_extents to be called and crash occurs because
async_chunk::inode suddenly became NULL. The call trace was similar to:
pc : submit_compressed_extents+0x38/0x3d0
lr : async_cow_submit+0x50/0xd0
sp : ffff800015d4bc20
<registers omitted for brevity>
Call trace:
submit_compressed_extents+0x38/0x3d0
async_cow_submit+0x50/0xd0
run_ordered_work+0xc8/0x280
btrfs_work_helper+0x98/0x250
process_one_work+0x1f0/0x4ac
worker_thread+0x188/0x504
kthread+0x110/0x114
ret_from_fork+0x10/0x18
Fix this by adding respective barrier calls which ensure that all
accesses preceding setting of WORK_DONE_BIT are strictly ordered before
setting the flag. At the same time add a read barrier after reading of
WORK_DONE_BIT in run_ordered_work which ensures all subsequent loads
would be strictly ordered after reading the bit. This in turn ensures
are all accesses before WORK_DONE_BIT are going to be strictly ordered
before any access that can occur in ordered_func.
Reported-by: Chris Murphy <lists@colorremedies.com>
Fixes: 08a9ff3264 ("btrfs: Added btrfs_workqueue_struct implemented ordered execution based on kernel workqueue")
CC: stable@vger.kernel.org # 4.4+
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2011928
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Chris Murphy <chris@colorremedies.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
The following script can cause btrfs to crash:
$ mount -o compress-force=lzo $DEV /mnt
$ dd if=/dev/urandom of=/mnt/foo bs=4k count=1
$ sync
The call trace looks like this:
general protection fault, probably for non-canonical address 0xe04b37fccce3b000: 0000 [#1] PREEMPT SMP NOPTI
CPU: 5 PID: 164 Comm: kworker/u20:3 Not tainted 5.15.0-rc7-custom+ #4
Workqueue: btrfs-delalloc btrfs_work_helper [btrfs]
RIP: 0010:__memcpy+0x12/0x20
Call Trace:
lzo_compress_pages+0x236/0x540 [btrfs]
btrfs_compress_pages+0xaa/0xf0 [btrfs]
compress_file_range+0x431/0x8e0 [btrfs]
async_cow_start+0x12/0x30 [btrfs]
btrfs_work_helper+0xf6/0x3e0 [btrfs]
process_one_work+0x294/0x5d0
worker_thread+0x55/0x3c0
kthread+0x140/0x170
ret_from_fork+0x22/0x30
---[ end trace 63c3c0f131e61982 ]---
[CAUSE]
In lzo_compress_pages(), parameter @out_pages is not only an output
parameter (for the number of compressed pages), but also an input
parameter, as the upper limit of compressed pages we can utilize.
In commit d4088803f5 ("btrfs: subpage: make lzo_compress_pages()
compatible"), the refactoring doesn't take @out_pages as an input, thus
completely ignoring the limit.
And for compress-force case, we could hit incompressible data that
compressed size would go beyond the page limit, and cause the above
crash.
[FIX]
Save @out_pages as @max_nr_page, and pass it to lzo_compress_pages(),
and check if we're beyond the limit before accessing the pages.
Note: this also fixes crash on 32bit architectures that was suspected to
be caused by merge of btrfs patches to 5.16-rc1. Reported in
https://lore.kernel.org/all/20211104115001.GU20319@twin.jikos.cz/ .
Reported-by: Omar Sandoval <osandov@fb.com>
Fixes: d4088803f5 ("btrfs: subpage: make lzo_compress_pages() compatible")
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add note ]
Signed-off-by: David Sterba <dsterba@suse.com>
rtm@csail.mit.edu reports:
> nfsd4_decode_bitmap4() will write beyond bmval[bmlen-1] if the RPC
> directs it to do so. This can cause nfsd4_decode_state_protect4_a()
> to write client-supplied data beyond the end of
> nfsd4_exchange_id.spo_must_allow[] when called by
> nfsd4_decode_exchange_id().
Rewrite the loops so nfsd4_decode_bitmap() cannot iterate beyond
@bmlen.
Reported by: rtm@csail.mit.edu
Fixes: d1c263a031 ("NFSD: Replace READ* macros in nfsd4_decode_fattr()")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This patch will replace the use of socket sk_callback_lock lock and uses
socket lock instead. Some users like sunrpc, see commit ea9afca88b
("SUNRPC: Replace use of socket sk_callback_lock with sock_lock") moving
from sk_callback_lock to sock_lock which seems to be held when the socket
callbacks are called.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
In some cases kernel_getpeername() will held the socket lock which is
already held when the socket layer calls error_report() callback. Since
commit 9dfc685e02 ("inet: remove races in inet{6}_getname()") this
problem becomes more likely because the socket lock will be held always.
You will see something like:
bob9-u5 login: [ 562.316860] BUG: spinlock recursion on CPU#7, swapper/7/0
[ 562.318562] lock: 0xffff8f2284720088, .magic: dead4ead, .owner: swapper/7/0, .owner_cpu: 7
[ 562.319522] CPU: 7 PID: 0 Comm: swapper/7 Not tainted 5.15.0+ #135
[ 562.320346] Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.13.0-2.module+el8.3.0+7353+9de0a3cc 04/01/2014
[ 562.321277] Call Trace:
[ 562.321529] <IRQ>
[ 562.321734] dump_stack_lvl+0x33/0x42
[ 562.322282] do_raw_spin_lock+0x8b/0xc0
[ 562.322674] lock_sock_nested+0x1e/0x50
[ 562.323057] inet_getname+0x39/0x110
[ 562.323425] ? sock_def_readable+0x80/0x80
[ 562.323838] lowcomms_error_report+0x63/0x260 [dlm]
[ 562.324338] ? wait_for_completion_interruptible_timeout+0xd2/0x120
[ 562.324949] ? lock_timer_base+0x67/0x80
[ 562.325330] ? do_raw_spin_unlock+0x49/0xc0
[ 562.325735] ? _raw_spin_unlock_irqrestore+0x1e/0x40
[ 562.326218] ? del_timer+0x54/0x80
[ 562.326549] sk_error_report+0x12/0x70
[ 562.326919] tcp_validate_incoming+0x3c8/0x530
[ 562.327347] ? kvm_clock_read+0x14/0x30
[ 562.327718] ? ktime_get+0x3b/0xa0
[ 562.328055] tcp_rcv_established+0x121/0x660
[ 562.328466] tcp_v4_do_rcv+0x132/0x260
[ 562.328835] tcp_v4_rcv+0xcea/0xe20
[ 562.329173] ip_protocol_deliver_rcu+0x35/0x1f0
[ 562.329615] ip_local_deliver_finish+0x54/0x60
[ 562.330050] ip_local_deliver+0xf7/0x110
[ 562.330431] ? inet_rtm_getroute+0x211/0x840
[ 562.330848] ? ip_protocol_deliver_rcu+0x1f0/0x1f0
[ 562.331310] ip_rcv+0xe1/0xf0
[ 562.331603] ? ip_local_deliver+0x110/0x110
[ 562.332011] __netif_receive_skb_core+0x46a/0x1040
[ 562.332476] ? inet_gro_receive+0x263/0x2e0
[ 562.332885] __netif_receive_skb_list_core+0x13b/0x2c0
[ 562.333383] netif_receive_skb_list_internal+0x1c8/0x2f0
[ 562.333896] ? update_load_avg+0x7e/0x5e0
[ 562.334285] gro_normal_list.part.149+0x19/0x40
[ 562.334722] napi_complete_done+0x67/0x160
[ 562.335134] virtnet_poll+0x2ad/0x408 [virtio_net]
[ 562.335644] __napi_poll+0x28/0x140
[ 562.336012] net_rx_action+0x23d/0x300
[ 562.336414] __do_softirq+0xf2/0x2ea
[ 562.336803] irq_exit_rcu+0xc1/0xf0
[ 562.337173] common_interrupt+0xb9/0xd0
It is and was always forbidden to call kernel_getpeername() in context
of error_report(). To get rid of the problem we access the destination
address for the peer over the socket structure. While on it we fix to
print out the destination port of the inet socket.
Fixes: 1a31833d08 ("DLM: Replace nodeid_to_addr with kernel_getpeername")
Reported-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
* Clean up open-coded swap() calls.
* A little bit of #ifdef golf to complete the reunification of the
kernel and userspace libxfs source code.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmGNT1IACgkQ+H93GTRK
tOucKA//Qk2NX3QBm/8pCrFE5V+eqooPANhZmzeviJCN/6++jcNOy0f+YK6JXVRC
U2WdotHFH5fF6lsDkzNtMPHZ8JMZmOfEiPx5CGFiWT5iUW7FbLkROHm7GFtbwMoH
qm3Lt7PbdSzJqTuOTvaGCw1xWkjDXMLdsdFM7mx3JO5zT9a/fCqjjmyR2Kl0qcSP
RzfruVe20wUka2BeaXfZzSasgfLswratkU4xsiNiwA37yQaldzhrg8fg6uP3OSYi
dkWFXi6WdWwQzARnjWNPwigUwA3xVaYgV+I6+ME0DYsUBvywZzUg3pkowhRAHyA9
kv86L5Zt5K7kQcVqyd+lIvIuAcGrOZ9hA18PIXnwahLBqmjcqAJoF9XhTTZDMD4J
LfujGMrf7DSDcf0vH8G9wlQQthsPGUOoFia5rr8MhdVVNee/b1Qvwsh7kmyg0DOK
9WuNQxGPd7s+X+kwdmGrK7E6fqyPwEfC43l8wtCiBIyGz6QcorwD7kH9DGzv5xGF
NX7WQeKvcaoXn1XVfonb0YgdVOnbyqK4AiY3Po1Ood3IxGyiLGCgDnusvYu+C9/r
T0rRMbljkX1lUKqfzGkg2egOKPR+8RFgFKrKNSXUkDxl8TFLRd3ZObowPPlohq1I
9lIIirip5UFYRv+7srDU1oZPWkvwkpJmaMFgagD3w+OWdo6zwao=
=wsFu
-----END PGP SIGNATURE-----
Merge tag 'xfs-5.16-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs cleanups from Darrick Wong:
"The most 'exciting' aspect of this branch is that the xfsprogs
maintainer and I have worked through the last of the code
discrepancies between kernel and userspace libxfs such that there are
no code differences between the two except for #includes.
IOWs, diff suffices to demonstrate that the userspace tools behave the
same as the kernel, and kernel-only bits are clearly marked in the
/kernel/ source code instead of just the userspace source.
Summary:
- Clean up open-coded swap() calls.
- A little bit of #ifdef golf to complete the reunification of the
kernel and userspace libxfs source code"
* tag 'xfs-5.16-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: sync xfs_btree_split macros with userspace libxfs
xfs: #ifdef out perag code for userspace
xfs: use swap() to make dabtree code cleaner