In function sweep_bh_for_rgrps, which is a helper for punch_hole,
it uses variable buf_in_tr to keep track of when it needs to commit
pending block frees on a partial delete that overflows the
transaction created for the delete. The problem is that the
variable was initialized at the start of function sweep_bh_for_rgrps
but it was never cleared, even when starting a new transaction.
This patch reinitializes the variable when the transaction is
ended, so the next transaction starts out with it cleared.
Fixes: d552a2b9b3 ("GFS2: Non-recursive delete")
Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
On filesystems with a block size smaller than PAGE_SIZE, page_mkwrite is
called for each memory-mapped page before that page can be written to.
When such a memory-mapped file is truncated down to size x which is not
a multiple of the page size and then back to a larger size, the page
straddling size x can end up with a partial block mapping. In that
case, make sure to mark that page read-only so that page_mkwrite will be
called before the page can be written to the next time.
(There is no point in marking the page straddling size x read-only when
truncating down as writing to memory beyond the end of the file will
result in SIGBUS instead of growing the file.)
Fixes xfstests generic/029, generic/030 on filesystems with a block size
smaller than PAGE_SIZE.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Because s_vfs_rename_mutex is not cluster-wide, multiple nodes can
reverse the roles of which directories are "old" and which are "new" for
the purposes of rename. This can cause deadlocks where two nodes end up
waiting for each other.
There can be several layers of directory dependencies across many nodes.
This patch fixes the problem by acquiring all gfs2_rename's inode glocks
asychronously and waiting for all glocks to be acquired. That way all
inodes are locked regardless of the order.
The timeout value for multiple asynchronous glocks is calculated to be
the total of the individual wait times for each glock times two.
Since gfs2_exchange is very similar to gfs2_rename, both functions are
patched in the same way.
A new async glock wait queue, sd_async_glock_wait, keeps a list of
waiters for these events. If gfs2's holder_wake function detects an
async holder, it wakes up any waiters for the event. The waiter only
tests whether any of its requests are still pending.
Since the glocks are sent to dlm asychronously, the wait function needs
to check to see which glocks, if any, were granted.
If a glock is granted by dlm (and therefore held), its minimum hold time
is checked and adjusted as necessary, as other glock grants do.
If the event times out, all glocks held thus far must be dequeued to
resolve any existing deadlocks. Then, if there are any outstanding
locking requests, we need to loop around and wait for dlm to respond to
those requests too. After we release all requests, we return -ESTALE to
the caller (vfs rename) which loops around and retries the request.
Node1 Node2
--------- ---------
1. Enqueue A Enqueue B
2. Enqueue B Enqueue A
3. A granted
6. B granted
7. Wait for B
8. Wait for A
9. A times out (since Node 1 holds A)
10. Dequeue B (since it was granted)
11. Wait for all requests from DLM
12. B Granted (since Node2 released it in step 10)
13. Rename
14. Dequeue A
15. DLM Grants A
16. Dequeue A (due to the timeout and since we
no longer have B held for our task).
17. Dequeue B
18. Return -ESTALE to vfs
19. VFS retries the operation, goto step 1.
This release-all-locks / acquire-all-locks may slow rename / exchange
down as both nodes struggle in the same way and do the same thing.
However, this will only happen when there is contention for the same
inodes, which ought to be rare.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
This patch moves the code that updates glock minimum hold
time to a separate function. This will be called by a future
patch.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Before this patch, gfs2_rename added a holder for the rgrp glock to
its array of holders, ghs. There's nothing wrong with that, but this
patch separates it into a separate holder. This is done to ensure
it's always locked last as per the proper glock lock ordering,
and also to pave the way for a future patch in which we will
lock the non-rgrp glocks asynchronously.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
The brelse() function tests whether its argument is NULL and then
returns immediately. Thus the test around the call is not needed.
This issue was detected by using the Coccinelle software.
[The same applies to brelse() in gfs2_dir_no_add (which Coccinelle
apparently missed), so fix that as well.]
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Get rid of the assumption that the number of slots can at most increase by
RECOVER_SIZE_INC (16) in set_recover_size.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
This patch fixes three places in which temporary character buffers
could overflow due to the addition of the file system id from patch
3792ce973f. Thanks to Dan Carpenter for pointing it out.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, function gfs2_drevalidate was a horrific tangle of
unreadable labels, cases and goto statements. This patch tries to
simplify the logic and make it more readable.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
When allocating space with fallocate, always update the file timestamps
and mark the inode dirty, no matter if the FALLOC_FL_KEEP_SIZE flag is
set or not. The inode needs to be marked dirty so that a subsequent
fsync will pick it up and any new allocations will make it to disk.
Filesystems like xfs and ext4 always update the timestamps, so make
gfs2 behave the same way.
Fixes xfstest generic/483.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
In gfs2_alloc_inode, when kmem_cache_alloc cannot allocate a new object, return
NULL immediately. The code currently relies on the fact that i_inode is the
first member in struct gfs2_inode and so ip and &ip->i_inode evaluate to the
same address, but that isn't immediately obvious.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
iomap handles all the nitty-gritty details of zeroing a file
range for us, so use the proper helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
Add support for the IOMAP_ZERO iomap operation so that iomap_zero_range will
work as expected. In the IOMAP_ZERO case, the caller of iomap_zero_range is
responsible for taking an exclusive glock on the inode, so we need no
additional locking in gfs2_iomap_begin.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
Following commit d0a22a4b03 ("gfs2: Fix iomap write page reclaim deadlock"),
gfs2_iomap_begin and gfs2_iomap_begin_write can be further cleaned up and the
split between those two functions can be improved.
With suggestions from Christoph Hellwig.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
It turns out that the current version of gfs2_metadata_walker suffers
from multiple problems that can cause gfs2_hole_size to report an
incorrect size. This will confuse fiemap as well as lseek with the
SEEK_DATA flag.
Fix that by changing gfs2_hole_walker to compute the metapath to the
first data block after the hole (if any), and compute the hole size
based on that.
Fixes xfstest generic/490.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
Cc: stable@vger.kernel.org # v4.18+
Pull xfs fixes from Darrick Wong:
- Avoid leaking kernel stack contents to userspace
- Fix a potential null pointer dereference in the dabtree scrub code
* tag 'xfs-5.3-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: Fix possible null-pointer dereferences in xchk_da_btree_block_check_sibling()
xfs: fix stack contents leakage in the v1 inumber ioctls
Merge misc fixes from Andrew Morton:
"17 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
drivers/acpi/scan.c: document why we don't need the device_hotplug_lock
memremap: move from kernel/ to mm/
lib/test_meminit.c: use GFP_ATOMIC in RCU critical section
asm-generic: fix -Wtype-limits compiler warnings
cgroup: kselftest: relax fs_spec checks
mm/memory_hotplug.c: remove unneeded return for void function
mm/migrate.c: initialize pud_entry in migrate_vma()
coredump: split pipe command whitespace before expanding template
page flags: prioritize kasan bits over last-cpuid
ubsan: build ubsan.c more conservatively
kasan: remove clang version check for KASAN_STACK
mm: compaction: avoid 100% CPU usage during compaction when a task is killed
mm: migrate: fix reference check race between __find_get_block() and migration
mm: vmscan: check if mem cgroup is disabled or not before calling memcg slab shrinker
ocfs2: remove set but not used variable 'last_hash'
Revert "kmemleak: allow to coexist with fault injection"
kernel/signal.c: fix a kernel-doc markup
Save the offsets of the start of each argument to avoid having to update
pointers to each argument after every corename krealloc and to avoid
having to duplicate the memory for the dump command.
Executable names containing spaces were previously being expanded from
%e or %E and then split in the middle of the filename. This is
incorrect behaviour since an argument list can represent arguments with
spaces.
The splitting could lead to extra arguments being passed to the core
dump handler that it might have interpreted as options or ignored
completely.
Core dump handlers that are not aware of this Linux kernel issue will be
using %e or %E without considering that it may be split and so they will
be vulnerable to processes with spaces in their names breaking their
argument list. If their internals are otherwise well written, such as
if they are written in shell but quote arguments, they will work better
after this change than before. If they are not well written, then there
is a slight chance of breakage depending on the details of the code but
they will already be fairly broken by the split filenames.
Core dump handlers that are aware of this Linux kernel issue will be
placing %e or %E as the last item in their core_pattern and then
aggregating all of the remaining arguments into one, separated by
spaces. Alternatively they will be obtaining the filename via other
methods. Both of these will be compatible with the new arrangement.
A side effect from this change is that unknown template types (for
example %z) result in an empty argument to the dump handler instead of
the argument being dropped. This is a desired change as:
It is easier for dump handlers to process empty arguments than dropped
ones, especially if they are written in shell or don't pass each
template item with a preceding command-line option in order to
differentiate between individual template types. Most core_patterns in
the wild do not use options so they can confuse different template types
(especially numeric ones) if an earlier one gets dropped in old kernels.
If the kernel introduces a new template type and a core_pattern uses it,
the core dump handler might not expect that the argument can be dropped
in old kernels.
For example, this can result in security issues when %d is dropped in
old kernels. This happened with the corekeeper package in Debian and
resulted in the interface between corekeeper and Linux having to be
rewritten to use command-line options to differentiate between template
types.
The core_pattern for most core dump handlers is written by the handler
author who would generally not insert unknown template types so this
change should be compatible with all the core dump handlers that exist.
Link: http://lkml.kernel.org/r/20190528051142.24939-1-pabs3@bonedaddy.net
Fixes: 74aadce986 ("core_pattern: allow passing of arguments to user mode helper when core_pattern is a pipe")
Signed-off-by: Paul Wise <pabs3@bonedaddy.net>
Reported-by: Jakub Wilk <jwilk@jwilk.net> [https://bugs.debian.org/924398]
Reported-by: Paul Wise <pabs3@bonedaddy.net> [https://lore.kernel.org/linux-fsdevel/c8b7ecb8508895bf4adb62a748e2ea2c71854597.camel@bonedaddy.net/]
Suggested-by: Jakub Wilk <jwilk@jwilk.net>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fixes gcc '-Wunused-but-set-variable' warning:
fs/ocfs2/xattr.c: In function ocfs2_xattr_bucket_find:
fs/ocfs2/xattr.c:3828:6: warning: variable last_hash set but not used [-Wunused-but-set-variable]
It's never used and can be removed.
Link: http://lkml.kernel.org/r/20190716132110.34836-1-yuehaibing@huawei.com
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl1ERCMQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpjr7D/0U8SMu1T9JOge91zXQQUc7XtCX9RvHYhhj
vbwwN9RwpIfrTwuLZUCvt2vEz8WPOVfZbwYGkfFcdI+N5I/dOfT8Swiwy7Zabpi2
KTedn2EdELTizEuWQ3QhaBHWuTGvE04aAzZTBRCQ0tCOYTPpXGRavxhG6UHcQi+z
lohB5Pr/cyX8/jWJj4kq7381QYUUH2bm9uY7qutBsQOt2CsN5prjWxX3JM6EO1wb
VyyI25fWLaS+bZW+crVutcARxccuav4e+LEJbb9Z7+19vjmkc2qE+22F3MBxYCzo
tOjU0RP0IvvVR9t0Hahw/3MnDTDfuSqlqrT12zNtn7FrzOKpkygMyRa+u8YygI6k
2iAp92HkNWpjBxUFNGoRCRfJpApG3vT6/VkI8tixFSw/Re3F1H9Bc9IRZxc3uU4H
5DMRmjZXGg+8Nw+93XzwWnD1paCJcDsHRHUpWFNJvRfJYQzDaziPUBV9a9TZ+HMF
BnCJBCW641tcA5yCRwBF6OpoowtmxOtWce7Lr9wAjU+cYHMEzOQoG+J6gPH3q8Jh
aD9U2FcnE6kReL+MsGj42q1U1n60xngcdzo8Ca4bWfWNpqb4lJatjumkDAiI6U4q
DFDs9bRbB4LLgwkRQ+n1biwAK626KJOp5lGXrEu7XHXSTlO/BiJytISwASjlzKsZ
4uGHc/uUdA==
=P5E/
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20190802' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"Here's a small collection of fixes that should go into this series.
This contains:
- io_uring potential use-after-free fix (Jackie)
- loop regression fix (Jan)
- O_DIRECT fragmented bio regression fix (Damien)
- Mark Denis as the new floppy maintainer (Denis)
- ataflop switch fall-through annotation (Gustavo)
- libata zpodd overflow fix (Kees)
- libata ahci deferred probe fix (Miquel)
- nbd invalidation BUG_ON() fix (Munehisa)
- dasd endless loop fix (Stefan)"
* tag 'for-linus-20190802' of git://git.kernel.dk/linux-block:
s390/dasd: fix endless loop after read unit address configuration
block: Fix __blkdev_direct_IO() for bio fragments
MAINTAINERS: floppy: take over maintainership
nbd: replace kill_bdev() with __invalidate_device() again
ata: libahci: do not complain in case of deferred probe
io_uring: fix KASAN use after free in io_sq_wq_submit_work
loop: Fix mount(2) failure due to race with LOOP_SET_FD
libata: zpodd: Fix small read overflow in zpodd_get_mech_type()
ataflop: Mark expected switch fall-through
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl1ETYkACgkQxWXV+ddt
WDupLA/+LXPJvxRG/Ob655sxOAqpG4+XUW7aaI5jSnlr958TtDlNFkpzrQ28NFzk
XlEe/HjgI6CH7DhNwe1IlHLqFt874fiCeIZ2jvuSVVbPAO+9T8a5BIbeUY6V4h5v
QUmbaUpVshsh2IDArU8Vc/UjycCuwrccfGkS5ydTVI6Eni4BWsOY2PMiB5ojdOTE
bpclws3ca/CMbNpGKnn3cH5aXJpENTXedXxBbp99DproZVY3YL1ZyDxc37R6+5Ua
kbZznSMZFoLjHt3vktHmiozF7W1Zdi6ExOp9NHApshlw/n08ICkY1oNfd9KSYgQs
8gTl9mZ8jbJnSfZTuHNL6LwfSZjDyT5LCTbXSj0UnaVOYwJVwopsDrrdO6AESd8S
1eFeEFQ62lEV+31xa0nsYBz5j6ejEfcvxiq14W8LawWpkxtAIsEyXockWtmXPods
29BYkstQADlCOu2WYdQ8ugseBtfEpEAvD+Rf+qqyZ6XlHMIjL1J1S+BIXQ868z1f
KZCVsepJ0dVm0siVNMFBUaM2z1PcHusxNgYd37YcJRX2t4Gql9Ku/00PJ+tRhsBa
80a5ue9p6nuZfR6XoM6Cpe3uG8qyFyPccC3ohwKEOAnHnmk2LEJYASudV4pk/P8G
Vy3o9Uv/NMMznop24r7UB3mklBQCIchKpaij3RT0Jx+3RGdR4Yo=
=jr4T
-----END PGP SIGNATURE-----
Merge tag 'for-5.3-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- tiny race window during 2 transactions aborting at the same time can
accidentally lead to a commit
- regression fix, possible deadlock during fiemap
- fix for an old bug when incremental send can fail on a file that has
been deduplicated in a special way
* tag 'for-5.3-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix deadlock between fiemap and transaction commits
Btrfs: fix race leading to fs corruption after transaction abort
Btrfs: fix incremental send failure after deduplication
The recent fix to properly handle IOCB_NOWAIT for async O_DIRECT IO
(patch 6a43074e2f) introduced two problems with BIO fragment handling
for direct IOs:
1) The dio size processed is calculated by incrementing the ret variable
by the size of the bio fragment issued for the dio. However, this size
is obtained directly from bio->bi_iter.bi_size AFTER the bio submission
which may result in referencing the bi_size value after the bio
completed, resulting in an incorrect value use.
2) The ret variable is not incremented by the size of the last bio
fragment issued for the bio, leading to an invalid IO size being
returned to the user.
Fix both problem by using dio->size (which is incremented before the bio
submission) to update the value of ret after bio submissions, including
for the last bio fragment issued.
Fixes: 6a43074e2f ("block: properly handle IOCB_NOWAIT for async O_DIRECT IO")
Reported-by: Masato Suzuki <masato.suzuki@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With the recent iomap write page reclaim deadlock fix, it turns out that the
GLF_DIRTY flag isn't always set when it needs to be anymore: previously, this
happened as a side effect of always adding the inode buffer head to the current
transaction with gfs2_trans_add_meta, but this isn't happening consistently
anymore. Fix by removing an additional unnecessary gfs2_trans_add_meta call
and by setting the GLF_DIRTY flag in gfs2_iomap_end.
(The GLF_DIRTY flag causes inode_go_sync to flush the transaction log when
syncing out the glock of that inode. When the flag isn't set, inode_go_sync
will skip inodes, including ones with an i_state of I_DIRTY_PAGES, which will
lead to cluster incoherency.)
In addition, in gfs2_iomap_page_done, if the metadata has changed, mark the
inode as I_DIRTY_DATASYNC to have the inode added to the current transaction:
we don't expect metadata to change here, but let's err on the safe side.
Fixes: d0a22a4b03 ("gfs2: Fix iomap write page reclaim deadlock");
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
In "consolidate the capability checks in sget_{fc,userns}())" the
wrong argument had been passed to mount_capable() by sget_fc().
That mistake had been further obscured later, when switching
mount_capable() to fs_context has moved the calculation of
bogus argument from sget_fc() to mount_capable() itself. It
should've been fc->user_ns all along.
Screwed-up-by: Al Viro <viro@zeniv.linux.org.uk>
Reported-by: Christian Brauner <christian@brauner.io>
Tested-by: Christian Brauner <christian@brauner.io>
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull dax fix from Dan Williams:
"Fix a botched manual patch update that got dropped between testing and
application"
* 'dax-fix-5.3-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
dax: Fix missed wakeup in put_unlocked_entry()
This set of patches adjust to follow recent setflags changes and fix two
regression introduced since 5.4-rc1.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAl1AgaYACgkQQBSofoJI
UNJzPA/+NXLuNcjdP2HdiTj2fZc5saFDZkMKmjd6559rsQI9799qGIqZbhoynV15
sWNToo04LL1nNSTj/+/Kemwx3jnKUst4HCeUodb88Fey/phzbX4Di59aOv9cb5nT
g1Fozf7pt4TI67/zOxrP32s5hAUe1c7kkdp4XC8lK4Sv7KqXxvw6dz7QOWCaWd/z
DfEww1miNL0zRKSHUxj+74nLU6Q3Ot9/lEg4Af+8hJfURuCtkqck3yUXuyYnPmA0
4E15IeayqUEwqLHsGb3ysWsTQz1gRPRJncMz9FHsaW2pPIGiQtvJMzlhmo8vdrnP
8vRXGL9U/kuzod/MR0nOl16Wmq4u75IbK4FGEHF032An0bzYspwej5GDWbD7fiIW
gjGF6MoSmtYdngSxED2Cf81MM5oCKGMMxkMc9C+NFIMR9UNtagTMjzLtEICP/Q0h
7u6A5vqbND7Oh8O2b+Xk4kvVtJ2Czu0Jq7p6kdfCPTXV3oJR9Wq7fdXyGBtm6AWx
sjbt6d0RvuR8G+4jPI8Wa8gZ9cPJIQW+Iw917Q/mDqCfK7mXarrpy8tsNzNduLzl
80XjIA1BknNPFOHKPndSAhkXrpb269nPVfdtPTv0GAris3DANG71tKMp9pijFOo0
rn6SKzVVmYo41FWgYpQc2b5V8uQZXK/LMY4TF6bgRHAfmPPgOxk=
=TSfD
-----END PGP SIGNATURE-----
Merge tag 'f2fs-for-5.4-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs fixes from Jaegeuk Kim:
"This set of patches adjust to follow recent setflags changes and fix
two regressions"
* tag 'f2fs-for-5.4-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs:
f2fs: use EINVAL for superblock with invalid magic
f2fs: fix to read source block before invalidating it
f2fs: remove redundant check from f2fs_setflags_common()
f2fs: use generic checking function for FS_IOC_FSSETXATTR
f2fs: use generic checking and prep function for FS_IOC_SETFLAGS
Commit 33ec3e53e7 ("loop: Don't change loop device under exclusive
opener") made LOOP_SET_FD ioctl acquire exclusive block device reference
while it updates loop device binding. However this can make perfectly
valid mount(2) fail with EBUSY due to racing LOOP_SET_FD holding
temporarily the exclusive bdev reference in cases like this:
for i in {a..z}{a..z}; do
dd if=/dev/zero of=$i.image bs=1k count=0 seek=1024
mkfs.ext2 $i.image
mkdir mnt$i
done
echo "Run"
for i in {a..z}{a..z}; do
mount -o loop -t ext2 $i.image mnt$i &
done
Fix the problem by not getting full exclusive bdev reference in
LOOP_SET_FD but instead just mark the bdev as being claimed while we
update the binding information. This just blocks new exclusive openers
instead of failing them with EBUSY thus fixing the problem.
Fixes: 33ec3e53e7 ("loop: Don't change loop device under exclusive opener")
Cc: stable@vger.kernel.org
Tested-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In xchk_da_btree_block_check_sibling(), there is an if statement on
line 274 to check whether ds->state->altpath.blk[level].bp is NULL:
if (ds->state->altpath.blk[level].bp)
When ds->state->altpath.blk[level].bp is NULL, it is used on line 281:
xfs_trans_brelse(..., ds->state->altpath.blk[level].bp);
struct xfs_buf_log_item *bip = bp->b_log_item;
ASSERT(bp->b_transp == tp);
Thus, possible null-pointer dereferences may occur.
To fix these bugs, ds->state->altpath.blk[level].bp is checked before
being used.
These bugs are found by a static analysis tool STCheck written by us.
Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
The fiemap handler locks a file range that can have unflushed delalloc,
and after locking the range, it tries to attach to a running transaction.
If the running transaction started its commit, that is, it is in state
TRANS_STATE_COMMIT_START, and either the filesystem was mounted with the
flushoncommit option or the transaction is creating a snapshot for the
subvolume that contains the file that fiemap is operating on, we end up
deadlocking. This happens because fiemap is blocked on the transaction,
waiting for it to complete, and the transaction is waiting for the flushed
dealloc to complete, which requires locking the file range that the fiemap
task already locked. The following stack traces serve as an example of
when this deadlock happens:
(...)
[404571.515510] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
[404571.515956] Call Trace:
[404571.516360] ? __schedule+0x3ae/0x7b0
[404571.516730] schedule+0x3a/0xb0
[404571.517104] lock_extent_bits+0x1ec/0x2a0 [btrfs]
[404571.517465] ? remove_wait_queue+0x60/0x60
[404571.517832] btrfs_finish_ordered_io+0x292/0x800 [btrfs]
[404571.518202] normal_work_helper+0xea/0x530 [btrfs]
[404571.518566] process_one_work+0x21e/0x5c0
[404571.518990] worker_thread+0x4f/0x3b0
[404571.519413] ? process_one_work+0x5c0/0x5c0
[404571.519829] kthread+0x103/0x140
[404571.520191] ? kthread_create_worker_on_cpu+0x70/0x70
[404571.520565] ret_from_fork+0x3a/0x50
[404571.520915] kworker/u8:6 D 0 31651 2 0x80004000
[404571.521290] Workqueue: btrfs-flush_delalloc btrfs_flush_delalloc_helper [btrfs]
(...)
[404571.537000] fsstress D 0 13117 13115 0x00004000
[404571.537263] Call Trace:
[404571.537524] ? __schedule+0x3ae/0x7b0
[404571.537788] schedule+0x3a/0xb0
[404571.538066] wait_current_trans+0xc8/0x100 [btrfs]
[404571.538349] ? remove_wait_queue+0x60/0x60
[404571.538680] start_transaction+0x33c/0x500 [btrfs]
[404571.539076] btrfs_check_shared+0xa3/0x1f0 [btrfs]
[404571.539513] ? extent_fiemap+0x2ce/0x650 [btrfs]
[404571.539866] extent_fiemap+0x2ce/0x650 [btrfs]
[404571.540170] do_vfs_ioctl+0x526/0x6f0
[404571.540436] ksys_ioctl+0x70/0x80
[404571.540734] __x64_sys_ioctl+0x16/0x20
[404571.540997] do_syscall_64+0x60/0x1d0
[404571.541279] entry_SYSCALL_64_after_hwframe+0x49/0xbe
(...)
[404571.543729] btrfs D 0 14210 14208 0x00004000
[404571.544023] Call Trace:
[404571.544275] ? __schedule+0x3ae/0x7b0
[404571.544526] ? wait_for_completion+0x112/0x1a0
[404571.544795] schedule+0x3a/0xb0
[404571.545064] schedule_timeout+0x1ff/0x390
[404571.545351] ? lock_acquire+0xa6/0x190
[404571.545638] ? wait_for_completion+0x49/0x1a0
[404571.545890] ? wait_for_completion+0x112/0x1a0
[404571.546228] wait_for_completion+0x131/0x1a0
[404571.546503] ? wake_up_q+0x70/0x70
[404571.546775] btrfs_wait_ordered_extents+0x27c/0x400 [btrfs]
[404571.547159] btrfs_commit_transaction+0x3b0/0xae0 [btrfs]
[404571.547449] ? btrfs_mksubvol+0x4a4/0x640 [btrfs]
[404571.547703] ? remove_wait_queue+0x60/0x60
[404571.547969] btrfs_mksubvol+0x605/0x640 [btrfs]
[404571.548226] ? __sb_start_write+0xd4/0x1c0
[404571.548512] ? mnt_want_write_file+0x24/0x50
[404571.548789] btrfs_ioctl_snap_create_transid+0x169/0x1a0 [btrfs]
[404571.549048] btrfs_ioctl_snap_create_v2+0x11d/0x170 [btrfs]
[404571.549307] btrfs_ioctl+0x133f/0x3150 [btrfs]
[404571.549549] ? mem_cgroup_charge_statistics+0x4c/0xd0
[404571.549792] ? mem_cgroup_commit_charge+0x84/0x4b0
[404571.550064] ? __handle_mm_fault+0xe3e/0x11f0
[404571.550306] ? do_raw_spin_unlock+0x49/0xc0
[404571.550608] ? _raw_spin_unlock+0x24/0x30
[404571.550976] ? __handle_mm_fault+0xedf/0x11f0
[404571.551319] ? do_vfs_ioctl+0xa2/0x6f0
[404571.551659] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
[404571.552087] do_vfs_ioctl+0xa2/0x6f0
[404571.552355] ksys_ioctl+0x70/0x80
[404571.552621] __x64_sys_ioctl+0x16/0x20
[404571.552864] do_syscall_64+0x60/0x1d0
[404571.553104] entry_SYSCALL_64_after_hwframe+0x49/0xbe
(...)
If we were joining the transaction instead of attaching to it, we would
not risk a deadlock because a join only blocks if the transaction is in a
state greater then or equals to TRANS_STATE_COMMIT_DOING, and the delalloc
flush performed by a transaction is done before it reaches that state,
when it is in the state TRANS_STATE_COMMIT_START. However a transaction
join is intended for use cases where we do modify the filesystem, and
fiemap only needs to peek at delayed references from the current
transaction in order to determine if extents are shared, and, besides
that, when there is no current transaction or when it blocks to wait for
a current committing transaction to complete, it creates a new transaction
without reserving any space. Such unnecessary transactions, besides doing
unnecessary IO, can cause transaction aborts (-ENOSPC) and unnecessary
rotation of the precious backup roots.
So fix this by adding a new transaction join variant, named join_nostart,
which behaves like the regular join, but it does not create a transaction
when none currently exists or after waiting for a committing transaction
to complete.
Fixes: 03628cdbc6 ("Btrfs: do not start a transaction during fiemap")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When one transaction is finishing its commit, it is possible for another
transaction to start and enter its initial commit phase as well. If the
first ends up getting aborted, we have a small time window where the second
transaction commit does not notice that the previous transaction aborted
and ends up committing, writing a superblock that points to btrees that
reference extent buffers (nodes and leafs) that were not persisted to disk.
The consequence is that after mounting the filesystem again, we will be
unable to load some btree nodes/leafs, either because the content on disk
is either garbage (or just zeroes) or corresponds to the old content of a
previouly COWed or deleted node/leaf, resulting in the well known error
messages "parent transid verify failed on ...".
The following sequence diagram illustrates how this can happen.
CPU 1 CPU 2
<at transaction N>
btrfs_commit_transaction()
(...)
--> sets transaction state to
TRANS_STATE_UNBLOCKED
--> sets fs_info->running_transaction
to NULL
(...)
btrfs_start_transaction()
start_transaction()
wait_current_trans()
--> returns immediately
because
fs_info->running_transaction
is NULL
join_transaction()
--> creates transaction N + 1
--> sets
fs_info->running_transaction
to transaction N + 1
--> adds transaction N + 1 to
the fs_info->trans_list list
--> returns transaction handle
pointing to the new
transaction N + 1
(...)
btrfs_sync_file()
btrfs_start_transaction()
--> returns handle to
transaction N + 1
(...)
btrfs_write_and_wait_transaction()
--> writeback of some extent
buffer fails, returns an
error
btrfs_handle_fs_error()
--> sets BTRFS_FS_STATE_ERROR in
fs_info->fs_state
--> jumps to label "scrub_continue"
cleanup_transaction()
btrfs_abort_transaction(N)
--> sets BTRFS_FS_STATE_TRANS_ABORTED
flag in fs_info->fs_state
--> sets aborted field in the
transaction and transaction
handle structures, for
transaction N only
--> removes transaction from the
list fs_info->trans_list
btrfs_commit_transaction(N + 1)
--> transaction N + 1 was not
aborted, so it proceeds
(...)
--> sets the transaction's state
to TRANS_STATE_COMMIT_START
--> does not find the previous
transaction (N) in the
fs_info->trans_list, so it
doesn't know that transaction
was aborted, and the commit
of transaction N + 1 proceeds
(...)
--> sets transaction N + 1 state
to TRANS_STATE_UNBLOCKED
btrfs_write_and_wait_transaction()
--> succeeds writing all extent
buffers created in the
transaction N + 1
write_all_supers()
--> succeeds
--> we now have a superblock on
disk that points to trees
that refer to at least one
extent buffer that was
never persisted
So fix this by updating the transaction commit path to check if the flag
BTRFS_FS_STATE_TRANS_ABORTED is set on fs_info->fs_state if after setting
the transaction to the TRANS_STATE_COMMIT_START we do not find any previous
transaction in the fs_info->trans_list. If the flag is set, just fail the
transaction commit with -EROFS, as we do in other places. The exact error
code for the previous transaction abort was already logged and reported.
Fixes: 49b25e0540 ("btrfs: enhance transaction abort infrastructure")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing an incremental send operation we can fail if we previously did
deduplication operations against a file that exists in both snapshots. In
that case we will fail the send operation with -EIO and print a message
to dmesg/syslog like the following:
BTRFS error (device sdc): Send: inconsistent snapshot, found updated \
extent for inode 257 without updated inode item, send root is 258, \
parent root is 257
This requires that we deduplicate to the same file in both snapshots for
the same amount of times on each snapshot. The issue happens because a
deduplication only updates the iversion of an inode and does not update
any other field of the inode, therefore if we deduplicate the file on
each snapshot for the same amount of time, the inode will have the same
iversion value (stored as the "sequence" field on the inode item) on both
snapshots, therefore it will be seen as unchanged between in the send
snapshot while there are new/updated/deleted extent items when comparing
to the parent snapshot. This makes the send operation return -EIO and
print an error message.
Example reproducer:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
# Create our first file. The first half of the file has several 64Kb
# extents while the second half as a single 512Kb extent.
$ xfs_io -f -s -c "pwrite -S 0xb8 -b 64K 0 512K" /mnt/foo
$ xfs_io -c "pwrite -S 0xb8 512K 512K" /mnt/foo
# Create the base snapshot and the parent send stream from it.
$ btrfs subvolume snapshot -r /mnt /mnt/mysnap1
$ btrfs send -f /tmp/1.snap /mnt/mysnap1
# Create our second file, that has exactly the same data as the first
# file.
$ xfs_io -f -c "pwrite -S 0xb8 0 1M" /mnt/bar
# Create the second snapshot, used for the incremental send, before
# doing the file deduplication.
$ btrfs subvolume snapshot -r /mnt /mnt/mysnap2
# Now before creating the incremental send stream:
#
# 1) Deduplicate into a subrange of file foo in snapshot mysnap1. This
# will drop several extent items and add a new one, also updating
# the inode's iversion (sequence field in inode item) by 1, but not
# any other field of the inode;
#
# 2) Deduplicate into a different subrange of file foo in snapshot
# mysnap2. This will replace an extent item with a new one, also
# updating the inode's iversion by 1 but not any other field of the
# inode.
#
# After these two deduplication operations, the inode items, for file
# foo, are identical in both snapshots, but we have different extent
# items for this inode in both snapshots. We want to check this doesn't
# cause send to fail with an error or produce an incorrect stream.
$ xfs_io -r -c "dedupe /mnt/bar 0 0 512K" /mnt/mysnap1/foo
$ xfs_io -r -c "dedupe /mnt/bar 512K 512K 512K" /mnt/mysnap2/foo
# Create the incremental send stream.
$ btrfs send -p /mnt/mysnap1 -f /tmp/2.snap /mnt/mysnap2
ERROR: send ioctl failed with -5: Input/output error
This issue started happening back in 2015 when deduplication was updated
to not update the inode's ctime and mtime and update only the iversion.
Back then we would hit a BUG_ON() in send, but later in 2016 send was
updated to return -EIO and print the error message instead of doing the
BUG_ON().
A test case for fstests follows soon.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203933
Fixes: 1c919a5e13 ("btrfs: don't update mtime/ctime on deduped inodes")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The condition checking whether put_unlocked_entry() needs to wake up
following waiter got broken by commit 23c84eb783 ("dax: Fix missed
wakeup with PMD faults"). We need to wake the waiter whenever the passed
entry is valid (i.e., non-NULL and not special conflict entry). This
could lead to processes never being woken up when waiting for entry
lock. Fix the condition.
Cc: <stable@vger.kernel.org>
Link: http://lore.kernel.org/r/20190729120228.GC17833@quack2.suse.cz
Fixes: 23c84eb783 ("dax: Fix missed wakeup with PMD faults")
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The kernel mount_block_root() function expects -EACESS or -EINVAL for a
unmountable filesystem when trying to mount the root with different
filesystem types.
However, in 5.3-rc1 the behavior when F2FS code cannot find valid block
changed to return -EFSCORRUPTED(-EUCLEAN), and this error code makes
mount_block_root() fail when trying to probe F2FS.
When the magic number of the superblock mismatches, it has a high
probability that it's just not a F2FS. In this case return -EINVAL seems
to be a better result, and this return value can make mount_block_root()
probing work again.
Return -EINVAL when the superblock has magic mismatch, -EFSCORRUPTED in
other cases (the magic matches but the superblock cannot be recognized).
Fixes: 10f966bbf5 ("f2fs: use generic EFSBADCRC/EFSCORRUPTED")
Signed-off-by: Icenowy Zheng <icenowy@aosc.io>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Explicitly initialize the onstack structures to zero so we don't leak
kernel memory into userspace when converting the in-core inumbers
structure to the v1 inogrp ioctl structure. Add a comment about why we
have to use memset to ensure that the padding holes in the structures
are set to zero.
Fixes: 5f19c7fc68 ("xfs: introduce v5 inode group structure")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Here are some small SPDX fixes for 5.3-rc2 for things that came in
during the 5.3-rc1 merge window that we previously missed.
Only 3 small patches here:
- 2 uapi patches to resolve some SPDX tags that were not correct
- fix an invalid SPDX tag in the iomap Makefile file
All have been properly reviewed on the public mailing lists.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXT2N9w8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ylY9wCeJIYfs/eNf3tsjLQXxUBMYAJNqnsAn2IaMiTt
cv2mck7JZm5KyHpP3f5N
=RSZa
-----END PGP SIGNATURE-----
Merge tag 'spdx-5.3-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx
Pull SPDX fixes from Greg KH:
"Here are some small SPDX fixes for 5.3-rc2 for things that came in
during the 5.3-rc1 merge window that we previously missed.
Only three small patches here:
- two uapi patches to resolve some SPDX tags that were not correct
- fix an invalid SPDX tag in the iomap Makefile file
All have been properly reviewed on the public mailing lists"
* tag 'spdx-5.3-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx:
iomap: fix Invalid License ID
treewide: remove SPDX "WITH Linux-syscall-note" from kernel-space headers again
treewide: add "WITH Linux-syscall-note" to SPDX tag of uapi headers
Pull scheduler fixes from Thomas Gleixner:
"Two fixes for the fair scheduling class:
- Prevent freeing memory which is accessible by concurrent readers
- Make the RCU annotations for numa groups consistent"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Use RCU accessors consistently for ->numa_group
sched/fair: Don't free p->numa_faults with concurrent readers
Hi Linus,
Please, pull the following patches that mark switch cases where we are
expecting to fall through. These patches are part of the ongoing efforts
to enable -Wimplicit-fallthrough. Most of them have been baking in linux-next
for a whole development cycle.
Also, pull the Makefile patch that globally enables the
-Wimplicit-fallthrough option.
Finally, some missing-break fixes that have been tagged for -stable:
- drm/amdkfd: Fix missing break in switch statement
- drm/amdgpu/gfx10: Fix missing break in switch statement
Notice that with these changes, we completely get rid of all the
fall-through warnings in the kernel.
Thanks
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEkmRahXBSurMIg1YvRwW0y0cG2zEFAl06XP0ACgkQRwW0y0cG
2zHPXhAAsatJGNIg7vuSicVipJBIlwgRLwtbE2rV+MCneUAZj37O9NtBgbxHNoJ+
5TBB8sLpNCw3rEvPKwm6vicRgntMclEY1vaplPOHt3RiY4lqVSIkpvFSCw1hGur3
+34+O++n/6HbtY96T+DN5WGMXrU3JSe46xnBLIt0BOoUwyKMaNdntiyd79GrrnyB
BwPkHrmB+9FEVq+dHmPnRIhc9WUfIdily3+j1CIncaM4eXKWLjoUlOIw1VcSEc4X
SSdFh2co+3nm/6O54ZxUEC1PyuQZWedsgFmeiTrOanG3AEWQ/jX7GwXvPlgraa2E
F5MzGUOQwXYL/IzXl1vGjQc4+FV4nhIjEBTDdZseOp7FP5xkHyyOwzGDEmaMXpXT
XThb+k7Q6EbiSPLFcz8zkCwNB8ngNJMNOsGP3JPFD06MfquqdzP7T1BsHcNtiMVo
FNwQg9KtvbK9F7a9lXDqfcxfuEScbqSvnKWWwNLImCqIBATYirJzeTeLvTbVWpA2
iyXF3ylIclsSMOvCtaz41iuoqNh12eqw2UIFPBmkU2wpVkTt+ZIn0Apgom90xe1K
tSUqMBbBMoHVtsBsILdkdz/m6FJpK+YmmwpRrJSDvPxA5c9caD7cj+62amW/gP4z
Hi6mC4hw1H5bKMsC2PP/QgJcmS/pgo1zwA9J2a2NNJbYPATHfKI=
=J2sO
-----END PGP SIGNATURE-----
Merge tag 'Wimplicit-fallthrough-5.3-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux
Pull Wimplicit-fallthrough enablement from Gustavo A. R. Silva:
"This marks switch cases where we are expecting to fall through, and
globally enables the -Wimplicit-fallthrough option in the main
Makefile.
Finally, some missing-break fixes that have been tagged for -stable:
- drm/amdkfd: Fix missing break in switch statement
- drm/amdgpu/gfx10: Fix missing break in switch statement
With these changes, we completely get rid of all the fall-through
warnings in the kernel"
* tag 'Wimplicit-fallthrough-5.3-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux:
Makefile: Globally enable fall-through warning
drm/i915: Mark expected switch fall-throughs
drm/amd/display: Mark expected switch fall-throughs
drm/amdkfd/kfd_mqd_manager_v10: Avoid fall-through warning
drm/amdgpu/gfx10: Fix missing break in switch statement
drm/amdkfd: Fix missing break in switch statement
perf/x86/intel: Mark expected switch fall-throughs
mtd: onenand_base: Mark expected switch fall-through
afs: fsclient: Mark expected switch fall-throughs
afs: yfsclient: Mark expected switch fall-throughs
can: mark expected switch fall-throughs
firewire: mark expected switch fall-throughs
f2fs_allocate_data_block() invalidates old block address and enable new block
address. Then, if we try to read old block by f2fs_submit_page_bio(), it will
give WARN due to reading invalid blocks.
Let's make the order sanely back.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl07KzUACgkQxWXV+ddt
WDs6Ow/+LQ4He5jW+xpgZ+5fSnSjervfSsYWxKARc4hYZog5l8e2ONASzrniKR0H
aU2SWUyLF0Os1w++vp1FYYwa1xaPJJRTvlqTI6DSCFgEpu9DefXkCAFuP/+fXbhC
Fx+ZDAbqgY2gUCEWcs4jQusSgz1sO0m97SJR7K4Vz+ZaBa+eneE5Hkm4kZd8AXg6
ufKkjOD0+Vy+CUVBZdPg8RlU5KRJ+yAE7B2CRyuSeq6cF0LxTxoNeAtWz5KWd6Pn
aNWDqa07na4kQz/p5s7Bw6qPC6i4hTdnNKVcc8N79iyz2TAetbjN8VXfbFFVCLZ9
3+9O/kFCh+ZolgV1f9U8iLwSbg7V/8wu1uUdR3645cq2PQljluMJEHPSWOmm+LVw
nOhm8VQiGsILbEYLRJbM3wI8dT4B0PtgP9IqYuk7TLV3qho6Pg8J2nyfpo5jehs5
IP0gDanCBHYgsoCUwZwnJaKH3hIP2+5IJw6AqX0Lt6ge9aYA35+kWR4KE5459Pxw
n0Eoiu+5y9QcfpgUuzb5EQIxHdRNlCjik70+n1yXBwwPtV9b6wDN9TlP3eI7p/Vx
j+FUYcgrjwjfAP7Eh5a/ne/XEcq1R5UghM+2TWEWpqGh11NPnMDJS6d4OYyeCvA6
yPLAhVvvqdbYM+PpALaNjvLxspxd6MSuKNLNk3pkJmkd69IXnXw=
=RT/D
-----END PGP SIGNATURE-----
Merge tag 'for-5.3-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Two regression fixes:
- hangs caused by a missing barrier in the locking code
- memory leaks of extent_state due to bad handling of a cached
pointer"
* tag 'for-5.3-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix extent_state leak in btrfs_lock_and_flush_ordered_range
btrfs: Fix deadlock caused by missing memory barrier
Pull vfs umount_tree() leak fix from Al Viro:
"Fix braino introduced in 'switch the remnants of releasing the
mountpoint away from fs_pin'.
The most visible result is leaking struct mount when mounting btrfs,
making it impossible to shut down"
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fix the struct mount leak in umount_tree()
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl07DGAQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgplf5EADOOvOdsz9N/Iw8ZHHHJXCqKR26zZv75G1z
0h1PGC7p0JZQbYFo0Zo7mjiRBGlg6tlXc2d4Gyl94XJKDwjeYTcFDvbvERdYa+MH
d2RiFkAfR967Ri4fb+FP5L3mYOQdMJ/zk0xCDHLv/DcxeFLa5a9EJS1+vBSR+AcB
0JpJWuHypGqGmbTaL0z9q2pmx0mgA1ERlWQtkMLrsEr2Vqg/rrjGwe2bGFY00lXc
vKtFkpfugKc4zVAPSzC1YZgojfDDpGNEA4QMtxMsEH4hqyMpHhrtUedNY5QrjC0B
p9h6aPXXYr2KhGP0grrEytzaYUOzK2crK5h+q+1vu6nOgx2EgmnLM9tBu/LuRH1j
uUzKJOa3/AE+bU7uZEsaUerTBsHrgEBa1x8G92obYRnjgW3aCD2CaSbjjBhNxTZ4
1dXyr0DTHFXZmfcfWja5tO26JTPzjwVOrwiRyU0S727UsdVJupoHiYLr5fwaDfgn
/Du2I/XWvFtflm5i0ND0sdcX1yRlFiGZ9e45z1QFaFmcteKKWzRBDlC6mQzI/lw3
oc583mhDR3tRtJxow+wn6AuMUehFRh8wj0UhL/MEMjLW8GiqXU5aRtanT+22Xz4L
saNDQieeEnV7raMYXMP0qIhkJtrNASmJQos+MOJAEGOWcS2ePIUUio2kSXie+071
BphJd2RamQ==
=HIzH
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20190726' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
- Several io_uring fixes/improvements:
- Blocking fix for O_DIRECT (me)
- Latter page slowness for registered buffers (me)
- Fix poll hang under certain conditions (me)
- Defer sequence check fix for wrapped rings (Zhengyuan)
- Mismatch in async inc/dec accounting (Zhengyuan)
- Memory ordering issue that could cause stall (Zhengyuan)
- Track sequential defer in bytes, not pages (Zhengyuan)
- NVMe pull request from Christoph
- Set of hang fixes for wbt (Josef)
- Redundant error message kill for libahci (Ding)
- Remove unused blk_mq_sched_started_request() and related ops (Marcos)
- drbd dynamic alloc shash descriptor to reduce stack use (Arnd)
- blkcg ->pd_stat() non-debug print (Tejun)
- bcache memory leak fix (Wei)
- Comment fix (Akinobu)
- BFQ perf regression fix (Paolo)
* tag 'for-linus-20190726' of git://git.kernel.dk/linux-block: (24 commits)
io_uring: ensure ->list is initialized for poll commands
Revert "nvme-pci: don't create a read hctx mapping without read queues"
nvme: fix multipath crash when ANA is deactivated
nvme: fix memory leak caused by incorrect subsystem free
nvme: ignore subnqn for ADATA SX6000LNP
drbd: dynamically allocate shash descriptor
block: blk-mq: Remove blk_mq_sched_started_request and started_request
bcache: fix possible memory leak in bch_cached_dev_run()
io_uring: track io length in async_list based on bytes
io_uring: don't use iov_iter_advance() for fixed buffers
block: properly handle IOCB_NOWAIT for async O_DIRECT IO
blk-mq: allow REQ_NOWAIT to return an error inline
io_uring: add a memory barrier before atomic_read
rq-qos: use a mb for got_token
rq-qos: set ourself TASK_UNINTERRUPTIBLE after we schedule
rq-qos: don't reset has_sleepers on spurious wakeups
rq-qos: fix missed wake-ups in rq_qos_throttle
wait: add wq_has_single_sleeper helper
block, bfq: check also in-flight I/O in dispatch plugging
block: fix sysfs module parameters directory path in comment
...
We need to drop everything we remove from the tree, whether
mnt_has_parent() is true or not. Usually the bug manifests as a slow
memory leak (leaked struct mount for initramfs); it becomes much more
visible in mount_subtree() users, such as btrfs. There we leak
a struct mount for btrfs superblock being mounted, which prevents
fs shutdown on subsequent umount.
Fixes: 56cbb429d9 ("switch the remnants of releasing the mountpoint away from fs_pin")
Reported-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
btrfs_lock_and_flush_ordered_range() loads given "*cached_state" into
cachedp, which, in general, is NULL. Then, lock_extent_bits() updates
"cachedp", but it never goes backs to the caller. Thus the caller still
see its "cached_state" to be NULL and never free the state allocated
under btrfs_lock_and_flush_ordered_range(). As a result, we will
see massive state leak with e.g. fstests btrfs/005. Fix this bug by
properly handling the pointers.
Fixes: bd80d94efb ("btrfs: Always use a cached extent_state in btrfs_lock_and_flush_ordered_range")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In preparation to enabling -Wimplicit-fallthrough, mark switch
cases where we are expecting to fall through.
This patch fixes the following warnings:
Warning level 3 was used: -Wimplicit-fallthrough=3
fs/afs/fsclient.c: In function ‘afs_deliver_fs_fetch_acl’:
fs/afs/fsclient.c:2199:19: warning: this statement may fall through [-Wimplicit-fallthrough=]
call->unmarshall++;
~~~~~~~~~~~~~~~~^~
fs/afs/fsclient.c:2202:2: note: here
case 1:
^~~~
fs/afs/fsclient.c:2216:19: warning: this statement may fall through [-Wimplicit-fallthrough=]
call->unmarshall++;
~~~~~~~~~~~~~~~~^~
fs/afs/fsclient.c:2219:2: note: here
case 2:
^~~~
fs/afs/fsclient.c:2225:19: warning: this statement may fall through [-Wimplicit-fallthrough=]
call->unmarshall++;
~~~~~~~~~~~~~~~~^~
fs/afs/fsclient.c:2228:2: note: here
case 3:
^~~~
This patch is part of the ongoing efforts to enable
-Wimplicit-fallthrough.
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
In preparation to enabling -Wimplicit-fallthrough, mark switch
cases where we are expecting to fall through.
This patch fixes the following warnings:
fs/afs/yfsclient.c: In function ‘yfs_deliver_fs_fetch_opaque_acl’:
fs/afs/yfsclient.c:1984:19: warning: this statement may fall through [-Wimplicit-fallthrough=]
call->unmarshall++;
~~~~~~~~~~~~~~~~^~
fs/afs/yfsclient.c:1987:2: note: here
case 1:
^~~~
fs/afs/yfsclient.c:2005:19: warning: this statement may fall through [-Wimplicit-fallthrough=]
call->unmarshall++;
~~~~~~~~~~~~~~~~^~
fs/afs/yfsclient.c:2008:2: note: here
case 2:
^~~~
fs/afs/yfsclient.c:2014:19: warning: this statement may fall through [-Wimplicit-fallthrough=]
call->unmarshall++;
~~~~~~~~~~~~~~~~^~
fs/afs/yfsclient.c:2017:2: note: here
case 3:
^~~~
fs/afs/yfsclient.c:2035:19: warning: this statement may fall through [-Wimplicit-fallthrough=]
call->unmarshall++;
~~~~~~~~~~~~~~~~^~
fs/afs/yfsclient.c:2038:2: note: here
case 4:
^~~~
fs/afs/yfsclient.c:2047:19: warning: this statement may fall through [-Wimplicit-fallthrough=]
call->unmarshall++;
~~~~~~~~~~~~~~~~^~
fs/afs/yfsclient.c:2050:2: note: here
case 5:
^~~~
Warning level 3 was used: -Wimplicit-fallthrough=3
Also, fix some commenting style issues.
This patch is part of the ongoing efforts to enable
-Wimplicit-fallthrough.
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Daniel reports that when testing an http server that uses io_uring
to poll for incoming connections, sometimes it hard crashes. This is
due to an uninitialized list member for the io_uring request. Normally
this doesn't trigger and none of the test cases caught it.
Reported-by: Daniel Kozak <kozzi11@gmail.com>
Tested-by: Daniel Kozak <kozzi11@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>