This patch adds a mount option, nobarrier, in f2fs.
The assumption in here is that file system keeps the IO ordering, but
doesn't care about cache flushes inside the storages.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We should put root inode correctly in error path of fill_super, otherwise we
may encounter a leak case of inode resource.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Now new interface ->rename2() is added to VFS, here are related description:
https://lkml.org/lkml/2014/2/7/873https://lkml.org/lkml/2014/2/7/758
This patch adds function f2fs_rename2() to support ->rename2() including
handling both RENAME_EXCHANGE and RENAME_NOREPLACE flag.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Otherwise, if a large amount of direct IO writes were done, the
segment allocation may be failed because no enough segments are gced.
Changes:
v2: add f2fs_balance_fs into __get_data_block instead of f2fs_direct_IO.
Signed-off-by: Huang, Ying <ying.huang@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In __set_test_and_free we will check whether all segment are free in one section
When free one segment, in order to set section to free status.
But the searching region of segmap is from start segno to last segno of f2fs,
it's not necessary. So let's just only check all segment bitmap of target
section.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We assume that modification of some special application could result in zeroed
name_len, or it is consciously made by somebody. We will deadloop in
find_in_block when name_len of dir entry is zero.
This patch is added for preventing deadloop in above scenario.
change log from v1:
o use f2fs_bug_on rather than break out from searching dir entry suggested by
Jaegeuk Kim.
Jaegeuk describe:
"Well, IMO, it would be good to add f2fs_bug_on() here with a specific comment.
In the current phase of f2fs, it is more important to investigate the file
system bugs, rather than workarounds for any corrupted images.
And, definitely it needs to stop the kernel if any corrupted image was mounted,
so that we can figure out where the bugs are occurred."
Suggested-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In this patch we use below inner macro and function to clean up codes.
1. ADDRS_PER_PAGE
2. SM_I
3. f2fs_readonly
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When we fail in ->write_begin()/->direct_IO(), our allocated node block in disk
and page cache are still kept, despite these may not be used again.
This patch introduce f2fs_write_failed() to handle the error case of these two
interfaces, it will truncate page cache and blocks of this file according to
i_size.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
kernel side(xx_init_acl), the acl is get/cloned from the parent dir's,
which is credible. So remove the redundant validation check of acl
here.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In our rename process, region of f2fs_lock_op covered is too big as some of the
code like f2fs_empty_dir/f2fs_find_entry are not needed to protect by this lock.
So in the extreme case like doing checkpoint when we rename old inode to exist
inode in a large directory could cause lower concurrency.
Let's reduce the region of f2fs_lock_op to fix this.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Although building NAT journal in cursum reduce the read/write work for NAT
block, but previous design leave us lower performance when write checkpoint
frequently for these cases:
1. if journal in cursum has already full, it's a bit of waste that we flush all
nat entries to page for persistence, but not to cache any entries.
2. if journal in cursum is not full, we fill nat entries to journal util
journal is full, then flush the left dirty entries to disk without merge
journaled entries, so these journaled entries may be flushed to disk at next
checkpoint but lost chance to flushed last time.
In this patch we merge dirty entries located in same NAT block to nat entry set,
and linked all set to list, sorted ascending order by entries' count of set.
Later we flush entries in sparse set into journal as many as we can, and then
flush merged entries to disk. In this way we can not only gain in performance,
but also save lifetime of flash device.
In my testing environment, it shows this patch can help to reduce NAT block
writes obviously. In hard disk test case: cost time of fsstress is stablely
reduced by about 5%.
1. virtual machine + hard disk:
fsstress -p 20 -n 200 -l 5
node num cp count nodes/cp
based 4599.6 1803.0 2.551
patched 2714.6 1829.6 1.483
2. virtual machine + 32g micro SD card:
fsstress -p 20 -n 200 -l 1 -w -f chown=0 -f creat=4 -f dwrite=0
-f fdatasync=4 -f fsync=4 -f link=0 -f mkdir=4 -f mknod=4 -f rename=5
-f rmdir=5 -f symlink=0 -f truncate=4 -f unlink=5 -f write=0 -S
node num cp count nodes/cp
based 84.5 43.7 1.933
patched 49.2 40.0 1.23
Our latency of merging op shows not bad when handling extreme case like:
merging a great number of dirty nats:
latency(ns) dirty nat count
3089219 24922
5129423 27422
4000250 24523
change log from v1:
o fix wrong logic in add_nat_entry when grab a new nat entry set.
o swith to create slab cache in create_node_manager_caches.
o use GFP_ATOMIC instead of GFP_NOFS to avoid potential long latency.
change log from v2:
o make comment position more appropriate suggested by Jaegeuk Kim.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds f2fs_do_tmpfile to eliminate the redundant init_inode_metadata
flow.
Throught this, we can provide the consistent lock usage, e.g., fi->i_sem, and
this will enable better debugging stuffs.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Add function f2fs_tmpfile() to support O_TMPFILE file creation, and modify logic
of init_inode_metadata to enable linkat temp file.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
After we call find_data_page in truncate_partial_data_page, we could not
guarantee this page is updated or not as error may occurred in lower layer.
We'd better check status of the page to avoid this no updated page be
writebacked to device.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We have already set page update in ->write_begin, so we should remove redundant
SetPageUptodate in ->write_end.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
o fix normal and recovery path for fallocated regions
o fix error case mishandling
o recover renamed fsync inodes correctly
o fix to get out of infinite loops in balance_dirty_pages
o fix kernel NULL pointer error
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJTvUA5AAoJEEAUqH6CSFDSSKgP/RQ6ryncwwSUilDswq95/VI1
qXwAlHLBgJkPquld6Klqw//4ot49sThCjBtusxdNqoyB5aSb/xqupJxRvCrJe1RQ
dRDYP1Mq63phd0cWsjAokfwXuiJQ2Ys/1bq2HguzAhL+7qNVNJEoy27ISUgvh71J
3v9pTfOqFY/qMxAa1Y91kIat3/27QTCtVQdS1sQM7s8UXlZHIIGyxrSmYWPUGNar
yVtMNtgMQcEtmekRAjstM0glj3IukosTP1jameXYumEw9bchfIeeLznvtDiEqxKA
maXtEPA+yrEk5y+RhOiBgaHuV/9uNmrHHvTwoqhMl9Wl+I4RzxpOhD2agRAUFbdn
rvPKU514tsjhkdelSYf0v2rXf0PxZcZ5XE27TZ+xyhCADKykBdN5ZzTH1OUWjEOA
TNdPVKv2btpvEdGdmdGzjKIQpPfjLgJLAKqDNNTSQ3u4XlVioMn6IyzEGddz41By
kSU0Hzj3iBHk+XlqBWSELOd34aCuvqXG/gcE7rWOj0qbJ5T6GKVRTQN5CbqMNutJ
Udw0JDhImgYxNI5fsy7Stg/5IqOwhp/pDIpLOHXRnYpLb2rJ1kzvgz4B/eJAZCcc
zmjxZBn1C2GLBJYFDbY1KeR5Tp6WZ9yok+wbXFiO1mpx5RsU7jIL64X/7+Zg0X84
p3LlN/vBn1nr2DiB3+n/
=pwxz
-----END PGP SIGNATURE-----
Merge tag 'f2fs-fixes-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs bugfixes from Jaegeuk Kim:
"This includes a couple of bug fixes found by xfstests. In addition,
one critical bug was reported by Brian Chadwick, which is falling into
the infinite loop in balance_dirty_pages. And it turned out due to
the IO merging policy in f2fs, which was newly merged in 3.16.
- fix normal and recovery path for fallocated regions
- fix error case mishandling
- recover renamed fsync inodes correctly
- fix to get out of infinite loops in balance_dirty_pages
- fix kernel NULL pointer error"
* tag 'f2fs-fixes-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs:
f2fs: avoid to access NULL pointer in issue_flush_thread
f2fs: check bdi->dirty_exceeded when trying to skip data writes
f2fs: do checkpoint for the renamed inode
f2fs: release new entry page correctly in error path of f2fs_rename
f2fs: fix error path in init_inode_metadata
f2fs: check lower bound nid value in check_nid_range
f2fs: remove unused variables in f2fs_sm_info
f2fs: fix not to allocate unnecessary blocks during fallocate
f2fs: recover fallocated data and its i_size together
f2fs: fix to report newly allocate region as extent
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=75861
Denis 2014-05-10 11:28:59 UTC reported:
"F2FS-fs (mmcblk0p28): mounting..
Unable to handle kernel NULL pointer dereference at virtual address 00000018
...
[<c0a2f678>] (_raw_spin_lock+0x3c/0x70) from [<c03a0330>] (issue_flush_thread+0x50/0x17c)
[<c03a0330>] (issue_flush_thread+0x50/0x17c) from [<c01b4064>] (kthread+0x98/0xa4)
[<c01b4064>] (kthread+0x98/0xa4) from [<c0108060>] (kernel_thread_exit+0x0/0x8)"
This patch assign cmd_control_info in sm_info before issue_flush_thread is being
created, so this make sure that issue flush thread will have no chance to access
invalid info in fcc.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If we don't check the current backing device status, balance_dirty_pages can
fall into infinite pausing routine.
This can be occurred when a lot of directories make a small number of dirty
dentry pages including files.
Reported-by: Brian Chadwick <brianchad@westnet.com.au>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If an inode is renamed, it should be registered as file_lost_pino to conduct
checkpoint at f2fs_sync_file.
Otherwise, the inode cannot be recovered due to no dent_mark in the following
scenario.
Note that, this scenario is from xfstests/322.
1. create "a"
2. fsync "a"
3. rename "a" to "b"
4. fsync "b"
5. Sudden power-cut
After recovery is done, "b" should be seen.
However, the result shows "a", since the recovery procedure does not enter
recover_dentry due to no dent_mark.
The reason is like below.
- The nid of "a" is checkpointed during #2, f2fs_sync_file.
- The inode page for "b" produced by #3 is written without dent_mark by
sync_node_pages.
So, this patch fixes this bug by assinging file_lost_pino to the "a"'s inode.
If the pino is lost, f2fs_sync_file conducts checkpoint, and then recovers
the latest pino and its dentry information for further recovery.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch correct releasing code of new_page to avoid BUG_ON in error patch of
f2fs_rename.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If we fail in this path:
->init_inode_metadata
->make_empty_dir
->get_new_data_page
->grab_cache_page return -ENOMEM
We will bug on in error path of init_inode_metadata when call remove_inode_page
because i_block = 2 (one inode block will be released later & one dentry block).
We should release the dentry block in init_inode_metadata to avoid this BUG_ON,
and avoid leak of dentry block resource, because we never have second chance to
release that block in ->evict_inode as in upper error path we make this inode
'bad'.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch add lower bound verification for nid in check_nid_range, so nids
reserved like 0, node, meta passed by caller could be checked there.
And then check_nid_range could be used in f2fs_nfs_get_inode for simplifying
code.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Pull btrfs fixes from Chris Mason:
"We've queued up a few fixes in my for-linus branch"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix crash when starting transaction
Btrfs: fix btrfs_print_leaf for skinny metadata
Btrfs: fix race of using total_bytes_pinned
btrfs: use E2BIG instead of EIO if compression does not help
btrfs: remove stale comment from btrfs_flush_all_pending_stuffs
Btrfs: fix use-after-free when cloning a trailing file hole
btrfs: fix null pointer dereference in btrfs_show_devname when name is null
btrfs: fix null pointer dereference in clone_fs_devices when name is null
btrfs: fix nossd and ssd_spread mount option regression
Btrfs: fix race between balance recovery and root deletion
Btrfs: atomically set inode->i_flags in btrfs_update_iflags
btrfs: only unlock block in verify_parent_transid if we locked it
Btrfs: assert send doesn't attempt to start transactions
btrfs compression: reuse recently used workspace
Btrfs: fix crash when mounting raid5 btrfs with missing disks
btrfs: create sprout should rename fsid on the sysfs as well
btrfs: dev replace should replace the sysfs entry
btrfs: dev add should add its sysfs entry
btrfs: dev delete should remove sysfs entry
btrfs: rename add_device_membership to btrfs_kobj_add_device
Well, one drivercore fix for kernfs to resolve a reported issue with
sysfs files being updated from atomic contexts, and another lz4 bugfix
for testing potential buffer overflows.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iEYEABECAAYFAlO1/FEACgkQMUfUDdst+ynRPACfWcssJKICc2N7g9/0XXGVTjVT
PwwAnjQ8bjOfu6i2z/lViLtZGjOnzKor
=qtjB
-----END PGP SIGNATURE-----
Merge tag 'driver-core-3.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core fixes from Greg KH:
"Well, one drivercore fix for kernfs to resolve a reported issue with
sysfs files being updated from atomic contexts, and another lz4 bugfix
for testing potential buffer overflows"
* tag 'driver-core-3.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
lz4: add overrun checks to lz4_uncompress_unknownoutputsize()
kernfs: kernfs_notify() must be useable from non-sleepable contexts
Pull nfsd bugfixes from Bruce Fields:
"By coincidence, two NFSv4 symlink bugs, one introduced in the 3.16 xdr
encoding rewrite, the other a decoding bug that I think we've had
since the start but that just doesn't trigger very often"
* 'for-3.16' of git://linux-nfs.org/~bfields/linux:
nfs: fix nfs4d readlink truncated packet
nfsd: fix rare symlink decoding bug
There are a couple of seq_files which use the single_open() interface.
This interface requires that the whole output must fit into a single
buffer.
E.g. for /proc/stat allocation failures have been observed because an
order-4 memory allocation failed due to memory fragmentation. In such
situations reading /proc/stat is not possible anymore.
Therefore change the seq_file code to fallback to vmalloc allocations
which will usually result in a couple of order-0 allocations and hence
also work if memory is fragmented.
For reference a call trace where reading from /proc/stat failed:
sadc: page allocation failure: order:4, mode:0x1040d0
CPU: 1 PID: 192063 Comm: sadc Not tainted 3.10.0-123.el7.s390x #1
[...]
Call Trace:
show_stack+0x6c/0xe8
warn_alloc_failed+0xd6/0x138
__alloc_pages_nodemask+0x9da/0xb68
__get_free_pages+0x2e/0x58
kmalloc_order_trace+0x44/0xc0
stat_open+0x5a/0xd8
proc_reg_open+0x8a/0x140
do_dentry_open+0x1bc/0x2c8
finish_open+0x46/0x60
do_last+0x382/0x10d0
path_openat+0xc8/0x4f8
do_filp_open+0x46/0xa8
do_sys_open+0x114/0x1f0
sysc_tracego+0x14/0x1a
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: David Rientjes <rientjes@google.com>
Cc: Ian Kent <raven@themaw.net>
Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Cc: Thorsten Diehl <thorsten.diehl@de.ibm.com>
Cc: Andrea Righi <andrea@betterlinux.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Stefan Bader <stefan.bader@canonical.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These two patches are supposed to "fix" failed order-4 memory
allocations which have been observed when reading /proc/stat. The
problem has been observed on s390 as well as on x86.
To address the problem change the seq_file memory allocations to
fallback to use vmalloc, so that allocations also work if memory is
fragmented.
This approach seems to be simpler and less intrusive than changing
/proc/stat to use an interator. Also it "fixes" other users as well,
which use seq_file's single_open() interface.
This patch (of 2):
Use seq_file's single_open_size() to preallocate a buffer that is large
enough to hold the whole output, instead of open coding it. Also
calculate the requested size using the number of online cpus instead of
possible cpus, since the size of the output only depends on the number
of online cpus.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Ian Kent <raven@themaw.net>
Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Cc: Thorsten Diehl <thorsten.diehl@de.ibm.com>
Cc: Andrea Righi <andrea@betterlinux.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Stefan Bader <stefan.bader@canonical.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On strict build environments we can see:
fs/autofs4/inode.c: In function 'autofs4_fill_super':
fs/autofs4/inode.c:312: error: 'pgrp' may be used uninitialized in this function
make[2]: *** [fs/autofs4/inode.o] Error 1
make[1]: *** [fs/autofs4] Error 2
make: *** [fs] Error 2
make: *** Waiting for unfinished jobs....
This is due to the use of pgrp_set being used to indicate pgrp has has
been set rather than initializing pgrp itself.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We wouldn't actuall print the extent information if we had a skinny metadata
item, this fixes that. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
This percpu counter @total_bytes_pinned is introduced to skip unnecessary
operations of 'commit transaction', it accounts for those space we may free
but are stuck in delayed refs.
And we zero out @space_info->total_bytes_pinned every transaction period so
we have a better idea of how much space we'll actually free up by committing
this transaction. However, we do the 'zero out' part a little earlier, before
we actually unpin space, so we end up returning ENOSPC when we actually have
free space that's just unpinned from committing transaction.
xfstests/generic/074 complained then.
This fixes it by actually accounting the percpu pinned number when 'unpin',
and since it's protected by space_info->lock, the race is gone now.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Return codes got updated in 60e1975acb
(btrfs: return errno instead of -1 from compression)
lzo wrapper returns E2BIG in this case, do the same for zlib.
Signed-off-by: David Sterba <dsterba@suse.cz>
The transaction handle was being used after being freed.
Cc: Chris Mason <clm@fb.com>
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
dev->name is null but missing flag is not set.
Strictly speaking the missing flag should have been set, but there
are more places where code just checks if name is null. For now this
patch does the same.
stack:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000064
IP: [<ffffffffa0228908>] btrfs_show_devname+0x58/0xf0 [btrfs]
[<ffffffff81198879>] show_vfsmnt+0x39/0x130
[<ffffffff81178056>] m_show+0x16/0x20
[<ffffffff8117d706>] seq_read+0x296/0x390
[<ffffffff8115aa7d>] vfs_read+0x9d/0x160
[<ffffffff8115b549>] SyS_read+0x49/0x90
[<ffffffff817abe52>] system_call_fastpath+0x16/0x1b
reproducer:
mkfs.btrfs -draid1 -mraid1 /dev/sdg1 /dev/sdg2
btrfstune -S 1 /dev/sdg1
modprobe -r btrfs && modprobe btrfs
mount -o degraded /dev/sdg1 /btrfs
btrfs dev add /dev/sdg3 /btrfs
Signed-off-by: Anand Jain <Anand.Jain@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
The commit
0780253 btrfs: Cleanup the btrfs_parse_options for remount.
broke ssd options quite badly; it stopped making ssd_spread
imply ssd, and it made "nossd" unsettable.
Put things back at least as well as they were before
(though ssd mount option handling is still pretty odd:
# mount -o "nossd,ssd_spread" works?)
Reported-by: Roman Mamedov <rm@romanrm.net>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Chris Mason <clm@fb.com>
Balance recovery is called when RW mounting or remounting from
RO to RW, it is called to finish roots merging.
When doing balance recovery, relocation root's corresponding
fs root(whose root refs is 0) might be destroyed by cleaner
thread, this will make btrfs fail to mount.
Fix this problem by holding @cleaner_mutex when doing balance
recovery.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
This change is based on the corresponding recent change for ext4:
ext4: atomically set inode->i_flags in ext4_set_inode_flags()
That has the following commit message that applies to btrfs as well:
"Use cmpxchg() to atomically set i_flags instead of clearing out the
S_IMMUTABLE, S_APPEND, etc. flags and then setting them from the
EXT4_IMMUTABLE_FL, EXT4_APPEND_FL flags, since this opens up a race
where an immutable file has the immutable flag cleared for a brief
window of time."
Replacing EXT4_IMMUTABLE_FL and EXT4_APPEND_FL with BTRFS_INODE_IMMUTABLE
and BTRFS_INODE_APPEND, respectively.
Reviewed-by: David Sterba <dsterba@suse.cz>
Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
XDR requires 4-byte alignment; nfs4d READLINK reply writes out the padding,
but truncates the packet to the padding-less size.
Fix by taking the padding into consideration when truncating the packet.
Symptoms:
# ll /mnt/
ls: cannot read symbolic link /mnt/test: Input/output error
total 4
-rw-r--r--. 1 root root 0 Jun 14 01:21 123456
lrwxrwxrwx. 1 root root 6 Jul 2 03:33 test
drwxr-xr-x. 1 root root 0 Jul 2 23:50 tmp
drwxr-xr-x. 1 root root 60 Jul 2 23:44 tree
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
Fixes: 476a7b1f4b (nfsd4: don't treat readlink like a zero-copy operation)
Reviewed-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
d911d98748 ("kernfs: make kernfs_notify() trigger inotify events
too") added fsnotify triggering to kernfs_notify() which requires a
sleepable context. There are already existing users of
kernfs_notify() which invoke it from an atomic context and in general
it's silly to require a sleepable context for triggering a
notification.
The following is an invalid context bug triggerd by md invoking
sysfs_notify() from IO completion path.
BUG: sleeping function called from invalid context at kernel/locking/mutex.c:586
in_atomic(): 1, irqs_disabled(): 1, pid: 0, name: swapper/1
2 locks held by swapper/1/0:
#0: (&(&vblk->vq_lock)->rlock){-.-...}, at: [<ffffffffa0039042>] virtblk_done+0x42/0xe0 [virtio_blk]
#1: (&(&bitmap->counts.lock)->rlock){-.....}, at: [<ffffffff81633718>] bitmap_endwrite+0x68/0x240
irq event stamp: 33518
hardirqs last enabled at (33515): [<ffffffff8102544f>] default_idle+0x1f/0x230
hardirqs last disabled at (33516): [<ffffffff818122ed>] common_interrupt+0x6d/0x72
softirqs last enabled at (33518): [<ffffffff810a1272>] _local_bh_enable+0x22/0x50
softirqs last disabled at (33517): [<ffffffff810a29e0>] irq_enter+0x60/0x80
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.16.0-0.rc2.git2.1.fc21.x86_64 #1
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
0000000000000000 f90db13964f4ee05 ffff88007d403b80 ffffffff81807b4c
0000000000000000 ffff88007d403ba8 ffffffff810d4f14 0000000000000000
0000000000441800 ffff880078fa1780 ffff88007d403c38 ffffffff8180caf2
Call Trace:
<IRQ> [<ffffffff81807b4c>] dump_stack+0x4d/0x66
[<ffffffff810d4f14>] __might_sleep+0x184/0x240
[<ffffffff8180caf2>] mutex_lock_nested+0x42/0x440
[<ffffffff812d76a0>] kernfs_notify+0x90/0x150
[<ffffffff8163377c>] bitmap_endwrite+0xcc/0x240
[<ffffffffa00de863>] close_write+0x93/0xb0 [raid1]
[<ffffffffa00df029>] r1_bio_write_done+0x29/0x50 [raid1]
[<ffffffffa00e0474>] raid1_end_write_request+0xe4/0x260 [raid1]
[<ffffffff813acb8b>] bio_endio+0x6b/0xa0
[<ffffffff813b46c4>] blk_update_request+0x94/0x420
[<ffffffff813bf0ea>] blk_mq_end_io+0x1a/0x70
[<ffffffffa00392c2>] virtblk_request_done+0x32/0x80 [virtio_blk]
[<ffffffff813c0648>] __blk_mq_complete_request+0x88/0x120
[<ffffffff813c070a>] blk_mq_complete_request+0x2a/0x30
[<ffffffffa0039066>] virtblk_done+0x66/0xe0 [virtio_blk]
[<ffffffffa002535a>] vring_interrupt+0x3a/0xa0 [virtio_ring]
[<ffffffff81116177>] handle_irq_event_percpu+0x77/0x340
[<ffffffff8111647d>] handle_irq_event+0x3d/0x60
[<ffffffff81119436>] handle_edge_irq+0x66/0x130
[<ffffffff8101c3e4>] handle_irq+0x84/0x150
[<ffffffff818146ad>] do_IRQ+0x4d/0xe0
[<ffffffff818122f2>] common_interrupt+0x72/0x72
<EOI> [<ffffffff8105f706>] ? native_safe_halt+0x6/0x10
[<ffffffff81025454>] default_idle+0x24/0x230
[<ffffffff81025f9f>] arch_cpu_idle+0xf/0x20
[<ffffffff810f5adc>] cpu_startup_entry+0x37c/0x7b0
[<ffffffff8104df1b>] start_secondary+0x25b/0x300
This patch fixes it by punting the notification delivery through a
work item. This ends up adding an extra pointer to kernfs_elem_attr
enlarging kernfs_node by a pointer, which is not ideal but not a very
big deal either. If this turns out to be an actual issue, we can move
kernfs_elem_attr->size to kernfs_node->iattr later.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Josh Boyer <jwboyer@fedoraproject.org>
Cc: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fix a number of miscellaneous bugs for punch hole as well as a
long-standing potential double buffer head release when failing a
block allocation for an indirect-mapped file.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJTsKRpAAoJENNvdpvBGATwf3YQAJoGNvrxd6BPpB0bSTLRdh+k
0IDShwB9PpSJulIHx8xIOpFIvpk5T0E1Ho+UqUnpMImRiqueYbFfOPca4lYsRBqW
r9TNmm0O8Bf8K0j8YUaV2P8BGNJuiCzv7YcCbSHt1/eP6/InyfJWA17hYD6oxBPF
ZrrcZemDH863KhYIF55sbx9UvBLz515ifd4kHN9pSVbV3pJT5/zPiRk/wujQZTrX
v0z5pe5i8GfYoBMmdCNak3a1YoXdf+FUdID+pvGWtfzs8AG8nS7jb8C1zkfgUWtC
zau9yBnFHKlBdmoPrFJIpvDT/2rBRks6g6uM2Jsc+YS+zHCXi0xJR1EaPIqkRJbf
vWHNSogzOKetyFZFML1Dg1cjXVEnPusOsyhvcXTFwb3n/YY1NBFdLiVgNbTzQ7aU
X7P4M+ca2yib2rQos5Ltipk4ju9eT4d5qsf+QSaaryYeUHg+8sNaqd82y42eita1
Dgg7IpyKWdNo+lBHEtDSV5Gli4oRrXddj7Yvrib7nobf+FTi4cpbbu0PfY5qXfDV
vKrsvIiJcPJYsvi+USH3fvv2m6UaZd+gDo/RObV06tSNTZWqGzalSeviaJZ+cfEP
v05ODGLUNvg+thhvBsYAGNeaV/SOYztmNA6v0qfI+OwscH8ycGeb6R98Kil/hd97
P3i12fhP7tkffNTNPM/E
=X4vY
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 bugfixes from Ted Ts'o:
"Fix a regression when trying to compile ext4 on older versions gcc.
Fix a number of miscellaneous bugs for punch hole as well as a
long-standing potential double buffer head release when failing a
block allocation for an indirect-mapped file"
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: Fix hole punching for files with indirect blocks
ext4: Fix block zeroing when punching holes in indirect block files
ext4: decrement free clusters/inodes counters when block group declared bad
fs/mbcache: replace __builtin_log2() with ilog2()
ext4: Fix buffer double free in ext4_alloc_branch()