The mount manpage says of the max_batch_time option,
This optimization can be turned off entirely
by setting max_batch_time to 0.
But the code doesn't do that. So fix the code to do
that.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
We are spending a lot of time explaining to users what this error
means. Let's try to improve the message to avoid this problem.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Make it clear that values printed are times, and that it is error
since last fsck. Also add note about fsck version required.
Signed-off-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: stable@vger.kernel.org
The first time that we allocate from an uninitialized inode allocation
bitmap, if the block allocation bitmap is also uninitalized, we need
to get write access to the block group descriptor before we start
modifying the block group descriptor flags and updating the free block
count, etc. Otherwise, there is the potential of a bad journal
checksum (if journal checksums are enabled), and of the file system
becoming inconsistent if we crash at exactly the wrong time.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Fix a number of miscellaneous bugs for punch hole as well as a
long-standing potential double buffer head release when failing a
block allocation for an indirect-mapped file.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJTsKRpAAoJENNvdpvBGATwf3YQAJoGNvrxd6BPpB0bSTLRdh+k
0IDShwB9PpSJulIHx8xIOpFIvpk5T0E1Ho+UqUnpMImRiqueYbFfOPca4lYsRBqW
r9TNmm0O8Bf8K0j8YUaV2P8BGNJuiCzv7YcCbSHt1/eP6/InyfJWA17hYD6oxBPF
ZrrcZemDH863KhYIF55sbx9UvBLz515ifd4kHN9pSVbV3pJT5/zPiRk/wujQZTrX
v0z5pe5i8GfYoBMmdCNak3a1YoXdf+FUdID+pvGWtfzs8AG8nS7jb8C1zkfgUWtC
zau9yBnFHKlBdmoPrFJIpvDT/2rBRks6g6uM2Jsc+YS+zHCXi0xJR1EaPIqkRJbf
vWHNSogzOKetyFZFML1Dg1cjXVEnPusOsyhvcXTFwb3n/YY1NBFdLiVgNbTzQ7aU
X7P4M+ca2yib2rQos5Ltipk4ju9eT4d5qsf+QSaaryYeUHg+8sNaqd82y42eita1
Dgg7IpyKWdNo+lBHEtDSV5Gli4oRrXddj7Yvrib7nobf+FTi4cpbbu0PfY5qXfDV
vKrsvIiJcPJYsvi+USH3fvv2m6UaZd+gDo/RObV06tSNTZWqGzalSeviaJZ+cfEP
v05ODGLUNvg+thhvBsYAGNeaV/SOYztmNA6v0qfI+OwscH8ycGeb6R98Kil/hd97
P3i12fhP7tkffNTNPM/E
=X4vY
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 bugfixes from Ted Ts'o:
"Fix a regression when trying to compile ext4 on older versions gcc.
Fix a number of miscellaneous bugs for punch hole as well as a
long-standing potential double buffer head release when failing a
block allocation for an indirect-mapped file"
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: Fix hole punching for files with indirect blocks
ext4: Fix block zeroing when punching holes in indirect block files
ext4: decrement free clusters/inodes counters when block group declared bad
fs/mbcache: replace __builtin_log2() with ilog2()
ext4: Fix buffer double free in ext4_alloc_branch()
Hole punching code for files with indirect blocks wrongly computed
number of blocks which need to be cleared when traversing the indirect
block tree. That could result in punching more blocks than actually
requested and thus effectively cause a data loss. For example:
fallocate -n -p 10240000 4096
will punch the range 10240000 - 12632064 instead of the range 1024000 -
10244096. Fix the calculation.
CC: stable@vger.kernel.org
Fixes: 8bad6fc813
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
free_holes_block() passed local variable as a block pointer
to ext4_clear_blocks(). Thus ext4_clear_blocks() zeroed out this local
variable instead of proper place in inode / indirect block. We later
zero out proper place in inode / indirect block but don't dirty the
inode / buffer again which can lead to subtle issues (some changes e.g.
to inode can be lost).
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
We should decrement free clusters counter when block bitmap is marked
as corrupt and free inodes counter when the allocation bitmap is
marked as corrupt to avoid misunderstanding due to incorrect available
size in statfs result. User can get immediately ENOSPC error from
write begin without reaching for the writepages.
Cc: Darrick J. Wong<darrick.wong@oracle.com>
Reported-by: Amit Sahrawat <amit.sahrawat83@gmail.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Pull CIFS fixes from Steve French:
"Small set of misc cifs/smb3 fixes"
* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
[CIFS] fix mount failure with broken pathnames when smb3 mount with mapchars option
cifs: revalidate mapping prior to satisfying read_iter request with cache=loose
fs/cifs: fix regression in cifs_create_mf_symlink()
Highlights include:
- Stable fix for a data corruption case due to incorrect cache validation
- Fix a couple of false positive cache invalidations
- Fix NFSv4 security negotiation issues
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJTq1GIAAoJEGcL54qWCgDyCfUP/3S7Py5Gocdqvb7FBPpCWtsb
PJlv1RjC4ngT+BJpBeDSOFEcZerfeQAwguL5kEgIjdyKmsAjVjIF7ThagNQK/0yr
qpeKh2EtbAipjjXVmul7saG3Ucuv/PggEhqGl9iJK0QyPdmnr30cHGHHt3kCIPGE
e4AkaCN4ZuXBdDOO4YpKzIl6wQPb0Gjwps1boW4INCvnBvK6Yno26Q6ilDf92gJE
hisEn0l8l09C6t2jZKP7daCyGForTYYlMxIbmjmQhsMEwnh1kmfpr/xuAQP2bflr
14OFrNbrZg3p4ucp8g7EzgS1Z5m/Ism0xNKfO4LgNwUobSgbvvvScAC3/LP2HIIk
RXuRhgb8u6pbWQRqq4XznB+csh6DGR/ui2PhonK4lJDaJxcU3bnFlhTgoC0GSyCa
Wbbdv+nhXhw5Xi9jsma6PW/CnHJH6sk/8KviRPOpC+RsCg+X41vTHzC4XvWbentw
aZGkNuWAnBKMyswu08E4+ScFQxToSB6ju4RjOsTTMleC0ewWXD3Y6FL+B5p4crPO
L05KCLkP+SeRxpakOM3e/x/bkVOa+DBna7foXUZ9snWybYoOmuxOkJgJT7bxrYaA
/3N0e/WUUgPR/bhdydMJSRo6DchKj+5GRSpx8FB9eMqqp8mNE+I61/Kq0dFEbtPQ
1IQCFT4w1PEegDpwjb0L
=o+QR
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-3.16-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client fixes from Trond Myklebust:
"Highlights include:
- Stable fix for a data corruption case due to incorrect cache
validation
- Fix a couple of false positive cache invalidations
- Fix NFSv4 security negotiation issues"
* tag 'nfs-for-3.16-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFSv4: test SECINFO RPC_AUTH_GSS pseudoflavors for support
NFS Return -EPERM if no supported or matching SECINFO flavor
NFS check the return of nfs4_negotiate_security in nfs4_submount
NFS: Don't mark the data cache as invalid if it has been flushed
NFS: Clear NFS_INO_REVAL_PAGECACHE when we update the file size
nfs: Fix cache_validity check in nfs_write_pageuptodate()
Fix compiler error with some gcc version(s) that do not
support __builtin_log2() by replacing __builtin_log2() with
ilog2().
Signed-off-by: T. Makphaibulchoke <tmac@hp.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Maciej W. Rozycki <macro@linux-mips.org>
Fix nfs4_negotiate_security to create an rpc_clnt used to test each SECINFO
returned pseudoflavor. Check credential creation (and gss_context creation)
which is important for RPC_AUTH_GSS pseudoflavors which can fail for multiple
reasons including mis-configuration.
Don't call nfs4_negotiate in nfs4_submount as it was just called by
nfs4_proc_lookup_mountpoint (nfs4_proc_lookup_common)
Signed-off-by: Andy Adamson <andros@netapp.com>
[Trond: fix corrupt return value from nfs_find_best_sec()]
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Do not return RPC_AUTH_UNIX if SEINFO reply tests fail. This
prevents an infinite loop of NFS4ERR_WRONGSEC for non RPC_AUTH_UNIX mounts.
Without this patch, a mount with no sec= option to a server
that does not include RPC_AUTH_UNIX in the
SECINFO return can be presented with an attemtp to use RPC_AUTH_UNIX
which will result in an NFS4ERR_WRONG_SEC which will prompt the SECINFO
call which will again try RPC_AUTH_UNIX....
Signed-off-by: Andy Adamson <andros@netapp.com>
Tested-By: Steve Dickson <steved@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Now that we have functions such as nfs_write_pageuptodate() that use
the cache_validity flags to check if the data cache is valid or not,
it is a little more important to keep the flags in sync with the
state of the data cache.
In particular, we'd like to ensure that if the data cache is empty, we
don't start marking it as needing revalidation.
Reported-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
In nfs_update_inode(), if the change attribute is seen to change on
the server, then we set NFS_INO_REVAL_PAGECACHE in order to make
sure that we check the file size.
However, if we also update the file size in the same function, we
don't need to check it again. So make sure that we clear the
NFS_INO_REVAL_PAGECACHE that was set earlier.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
NFS_INO_INVALID_DATA cannot be ignored, even if we have a delegation.
We're still having some problems with data corruption when multiple
clients are appending to a file and those clients are being granted
write delegations on open.
To reproduce:
Client A:
vi /mnt/`hostname -s`
while :; do echo "XXXXXXXXXXXXXXX" >>/mnt/file; sleep $(( $RANDOM % 5 )); done
Client B:
vi /mnt/`hostname -s`
while :; do echo "YYYYYYYYYYYYYYY" >>/mnt/file; sleep $(( $RANDOM % 5 )); done
What's happening is that in nfs_update_inode() we're recognizing that
the file size has changed and we're setting NFS_INO_INVALID_DATA
accordingly, but then we ignore the cache_validity flags in
nfs_write_pageuptodate() because we have a delegation. As a result,
in nfs_updatepage() we're extending the write to cover the full page
even though we've not read in the data to begin with.
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Cc: <stable@vger.kernel.org> # v3.11+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
A kernel memory disclosure was introduced in aio_read_events_ring() in v3.10
by commit a31ad380be. The changes made to
aio_read_events_ring() failed to correctly limit the index into
ctx->ring_pages[], allowing an attacked to cause the subsequent kmap() of
an arbitrary page with a copy_to_user() to copy the contents into userspace.
This vulnerability has been assigned CVE-2014-0206. Thanks to Mateusz and
Petr for disclosing this issue.
This patch applies to v3.12+. A separate backport is needed for 3.10/3.11.
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Petr Matousek <pmatouse@redhat.com>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: stable@vger.kernel.org
The aio cleanups and optimizations by kmo that were merged into the 3.10
tree added a regression for userspace event reaping. Specifically, the
reference counts are not decremented if the event is reaped in userspace,
leading to the application being unable to submit further aio requests.
This patch applies to 3.12+. A separate backport is required for 3.10/3.11.
This issue was uncovered as part of CVE-2014-0206.
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: stable@vger.kernel.org
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Petr Matousek <pmatouse@redhat.com>
When we SMB3 mounted with mapchars (to allow reserved characters : \ / > < * ?
via the Unicode Windows to POSIX remap range) empty paths
(eg when we open "" to query the root of the SMB3 directory on mount) were not
null terminated so we sent garbarge as a path name on empty paths which caused
SMB2/SMB2.1/SMB3 mounts to fail when mapchars was specified. mapchars is
particularly important since Unix Extensions for SMB3 are not supported (yet)
Signed-off-by: Steve French <smfrench@gmail.com>
Cc: <stable@vger.kernel.org>
Reviewed-by: David Disseldorp <ddiss@suse.de>
When workqueue is delayed, it may occur that a lockres is purged while it
is still queued for master assert. it may trigger BUG() as follows.
N1 N2
dlm_get_lockres()
->dlm_do_master_requery
is the master of lockres,
so queue assert_master work
dlm_thread() start running
and purge the lockres
dlm_assert_master_worker()
send assert master message
to other nodes
receiving the assert_master
message, set master to N2
dlmlock_remote() send create_lock message to N2, but receive DLM_IVLOCKID,
if it is RECOVERY lockres, it triggers the BUG().
Another BUG() is triggered when N3 become the new master and send
assert_master to N1, N1 will trigger the BUG() because owner doesn't
match. So we should not purge lockres when it is queued for assert
master.
Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The following case may lead to endless loop during umount.
node A node B node C node D
umount volume,
migrate lockres1
to B
want to lock lockres1,
send
MASTER_REQUEST_MSG
to C
init block mle
send
MIGRATE_REQUEST_MSG
to C
find a block
mle, and then
return
DLM_MIGRATE_RESPONSE_MASTERY_REF
to B
set C in refmap
umount successfully
try to umount, endless
loop occurs when migrate
lockres1 since C is in
refmap
So we can fix this endless loop case by only returning
DLM_MIGRATE_RESPONSE_MASTERY_REF if it has a mastery mle when receiving
MIGRATE_REQUEST_MSG.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: jiangyiwen <jiangyiwen@huawei.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Xue jiufei <xuejiufei@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When the call to ocfs2_add_entry() failed in ocfs2_symlink() and
ocfs2_mknod(), iput() will not be called during dput(dentry) because no
d_instantiate(), and this will lead to umount hung.
Signed-off-by: jiangyiwen <jiangyiwen@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When running dirop_fileop_racer we found a dead lock case.
2 nodes, say Node A and Node B, mount the same ocfs2 volume. Create
/race/16/1 in the filesystem, and let the inode number of dir 16 is less
than the inode number of dir race.
Node A Node B
mv /race/16/1 /race/
right after Node A has got the
EX mode of /race/16/, and tries to
get EX mode of /race
ls /race/16/
In this case, Node A has got the EX mode of /race/16/, and wants to get EX
mode of /race/. Node B has got the PR mode of /race/, and wants to get
the PR mode of /race/16/. Since EX and PR are mutually exclusive, dead
lock happens.
This patch fixes this case by locking in ancestor order before trying
inode number order.
Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When a lockres in purge list but is still in use, it should be moved to
the tail of purge list. dlm_thread will continue to check next lockres in
purge list. However, code list_move_tail(&dlm->purge_list,
&lockres->purge) will do *no* movements, so dlm_thread will purge the same
lockres in this loop again and again. If it is in use for a long time,
other lockres will not be processed.
Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch tries to fix this crash:
#5 [ffff88003c1cd690] do_invalid_op at ffffffff810166d5
#6 [ffff88003c1cd730] invalid_op at ffffffff8159b2de
[exception RIP: ocfs2_direct_IO_get_blocks+359]
RIP: ffffffffa05dfa27 RSP: ffff88003c1cd7e8 RFLAGS: 00010202
RAX: 0000000000000000 RBX: ffff88003c1cdaa8 RCX: 0000000000000000
RDX: 000000000000000c RSI: ffff880027a95000 RDI: ffff88003c79b540
RBP: ffff88003c1cd858 R8: 0000000000000000 R9: ffffffff815f6ba0
R10: 00000000000001c9 R11: 00000000000001c9 R12: ffff88002d271500
R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000001000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#7 [ffff88003c1cd860] do_direct_IO at ffffffff811cd31b
#8 [ffff88003c1cd950] direct_IO_iovec at ffffffff811cde9c
#9 [ffff88003c1cd9b0] do_blockdev_direct_IO at ffffffff811ce764
#10 [ffff88003c1cdb80] __blockdev_direct_IO at ffffffff811ce7cc
#11 [ffff88003c1cdbb0] ocfs2_direct_IO at ffffffffa05df756 [ocfs2]
#12 [ffff88003c1cdbe0] generic_file_direct_write_iter at ffffffff8112f935
#13 [ffff88003c1cdc40] ocfs2_file_write_iter at ffffffffa0600ccc [ocfs2]
#14 [ffff88003c1cdd50] do_aio_write at ffffffff8119126c
#15 [ffff88003c1cddc0] aio_rw_vect_retry at ffffffff811d9bb4
#16 [ffff88003c1cddf0] aio_run_iocb at ffffffff811db880
#17 [ffff88003c1cde30] io_submit_one at ffffffff811dc238
#18 [ffff88003c1cde80] do_io_submit at ffffffff811dc437
#19 [ffff88003c1cdf70] sys_io_submit at ffffffff811dc530
#20 [ffff88003c1cdf80] system_call_fastpath at ffffffff8159a159
It crashes at
BUG_ON(create && (ext_flags & OCFS2_EXT_REFCOUNTED));
in ocfs2_direct_IO_get_blocks.
ocfs2_direct_IO_get_blocks is expecting the OCFS2_EXT_REFCOUNTED be removed in
ocfs2_prepare_inode_for_write() if it was there. But no cluster lock is taken
during the time before (or inside) ocfs2_prepare_inode_for_write() and after
ocfs2_direct_IO_get_blocks().
It can happen in this case:
Node A(which crashes) Node B
------------------------ ---------------------------
ocfs2_file_aio_write
ocfs2_prepare_inode_for_write
ocfs2_inode_lock
...
ocfs2_inode_unlock
#no refcount found
.... ocfs2_reflink
ocfs2_inode_lock
...
ocfs2_inode_unlock
#now, refcount flag set on extent
...
flush change to disk
ocfs2_direct_IO_get_blocks
ocfs2_get_clusters
#extent map miss
#buffer_head miss
read extents from disk
found refcount flag on extent
crash..
Fix:
Take rw_lock in ocfs2_reflink path
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
75f82eaa50 ("ocfs2: fix NULL pointer dereference when dismount and
ocfs2rec simultaneously") may cause umount hang while shutting down
truncate log.
The situation is as followes:
ocfs2_dismout_volume
-> ocfs2_recovery_exit
-> free osb->recovery_map
-> ocfs2_truncate_shutdown
-> lock global bitmap inode
-> ocfs2_wait_for_recovery
-> check whether osb->recovery_map->rm_used is zero
Because osb->recovery_map is already freed, rm_used can be any other
values, so it may yield umount hang.
Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Orabug: 18639535
Two node cluster and both nodes hold a lock at PR level and both want to
convert to EX at the same time. Master node 1 has sent BAST and then
closes the connection due to idletime out. Node 0 receives BAST, sends
unlock req with cancel flag but gets error -ENOTCONN. The problem is
this error is ignored in dlm_send_remote_unlock_request() on the
**incorrect** assumption that the master is dead. See NOTE in comment
why it returns DLM_NORMAL. Upon getting DLM_NORMAL, node 0 proceeds to
sends convert (without cancel flg) which fails with -ENOTCONN. waits 5
sec and resends.
This time gets DLM_IVLOCKID from the master since lock not found in
grant, it had been moved to converting queue in response to conv PR->EX
req. No way out.
Node 1 (master) Node 0
============== ======
lock mode PR PR
convert PR -> EX
mv grant -> convert and que BAST
...
<-------- convert PR -> EX
convert que looks like this: ((node 1, PR -> EX) (node 0, PR -> EX))
...
BAST (want PR -> NL)
------------------>
...
idle timout, conn closed
...
In response to BAST,
sends unlock with cancel convert flag
gets -ENOTCONN. Ignores and
sends remote convert request
gets -ENOTCONN, waits 5 Sec, retries
...
reconnects
<----------------- convert req goes through on next try
does not find lock on grant que
status DLM_IVLOCKID
------------------>
...
No way out. Fix is to keep retrying unlock with cancel flag until it
succeeds or the master dies.
Signed-off-by: Tariq Saeed <tariq.x.saeed@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are two files a and b in dir /mnt/ocfs2.
node A node B
mv a b
In ocfs2_rename(), after calling
ocfs2_orphan_add(), the inode of
file b will be added into orphan
dir.
If ocfs2_update_entry() fails,
ocfs2_rename return error and mv
operation fails. But file b still
exists in the parent dir.
ocfs2_queue_orphan_scan
-> ocfs2_queue_recovery_completion
-> ocfs2_complete_recovery
-> ocfs2_recover_orphans
The inode of the file b will be
put with iput().
ocfs2_evict_inode
-> ocfs2_delete_inode
-> ocfs2_wipe_inode
-> ocfs2_remove_inode
OCFS2_VALID_FL in the inode
i_flags will be cleared.
The file b still can be accessed
on node B.
ls /mnt/ocfs2
When first read the file b with
ocfs2_read_inode_block(). It will
validate the inode using
ocfs2_validate_inode_block().
Because OCFS2_VALID_FL not set in
the inode i_flags, so the file
system will be readonly.
So we should add inode into orphan dir after updating entry in
ocfs2_rename().
Signed-off-by: alex.chen <alex.chen@huawei.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJTpiwZAAoJEAAOaEEZVoIVplsP/383a9q3eXonbsi+Ea8CGbRl
tdjjVhM1OY4NZYFAoulILDt3HqPTC6MBnqKlHz+BuMziwd/1+3w8S4E7IEwm/KtM
ghNYX8ct5Bf1nc5QEdDmwf4PX48QbRwTuT1uIcXaJ+KtxTzI9qN7mnjRN91TUtq4
WRGvOl0AsGCVq8YqxjztgD3TYbu7AG/72Em+DE9f81PTArAPTo2ySc3gxPuJJAsg
G1x46Gx46sfqFX2FY4SPsXen+J/67Og67y6eBawxnT2Bp6ZGDuW+jyPRmkhf0yth
pWAtkUi3XmEe6kk6GHiICsS0Yn0RG4jbz39+Ja+X7jibQVJ8Iz6b+Optw9RNQwYt
jDWHKFS2AaL/CDejHYOQ1shHcozpRojtIbDLIZ9vTNTQ2r5cdaBvkXMmQzdoktmN
wQtQ9AzBl8fHOFOQeCAwd/ZfCLIotvLoLds3K/CSqmpsyK2+9IyriQLKKZ2xm6Iu
8+UUspGQcNVwcMP6YWtI6G+u58/mVanmK6dtpiyXrncZLAfU4H7ETL2IPu8jJTbv
kTFCOJtXQzNZa5Xqur1hIewOG9/RlvAZAnii0Ghc3nXTWCCeNeI8re3jw4g9KdRv
33t4sYfJld8LQ1NSMqIDyAs+fvytmGurYt+uhVpb58G/4CqBLNpmmIGIQ6LFLMfs
75FQnbAezrD0H/JyAHUk
=AJD3
-----END PGP SIGNATURE-----
Merge tag 'locks-v3.16-2' of git://git.samba.org/jlayton/linux
Pull file locking fixes from Jeff Layton:
"File locking related bugfixes
Nothing too earth-shattering here. A fix for a potential regression
due to a patch in pile #1, and the addition of a memory barrier to
prevent a race condition between break_deleg and generic_add_lease"
* tag 'locks-v3.16-2' of git://git.samba.org/jlayton/linux:
locks: set fl_owner for leases back to current->files
locks: add missing memory barrier in break_deleg
Pull btrfs fixes from Chris Mason:
"This fixes some lockups in btrfs reported with rc1. It probably has
some performance impact because it is backing off our spinning locks
more often and switching to a blocking lock. I'll be able to nail
that down next week, but for now I want to get the lockups taken care
of.
Otherwise some more stack reduction and assorted fixes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix wrong error handle when the device is missing or is not writeable
Btrfs: fix deadlock when mounting a degraded fs
Btrfs: use bio_endio_nodec instead of open code
Btrfs: fix NULL pointer crash when running balance and scrub concurrently
btrfs: Skip scrubbing removed chunks to avoid -ENOENT.
Btrfs: fix broken free space cache after the system crashed
Btrfs: make free space cache write out functions more readable
Btrfs: remove unused wait queue in struct extent_buffer
Btrfs: fix deadlocks with trylock on tree nodes
Pull nfsd bugfixes from Bruce Fields:
"Fixes for a new regression from the xdr encoding rewrite, and a
delegation problem we've had for a while (made somewhat more annoying
by the vfs delegation support added in 3.13)"
* 'for-3.16' of git://linux-nfs.org/~bfields/linux:
NFSD: fix bug for readdir of pseudofs
NFSD: Don't hand out delegations for 30 seconds after recalling them.
The original bio might be submitted, so we shoud increase bi_remaining to
account for it when we deal with the error that the device is missing or
is not writeable, or we would skip the endio handle.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
The deadlock happened when we mount degraded filesystem, the reproduced
steps are following:
# mkfs.btrfs -f -m raid1 -d raid1 <dev0> <dev1>
# echo 1 > /sys/block/`basename <dev0>`/device/delete
# mount -o degraded <dev1> <mnt>
The reason was that the counter -- bi_remaining was wrong. If the missing
or unwriteable device was the last device in the mapping array, we would
not submit the original bio, so we shouldn't increase bi_remaining of it
in btrfs_end_bio(), or we would skip the final endio handle.
Fix this problem by adding a flag into btrfs bio structure. If we submit
the original bio, we will set the flag, and we increase bi_remaining counter,
or we don't.
Though there is another way to fix it -- decrease bi_remaining counter of the
original bio when we make sure the original bio is not submitted, this method
need add more check and is easy to make mistake.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
While running balance, scrub, fsstress concurrently we hit the
following kernel crash:
[56561.448845] BTRFS info (device sde): relocating block group 11005853696 flags 132
[56561.524077] BUG: unable to handle kernel NULL pointer dereference at 0000000000000078
[56561.524237] IP: [<ffffffffa038956d>] scrub_chunk.isra.12+0xdd/0x130 [btrfs]
[56561.524297] PGD 9be28067 PUD 7f3dd067 PMD 0
[56561.524325] Oops: 0000 [#1] SMP
[....]
[56561.527237] Call Trace:
[56561.527309] [<ffffffffa038980e>] scrub_enumerate_chunks+0x24e/0x490 [btrfs]
[56561.527392] [<ffffffff810abe00>] ? abort_exclusive_wait+0x50/0xb0
[56561.527476] [<ffffffffa038add4>] btrfs_scrub_dev+0x1a4/0x530 [btrfs]
[56561.527561] [<ffffffffa0368107>] btrfs_ioctl+0x13f7/0x2a90 [btrfs]
[56561.527639] [<ffffffff811c82f0>] do_vfs_ioctl+0x2e0/0x4c0
[56561.527712] [<ffffffff8109c384>] ? vtime_account_user+0x54/0x60
[56561.527788] [<ffffffff810f768c>] ? __audit_syscall_entry+0x9c/0xf0
[56561.527870] [<ffffffff811c8551>] SyS_ioctl+0x81/0xa0
[56561.527941] [<ffffffff815707f7>] tracesys+0xdd/0xe2
[...]
[56561.528304] RIP [<ffffffffa038956d>] scrub_chunk.isra.12+0xdd/0x130 [btrfs]
[56561.528395] RSP <ffff88004c0f5be8>
[56561.528454] CR2: 0000000000000078
This is because in btrfs_relocate_chunk(), we will free @bdev directly while
scrub may still hold extent mapping, and may access freed memory.
Fix this problem by wrapping freeing @bdev work into free_extent_map() which
is based on reference count.
Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
When run scrub with balance, sometimes -ENOENT will be returned, since
in scrub_enumerate_chunks() will search dev_extent in *COMMIT_ROOT*, but
btrfs_lookup_block_group() will search block group in *MEMORY*, so if a
chunk is removed but not committed, -ENOENT will be returned.
However, there is no need to stop scrubbing since other chunks may be
scrubbed without problem.
So this patch changes the behavior to skip removed chunks and continue
to scrub the rest.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
When we mounted the filesystem after the crash, we got the following
message:
BTRFS error (device xxx): block group xxxx has wrong amount of free space
BTRFS error (device xxx): failed to load free space cache for block group xxx
It is because we didn't update the metadata of the allocated space (in extent
tree) until the file data was written into the disk. During this time, there was
no information about the allocated spaces in either the extent tree nor the
free space cache. when we wrote out the free space cache at this time (commit
transaction), those spaces were lost. In fact, only the free space that is
used to store the file data had this problem, the others didn't because
the metadata of them is updated in the same transaction context.
There are many methods which can fix the above problem
- track the allocated space, and write it out when we write out the free
space cache
- account the size of the allocated space that is used to store the file
data, if the size is not zero, don't write out the free space cache.
The first one is complex and may make the performance drop down.
This patch chose the second method, we use a per-block-group variant to
account the size of that allocated space. Besides that, we also introduce
a per-block-group read-write semaphore to avoid the race between
the allocation and the free space cache write out.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
This patch makes the free space cache write out functions more readable,
and beisdes that, it also reduces the stack space that the function --
__btrfs_write_out_cache uses from 194bytes to 144bytes.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
The lock_wq wait queue is not used anywhere, therefore just remove it.
On a x86_64 system, this reduced sizeof(struct extent_buffer) from 320
bytes down to 296 bytes, which means a 4Kb page can now be used for
13 extent buffers instead of 12.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
The Btrfs tree trylock function is poorly named. It always takes
the spinlock and backs off if the blocking lock is held. This
can lead to surprising lockups because people expect it to really be a
trylock.
This commit makes it a pure trylock, both for the spinlock and the
blocking lock. It also reworks the nested lock handling slightly to
avoid taking the read lock while a spinning write lock might be held.
Signed-off-by: Chris Mason <clm@fb.com>
Before satisfying a read with cache=loose, we should always check
that the pagecache is valid before allowing a read to be satisfied
out of it.
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: Steve French <smfrench@gmail.com>
Commit 561f0ed498 (nfsd4: allow large readdirs) introduces a bug
about readdir the root of pseudofs.
Call xdr_truncate_encode() revert encoded name when skipping.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
If nfsd needs to recall a delegation for some reason it implies that there is
contention on the file, so further delegations should not be handed out.
The current code fails to do so, and the result is effectively a
live-lock under some workloads: a client attempting a conflicting
operation on a read-delegated file receives NFS4ERR_DELAY and retries
the operation, but by the time it retries the server may already have
given out another delegation.
We could simply avoid delegations for (say) 30 seconds after any recall, but
this is probably too heavy handed.
We could keep a list of inodes (or inode numbers or filehandles) for recalled
delegations, but that requires memory allocation and searching.
The approach taken here is to use a bloom filter to record the filehandles
which are currently blocked from delegation, and to accept the cost of a few
false positives.
We have 2 bloom filters, each of which is valid for 30 seconds. When a
delegation is recalled the filehandle is added to one filter and will remain
disabled for between 30 and 60 seconds.
We keep a count of the number of filehandles that have been added, so when
that count is zero we can bypass all other tests.
The bloom filters have 256 bits and 3 hash functions. This should allow a
couple of dozen blocked filehandles with minimal false positives. If many
more filehandles are all blocked at once, behaviour will degrade towards
rejecting all delegations for between 30 and 60 seconds, then resetting and
allowing new delegations.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This fixes use-after-free of epi->fllink.next inside list loop macro.
This loop actually releases elements in the body. The list is
rcu-protected but here we cannot hold rcu_read_lock because we need to
lock mutex inside.
The obvious solution is to use list_for_each_entry_safe(). RCU-ness
isn't essential because nobody can change this list under us, it's final
fput for this file.
The bug was introduced by ae10b2b4eb ("epoll: optimize EPOLL_CTL_DEL
using rcu")
Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Reported-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Stable <stable@vger.kernel.org> # 3.13+
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Jason Baron <jbaron@akamai.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
commit d81b8a40e2
("CIFS: Cleanup cifs open codepath")
changed disposition to FILE_OPEN.
Signed-off-by: Björn Baumbach <bb@sernet.de>
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Reviewed-by: Stefan Metzmacher <metze@samba.org>
Cc: <stable@vger.kernel.org> # v3.14+
Cc: Pavel Shilovsky <piastry@etersoft.ru>
Cc: Steve French <sfrench@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
Error recovery in ext4_alloc_branch() calls ext4_forget() even for
buffer corresponding to indirect block it did not allocate. This leads
to brelse() being called twice for that buffer (once from ext4_forget()
and once from cleanup in ext4_ind_map_blocks()) leading to buffer use
count misaccounting. Eventually (but often much later because there
are other users of the buffer) we will see messages like:
VFS: brelse: Trying to free free buffer
Another manifestation of this problem is an error:
JBD2 unexpected failure: jbd2_journal_revoke: !buffer_revoked(bh);
inconsistent data on disk
The fix is easy - don't forget buffer we did not allocate. Also add an
explanatory comment because the indexing at ext4_alloc_branch() is
somewhat subtle.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Pull more btrfs updates from Chris Mason:
"This has a few fixes since our last pull and a new ioctl for doing
btree searches from userland. It's very similar to the existing
ioctl, but lets us return larger items back down to the app"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs: fix error handling in create_pending_snapshot
btrfs: fix use of uninit "ret" in end_extent_writepage()
btrfs: free ulist in qgroup_shared_accounting() error path
Btrfs: fix qgroups sanity test crash or hang
btrfs: prevent RCU warning when dereferencing radix tree slot
Btrfs: fix unfinished readahead thread for raid5/6 degraded mounting
btrfs: new ioctl TREE_SEARCH_V2
btrfs: tree_search, search_ioctl: direct copy to userspace
btrfs: new function read_extent_buffer_to_user
btrfs: tree_search, copy_to_sk: return needed size on EOVERFLOW
btrfs: tree_search, copy_to_sk: return EOVERFLOW for too small buffer
btrfs: tree_search, search_ioctl: accept varying buffer
btrfs: tree_search: eliminate redundant nr_items check
Pull aio fix and cleanups from Ben LaHaise:
"This consists of a couple of code cleanups plus a minor bug fix"
* git://git.kvack.org/~bcrl/aio-next:
aio: cleanup: flatten kill_ioctx()
aio: report error from io_destroy() when threads race in io_destroy()
fs/aio.c: Remove ctx parameter in kiocb_cancel
fcebe456 cut and pasted some code to a later point
in create_pending_snapshot(), but didn't switch
to the appropriate error handling for this stage
of the function.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Chris Mason <clm@fb.com>