linux/fs/ocfs2
Junxiao Bi c43c363def ocfs2: o2net: don't shutdown connection when idle timeout
This patch series is to fix a possible message lost bug in ocfs2 when
network go bad.  This bug will cause ocfs2 hung forever even network
become good again.

The messages may lost in this case.  After the tcp connection is
established between two nodes, an idle timer will be set to check its
state periodically, if no messages are received during this time, idle
timer will timeout, it will shutdown the connection and try to
reconnect, so pending messages in tcp queues will be lost.  This
messages may be from dlm.  Dlm may get hung in this case.  This may
cause the whole ocfs2 cluster hung.

This is very possible to happen when network state goes bad.  Do the
reconnect is useless, it will fail if network state is still bad.  Just
waiting there for network recovering may be a good idea, it will not
lost messages and some node will be fenced until cluster goes into
split-brain state, for this case, Tcp user timeout is used to override
the tcp retransmit timeout.  It will timeout after 25 days, user should
have notice this through the provided log and fix the network, if they
don't, ocfs2 will fall back to original reconnect way.

This patch (of 3):

Some messages in the tcp queue maybe lost if we shutdown the connection
and reconnect when idle timeout.  If packets lost and reconnect success,
then the ocfs2 cluster maybe hung.

To fix this, we can leave the connection there and do the fence decision
when idle timeout, if network recover before fence dicision is made, the
connection survive without lost any messages.

This bug can be saw when network state go bad.  It may cause ocfs2 hung
forever if some packets lost.  With this fix, ocfs2 will recover from
hung if network becomes good again.

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-29 16:28:16 -07:00
..
cluster ocfs2: o2net: don't shutdown connection when idle timeout 2014-08-29 16:28:16 -07:00
dlm ocfs2: race between umount and unfinished remastering during recovery 2014-08-06 18:01:13 -07:00
dlmfs ocfs2: remove versioning information 2014-01-21 16:19:41 -08:00
acl.c ocfs2: call ocfs2_update_inode_fsync_trans when updating any inode 2014-04-03 16:20:56 -07:00
acl.h ocfs2: use generic posix ACL infrastructure 2014-01-25 23:58:21 -05:00
alloc.c ocfs2: correctly check the return value of ocfs2_search_extent_list 2014-08-06 18:01:13 -07:00
alloc.h
aops.c switch {__,}blockdev_direct_IO() to iov_iter 2014-05-06 17:32:46 -04:00
aops.h ocfs2: change ip_unaligned_aio to of type mutex from atomit_t 2014-04-03 16:20:53 -07:00
blockcheck.c
blockcheck.h
buffer_head_io.c ocfs2: do not put bh when buffer_uptodate failed 2014-04-03 16:20:56 -07:00
buffer_head_io.h
dcache.c ocfs2: revert iput deferring code in ocfs2_drop_dentry_lock 2014-04-03 16:20:55 -07:00
dcache.h ocfs2: revert iput deferring code in ocfs2_drop_dentry_lock 2014-04-03 16:20:55 -07:00
dir.c ocfs2: call ocfs2_update_inode_fsync_trans when updating any inode 2014-04-03 16:20:56 -07:00
dir.h [readdir] convert ocfs2 2013-06-29 12:57:02 +04:00
dlmglue.c ocfs2: remove some unused code 2014-06-04 16:53:55 -07:00
dlmglue.h ocfs2: avoid blocking in ocfs2_mark_lockres_freeing() in downconvert thread 2014-04-03 16:20:55 -07:00
export.c
export.h
extent_map.c ocfs2: fix the end cluster offset of FIEMAP 2013-09-11 15:56:53 -07:00
extent_map.h
file.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-06-12 10:30:18 -07:00
file.h
heartbeat.c
heartbeat.h
inode.c mm + fs: store shadow entries in page cache 2014-04-03 16:21:01 -07:00
inode.h ocfs2: remove OCFS2_INODE_SKIP_DELETE flag 2014-04-03 16:20:54 -07:00
ioctl.c ocfs2: do not write error flag to user structure we cannot copy from/to 2014-08-29 16:28:16 -07:00
ioctl.h
journal.c ocfs2: limit printk when journal is aborted 2014-06-04 16:53:54 -07:00
journal.h ocfs2: improve fsync efficiency and fix deadlock between aio_write and sync_file 2014-04-03 16:20:53 -07:00
Kconfig
localalloc.c ocfs2: free allocated clusters if error occurs after ocfs2_claim_clusters 2014-02-06 13:48:51 -08:00
localalloc.h ocfs2: free allocated clusters if error occurs after ocfs2_claim_clusters 2014-02-06 13:48:51 -08:00
locks.c ocfs2: flock: drop cross-node lock when failed locally 2014-04-03 16:20:56 -07:00
locks.h
Makefile ocfs2: remove versioning information 2014-01-21 16:19:41 -08:00
mmap.c
mmap.h
move_extents.c ocfs2: correctly check the return value of ocfs2_search_extent_list 2014-08-06 18:01:13 -07:00
move_extents.h
namei.c ocfs2: manually do the iput once ocfs2_add_entry failed in ocfs2_symlink and ocfs2_mknod 2014-06-23 16:47:45 -07:00
namei.h
ocfs1_fs_compat.h
ocfs2_fs.h
ocfs2_ioctl.h
ocfs2_lockid.h
ocfs2_lockingver.h
ocfs2_trace.h ocfs2: fix a tiny race when running dirop_fileop_racer 2014-06-23 16:47:45 -07:00
ocfs2.h ocfs2: fix umount hang while shutting down truncate log 2014-06-04 16:53:54 -07:00
quota_global.c ocfs2: implement delayed dropping of last dquot reference 2014-04-03 16:20:54 -07:00
quota_local.c ocfs2: fix quota file corruption 2014-03-04 07:55:48 -08:00
quota.h ocfs2: implement delayed dropping of last dquot reference 2014-04-03 16:20:54 -07:00
refcounttree.c ocfs2: correctly check the return value of ocfs2_search_extent_list 2014-08-06 18:01:13 -07:00
refcounttree.h ocfs2: fix NULL pointer dereference in ocfs2_duplicate_clusters_by_page 2013-08-13 17:57:49 -07:00
reservations.c
reservations.h
resize.c ocfs2: fix incorrect i_size of global bitmap inode after resize 2014-06-04 16:53:54 -07:00
resize.h
slot_map.c fs/ocfs2/slot_map.c: replace count*size kzalloc by kcalloc 2014-08-06 18:01:13 -07:00
slot_map.h
stack_o2cb.c ocfs2: pass ocfs2_cluster_connection to ocfs2_this_node 2014-01-21 16:19:41 -08:00
stack_user.c ocfs2: fix sparse non static symbol warning 2014-01-21 16:19:42 -08:00
stackglue.c ocfs2: remove NULL assignments on static 2014-06-04 16:53:53 -07:00
stackglue.h ocfs2: pass ocfs2_cluster_connection to ocfs2_this_node 2014-01-21 16:19:41 -08:00
suballoc.c ocfs2: iput inode alloc when failed locally 2014-04-03 16:20:57 -07:00
suballoc.h ocfs2: rollback alloc_dinode counts when ocfs2_block_group_set_bits() failed 2014-04-03 16:20:56 -07:00
super.c ocfs2: revert "ocfs2: fix NULL pointer dereference when dismount and ocfs2rec simultaneously" 2014-06-23 16:47:45 -07:00
super.h
symlink.c
symlink.h
sysfile.c ocfs2: avoid system inode ref confusion by adding mutex lock 2014-04-03 16:20:57 -07:00
sysfile.h
uptodate.c ocfs2: remove NULL assignments on static 2014-06-04 16:53:53 -07:00
uptodate.h
xattr.c ocfs2: pass "new" parameter to ocfs2_init_xattr_bucket 2014-04-03 16:20:57 -07:00
xattr.h ocfs2: use generic posix ACL infrastructure 2014-01-25 23:58:21 -05:00