linux/fs
Liu Bo 9e0af23764 Btrfs: fix task hang under heavy compressed write
This has been reported and discussed for a long time, and this hang occurs in
both 3.15 and 3.16.

Btrfs now migrates to use kernel workqueue, but it introduces this hang problem.

Btrfs has a kind of work queued as an ordered way, which means that its
ordered_func() must be processed in the way of FIFO, so it usually looks like --

normal_work_helper(arg)
    work = container_of(arg, struct btrfs_work, normal_work);

    work->func() <---- (we name it work X)
    for ordered_work in wq->ordered_list
            ordered_work->ordered_func()
            ordered_work->ordered_free()

The hang is a rare case, first when we find free space, we get an uncached block
group, then we go to read its free space cache inode for free space information,
so it will

file a readahead request
    btrfs_readpages()
         for page that is not in page cache
                __do_readpage()
                     submit_extent_page()
                           btrfs_submit_bio_hook()
                                 btrfs_bio_wq_end_io()
                                 submit_bio()
                                 end_workqueue_bio() <--(ret by the 1st endio)
                                      queue a work(named work Y) for the 2nd
                                      also the real endio()

So the hang occurs when work Y's work_struct and work X's work_struct happens
to share the same address.

A bit more explanation,

A,B,C -- struct btrfs_work
arg   -- struct work_struct

kthread:
worker_thread()
    pick up a work_struct from @worklist
    process_one_work(arg)
	worker->current_work = arg;  <-- arg is A->normal_work
	worker->current_func(arg)
		normal_work_helper(arg)
		     A = container_of(arg, struct btrfs_work, normal_work);

		     A->func()
		     A->ordered_func()
		     A->ordered_free()  <-- A gets freed

		     B->ordered_func()
			  submit_compressed_extents()
			      find_free_extent()
				  load_free_space_inode()
				      ...   <-- (the above readhead stack)
				      end_workqueue_bio()
					   btrfs_queue_work(work C)
		     B->ordered_free()

As if work A has a high priority in wq->ordered_list and there are more ordered
works queued after it, such as B->ordered_func(), its memory could have been
freed before normal_work_helper() returns, which means that kernel workqueue
code worker_thread() still has worker->current_work pointer to be work
A->normal_work's, ie. arg's address.

Meanwhile, work C is allocated after work A is freed, work C->normal_work
and work A->normal_work are likely to share the same address(I confirmed this
with ftrace output, so I'm not just guessing, it's rare though).

When another kthread picks up work C->normal_work to process, and finds our
kthread is processing it(see find_worker_executing_work()), it'll think
work C as a collision and skip then, which ends up nobody processing work C.

So the situation is that our kthread is waiting forever on work C.

Besides, there're other cases that can lead to deadlock, but the real problem
is that all btrfs workqueue shares one work->func, -- normal_work_helper,
so this makes each workqueue to have its own helper function, but only a
wraper pf normal_work_helper.

With this patch, I no long hit the above hang.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-24 07:17:02 -07:00
..
9p Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-06-12 10:30:18 -07:00
adfs write_iter variants of {__,}generic_file_aio_write() 2014-05-06 17:38:00 -04:00
affs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-06-12 10:30:18 -07:00
afs AFS: Correctly assemble the client UUID 2014-07-29 10:14:36 -07:00
autofs4 autofs4: fix false positive compile error 2014-07-03 09:21:53 -07:00
befs fs/befs: kernel-doc fixes 2014-06-06 16:08:09 -07:00
bfs write_iter variants of {__,}generic_file_aio_write() 2014-05-06 17:38:00 -04:00
btrfs Btrfs: fix task hang under heavy compressed write 2014-08-24 07:17:02 -07:00
cachefiles fs/cachefiles: replace kerror by pr_err 2014-06-06 16:08:14 -07:00
ceph Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client 2014-06-12 23:06:23 -07:00
cifs [CIFS] fix mount failure with broken pathnames when smb3 mount with mapchars option 2014-06-24 08:10:24 -05:00
coda coda: convert use of typedef ctl_table to struct ctl_table 2014-06-06 16:08:16 -07:00
configfs fs/configfs: use pr_fmt 2014-06-04 16:53:53 -07:00
cramfs Major changes for 3.14 include support for the newly added ZERO_RANGE 2014-04-04 15:39:39 -07:00
debugfs Major changes for 3.14 include support for the newly added ZERO_RANGE 2014-04-04 15:39:39 -07:00
devpts fs/devpts/inode.c: convert printk to pr_foo() 2014-06-06 16:08:14 -07:00
dlm dlm: keep listening connection alive with sctp mode 2014-06-12 10:26:14 -05:00
ecryptfs write_iter variants of {__,}generic_file_aio_write() 2014-05-06 17:38:00 -04:00
efivarfs fs/efivarfs/super.c: use static const for dentry_operations 2014-06-04 16:54:14 -07:00
efs fs/efs: convert printk(KERN_DEBUG to pr_debug 2014-06-04 16:54:21 -07:00
exofs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-06-12 10:30:18 -07:00
exportfs fs/exportfs/expfs.c: kernel-doc warning fixes 2014-06-04 16:54:14 -07:00
ext2 ->splice_write() via ->write_iter() 2014-06-12 00:18:51 -04:00
ext3 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-06-12 10:30:18 -07:00
ext4 ext4: fix potential null pointer dereference in ext4_free_inode 2014-07-12 16:11:42 -04:00
f2fs f2fs: avoid to access NULL pointer in issue_flush_thread 2014-07-09 05:59:55 -07:00
fat Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-06-12 10:30:18 -07:00
freevxfs Major changes for 3.14 include support for the newly added ZERO_RANGE 2014-04-04 15:39:39 -07:00
fscache fscache: convert use of typedef ctl_table to struct ctl_table 2014-06-06 16:08:16 -07:00
fuse fuse: add FUSE_NO_OPEN_SUPPORT flag to INIT 2014-07-22 16:37:43 +02:00
gfs2 GFS2: fs/gfs2/rgrp.c: kernel-doc warning fixes 2014-07-18 11:15:14 +01:00
hfs write_iter variants of {__,}generic_file_aio_write() 2014-05-06 17:38:00 -04:00
hfsplus Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-06-12 10:30:18 -07:00
hostfs write_iter variants of {__,}generic_file_aio_write() 2014-05-06 17:38:00 -04:00
hpfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-06-12 10:30:18 -07:00
hppfs
hugetlbfs fs/hugetlbfs/inode.c: remove null test before kfree 2014-06-04 16:54:11 -07:00
isofs Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs 2014-04-07 17:59:17 -07:00
jbd fs/jbd/revoke.c: replace shift loop by ilog2 2014-05-21 10:26:13 +02:00
jbd2 ext4: disable synchronous transaction batching if max_batch_time==0 2014-07-05 19:18:22 -04:00
jffs2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-06-12 10:30:18 -07:00
jfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-06-12 10:30:18 -07:00
kernfs Merge branch 'for-3.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup 2014-07-10 11:38:23 -07:00
lockd Merge branch 'for-3.16' of git://linux-nfs.org/~bfields/linux 2014-06-10 11:50:57 -07:00
logfs write_iter variants of {__,}generic_file_aio_write() 2014-05-06 17:38:00 -04:00
minix write_iter variants of {__,}generic_file_aio_write() 2014-05-06 17:38:00 -04:00
ncpfs fs/ncpfs/getopt.c: replace simple_strtoul by kstrtoul 2014-06-04 16:54:21 -07:00
nfs NFS: Don't reset pg_moreio in __nfs_pageio_add_request 2014-07-13 15:18:44 -04:00
nfs_common
nfsd NFSD: Fix crash encoding lock reply on 32-bit 2014-07-23 10:31:56 -04:00
nilfs2 write_iter variants of {__,}generic_file_aio_write() 2014-05-06 17:38:00 -04:00
nls
notify inotify: convert use of typedef ctl_table to struct ctl_table 2014-06-06 16:08:16 -07:00
ntfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-06-12 10:30:18 -07:00
ocfs2 ocfs2/dlm: do not purge lockres that is queued for assert master 2014-06-23 16:47:45 -07:00
omfs write_iter variants of {__,}generic_file_aio_write() 2014-05-06 17:38:00 -04:00
openpromfs fs: push sync_filesystem() down to the file system's remount_fs() 2014-03-13 10:14:33 -04:00
proc /proc/stat: convert to single_open_size() 2014-07-03 09:21:54 -07:00
pstore fs/pstore: logging clean-up 2014-06-06 16:08:13 -07:00
qnx4 fs: push sync_filesystem() down to the file system's remount_fs() 2014-03-13 10:14:33 -04:00
qnx6 fs: push sync_filesystem() down to the file system's remount_fs() 2014-03-13 10:14:33 -04:00
quota quota: missing lock in dqcache_shrink_scan() 2014-07-15 22:36:18 +02:00
ramfs ->splice_write() via ->write_iter() 2014-06-12 00:18:51 -04:00
reiserfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-06-12 10:30:18 -07:00
romfs switch simple generic_file_aio_read() users to ->read_iter() 2014-05-06 17:37:55 -04:00
squashfs fs/squashfs/squashfs.h: replace pr_warning by pr_warn 2014-06-04 16:53:52 -07:00
sysfs kernfs: move the last knowledge of sysfs out from kernfs 2014-06-03 08:11:18 -07:00
sysv write_iter variants of {__,}generic_file_aio_write() 2014-05-06 17:38:00 -04:00
ubifs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-06-12 10:30:18 -07:00
udf udf: switch to ->write_iter() 2014-05-06 17:39:36 -04:00
ufs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-06-12 10:30:18 -07:00
xfs xfs: null unused quota inodes when quota is on 2014-07-15 07:28:41 +10:00
aio.c aio: protect reqs_available updates from changes in interrupt handlers 2014-07-14 13:05:26 -04:00
anon_inodes.c vfs: Allocate anon_inode_inode in anon_inode_init() 2014-03-27 09:52:54 -07:00
attr.c fs,userns: Change inode_capable to capable_wrt_inode_uidgid 2014-06-10 13:57:22 -07:00
bad_inode.c
binfmt_aout.c
binfmt_elf_fdpic.c
binfmt_elf.c Merge branch 'x86/vdso' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next 2014-06-05 08:05:29 -07:00
binfmt_em86.c
binfmt_flat.c fs/binfmt_flat.c: make old_reloc() static 2014-06-04 16:54:21 -07:00
binfmt_misc.c binfmt_misc: add missing 'break' statement 2014-04-03 16:21:16 -07:00
binfmt_script.c
binfmt_som.c
block_dev.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-06-12 10:30:18 -07:00
buffer.c mm: non-atomically mark page accessed during page cache allocation where possible 2014-06-04 16:54:10 -07:00
char_dev.c
compat_binfmt_elf.c binfmt_elf: add ELF_HWCAP2 to compat auxv entries 2014-03-04 08:05:21 +00:00
compat_ioctl.c fs/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types 2014-03-06 16:30:44 +01:00
compat.c locks: rename file-private locks to "open file description locks" 2014-04-22 08:23:58 -04:00
coredump.c coredump: fix the setting of PF_DUMPCORE 2014-07-23 15:10:54 -07:00
dcache.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-06-12 10:30:18 -07:00
dcookies.c fs/compat: fix lookup_dcookie() parameter handling 2014-01-29 16:22:40 -08:00
direct-io.c direct-io: fix AIO regression 2014-08-01 02:35:51 -04:00
drop_caches.c fs: convert use of typedef ctl_table to struct ctl_table 2014-06-06 16:08:16 -07:00
eventfd.c
eventpoll.c epoll: fix use-after-free in eventpoll_release_file 2014-06-16 17:21:59 -10:00
exec.c perf: Differentiate exec() and non-exec() comm events 2014-06-06 07:56:22 +02:00
fcntl.c locks: rename file-private locks to "open file description locks" 2014-04-22 08:23:58 -04:00
fhandle.c
file_table.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-06-12 10:30:18 -07:00
file.c fs/file.c: don't open-code kvfree() 2014-05-06 17:31:10 -04:00
filesystems.c sys_sysfs: Add CONFIG_SYSFS_SYSCALL 2014-04-03 16:21:05 -07:00
fs_struct.c
fs-writeback.c One of the main highlights this time, is not the patches themselves 2014-04-04 14:49:16 -07:00
inode.c fs,userns: Change inode_capable to capable_wrt_inode_uidgid 2014-06-10 13:57:22 -07:00
internal.h
ioctl.c
Kconfig kernfs: add CONFIG_KERNFS 2014-02-07 16:08:57 -08:00
Kconfig.binfmt
libfs.c fs/libfs.c: add generic data flush to fsync 2014-06-04 16:53:55 -07:00
locks.c locks: set fl_owner for leases back to current->files 2014-06-10 12:29:05 -04:00
Makefile block: move ioprio.c from fs/ to block/ 2014-05-19 11:02:18 -06:00
mbcache.c fs/mbcache: replace __builtin_log2() with ilog2() 2014-06-25 22:08:29 -04:00
mount.h reduce m_start() cost... 2014-04-01 23:19:09 -04:00
mpage.c fs/block_dev.c: add bdev_read_page() and bdev_write_page() 2014-06-04 16:54:02 -07:00
namei.c fs: umount on symlink leaks mnt count 2014-07-24 06:18:12 -04:00
namespace.c VFS: Make delayed_free() call free_vfsmnt() 2014-04-01 23:19:18 -04:00
no-block.c
open.c vfs: fix check for fallocate on active swapfile 2014-08-01 02:36:04 -04:00
pipe.c new helper: copy_page_from_iter() 2014-05-06 17:39:42 -04:00
pnode.c smarter propagate_mnt() 2014-04-01 23:19:08 -04:00
pnode.h smarter propagate_mnt() 2014-04-01 23:19:08 -04:00
posix_acl.c posix_acl: handle NULL ACL in posix_acl_equiv_mode 2014-05-06 13:58:42 -04:00
proc_namespace.c reduce m_start() cost... 2014-04-01 23:19:09 -04:00
read_write.c switch simple generic_file_aio_read() users to ->read_iter() 2014-05-06 17:37:55 -04:00
readdir.c fanotify: create FAN_ACCESS event for readdir 2014-06-04 16:53:52 -07:00
select.c
seq_file.c fs/seq_file: fallback to vmalloc allocation 2014-07-03 09:21:54 -07:00
signalfd.c
splice.c Merge commit '9f12600fe425bc28f0ccba034a77783c09c15af4' into for-linus 2014-06-12 00:28:09 -04:00
stack.c
stat.c
statfs.c
super.c fs/superblock: avoid locking counting inodes and dentries before reclaiming them 2014-06-04 16:54:11 -07:00
sync.c Revert "writeback: do not sync data dirtied after sync start" 2014-02-22 02:02:28 +01:00
timerfd.c
utimes.c
xattr.c simple_xattr: permit 0-size extended attributes 2014-07-23 15:10:55 -07:00