linux/fs
Jan Kara 5accdf82ba fs: Improve filesystem freezing handling
vfs_check_frozen() tests are racy since the filesystem can be frozen just after
the test is performed. Thus in write paths we can end up marking some pages or
inodes dirty even though the file system is already frozen. This creates
problems with flusher thread hanging on frozen filesystem.

Another problem is that exclusion between ->page_mkwrite() and filesystem
freezing has been handled by setting page dirty and then verifying s_frozen.
This guaranteed that either the freezing code sees the faulted page, writes it,
and writeprotects it again or we see s_frozen set and bail out of page fault.
This works to protect from page being marked writeable while filesystem
freezing is running but has an unpleasant artefact of leaving dirty (although
unmodified and writeprotected) pages on frozen filesystem resulting in similar
problems with flusher thread as the first problem.

This patch aims at providing exclusion between write paths and filesystem
freezing. We implement a writer-freeze read-write semaphore in the superblock.
Actually, there are three such semaphores because of lock ranking reasons - one
for page fault handlers (->page_mkwrite), one for all other writers, and one of
internal filesystem purposes (used e.g. to track running transactions).  Write
paths which should block freezing (e.g. directory operations, ->aio_write(),
->page_mkwrite) hold reader side of the semaphore. Code freezing the filesystem
takes the writer side.

Only that we don't really want to bounce cachelines of the semaphores between
CPUs for each write happening. So we implement the reader side of the semaphore
as a per-cpu counter and the writer side is implemented using s_writers.frozen
superblock field.

[AV: microoptimize sb_start_write(); we want it fast in normal case]

BugLink: https://bugs.launchpad.net/bugs/897421
Tested-by: Kamal Mostafa <kamal@canonical.com>
Tested-by: Peter M. Petrakis <peter.petrakis@canonical.com>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Tested-by: Massimo Morana <massimo.morana@canonical.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-31 09:30:13 +04:00
..
9p 9p: Push file_update_time() into v9fs_vm_page_mkwrite() 2012-07-31 01:02:46 +04:00
adfs stop passing nameidata to ->lookup() 2012-07-14 16:34:32 +04:00
affs don't pass nameidata to ->create() 2012-07-14 16:34:47 +04:00
afs VFS: Pass mount flags to sget() 2012-07-14 16:38:34 +04:00
autofs4 switch dentry_open() to struct path, make it grab references itself 2012-07-23 00:01:29 +04:00
befs stop passing nameidata to ->lookup() 2012-07-14 16:34:32 +04:00
bfs don't pass nameidata to ->create() 2012-07-14 16:34:47 +04:00
btrfs btrfs: Push mnt_want_write() outside of i_mutex 2012-07-31 01:02:51 +04:00
cachefiles switch dentry_open() to struct path, make it grab references itself 2012-07-23 00:01:29 +04:00
ceph ceph: Push file_update_time() into ceph_page_mkwrite() 2012-07-31 01:02:45 +04:00
cifs VFS: Pass mount flags to sget() 2012-07-14 16:38:34 +04:00
coda don't pass nameidata to ->create() 2012-07-14 16:34:47 +04:00
configfs stop passing nameidata to ->lookup() 2012-07-14 16:34:32 +04:00
cramfs stop passing nameidata to ->lookup() 2012-07-14 16:34:32 +04:00
debugfs debugfs: get rid of useless arguments to debugfs_{mkdir,symlink} 2012-07-14 16:35:30 +04:00
devpts VFS: Pass mount flags to sget() 2012-07-14 16:38:34 +04:00
dlm dlm: NULL dereference on failure in kmem_cache_create() 2012-05-15 10:39:28 -05:00
ecryptfs ecryptfs_lookup_interpose(): allocate dentry_info first 2012-07-29 21:24:17 +04:00
efs stop passing nameidata to ->lookup() 2012-07-14 16:34:32 +04:00
exofs don't pass nameidata to ->create() 2012-07-14 16:34:47 +04:00
exportfs switch dentry_open() to struct path, make it grab references itself 2012-07-23 00:01:29 +04:00
ext2 don't expose I_NEW inodes via dentry->d_inode 2012-07-23 00:00:58 +04:00
ext3 don't expose I_NEW inodes via dentry->d_inode 2012-07-23 00:00:58 +04:00
ext4 ext4: switch EXT4_IOC_RESIZE_FS to mnt_want_write_file() 2012-07-23 00:01:55 +04:00
fat fat: Push mnt_want_write() outside of i_mutex 2012-07-31 01:02:50 +04:00
freevxfs stop passing nameidata to ->lookup() 2012-07-14 16:34:32 +04:00
fscache
fuse don't pass nameidata to ->create() 2012-07-14 16:34:47 +04:00
gfs2 gfs2: Push file_update_time() into gfs2_page_mkwrite() 2012-07-31 01:02:46 +04:00
hfs hfs: get rid of hfs_sync_super 2012-07-22 23:58:09 +04:00
hfsplus hfsplus: get rid of write_super 2012-07-22 23:58:04 +04:00
hostfs don't pass nameidata to ->create() 2012-07-14 16:34:47 +04:00
hpfs don't pass nameidata to ->create() 2012-07-14 16:34:47 +04:00
hppfs switch dentry_open() to struct path, make it grab references itself 2012-07-23 00:01:29 +04:00
hugetlbfs don't pass nameidata to ->create() 2012-07-14 16:34:47 +04:00
isofs stop passing nameidata to ->lookup() 2012-07-14 16:34:32 +04:00
jbd jbd: Write journal superblock with WRITE_FUA after checkpointing 2012-05-15 23:34:37 +02:00
jbd2 jbd2: use kmem_cache_zalloc wrapper instead of flag 2012-06-01 00:10:32 -04:00
jffs2 don't expose I_NEW inodes via dentry->d_inode 2012-07-23 00:00:58 +04:00
jfs don't expose I_NEW inodes via dentry->d_inode 2012-07-23 00:00:58 +04:00
lockd lockd: handle lockowner allocation failure in nlmclnt_proc() 2012-07-29 23:17:39 +04:00
logfs VFS: Pass mount flags to sget() 2012-07-14 16:38:34 +04:00
minix don't pass nameidata to ->create() 2012-07-14 16:34:47 +04:00
ncpfs don't pass nameidata to ->create() 2012-07-14 16:34:47 +04:00
nfs VFS: Pass mount flags to sget() 2012-07-14 16:38:34 +04:00
nfs_common
nfsd nfsd: Push mnt_want_write() outside of i_mutex 2012-07-31 01:02:51 +04:00
nilfs2 VFS: Pass mount flags to sget() 2012-07-14 16:38:34 +04:00
nls nls: fix (and rename) mac NLS table files and config options 2012-06-01 19:51:22 -07:00
notify switch dentry_open() to struct path, make it grab references itself 2012-07-23 00:01:29 +04:00
ntfs stop passing nameidata to ->lookup() 2012-07-14 16:34:32 +04:00
ocfs2 pull mnt_want_write()/mnt_drop_write() into kern_path_create()/done_path_create() resp. 2012-07-29 21:24:15 +04:00
omfs don't pass nameidata to ->create() 2012-07-14 16:34:47 +04:00
openpromfs stop passing nameidata to ->lookup() 2012-07-14 16:34:32 +04:00
proc VFS: Pass mount flags to sget() 2012-07-14 16:38:34 +04:00
pstore staging tree fixes for 3.5-rc4 2012-06-20 15:15:03 -07:00
qnx4 stop passing nameidata to ->lookup() 2012-07-14 16:34:32 +04:00
qnx6 stop passing nameidata to ->lookup() 2012-07-14 16:34:32 +04:00
quota quota: Split dquot_quota_sync() to writeback and cache flushing part 2012-07-22 23:58:19 +04:00
ramfs don't pass nameidata to ->create() 2012-07-14 16:34:47 +04:00
reiserfs don't expose I_NEW inodes via dentry->d_inode 2012-07-23 00:00:58 +04:00
romfs stop passing nameidata to ->lookup() 2012-07-14 16:34:32 +04:00
squashfs stop passing nameidata to ->lookup() 2012-07-14 16:34:32 +04:00
sysfs sysfs: Push file_update_time() into bin_page_mkwrite() 2012-07-31 01:02:47 +04:00
sysv fs/sysv: stop using write_super and s_dirt 2012-07-22 23:58:12 +04:00
ubifs VFS: Pass mount flags to sget() 2012-07-14 16:38:34 +04:00
udf don't pass nameidata to ->create() 2012-07-14 16:34:47 +04:00
ufs fs/ufs: get rid of write_super 2012-07-22 23:58:16 +04:00
xfs switch dentry_open() to struct path, make it grab references itself 2012-07-23 00:01:29 +04:00
aio.c aio: now fput() is OK from interrupt context; get rid of manual delayed __fput() 2012-07-22 23:57:59 +04:00
anon_inodes.c
attr.c notify_change(): check that i_mutex is held 2012-07-14 16:35:42 +04:00
bad_inode.c don't pass nameidata to ->create() 2012-07-14 16:34:47 +04:00
binfmt_aout.c VM: add "vm_mmap()" helper function 2012-04-20 17:29:13 -07:00
binfmt_elf_fdpic.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2012-05-23 17:42:39 -07:00
binfmt_elf.c binfmt_elf: switch elf_map() to vm_mmap/vm_munmap 2012-05-30 21:04:55 -04:00
binfmt_em86.c
binfmt_flat.c binfmt_flat: use vm_munmap, we are missing ->mmap_sem there 2012-05-30 21:04:56 -04:00
binfmt_misc.c vfs: Rename end_writeback() to clear_inode() 2012-05-06 13:43:41 +08:00
binfmt_script.c
binfmt_som.c VM: add "vm_mmap()" helper function 2012-04-20 17:29:13 -07:00
bio-integrity.c
bio.c Merge branch 'for-3.5/core' of git://git.kernel.dk/linux-block 2012-05-30 08:52:42 -07:00
block_dev.c vfs: Create function for iterating over block devices 2012-07-22 23:58:45 +04:00
buffer.c fs: Push file_update_time() into __block_page_mkwrite() 2012-07-31 01:02:44 +04:00
char_dev.c
compat_binfmt_elf.c
compat_ioctl.c
compat.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal 2012-06-01 11:53:44 -07:00
dcache.c __d_unalias() should refuse to move mountpoints 2012-07-14 16:35:15 +04:00
dcookies.c
direct-io.c fs/direct-io.c: adjust suspicious bit operation 2012-07-14 16:32:46 +04:00
drop_caches.c
eventfd.c eventfd: change int to __u64 in eventfd_signal() 2012-05-31 17:49:32 -07:00
eventpoll.c HAVE_RESTORE_SIGMASK is defined on all architectures now 2012-06-01 12:58:46 -04:00
exec.c consolidate pipe file creation 2012-07-29 21:24:19 +04:00
fcntl.c switch fcntl to fget_raw_light/fput_light 2012-05-29 23:28:30 -04:00
fhandle.c
fifo.c
file_table.c uninline file_free_rcu() 2012-07-29 21:24:17 +04:00
file.c Merge branch 'x86-x32-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2012-03-29 18:12:23 -07:00
filesystems.c
fs_struct.c get rid of ->mnt_longterm 2012-07-14 16:32:47 +04:00
fs-writeback.c vfs: Move noop_backing_dev_info check from sync into writeback 2012-07-22 23:58:18 +04:00
generic_acl.c
inode.c vfs: switch i_dentry/d_alias to hlist 2012-07-14 16:32:55 +04:00
internal.h VFS: Split inode_permission() 2012-07-14 16:38:36 +04:00
ioctl.c
ioprio.c Merge branch 'for-3.5/core' of git://git.kernel.dk/linux-block 2012-05-30 08:52:42 -07:00
Kconfig
Kconfig.binfmt C6X: add support to build with BINFMT_ELF_FDPIC 2012-05-15 09:17:34 -04:00
libfs.c VFS: Pass mount flags to sget() 2012-07-14 16:38:34 +04:00
locks.c Remove easily user-triggerable BUG from generic_setlease 2012-07-13 10:50:23 -07:00
Makefile
mbcache.c
mount.h get rid of magic in proc_namespace.c 2012-07-14 16:32:48 +04:00
mpage.c
namei.c fs: Push mnt_want_write() outside of i_mutex 2012-07-31 01:02:49 +04:00
namespace.c VFS: Comment mount following code 2012-07-14 16:38:32 +04:00
no-block.c
open.c take grabbing f->f_path to do_dentry_open() 2012-07-29 21:24:18 +04:00
pipe.c consolidate pipe file creation 2012-07-29 21:24:19 +04:00
pnode.c VFS: Make clone_mnt()/copy_tree()/collect_mounts() return errors 2012-07-14 16:37:27 +04:00
pnode.h
posix_acl.c
proc_namespace.c get rid of magic in proc_namespace.c 2012-07-14 16:32:48 +04:00
read_write.c vfs: allow custom EOF in generic_file_llseek code 2012-07-23 00:00:15 +04:00
read_write.h
readdir.c switch readdir/getdents to fget_light/fput_light 2012-05-29 23:28:29 -04:00
select.c HAVE_RESTORE_SIGMASK is defined on all architectures now 2012-06-01 12:58:46 -04:00
seq_file.c
signalfd.c switch signalfd4() to fget_light/fput_light 2012-05-29 23:28:30 -04:00
splice.c splice: fix racy pipe->buffers uses 2012-06-13 21:16:42 +02:00
stack.c
stat.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2012-05-23 17:42:39 -07:00
statfs.c switch statfs to fget_light/fput_light 2012-05-29 23:28:31 -04:00
super.c fs: Improve filesystem freezing handling 2012-07-31 09:30:13 +04:00
sync.c vfs: Avoid unnecessary WB_SYNC_NONE writeback during sys_sync and reorder sync passes 2012-07-22 23:59:01 +04:00
timerfd.c
utimes.c switch utimes() to fget_light/fput_light 2012-05-29 23:28:32 -04:00
xattr_acl.c
xattr.c switch xattr syscalls to fget_light/fput_light 2012-05-29 23:28:30 -04:00