linux/fs
Christoph Hellwig 2daea67e96 fsync: wait for data writeout completion before calling ->fsync
Currenly vfs_fsync(_range) first calls filemap_fdatawrite to write out
the data, the calls into ->fsync to write out the metadata and then finally
calls filemap_fdatawait to wait for the data I/O to complete.  What sounds
like a clever micro-optimization actually is nast trap for many filesystems.

For many modern filesystems i_size or other inode information is only
updated on I/O completion and we need to wait for I/O to finish before
we can write out the metadata.  For old fashionen filesystems that
instanciate blocks during the actual write and also update the metadata
at that point it opens up a large window were we could expose uninitialized
blocks after a crash.  While a few filesystems that need it already wait
for the I/O to finish inside their ->fsync methods it is rather suboptimal
as it is done under the i_mutex and also always for the whole file instead
of just a part as we could do for O_SYNC handling.

Here is a small audit of all fsync instances in the tree:

 - spufs_mfc_fsync:
 - ps3flash_fsync:
 - vol_cdev_fsync:
 - printer_fsync:
 - fb_deferred_io_fsync:
 - bad_file_fsync:
 - simple_sync_file:

	don't care - filesystems/drivers do't use the page cache or are
	purely in-memory.

 - simple_fsync:
 - file_fsync:
 - affs_file_fsync:
 - fat_file_fsync:
 - jfs_fsync:
 - ubifs_fsync:
 - reiserfs_dir_fsync:
 - reiserfs_sync_file:

	never touch pagecache themselves.  We need to wait before if we do
	not want to expose stale data after an allocation.

 - afs_fsync:
 - fuse_fsync_common:

	do the waiting writeback itself in awkward ways, would benefit from
	proper semantics

 - block_fsync:

	Does a filemap_write_and_wait on the block device inode.  Because we
	now have f_mapping that is the same inode we call it on in vfs_fsync.
	So just removing it and letting the VFS do the work in one go would
	be an improvement.

 - btrfs_sync_file:
 - cifs_fsync:
 - xfs_file_fsync:

	need the wait first and currently do it themselves. would benefit from
	doing it outside i_mutex.

 - coda_fsync:
 - ecryptfs_fsync:
 - exofs_file_fsync:
 - shm_fsync:

	only passes the fsync through to the lower layer

 - ext3_sync_file:

	doesn't seem to care, comments are confusing.

 - ext4_sync_file:

	would need the wait to work correctly for delalloc mode with late
	i_size updates.  Otherwise the ext3 comment applies.

	currently implemens it's own writeback and wait in an odd way,
	could benefit from doing it properly.

 - gfs2_fsync:

	not needed for journaled data mode, but probably harmless there.
	Currently writes back data asynchronously itself.  Needs some
	major audit.

 - hostfs_fsync:

	just calls fsync/datasync on the host FD.  Without the wait before
	data might not even be inflight yet if we're unlucky.

 - hpfs_file_fsync:
 - ncp_fsync:

	no-ops.  Dangerous before and after.

 - jffs2_fsync:

	just calls jffs2_flush_wbuf_gc, not sure how this relates to data.

 - nfs_fsync_dir:

	just increments stats, claims all directory operations are synchronous

 - nfs_file_fsync:

	only writes out data???  Looks very odd.

 - nilfs_sync_file:

	looks like it expects all data done, but not sure from the code

 - ntfs_dir_fsync:
 - ntfs_file_fsync:

	appear to do their own data writeback.  Very convoluted code.

 - ocfs2_sync_file:

	does it's own data writeback, but no wait.  probably needs the wait.

 - smb_fsync:

	according to a comment expects all pages written already, probably needs
	the wait before.

This patch only changes vfs_fsync_range, removal of the wait in the methods
that have it is left to the filesystem maintainers.  Note that most
filesystems really do need an audit for their fsync methods given the
gems found in this very brief audit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-09-14 17:08:17 +02:00
..
9p 9p: remove unnecessary v9fses->options which duplicates the mount string 2009-08-17 16:42:28 -05:00
adfs headers: smp_lock.h redux 2009-07-12 12:22:34 -07:00
affs affs: add ->sync_fs 2009-06-11 21:36:14 -04:00
afs AFS: Stop readlink() on AFS crashing due to NULL 'file' ptr 2009-08-27 12:22:08 -07:00
autofs switch follow_down() 2009-06-11 21:36:01 -04:00
autofs4 autofs4 - fix missed case when changing to use struct path 2009-08-31 17:44:05 -10:00
befs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 2009-06-17 08:46:57 -07:00
bfs headers: smp_lock.h redux 2009-07-12 12:22:34 -07:00
btrfs writeback: add name to backing_dev_info 2009-09-11 09:20:26 +02:00
cachefiles enforce ->sync_fs is only called for rw superblock 2009-06-11 21:36:06 -04:00
cifs [CIFS] Update readme to reflect forceuid mount parms 2009-08-04 03:53:28 +00:00
coda splice: implement default splice_read method 2009-05-11 14:13:10 +02:00
configfs writeback: add name to backing_dev_info 2009-09-11 09:20:26 +02:00
cramfs fs/cramfs: return f_fsid for statfs(2) 2009-04-02 19:05:08 -07:00
debugfs debugfs: use specified mode to possibly mark files read/write only 2009-06-15 21:30:28 -07:00
devpts devpts: remove module-related code 2009-06-24 08:15:24 -04:00
dlm dlm: free socket in error exit path 2009-07-14 12:28:43 -05:00
ecryptfs eCryptfs: parse_tag_3_packet check tag 3 packet encrypted key size 2009-07-28 14:26:06 -07:00
efs get rid of BKL in fs/efs 2009-06-17 00:36:36 -04:00
exofs headers: smp_lock.h redux 2009-07-12 12:22:34 -07:00
exportfs
ext2 ext2: Update comment about generic_osync_inode 2009-09-14 17:08:16 +02:00
ext3 ext3: Remove syncing logic from ext3_file_write 2009-09-14 17:08:16 +02:00
ext4 ext4: Remove syncing logic from ext4_file_write 2009-09-14 17:08:16 +02:00
fat fat: Opencode sync_page_range_nolock() 2009-09-14 17:08:17 +02:00
freevxfs headers: smp_lock.h redux 2009-07-12 12:22:34 -07:00
fscache FS-Cache: Fixup renamed filenames in comments in internal.h 2009-05-27 10:20:13 -07:00
fuse writeback: add name to backing_dev_info 2009-09-11 09:20:26 +02:00
gfs2 GFS2: Fix permissions on "recover" file 2009-08-14 14:04:46 +01:00
hfs headers: smp_lock.h redux 2009-07-12 12:22:34 -07:00
hfsplus headers: smp_lock.h redux 2009-07-12 12:22:34 -07:00
hostfs hostfs: set maximum filesize in superblock for proper LFS support 2009-06-30 18:56:03 -07:00
hpfs headers: smp_lock.h redux 2009-07-12 12:22:34 -07:00
hppfs hppfs: hppfs_read_file() may return -ERROR 2009-04-02 19:04:53 -07:00
hugetlbfs writeback: add name to backing_dev_info 2009-09-11 09:20:26 +02:00
isofs isofs: fix Joliet regression 2009-07-10 19:18:59 -07:00
jbd jbd: fix race between write_metadata_buffer and get_write_access 2009-07-21 11:54:42 +02:00
jbd2 jbd2: fix race between write_metadata_buffer and get_write_access 2009-07-13 17:55:35 -04:00
jffs2 jffs2/jfs/xfs: switch over to 'check_acl' rather than 'permission()' 2009-09-08 11:09:04 -07:00
jfs jffs2/jfs/xfs: switch over to 'check_acl' rather than 'permission()' 2009-09-08 11:09:04 -07:00
lockd lockd: Replace nsm_display_address() with rpc_ntop() 2009-08-09 15:09:39 -04:00
minix Making fs/minix/minix.h double including safe 2009-06-22 11:34:42 -07:00
ncpfs NLS: update handling of Unicode 2009-06-15 21:44:43 -07:00
nfs Merge branch 'nfs-for-2.6.32' 2009-09-11 14:59:37 -04:00
nfs_common
nfsd Merge branch 'nfs-for-2.6.32' 2009-09-11 14:59:37 -04:00
nilfs2 nilfs2: fix preempt count underflow in nilfs_btnode_prepare_change_key 2009-08-31 12:03:06 +09:00
nls NLS: update handling of Unicode 2009-06-15 21:44:43 -07:00
notify inotify: update the group mask on mark addition 2009-08-28 12:51:14 -04:00
ntfs ntfs: Use new syncing helpers and update comments 2009-09-14 17:08:16 +02:00
ocfs2 ocfs2: Update syncing after splicing to match generic version 2009-09-14 17:08:16 +02:00
omfs switch omfs to simple_fsync() 2009-06-11 21:36:13 -04:00
openpromfs
partitions partitions: fix broken uevent_suppress conversion 2009-07-12 13:02:09 -07:00
proc mm: revert "oom: move oom_adj value" 2009-08-18 16:31:13 -07:00
qnx4 fs/qnx4: sanitize includes 2009-06-11 21:36:12 -04:00
quota quota: Silence lockdep on quota_on 2009-07-30 17:31:23 +02:00
ramfs writeback: add name to backing_dev_info 2009-09-11 09:20:26 +02:00
reiserfs headers: smp_lock.h redux 2009-07-12 12:22:34 -07:00
romfs ROMFS: romfs_dev_read() error ignored 2009-05-09 10:49:41 -04:00
smbfs push BKL down into ->put_super 2009-06-11 21:36:07 -04:00
squashfs headers: smp_lock.h redux 2009-07-12 12:22:34 -07:00
sysfs Merge branch 'writeback' of git://git.kernel.dk/linux-2.6-block 2009-09-11 09:17:05 -07:00
sysv get rid of BKL in fs/sysv 2009-06-17 00:36:37 -04:00
ubifs writeback: add name to backing_dev_info 2009-09-11 09:20:26 +02:00
udf udf: Fix loading of VAT inode when drive wrongly reports number of recorded blocks 2009-07-30 17:28:26 +02:00
ufs ufs: sector_t cannot be negative 2009-06-18 13:03:46 -07:00
xfs xfs: Convert sync_page_range() to simple filemap_write_and_wait_range() 2009-09-14 17:08:17 +02:00
aio.c eventfd: revised interface and cleanups 2009-06-30 18:55:58 -07:00
anon_inodes.c fs: Provide empty .set_page_dirty() aop for anon inodes 2009-06-18 14:46:10 +02:00
attr.c vfs: Use lowercase names of quota functions 2009-03-26 02:18:35 +01:00
bad_inode.c
binfmt_aout.c
binfmt_elf_fdpic.c elf_core_dump: use rcu_read_lock() to access ->real_parent 2009-06-18 13:03:52 -07:00
binfmt_elf.c binfmt_elf: fix PT_INTERP bss handling 2009-09-10 20:11:12 +10:00
binfmt_em86.c
binfmt_flat.c flat: fix uninitialized ptr with shared libs 2009-08-07 10:39:57 -07:00
binfmt_misc.c
binfmt_script.c
binfmt_som.c Don't crap into descriptor table in binfmt_som 2009-03-31 23:00:28 -04:00
bio-integrity.c block: Create bip slabs with embedded integrity vectors 2009-07-01 10:56:25 +02:00
bio.c block: fix sg SG_DXFER_TO_FROM_DEV regression 2009-07-10 20:31:53 +02:00
block_dev.c vfs: Rename generic_file_aio_write_nolock 2009-09-14 17:08:15 +02:00
buffer.c writeback: switch to per-bdi threads for flushing data 2009-09-11 09:20:25 +02:00
char_dev.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6 2009-09-11 09:19:35 -07:00
compat_binfmt_elf.c
compat_ioctl.c compat_ioctl: hook up compat handler for FIEMAP ioctl 2009-08-07 10:39:56 -07:00
compat.c exec: do not sleep in TASK_TRACED under ->cred_guard_mutex 2009-09-05 11:30:42 -07:00
dcache.c sched: Pull up the might_sleep() check into cond_resched() 2009-07-18 15:51:44 +02:00
dcookies.c
direct-io.c block: Do away with the notion of hardsect_size 2009-05-22 23:22:54 +02:00
drop_caches.c mm: remove __invalidate_mapping_pages variant 2009-06-16 19:47:43 -07:00
eventfd.c eventfd: revised interface and cleanups 2009-06-30 18:55:58 -07:00
eventpoll.c epoll: fix nested calls support 2009-06-18 13:03:41 -07:00
exec.c exec: do not sleep in TASK_TRACED under ->cred_guard_mutex 2009-09-05 11:30:42 -07:00
fcntl.c headers: smp_lock.h redux 2009-07-12 12:22:34 -07:00
fifo.c
file_table.c fs: move mark_files_ro into file_table.c 2009-06-11 21:36:02 -04:00
file.c
filesystems.c fs: Mark get_filesystem_list() as __init function. 2009-04-20 23:02:52 -04:00
fs_struct.c Get rid of indirect include of fs_struct.h 2009-03-31 23:00:27 -04:00
fs-writeback.c vfs: Remove generic_osync_inode() and sync_page_range{_nolock}() 2009-09-14 17:08:17 +02:00
generic_acl.c New helper - current_umask() 2009-03-31 23:00:26 -04:00
inode.c vfs: add __destroy_inode 2009-08-07 14:38:29 -03:00
internal.h Trim a bit of crap from fs.h 2009-06-11 21:36:07 -04:00
ioctl.c fs: Add new pre-allocation ioctls to vfs for compatibility with legacy xfs ioctls 2009-06-24 08:15:27 -04:00
ioprio.c
Kconfig fs/Kconfig: move nilfs2 out 2009-07-14 12:34:17 +09:00
Kconfig.binfmt
libfs.c vfs: make get_sb_pseudo set s_maxbytes to value that can be cast to signed 2009-08-18 16:31:12 -07:00
locks.c Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-09-11 13:23:18 -07:00
Makefile nilfs2: update makefile and Kconfig 2009-04-07 08:31:16 -07:00
mbcache.c
mpage.c ext4: Properly initialize the buffer_head state 2009-05-13 15:13:42 -04:00
namei.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6 2009-09-11 08:55:49 -07:00
namespace.c vfs: mnt_want_write_file(): fix special file handling 2009-08-07 10:39:56 -07:00
nfsctl.c
no-block.c
open.c CRED: Add some configurable debugging [try #6] 2009-09-02 21:29:01 +10:00
pipe.c lockdep: Fix lockdep annotation for pipe_double_lock() 2009-07-22 21:14:14 +02:00
pnode.c
pnode.h
posix_acl.c
read_write.c splice: implement default splice_read method 2009-05-11 14:13:10 +02:00
read_write.h
readdir.c
select.c poll/select: initialize triggered field of struct poll_wqueues 2009-08-15 18:40:11 -07:00
seq_file.c seq_file: add function to write binary data 2009-06-18 13:03:57 -07:00
signalfd.c
splice.c vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode 2009-09-14 17:08:15 +02:00
stack.c
stat.c kill vfs_stat_fd / vfs_lstat_fd 2009-04-20 23:02:52 -04:00
super.c writeback: switch to per-bdi threads for flushing data 2009-09-11 09:20:25 +02:00
sync.c fsync: wait for data writeout completion before calling ->fsync 2009-09-14 17:08:17 +02:00
timerfd.c
utimes.c
xattr_acl.c
xattr.c VFS: Factor out part of vfs_setxattr so it can be called from the SELinux hook for inode_setsecctx. 2009-09-10 10:11:22 +10:00