linux

History

Miklos Szeredi 3be5a52b30 fuse: support writable mmap Quoting Linus (3 years ago, FUSE inclusion discussions): "User-space filesystems are hard to get right. I'd claim that they are almost impossible, unless you limit them somehow (shared writable mappings are the nastiest part - if you don't have those, you can reasonably limit your problems by limiting the number of dirty pages you accept through normal "write()" calls)." Instead of attempting the impossible, I've just waited for the dirty page accounting infrastructure to materialize (thanks to Peter Zijlstra and others). This nicely solved the biggest problem: limiting the number of pages used for write caching. Some small details remained, however, which this largish patch attempts to address. It provides a page writeback implementation for fuse, which is completely safe against VM related deadlocks. Performance may not be very good for certain usage patterns, but generally it should be acceptable. It has been tested extensively with fsx-linux and bash-shared-mapping. Fuse page writeback design -------------------------- fuse_writepage() allocates a new temporary page with GFP_NOFS\|__GFP_HIGHMEM. It copies the contents of the original page, and queues a WRITE request to the userspace filesystem using this temp page. The writeback is finished instantly from the MM's point of view: the page is removed from the radix trees, and the PageDirty and PageWriteback flags are cleared. For the duration of the actual write, the NR_WRITEBACK_TEMP counter is incremented. The per-bdi writeback count is not decremented until the actual write completes. On dirtying the page, fuse waits for a previous write to finish before proceeding. This makes sure, there can only be one temporary page used at a time for one cached page. This approach is wasteful in both memory and CPU bandwidth, so why is this complication needed? The basic problem is that there can be no guarantee about the time in which the userspace filesystem will complete a write. It may be buggy or even malicious, and fail to complete WRITE requests. We don't want unrelated parts of the system to grind to a halt in such cases. Also a filesystem may need additional resources (particularly memory) to complete a WRITE request. There's a great danger of a deadlock if that allocation may wait for the writepage to finish. Currently there are several cases where the kernel can block on page writeback: - allocation order is larger than PAGE_ALLOC_COSTLY_ORDER - page migration - throttle_vm_writeout (through NR_WRITEBACK) - sync(2) Of course in some cases (fsync, msync) we explicitly want to allow blocking. So for these cases new code has to be added to fuse, since the VM is not tracking writeback pages for us any more. As an extra safetly measure, the maximum dirty ratio allocated to a single fuse filesystem is set to 1% by default. This way one (or several) buggy or malicious fuse filesystems cannot slow down the rest of the system by hogging dirty memory. With appropriate privileges, this limit can be raised through '/sys/class/bdi/<bdi>/max_ratio'. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>		2008-04-30 08:29:50 -07:00
..
9p	[PATCH] restore sane ->umount_begin() API	2008-04-25 09:23:25 -04:00
adfs	adfs: work around bogus sparse warning	2008-04-29 08:05:59 -07:00
affs	fs/affs/file.c: use BUG_ON	2008-04-29 08:06:02 -07:00
afs	afs: support the CB.ProbeUuid RPC op	2008-04-29 08:06:26 -07:00
autofs	mount options: fix autofs	2008-02-08 09:22:40 -08:00
autofs4	autofs4: fix sparse warning in root.c	2008-04-29 08:06:01 -07:00
befs	befs: fix sparse warning in linuxvfs.c	2008-04-29 08:05:59 -07:00
bfs	iget: stop BFS from using iget() and read_inode()	2008-02-07 08:42:27 -08:00
cifs	proc: remove proc_root_fs	2008-04-29 08:06:18 -07:00
coda	codafs: fix build warning	2008-04-29 08:06:04 -07:00
configfs	mm: bdi: add separate writeback accounting capability	2008-04-30 08:29:50 -07:00
cramfs	fs: Remove unnecessary inclusions of asm/semaphore.h	2008-04-18 22:16:44 -04:00
debugfs	debugfs: fix sparse warnings	2008-03-04 14:47:06 -08:00
devpts	devpts: factor out PTY index allocation	2008-04-30 08:29:48 -07:00
dlm	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm	2008-04-22 13:44:23 -07:00
ecryptfs	Remove duplicated unlikely() in IS_ERR()	2008-04-29 08:06:25 -07:00
efs	efs: update error msg to not refer to deleted read_inode()	2008-04-02 15:28:19 -07:00
exportfs
ext2	ext2: retry block allocation if new blocks are allocated from system zone	2008-04-28 08:58:43 -07:00
ext3	ext3: fix test ext_generic_write_end() copied return value	2008-04-29 22:01:27 -04:00
ext4	ext4: fix test ext_generic_write_end() copied return value	2008-04-29 22:01:18 -04:00
fat	fat: use get/put_unaligned_* helpers	2008-04-29 08:06:28 -07:00
freevxfs	fs/freevxfs/: proper externs	2008-04-29 08:06:00 -07:00
fuse	fuse: support writable mmap	2008-04-30 08:29:50 -07:00
gfs2	mm: remove nopage	2008-04-28 08:58:18 -07:00
hfs	hfs: handle match_strdup failure	2008-04-29 08:06:01 -07:00
hfsplus	trivial: fix user-visible typo in hfsplus	2008-04-29 13:23:21 -07:00
hostfs	uml: fix hostfs tv_usec calculations	2008-02-05 09:44:30 -08:00
hpfs	mount options: fix hpfs	2008-02-08 09:22:40 -08:00
hppfs	[PATCH] sanitize hppfs	2008-03-19 06:42:18 -04:00
hugetlbfs	mm: bdi: add separate writeback accounting capability	2008-04-30 08:29:50 -07:00
isofs	isofs: fix access to unallocated memory when reading corrupted filesystem	2008-04-30 08:29:33 -07:00
jbd	jbd: replace remaining __FUNCTION__ occurrences	2008-04-28 08:58:45 -07:00
jbd2	jbd2: use non-racy method for proc entries creation	2008-04-29 08:06:20 -07:00
jffs2	[JFFS2] Introduce dbg_readinode2 log level, use it to shut read_dnode() up	2008-04-23 16:43:15 +01:00
jfs	proc: remove proc_root_fs	2008-04-29 08:06:18 -07:00
lockd	locks: don't call ->copy_lock methods on return of conflicting locks	2008-04-25 13:00:11 -04:00
minix	iget: stop the MINIX filesystem from using iget() and read_inode()	2008-02-07 08:42:28 -08:00
msdos	fat: fat_notify_change() and check_mode() cleanup	2008-04-28 08:58:47 -07:00
ncpfs	ncpfs: use get/put_unaligned_* helpers	2008-04-29 08:06:28 -07:00
nfs	mm: bdi: expose the BDI object in sysfs for NFS	2008-04-30 08:29:49 -07:00
nfs_common
nfsd	nfsd: use proc_create to setup de->proc_fops	2008-04-29 08:06:20 -07:00
nls
ntfs	Remove duplicated unlikely() in IS_ERR()	2008-04-29 08:06:25 -07:00
ocfs2	mm: bdi: add separate writeback accounting capability	2008-04-30 08:29:50 -07:00
openpromfs	iget: stop OPENPROMFS from using iget() and read_inode()	2008-02-07 08:42:29 -08:00
partitions	fat: detect media without partition table correctly	2008-04-28 08:58:47 -07:00
proc	mm: Add NR_WRITEBACK_TEMP counter	2008-04-30 08:29:50 -07:00
qnx4	iget: stop QNX4 from using iget() and read_inode()	2008-02-07 08:42:28 -08:00
ramfs	mm: bdi: add separate writeback accounting capability	2008-04-30 08:29:50 -07:00
reiserfs	reiserfs: use non-racy method for proc entries creation	2008-04-29 08:06:20 -07:00
romfs	ROMFS: Fix up an error in iget removal	2008-03-19 18:53:36 -07:00
smbfs	NULL noise: fs/, mm/, kernel/*	2008-03-30 14:18:41 -07:00
sysfs	mm: bdi: add separate writeback accounting capability	2008-04-30 08:29:50 -07:00
sysv	iget: stop the SYSV filesystem from using iget() and read_inode()	2008-02-07 08:42:29 -08:00
udf	udf: fix sparse warning in namei.c	2008-04-28 08:58:46 -07:00
ufs	ufs: replace __inline with inline	2008-04-28 08:58:45 -07:00
vfat	fat: use __getname()	2008-04-28 08:58:47 -07:00
xfs	[XFS] Include linux/random.h in all builds, not just debug.	2008-04-30 07:53:50 -07:00
aio.c	aio: fix misleading comments	2008-04-29 08:06:29 -07:00
anon_inodes.c	[PATCH] fix up new filp allocators	2008-03-19 06:54:05 -04:00
attr.c
bad_inode.c	iget: introduce a function to register iget failure	2008-02-07 08:42:26 -08:00
binfmt_aout.c	fs/binfmt_aout.c: use printk_ratelimit()	2008-04-29 08:06:04 -07:00
binfmt_elf_fdpic.c	fdpic: check that the size returned by kernel_read() is what we asked for	2008-04-29 08:06:05 -07:00
binfmt_elf.c	elf: fix shadowed variables in fs/binfmt_elf.c	2008-04-29 08:06:16 -07:00
binfmt_em86.c	binfmt_misc.c: avoid potential kernel stack overflow	2008-04-29 08:06:04 -07:00
binfmt_flat.c	procfs task exe symlink	2008-04-29 08:06:17 -07:00
binfmt_misc.c	binfmt_misc.c: avoid potential kernel stack overflow	2008-04-29 08:06:04 -07:00
binfmt_script.c	binfmt_misc.c: avoid potential kernel stack overflow	2008-04-29 08:06:04 -07:00
binfmt_som.c	[PATCH] sanitize handling of shared descriptor tables in failing execve()	2008-04-25 09:23:53 -04:00
bio.c	block: add dma alignment and padding support to blk_rq_map_kern	2008-04-29 09:50:34 +02:00
block_dev.c	fs/block_dev.c: remove #if 0'ed code	2008-02-19 10:04:00 +01:00
buffer.c	make fs/buffer.c:cont_expand_zero() static	2008-04-29 08:06:01 -07:00
char_dev.c	fs: remove unused fops from struct char_device_struct	2008-04-29 08:06:01 -07:00
compat_binfmt_elf.c
compat_ioctl.c	tty: The big operations rework	2008-04-30 08:29:47 -07:00
compat.c	signals: use HAVE_SET_RESTORE_SIGMASK	2008-04-30 08:29:37 -07:00
dcache.c	[patch 2/7] vfs: mountinfo: add seq_file_root()	2008-04-23 00:04:38 -04:00
dcookies.c	d_path: Make d_path() use a struct path	2008-02-14 21:17:09 -08:00
direct-io.c	Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user	2008-02-05 09:44:13 -08:00
dnotify.c
dquot.c	quota: quota core changes for quotaon on remount	2008-04-28 08:58:33 -07:00
drop_caches.c	vfs: skip inodes without pages to free in drop_pagecache_sb()	2008-04-29 08:06:05 -07:00
eventfd.c	fs/eventfd.c should #include <linux/syscalls.h>	2008-02-06 10:41:03 -08:00
eventpoll.c	signals: use HAVE_SET_RESTORE_SIGMASK	2008-04-30 08:29:37 -07:00
exec.c	document de_thread() with exit_notify() connection	2008-04-30 08:29:38 -07:00
fcntl.c	[PATCH] sanitize locate_fd()	2008-04-25 09:24:05 -04:00
fifo.c
file_table.c	[PATCH] r/o bind mounts: debugging for missed calls	2008-04-19 00:29:28 -04:00
file.c	get rid of NR_OPEN and introduce a sysctl_nr_open	2008-02-06 10:41:06 -08:00
filesystems.c
fs-writeback.c	fs/fs-writeback.c: make 2 functions static	2008-04-29 08:06:00 -07:00
generic_acl.c
inode.c	fs/inode.c: use hlist_for_each_entry()	2008-04-29 08:06:06 -07:00
inotify_user.c	Remove duplicated unlikely() in IS_ERR()	2008-04-29 08:06:25 -07:00
inotify.c	inotify: remove debug code	2008-02-06 10:41:07 -08:00
internal.h	[PATCH] move a bunch of declarations to fs/internal.h	2008-04-21 23:11:01 -04:00
ioctl.c	make vfs_ioctl() static	2008-04-29 08:06:00 -07:00
ioprio.c
Kconfig	Merge git://git.linux-nfs.org/projects/trondmy/nfs-2.6	2008-04-24 11:46:16 -07:00
Kconfig.binfmt	make BINFMT_FLAT a bool	2008-04-29 08:06:01 -07:00
libfs.c	Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user	2008-02-05 09:44:13 -08:00
locks.c	Export __locks_copy_lock() so modular lockd builds	2008-04-25 15:49:46 -07:00
Makefile
mbcache.c	vfs: fix possible deadlock in ext2, ext3, ext4 when using xattrs	2008-04-15 19:35:41 -07:00
mpage.c	docbook: fix filesystems.tmpl source files	2008-03-03 10:47:13 -08:00
namei.c	cgroups: implement device whitelist	2008-04-29 08:06:09 -07:00
namespace.c	vfs: remove lives_below_in_same_fs()	2008-04-29 08:06:06 -07:00
nfsctl.c	Introduce path_put()	2008-02-14 21:13:33 -08:00
no-block.c
open.c	xip: support non-struct page backed memory	2008-04-28 08:58:23 -07:00
pipe.c	[PATCH] double-free of inode on alloc_file() failure exit in create_write_pipe()	2008-04-22 19:54:57 -04:00
pnode.c	[patch 7/7] vfs: mountinfo: show dominating group id	2008-04-23 00:05:09 -04:00
pnode.h	[patch 7/7] vfs: mountinfo: show dominating group id	2008-04-23 00:05:09 -04:00
posix_acl.c
quota_v1.c	quota: do not allow setting of quota limits to too high values	2008-04-28 08:58:32 -07:00
quota_v2.c	quota: do not allow setting of quota limits to too high values	2008-04-28 08:58:32 -07:00
quota.c	quota: quota core changes for quotaon on remount	2008-04-28 08:58:33 -07:00
read_write.c	fs: use loff_t type instead of long long	2008-04-22 15:17:11 -07:00
read_write.h
readdir.c
select.c	signals: use HAVE_SET_RESTORE_SIGMASK	2008-04-30 08:29:37 -07:00
seq_file.c	[patch 2/7] vfs: mountinfo: add seq_file_root()	2008-04-23 00:04:38 -04:00
signalfd.c	signalfd: fix for incorrect SI_QUEUE user data reporting	2008-04-11 08:06:44 -07:00
splice.c	relay: fix splice problem	2008-04-29 09:48:15 +02:00
stack.c
stat.c	Introduce path_put()	2008-02-14 21:13:33 -08:00
super.c	make __put_super() static	2008-04-29 08:06:00 -07:00
sync.c	vfs: fix unconditional write_super() call in file_fsync()	2008-04-29 08:06:06 -07:00
timerfd.c	fs/timerfd.c should #include <linux/syscalls.h>	2008-04-29 08:06:01 -07:00
utimes.c	[PATCH] r/o bind mounts: elevate write count for do_utimes()	2008-04-19 00:29:24 -04:00
xattr_acl.c
xattr.c	xattr: add missing consts to function arguments	2008-04-29 08:06:06 -07:00