linux/fs
Dave Chinner 133eeb1747 xfs: don't use speculative prealloc for small files
Dedicated small file workloads have been seeing significant free
space fragmentation causing premature inode allocation failure
when large inode sizes are in use. A particular test case showed
that a workload that runs to a real ENOSPC on 256 byte inodes would
fail inode allocation with ENOSPC about about 80% full with 512 byte
inodes, and at about 50% full with 1024 byte inodes.

The same workload, when run with -o allocsize=4096 on 1024 byte
inodes would run to being 100% full before giving ENOSPC. That is,
no freespace fragmentation at all.

The issue was caused by the specific IO pattern the application had
- the framework it was using did not support direct IO, and so it
was emulating it by using fadvise(DONT_NEED). The result was that
the data was getting written back before the speculative prealloc
had been trimmed from memory by the close(), and so small single
block files were being allocated with 2 blocks, and then having one
truncated away. The result was lots of small 4k free space extents,
and hence each new 8k allocation would take another 8k from
contiguous free space and turn it into 4k of allocated space and 4k
of free space.

Hence inode allocation, which requires contiguous, aligned
allocation of 16k (256 byte inodes), 32k (512 byte inodes) or 64k
(1024 byte inodes) can fail to find sufficiently large freespace and
hence fail while there is still lots of free space available.

There's a simple fix for this, and one that has precendence in the
allocator code already - don't do speculative allocation unless the
size of the file is larger than a certain size. In this case, that
size is the minimum default preallocation size:
mp->m_writeio_blocks. And to keep with the concept of being nice to
people when the files are still relatively small, cap the prealloc
to mp->m_writeio_blocks until the file goes over a stripe unit is
size, at which point we'll fall back to the current behaviour based
on the last extent size.

This will effectively turn off speculative prealloc for very small
files, keep preallocation low for small files, and behave as it
currently does for any file larger than a stripe unit. This
completely avoids the freespace fragmentation problem this
particular IO pattern was causing.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-06-27 13:27:37 -05:00
..
9p aio: don't include aio.h in sched.h 2013-05-07 20:16:25 -07:00
adfs fs: Limit sys_mount to only request filesystem modules. 2013-03-03 19:36:31 -08:00
affs fs: Limit sys_mount to only request filesystem modules. 2013-03-03 19:36:31 -08:00
afs aio: don't include aio.h in sched.h 2013-05-07 20:16:25 -07:00
autofs4 autofs - remove autofs dentry mount check 2013-05-06 13:06:59 -07:00
befs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2013-04-30 09:36:50 -07:00
bfs fs: Limit sys_mount to only request filesystem modules. 2013-03-03 19:36:31 -08:00
btrfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs 2013-05-09 13:07:40 -07:00
cachefiles lift sb_start_write() out of ->write() 2013-04-09 14:12:56 -04:00
ceph aio: don't include aio.h in sched.h 2013-05-07 20:16:25 -07:00
cifs cifs: small variable name cleanup 2013-05-04 22:18:10 -05:00
coda lift sb_start_write() out of ->write() 2013-04-09 14:12:56 -04:00
configfs fs: Limit sys_mount to only request filesystem modules. 2013-03-03 19:36:31 -08:00
cramfs fs: Limit sys_mount to only request filesystem modules. 2013-03-03 19:36:31 -08:00
debugfs fs: Limit sys_mount to only request filesystem modules. 2013-03-03 19:36:31 -08:00
devpts fs: Limit sys_mount to only request filesystem modules (Part 2). 2013-03-07 01:08:55 -08:00
dlm Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2013-05-01 14:08:52 -07:00
ecryptfs Improve performance when AES-NI (and most likely other crypto accelerators) is 2013-05-10 09:20:01 -07:00
efivarfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2013-05-01 17:51:54 -07:00
efs fs: Limit sys_mount to only request filesystem modules. 2013-03-03 19:36:31 -08:00
exofs block: Add bio_for_each_segment_all() 2013-03-23 14:26:28 -07:00
exportfs hlist: drop the node parameter from iterators 2013-02-27 19:10:24 -08:00
ext2 aio: don't include aio.h in sched.h 2013-05-07 20:16:25 -07:00
ext3 Merge branch 'akpm' (incoming from Andrew) 2013-05-07 20:49:51 -07:00
ext4 Merge branch 'akpm' (incoming from Andrew) 2013-05-07 20:49:51 -07:00
f2fs f2fs updates for v3.10 2013-05-08 15:11:48 -07:00
fat aio: don't include aio.h in sched.h 2013-05-07 20:16:25 -07:00
freevxfs fs: Readd the fs module aliases. 2013-03-12 18:55:21 -07:00
fscache fs/fscache/stats.c: fix memory leak 2013-04-29 15:54:27 -07:00
fuse Merge branch 'akpm' (incoming from Andrew) 2013-05-07 20:49:51 -07:00
gfs2 Merge branch 'for-3.10/core' of git://git.kernel.dk/linux-block 2013-05-08 10:13:35 -07:00
hfs Merge branch 'akpm' (incoming from Andrew) 2013-05-07 20:49:51 -07:00
hfsplus aio: don't include aio.h in sched.h 2013-05-07 20:16:25 -07:00
hostfs hostfs: use kmalloc instead of kzalloc 2013-05-04 15:48:45 -04:00
hpfs hpfs: move setting hpfs-private i_dirty to ->write_end() 2013-04-09 14:12:55 -04:00
hppfs hppfs: get rid of ->fsync() 2013-04-29 15:41:42 -04:00
hugetlbfs hugetlbfs: fix mmap failure in unaligned size request 2013-05-07 18:38:27 -07:00
isofs fs: Readd the fs module aliases. 2013-03-12 18:55:21 -07:00
jbd Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs 2013-05-03 09:56:25 -07:00
jbd2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2013-05-01 17:51:54 -07:00
jffs2 fs: Limit sys_mount to only request filesystem modules. 2013-03-03 19:36:31 -08:00
jfs Merge branch 'for-3.10/core' of git://git.kernel.dk/linux-block 2013-05-08 10:13:35 -07:00
lockd LOCKD: Ensure that nlmclnt_block resets block->b_status after a server reboot 2013-04-21 18:08:42 -04:00
logfs block: Remove bi_idx references 2013-03-23 14:15:31 -07:00
minix fs: Limit sys_mount to only request filesystem modules. 2013-03-03 19:36:31 -08:00
ncpfs fs: Limit sys_mount to only request filesystem modules. 2013-03-03 19:36:31 -08:00
nfs More NFS client bugfixes for 3.10 2013-05-09 10:24:54 -07:00
nfs_common nfs_common: Update the translation between nfsv3 acls linux posix acls 2013-02-13 06:15:14 -08:00
nfsd Merge branch 'for-3.10' of git://linux-nfs.org/~bfields/linux 2013-05-10 09:28:55 -07:00
nilfs2 aio: don't include aio.h in sched.h 2013-05-07 20:16:25 -07:00
nls
notify unify compat fanotify_mark(2), switch to COMPAT_SYSCALL_DEFINE 2013-05-09 13:46:38 -04:00
ntfs aio: don't include aio.h in sched.h 2013-05-07 20:16:25 -07:00
ocfs2 aio: don't include aio.h in sched.h 2013-05-07 20:16:25 -07:00
omfs fs: Limit sys_mount to only request filesystem modules. 2013-03-03 19:36:31 -08:00
openpromfs fs: Limit sys_mount to only request filesystem modules. 2013-03-03 19:36:31 -08:00
proc Merge branch 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux 2013-05-07 08:42:20 -07:00
pstore Couple of pstore cleanups 2013-05-09 16:42:10 -07:00
qnx4 fs: Limit sys_mount to only request filesystem modules. 2013-03-03 19:36:31 -08:00
qnx6 fs: Limit sys_mount to only request filesystem modules. 2013-03-03 19:36:31 -08:00
quota quota: add missing use of dq_data_lock in __dquot_initialize 2013-03-11 22:05:56 +01:00
ramfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2013-02-26 20:16:07 -08:00
reiserfs Merge branch 'akpm' (incoming from Andrew) 2013-05-07 20:49:51 -07:00
romfs romfs: fix nommu map length to keep inside filesystem 2013-04-29 09:17:57 +10:00
squashfs fs: Limit sys_mount to only request filesystem modules. (Part 3) 2013-03-11 07:09:48 -07:00
sysfs sysfs: check if one entry has been removed before freeing 2013-04-05 15:35:52 -07:00
sysv fs: Readd the fs module aliases. 2013-03-12 18:55:21 -07:00
ubifs aio: don't include aio.h in sched.h 2013-05-07 20:16:25 -07:00
udf aio: don't include aio.h in sched.h 2013-05-07 20:16:25 -07:00
ufs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2013-04-30 09:36:50 -07:00
xfs xfs: don't use speculative prealloc for small files 2013-06-27 13:27:37 -05:00
aio.c aio: kill ki_retry 2013-05-07 19:46:02 -07:00
anon_inodes.c get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero 2013-02-26 02:46:11 -05:00
attr.c userns: Allow chown and setgid preservation 2012-11-20 04:17:24 -08:00
bad_inode.c lseek: the "whence" argument is called "whence" 2012-12-17 17:15:12 -08:00
binfmt_aout.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2013-05-01 17:51:54 -07:00
binfmt_elf_fdpic.c Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc 2013-05-02 10:16:16 -07:00
binfmt_elf.c Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc 2013-05-02 10:16:16 -07:00
binfmt_em86.c exec: use -ELOOP for max recursion depth 2012-12-17 17:15:23 -08:00
binfmt_flat.c new helper: read_code() 2013-04-29 15:40:23 -04:00
binfmt_misc.c binfmt_misc: reuse string_unescape_inplace() 2013-04-30 17:04:03 -07:00
binfmt_script.c exec: do not leave bprm->interp on stack 2012-12-20 17:40:19 -08:00
binfmt_som.c get rid of pt_regs argument of ->load_binary() 2012-11-28 21:53:38 -05:00
bio-integrity.c bio-integrity: Add explicit field for owner of bip_buf 2013-03-23 14:26:34 -07:00
bio.c Merge branch 'for-3.10/core' of git://git.kernel.dk/linux-block 2013-05-08 10:13:35 -07:00
block_dev.c Merge branch 'for-3.10/core' of git://git.kernel.dk/linux-block 2013-05-08 10:13:35 -07:00
buffer.c Merge branch 'for-3.10/core' of git://git.kernel.dk/linux-block 2013-05-08 10:13:35 -07:00
char_dev.c
compat_binfmt_elf.c
compat_ioctl.c Removed unused typedef to avoid "unused local typedef" warnings. 2013-05-04 15:03:05 -04:00
compat.c aio: don't include aio.h in sched.h 2013-05-07 20:16:25 -07:00
coredump.c do_coredump(): don't wait for thaw if coredump has already been interrupted 2013-05-04 14:45:54 -04:00
coredump.h
dcache.c vfs: use list_move instead of list_del/list_add 2013-05-04 15:43:02 -04:00
dcookies.c consolidate compat lookup_dcookie() 2013-03-03 23:00:23 -05:00
direct-io.c Merge branch 'for-3.10/core' of git://git.kernel.dk/linux-block 2013-05-08 10:13:35 -07:00
drop_caches.c
eventfd.c fs, eventfd: add procfs fdinfo helper 2012-12-17 17:15:27 -08:00
eventpoll.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal 2013-05-01 07:21:43 -07:00
exec.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2013-05-01 17:51:54 -07:00
fcntl.c new helper: file_inode(file) 2013-02-22 23:31:31 -05:00
fhandle.c Merge branch 'for-3.8' of git://linux-nfs.org/~bfields/linux 2012-12-20 14:04:11 -08:00
file_table.c cache the value of file_inode() in struct file 2013-03-01 19:48:30 -05:00
file.c don't bother with deferred freeing of fdtables 2013-05-01 17:31:42 -04:00
filesystems.c fs: Limit sys_mount to only request filesystem modules. 2013-03-03 19:36:31 -08:00
fs_struct.c constify path_get/path_put and fs_struct.c stuff 2013-03-01 23:51:07 -05:00
fs-writeback.c Merge branch 'for-3.10/core' of git://git.kernel.dk/linux-block 2013-05-08 10:13:35 -07:00
generic_acl.c
inode.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2013-05-01 17:51:54 -07:00
internal.h pipe: fold file_operations instances in one 2013-04-09 14:12:58 -04:00
ioctl.c new helper: file_inode(file) 2013-02-22 23:31:31 -05:00
ioprio.c
Kconfig efivarfs: Move to fs/efivarfs 2013-04-17 13:25:09 +01:00
Kconfig.binfmt fs: make binfmt support for #! scripts modular and removable 2013-04-30 17:04:04 -07:00
libfs.c vfs: drop vmtruncate 2012-12-20 18:46:29 -05:00
locks.c new helper: file_inode(file) 2013-02-22 23:31:31 -05:00
Makefile Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2013-05-01 17:51:54 -07:00
mbcache.c
mount.h get rid of full-hash scan on detaching vfsmounts 2013-04-09 14:12:52 -04:00
mpage.c
namei.c Merge git://git.infradead.org/users/eparis/audit 2013-05-11 14:29:11 -07:00
namespace.c create_mnt_ns: unidiomatic use of list_add() 2013-05-04 15:18:53 -04:00
no-block.c
open.c make SYSCALL_DEFINE<n>-generated wrappers do asmlinkage_protect 2013-03-03 22:58:33 -05:00
pipe.c aio: don't include aio.h in sched.h 2013-05-07 20:16:25 -07:00
pnode.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2013-05-01 17:51:54 -07:00
pnode.h Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2013-05-01 17:51:54 -07:00
posix_acl.c
proc_namespace.c
read_write.c aio: don't include aio.h in sched.h 2013-05-07 20:16:25 -07:00
readdir.c new helper: file_inode(file) 2013-02-22 23:31:31 -05:00
select.c sched/rt: Move rt specific bits into new header file 2013-02-07 20:51:08 +01:00
seq_file.c new helper: single_open_size() 2013-04-09 14:13:29 -04:00
signalfd.c switch signalfd{,4}() to COMPAT_SYSCALL_DEFINE 2013-03-03 22:58:46 -05:00
splice.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2013-05-01 17:51:54 -07:00
stack.c
stat.c switch vfs_getattr() to struct path 2013-02-26 02:46:08 -05:00
statfs.c vfs: fix user_statfs to retry once on ESTALE errors 2012-12-20 18:50:07 -05:00
super.c hlist: drop the node parameter from iterators 2013-02-27 19:10:24 -08:00
sync.c teach SYSCALL_DEFINE<n> how to deal with long long/unsigned long long 2013-03-03 22:46:22 -05:00
timerfd.c compat: restore timerfd settime and gettime compat syscalls 2013-03-02 09:35:13 -05:00
utimes.c vfs: allow utimensat() calls to retry once on an ESTALE error 2012-12-20 18:50:08 -05:00
xattr_acl.c
xattr.c vfs: make lremovexattr retry once on ESTALE error 2012-12-20 18:50:11 -05:00