linux/fs
Ross Zwisler fffa281b48 dax: fix deadlock due to misaligned PMD faults
In DAX there are two separate places where the 2MiB range of a PMD is
defined.

The first is in the page tables, where a PMD mapping inserted for a
given address spans from (vmf->address & PMD_MASK) to ((vmf->address &
PMD_MASK) + PMD_SIZE - 1).  That is, from the 2MiB boundary below the
address to the 2MiB boundary above the address.

So, for example, a fault at address 3MiB (0x30 0000) falls within the
PMD that ranges from 2MiB (0x20 0000) to 4MiB (0x40 0000).

The second PMD range is in the mapping->page_tree, where a given file
offset is covered by a radix tree entry that spans from one 2MiB aligned
file offset to another 2MiB aligned file offset.

So, for example, the file offset for 3MiB (pgoff 768) falls within the
PMD range for the order 9 radix tree entry that ranges from 2MiB (pgoff
512) to 4MiB (pgoff 1024).

This system works so long as the addresses and file offsets for a given
mapping both have the same offsets relative to the start of each PMD.

Consider the case where the starting address for a given file isn't 2MiB
aligned - say our faulting address is 3 MiB (0x30 0000), but that
corresponds to the beginning of our file (pgoff 0).  Now all the PMDs in
the mapping are misaligned so that the 2MiB range defined in the page
tables never matches up with the 2MiB range defined in the radix tree.

The current code notices this case for DAX faults to storage with the
following test in dax_pmd_insert_mapping():

	if (pfn_t_to_pfn(pfn) & PG_PMD_COLOUR)
		goto unlock_fallback;

This test makes sure that the pfn we get from the driver is 2MiB
aligned, and relies on the assumption that the 2MiB alignment of the pfn
we get back from the driver matches the 2MiB alignment of the faulting
address.

However, faults to holes were not checked and we could hit the problem
described above.

This was reported in response to the NVML nvml/src/test/pmempool_sync
TEST5:

	$ cd nvml/src/test/pmempool_sync
	$ make TEST5

You can grab NVML here:

	https://github.com/pmem/nvml/

The dmesg warning you see when you hit this error is:

  WARNING: CPU: 13 PID: 2900 at fs/dax.c:641 dax_insert_mapping_entry+0x2df/0x310

Where we notice in dax_insert_mapping_entry() that the radix tree entry
we are about to replace doesn't match the locked entry that we had
previously inserted into the tree.  This happens because the initial
insertion was done in grab_mapping_entry() using a pgoff calculated from
the faulting address (vmf->address), and the replacement in
dax_pmd_load_hole() => dax_insert_mapping_entry() is done using
vmf->pgoff.

In our failure case those two page offsets (one calculated from
vmf->address, one using vmf->pgoff) point to different order 9 radix
tree entries.

This failure case can result in a deadlock because the radix tree unlock
also happens on the pgoff calculated from vmf->address.  This means that
the locked radix tree entry that we swapped in to the tree in
dax_insert_mapping_entry() using vmf->pgoff is never unlocked, so all
future faults to that 2MiB range will block forever.

Fix this by validating that the faulting address's PMD offset matches
the PMD offset from the start of the file.  This check is done at the
very beginning of the fault and covers faults that would have mapped to
storage as well as faults to holes.  I left the COLOUR check in
dax_pmd_insert_mapping() in place in case we ever hit the insanity
condition where the alignment of the pfn we get from the driver doesn't
match the alignment of the userspace address.

Link: http://lkml.kernel.org/r/20170822222436.18926-1-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: "Slusarz, Marcin" <marcin.slusarz@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-25 16:12:46 -07:00
..
9p 9p: Implement show_options 2017-07-11 06:08:58 -04:00
adfs
affs affs: Implement show_options 2017-07-11 06:06:17 -04:00
afs Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2017-07-15 12:00:42 -07:00
autofs4 Fix up over-eager 'wait_queue_t' renaming 2017-07-10 11:40:19 -07:00
befs Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2017-07-15 12:00:42 -07:00
bfs bfs: fix sanity checks for empty files 2017-07-12 16:26:00 -07:00
btrfs Btrfs: fix blk_status_t/errno confusion 2017-08-24 17:19:02 +02:00
cachefiles sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming 2017-06-20 12:19:14 +02:00
ceph ceph: fix race in concurrent readdir 2017-07-17 14:54:59 +02:00
cifs Add wait_for_random_bytes() and get_random_*_wait() functions so that 2017-07-15 12:44:02 -07:00
coda fs: implement vfs_iter_write using do_iter_write 2017-06-29 17:49:23 -04:00
configfs configfs: Introduce config_item_get_unless_zero() 2017-06-12 13:20:20 +02:00
cramfs
crypto The first major feature for ext4 this merge window is the largedir 2017-07-09 09:31:22 -07:00
debugfs Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2017-07-15 12:00:42 -07:00
devpts pty: Repair TIOCGPTPEER 2017-08-24 13:23:03 -07:00
dlm
ecryptfs ecryptfs: Convert to separately allocated bdi 2017-04-20 12:09:55 -06:00
efivarfs VFS: Kill off s_options and helpers 2017-07-11 06:09:21 -04:00
efs
exofs mm: drop "wait" parameter from write_one_page() 2017-07-05 18:44:22 -04:00
exportfs
ext2 ext2: preserve i_mode if ext2_set_acl() fails 2017-07-18 11:23:56 +02:00
ext4 ext4: add missing xattr hash update 2017-08-14 08:30:06 -04:00
f2fs f2fs: avoid cpu lockup 2017-07-17 19:23:18 -07:00
fat
freevxfs
fscache
fuse Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse 2017-08-11 11:20:48 -07:00
gfs2 Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2017-07-15 12:00:42 -07:00
hfs fs: semove set but not checked AOP_FLAG_UNINTERRUPTIBLE flag 2017-05-08 17:15:14 -07:00
hfsplus hfsplus: Don't clear SGID when inheriting ACLs 2017-07-18 18:23:39 +02:00
hostfs
hpfs
hugetlbfs Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2017-07-15 12:00:42 -07:00
isofs isofs: Fix off-by-one in 'session' mount option parsing 2017-07-18 12:33:16 +02:00
jbd2 Writeback error handling fixes (pile #2) 2017-07-07 19:38:17 -07:00
jffs2 jffs2: fix spelling mistake: "requestied" -> "requested" 2017-04-19 11:35:55 -07:00
jfs JFS fixes for 4.13 2017-07-25 08:51:57 -07:00
kernfs
lockd sunrpc: mark all struct svc_version instances as const 2017-07-13 15:58:03 -04:00
minix Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2017-07-08 10:50:54 -07:00
ncpfs mm: per-cgroup memory reclaim stats 2017-07-06 16:24:35 -07:00
nfs Some more NFS client bugfixes for 4.13 2017-08-11 13:54:09 -07:00
nfs_common
nfsd nfsd: Fix a memory scribble in the callback channel 2017-07-17 13:15:06 -04:00
nilfs2 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2017-07-03 13:08:04 -07:00
nls
notify dentry name snapshots 2017-07-07 20:09:10 -04:00
ntfs ntfs: Use ERR_CAST() to avoid cross-structure cast 2017-05-28 10:11:48 -07:00
ocfs2 ocfs2: don't clear SGID when inheriting ACLs 2017-08-02 17:16:13 -07:00
omfs omfs: Implement show_options 2017-07-06 03:31:46 -04:00
openpromfs
orangefs Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2017-07-15 12:00:42 -07:00
overlayfs ovl: check for bad and whiteout index on lookup 2017-07-20 11:08:21 +02:00
proc mm: fix KSM data corruption 2017-08-10 15:54:07 -07:00
pstore Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2017-07-15 12:00:42 -07:00
qnx4
qnx6
quota quota: correct space limit check 2017-08-07 16:51:28 +02:00
ramfs ramfs: Implement show_options 2017-07-06 03:31:46 -04:00
reiserfs reiserfs: preserve i_mode if __reiserfs_set_acl() fails 2017-07-18 11:24:08 +02:00
romfs
squashfs
sysfs
sysv mm: drop "wait" parameter from write_one_page() 2017-07-05 18:44:22 -04:00
tracefs VFS: Don't use save/replace_mount_options if not using generic_show_options 2017-07-06 03:31:46 -04:00
ubifs ubifs: Set double hash cookie also for RENAME_EXCHANGE 2017-07-14 22:50:57 +02:00
udf udf: Convert udf_disk_stamp_to_time() to use mktime64() 2017-06-14 11:21:02 +02:00
ufs Writeback error handling fixes (pile #1) 2017-07-07 18:39:15 -07:00
xfs xfs: don't leak quotacheck dquots when cow recovery 2017-08-17 12:40:33 -07:00
aio.c fs: add O_DIRECT and aio support for sending down write life time hints 2017-06-27 12:05:36 -06:00
anon_inodes.c
attr.c
bad_inode.c
binfmt_aout.c
binfmt_elf_fdpic.c
binfmt_elf.c x86/elf: Remove the unnecessary ADDR_NO_RANDOMIZE checks 2017-08-16 20:32:02 +02:00
binfmt_em86.c
binfmt_flat.c binfmt_flat: Use %u to format u32 2017-07-16 09:24:05 -07:00
binfmt_misc.c fs: constify tree_descr arrays passed to simple_fill_super() 2017-04-26 23:54:06 -04:00
binfmt_script.c
block_dev.c Writeback error handling fixes (pile #2) 2017-07-07 19:38:17 -07:00
buffer.c fs/buffer.c: make bh_lru_install() more efficient 2017-07-10 16:32:30 -07:00
char_dev.c
compat_binfmt_elf.c
compat_ioctl.c Merge branch 'work.__copy_in_user' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2017-07-08 10:15:02 -07:00
compat.c
coredump.c
dax.c dax: fix deadlock due to misaligned PMD faults 2017-08-25 16:12:46 -07:00
dcache.c Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2017-07-15 12:00:42 -07:00
dcookies.c
direct-io.c fs: add O_DIRECT and aio support for sending down write life time hints 2017-06-27 12:05:36 -06:00
drop_caches.c
eventfd.c There has been a fair amount of activity in the docs tree this time 2017-07-03 21:13:25 -07:00
eventpoll.c kcmp: fs/epoll: wrap kcmp code with CONFIG_CHECKPOINT_RESTORE 2017-07-12 16:26:01 -07:00
exec.c exec: Limit arg stack to at most 75% of _STK_LIM 2017-07-07 20:05:08 -07:00
fcntl.c vfs: fix flock compat thinko 2017-07-07 13:48:18 -07:00
fhandle.c
file_table.c fs: new infrastructure for writeback error handling and reporting 2017-07-06 07:02:25 -04:00
file.c fs/file.c: replace alloc_fdmem() with kvmalloc() alternative 2017-07-06 16:24:30 -07:00
filesystems.c Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2017-07-15 12:00:42 -07:00
fs_pin.c sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming 2017-06-20 12:19:14 +02:00
fs_struct.c
fs-writeback.c writeback: rework wb_[dec|inc]_stat family of functions 2017-07-12 16:26:05 -07:00
inode.c Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2017-07-08 10:50:54 -07:00
internal.h Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2017-05-09 09:12:53 -07:00
ioctl.c
iomap.c iomap: fix integer truncation issues in the zeroing and dirtying helpers 2017-08-11 16:56:33 -07:00
Kconfig fs/Kconfig: kill CONFIG_PERCPU_RWSEM some more 2017-07-12 16:26:00 -07:00
Kconfig.binfmt
libfs.c fs: convert __generic_file_fsync to use errseq_t based reporting 2017-07-06 07:02:29 -04:00
locks.c fs/locks: pass kernel struct flock to fcntl_getlk/setlk 2017-05-27 06:07:19 -04:00
Makefile
mbcache.c ext4: xattr inode deduplication 2017-06-22 11:44:55 -04:00
mount.h Now that IPC and other changes have landed, enable manual markings for 2017-07-19 08:55:18 -07:00
mpage.c There has been a fair amount of activity in the docs tree this time 2017-07-03 21:13:25 -07:00
namei.c Now that IPC and other changes have landed, enable manual markings for 2017-07-19 08:55:18 -07:00
namespace.c Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2017-07-15 12:00:42 -07:00
no-block.c
nsfs.c VFS: Provide empty name qstr 2017-07-06 03:27:09 -04:00
open.c Writeback error handling fixes (pile #2) 2017-07-07 19:38:17 -07:00
pipe.c VFS: Provide empty name qstr 2017-07-06 03:27:09 -04:00
pnode.c mnt: Make propagate_umount less slow for overlapping mount propagation trees 2017-05-23 08:41:17 -05:00
pnode.h
posix_acl.c
proc_namespace.c
read_write.c Merge branch 'work.read_write' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2017-07-07 21:48:15 -07:00
readdir.c
select.c Merge branch 'misc.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2017-07-06 20:57:13 -07:00
seq_file.c mm: introduce kv[mz]alloc helpers 2017-05-08 17:15:12 -07:00
signalfd.c sched/wait: Rename wait_queue_t => wait_queue_entry_t 2017-06-20 12:18:27 +02:00
splice.c fs: implement vfs_iter_write using do_iter_write 2017-06-29 17:49:23 -04:00
stack.c
stat.c ufs: restore maintaining ->i_blocks 2017-06-09 16:28:01 -04:00
statfs.c Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2017-07-08 10:50:54 -07:00
super.c VFS: Kill off s_options and helpers 2017-07-11 06:09:21 -04:00
sync.c fs: remove call_fsync helper function 2017-07-05 18:44:23 -04:00
timerfd.c timerfd: Use get_itimerspec64() and put_itimerspec64() 2017-06-30 04:14:38 -04:00
userfaultfd.c userfaultfd: replace ENOSPC with ESRCH in case mm has gone during copy/zeropage 2017-08-10 15:54:07 -07:00
utimes.c
xattr.c treewide: use kv[mz]alloc* rather than opencoded variants 2017-05-08 17:15:13 -07:00