linux/fs
Mike Kravetz cb6acd01e2 hugetlbfs: fix races and page leaks during migration
hugetlb pages should only be migrated if they are 'active'.  The
routines set/clear_page_huge_active() modify the active state of hugetlb
pages.

When a new hugetlb page is allocated at fault time, set_page_huge_active
is called before the page is locked.  Therefore, another thread could
race and migrate the page while it is being added to page table by the
fault code.  This race is somewhat hard to trigger, but can be seen by
strategically adding udelay to simulate worst case scheduling behavior.
Depending on 'how' the code races, various BUG()s could be triggered.

To address this issue, simply delay the set_page_huge_active call until
after the page is successfully added to the page table.

Hugetlb pages can also be leaked at migration time if the pages are
associated with a file in an explicitly mounted hugetlbfs filesystem.
For example, consider a two node system with 4GB worth of huge pages
available.  A program mmaps a 2G file in a hugetlbfs filesystem.  It
then migrates the pages associated with the file from one node to
another.  When the program exits, huge page counts are as follows:

  node0
  1024    free_hugepages
  1024    nr_hugepages

  node1
  0       free_hugepages
  1024    nr_hugepages

  Filesystem                         Size  Used Avail Use% Mounted on
  nodev                              4.0G  2.0G  2.0G  50% /var/opt/hugepool

That is as expected.  2G of huge pages are taken from the free_hugepages
counts, and 2G is the size of the file in the explicitly mounted
filesystem.  If the file is then removed, the counts become:

  node0
  1024    free_hugepages
  1024    nr_hugepages

  node1
  1024    free_hugepages
  1024    nr_hugepages

  Filesystem                         Size  Used Avail Use% Mounted on
  nodev                              4.0G  2.0G  2.0G  50% /var/opt/hugepool

Note that the filesystem still shows 2G of pages used, while there
actually are no huge pages in use.  The only way to 'fix' the filesystem
accounting is to unmount the filesystem

If a hugetlb page is associated with an explicitly mounted filesystem,
this information in contained in the page_private field.  At migration
time, this information is not preserved.  To fix, simply transfer
page_private from old to new page at migration time if necessary.

There is a related race with removing a huge page from a file and
migration.  When a huge page is removed from the pagecache, the
page_mapping() field is cleared, yet page_private remains set until the
page is actually freed by free_huge_page().  A page could be migrated
while in this state.  However, since page_mapping() is not set the
hugetlbfs specific routine to transfer page_private is not called and we
leak the page count in the filesystem.

To fix that, check for this condition before migrating a huge page.  If
the condition is detected, return EBUSY for the page.

Link: http://lkml.kernel.org/r/74510272-7319-7372-9ea6-ec914734c179@oracle.com
Link: http://lkml.kernel.org/r/20190212221400.3512-1-mike.kravetz@oracle.com
Fixes: bcc5422230 ("mm: hugetlb: introduce page_huge_active")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: <stable@vger.kernel.org>
[mike.kravetz@oracle.com: v2]
  Link: http://lkml.kernel.org/r/7534d322-d782-8ac6-1c8d-a8dc380eb3ab@oracle.com
[mike.kravetz@oracle.com: update comment and changelog]
  Link: http://lkml.kernel.org/r/420bcfd6-158b-38e4-98da-26d0cd85bd01@oracle.com
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-01 09:02:33 -08:00
..
9p Merge branch 'work.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-11-01 19:58:52 -07:00
adfs adfs: use timespec64 for time conversion 2018-08-22 10:52:51 -07:00
affs affs: fix potential memory leak when parsing option 'prefix' 2018-05-28 12:36:41 +02:00
afs afs: Fix manually set volume location server list 2019-02-25 11:59:07 -08:00
autofs autofs: fix error return in autofs_fill_super() 2019-02-01 15:46:24 -08:00
befs fix a series of Documentation/ broken file name references 2018-06-15 18:10:01 -03:00
bfs bfs: extra sanity checking and static inode bitmap 2019-01-04 13:13:47 -08:00
btrfs for-5.0-rc4-tag 2019-02-03 08:48:33 -08:00
cachefiles fscache, cachefiles: remove redundant variable 'cache' 2018-11-30 16:00:58 +00:00
ceph ceph: avoid repeatedly adding inode to mdsc->snap_flush_list 2019-02-18 18:08:29 +01:00
cifs cifs: update internal module version number 2019-01-31 07:05:06 -06:00
coda vfs: change inode times to use struct timespec64 2018-06-05 16:57:31 -07:00
configfs configfs: fix registered group removal 2018-07-17 06:14:07 -07:00
cramfs Make the Cramfs code more robust against filesystem corruptions, 2018-10-30 12:46:25 -07:00
crypto fscrypt: add Adiantum support 2019-01-06 08:36:21 -05:00
debugfs debugfs: debugfs_lookup() should return NULL if not found 2019-01-30 12:39:49 +01:00
devpts devpts: Convert to new IDA API 2018-08-21 23:54:17 -04:00
dlm dlm: fix invalid cluster name warning 2018-12-03 15:30:24 -06:00
ecryptfs ecryptfs_rename(): verify that lower dentries are still OK after lock_rename() 2018-10-09 23:33:17 -04:00
efivarfs efivars: Call guid_parse() against guid_t type of variable 2018-07-22 14:13:44 +02:00
efs
exofs exofs_mount(): fix leaks on failure exits 2018-12-17 18:36:33 -05:00
exportfs exportfs: do not read dentry after free 2018-11-23 09:08:17 -05:00
ext2 \n 2018-12-27 17:00:35 -08:00
ext4 Revert "ext4: use ext4_write_inode() when fsyncing w/o a journal" 2019-01-31 23:41:11 -05:00
f2fs f2fs-for-4.21-rc1 2018-12-31 09:41:37 -08:00
fat Merge branch 'akpm' (patches from Andrew) 2019-01-05 09:16:18 -08:00
freevxfs freevxfs_lookup(): use d_splice_alias() 2018-05-22 14:27:51 -04:00
fscache fscache: fix race between enablement and dropping of object 2018-11-30 15:57:31 +00:00
fuse fuse: decrement NR_WRITEBACK_TEMP on the right page 2019-01-16 10:27:59 +01:00
gfs2 Revert "gfs2: read journal in large chunks to locate the head" 2019-02-14 09:52:51 -08:00
hfs hfs: do not free node before using 2018-11-30 14:56:14 -08:00
hfsplus hfsplus: return file attributes on statx 2019-01-04 13:13:47 -08:00
hostfs vfs: discard ATTR_ATTR_FLAG 2018-08-17 16:20:28 -07:00
hpfs hpfs: remove unnecessary checks on the value of r when assigning error code 2018-08-25 12:42:33 -07:00
hugetlbfs hugetlbfs: fix races and page leaks during migration 2019-03-01 09:02:33 -08:00
isofs Update email address 2018-09-29 22:47:48 -04:00
jbd2 jbd2: clean up indentation issue, replace spaces with tab 2018-12-04 00:20:10 -05:00
jffs2 jffs2: Fix use of uninitialized delayed_work, lockdep breakage 2018-12-02 09:20:34 +01:00
jfs jfs: remove redundant dquot_initialize() in jfs_evict_inode() 2018-09-20 09:28:49 -05:00
kernfs kernfs: Improve kernfs_notify() poll notification latency 2018-11-27 11:59:33 +01:00
lockd NFS client updates for Linux 4.21 2019-01-02 16:35:23 -08:00
minix minix_lookup: use d_splice_alias() 2018-05-22 14:27:52 -04:00
nfs Merge branch 'fixes-v5.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security 2019-02-20 09:09:33 -08:00
nfs_common
nfsd Revert "nfsd4: return default lease period" 2019-02-14 12:33:19 -05:00
nilfs2 nilfs2: Use xa_erase_irq 2018-11-05 14:57:05 -05:00
nls
notify inotify: Fix fd refcount leak in inotify_add_watch(). 2019-01-02 18:28:37 +01:00
ntfs mm: convert totalram_pages and totalhigh_pages variables to atomic 2018-12-28 12:11:47 -08:00
ocfs2 Merge branch 'akpm' (patches from Andrew) 2019-01-05 09:16:18 -08:00
omfs omfs_lookup(): report IO errors, use d_splice_alias() 2018-05-22 14:27:58 -04:00
openpromfs fs/openpromfs: Use of_node_name_eq for node name comparisons 2018-11-18 13:35:19 -08:00
orangefs fs: don't open code lru_to_page() 2019-01-04 13:13:48 -08:00
overlayfs Revert "ovl: relax permission checking on underlying layers" 2018-12-04 11:31:30 +01:00
proc proc, oom: do not report alien mms when setting oom_score_adj 2019-02-21 09:01:00 -08:00
pstore pstore/ram: Avoid allocation and leak of platform data 2019-01-20 14:44:52 -08:00
qnx4 qnx4_lookup: use d_splice_alias() 2018-05-22 14:27:52 -04:00
qnx6 qnx6_lookup: switch to d_splice_alias() 2018-05-22 14:27:54 -04:00
quota quota: Lock s_umount in exclusive mode for Q_XQUOTA{ON,OFF} quotactls. 2018-12-18 18:29:15 +01:00
ramfs
reiserfs reiserfs: remove workaround code for GCC 3.x 2018-10-31 08:54:14 -07:00
romfs romfs_lookup: switch to d_splice_alias() 2018-05-22 14:27:55 -04:00
squashfs Squashfs: Compute expected length from inode size rather than block length 2018-08-02 09:34:02 -07:00
sysfs sysfs: convert BUG_ON to WARN_ON 2019-01-07 08:53:32 +01:00
sysv sysv: return 'err' instead of 0 in __sysv_write_inode 2018-11-10 08:02:40 -05:00
tracefs tracefs: Annotate tracefs_ops with __ro_after_init 2018-07-31 11:32:44 -04:00
ubifs mm: migrate: drop unused argument of migrate_page_move_mapping() 2018-12-28 12:11:51 -08:00
udf \n 2018-12-27 17:00:35 -08:00
ufs fs/ufs: use ktime_get_real_seconds for sb and cg timestamps 2018-08-17 16:20:27 -07:00
xfs xfs: set buffer ops when repair probes for btree type 2019-02-03 14:03:59 -08:00
aio.c aio: initialize kiocb private in case any filesystems expect it. 2019-02-06 08:04:22 -07:00
anon_inodes.c anon_inode_getfile(): switch to alloc_file_pseudo() 2018-07-12 10:04:27 -04:00
attr.c fs: Fix attr.c kernel-doc 2018-07-03 16:44:45 -04:00
bad_inode.c get rid of 'opened' argument of ->atomic_open() - part 3 2018-07-12 10:04:20 -04:00
binfmt_aout.c Remove 'type' argument from access_ok() function 2019-01-03 18:57:57 -08:00
binfmt_elf_fdpic.c treewide: kmalloc() -> kmalloc_array() 2018-06-12 16:19:22 -07:00
binfmt_elf.c signal: Distinguish between kernel_siginfo and siginfo 2018-10-03 16:47:43 +02:00
binfmt_em86.c
binfmt_flat.c
binfmt_misc.c turn filp_clone_open() into inline wrapper for dentry_open() 2018-07-10 23:29:03 -04:00
binfmt_script.c exec: load_script: Do not exec truncated interpreter path 2019-02-18 16:49:36 -08:00
block_dev.c blockdev: Fix livelocks on loop device 2019-01-15 07:30:56 -07:00
buffer.c fs: ratelimit __find_get_block_slow() failure message. 2019-02-06 12:58:56 -07:00
char_dev.c
compat_binfmt_elf.c y2038: globally rename compat_time to old_time32 2018-08-27 14:48:48 +02:00
compat_ioctl.c media updates for v4.20-rc1 2018-10-29 14:29:58 -07:00
compat.c ncpfs: remove compat functionality 2018-06-05 19:23:26 +02:00
coredump.c signal: Distinguish between kernel_siginfo and siginfo 2018-10-03 16:47:43 +02:00
d_path.c
dax.c dax fix 4.21 2018-12-31 09:46:39 -08:00
dcache.c fs/dcache: Track & report number of negative dentries 2019-01-30 11:02:11 -08:00
dcookies.c
direct-io.c direct-io: allow direct writes to empty inodes 2019-01-22 08:26:44 -07:00
drop_caches.c fs/drop_caches.c: avoid softlockups in drop_pagecache_sb() 2019-02-01 15:46:24 -08:00
eventfd.c Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL 2018-06-28 10:40:47 -07:00
eventpoll.c Merge branch 'akpm' (patches from Andrew) 2019-01-05 09:16:18 -08:00
exec.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-01-05 13:18:59 -08:00
fcntl.c signal: Distinguish between kernel_siginfo and siginfo 2018-10-03 16:47:43 +02:00
fhandle.c
file_table.c mm: convert totalram_pages and totalhigh_pages variables to atomic 2018-12-28 12:11:47 -08:00
file.c Char/Misc driver patches for 4.21-rc1 2018-12-28 20:54:57 -08:00
filesystems.c
fs_pin.c
fs_struct.c
fs-writeback.c writeback: synchronize sync(2) against cgroup writeback membership switches 2019-01-22 14:39:38 -07:00
inode.c Revert "mm: don't reclaim inodes with many attached pages" 2019-02-12 16:33:18 -08:00
internal.h overlayfs update for 4.19 2018-08-21 18:19:09 -07:00
ioctl.c Remove 'type' argument from access_ok() function 2019-01-03 18:57:57 -08:00
iomap.c iomap: fix a use after free in iomap_dio_rw 2019-01-27 08:47:42 -08:00
Kconfig autofs: remove left-over autofs4 stubs 2018-06-11 08:22:34 -07:00
Kconfig.binfmt kconfig: move the "Executable file formats" menu to fs/Kconfig.binfmt 2018-08-02 08:06:55 +09:00
libfs.c
locks.c locks: fix error in locks_move_blocks() 2019-01-02 20:14:50 -05:00
Makefile autofs: remove left-over autofs4 stubs 2018-06-11 08:22:34 -07:00
mbcache.c treewide: kmalloc() -> kmalloc_array() 2018-06-12 16:19:22 -07:00
mount.h
mpage.c mpage: mpage_readpages() should submit IO as read-ahead 2018-08-17 16:20:29 -07:00
namei.c Revert "vfs: Allow userns root to call mknod on owned filesystems." 2018-12-22 14:18:34 -08:00
namespace.c Revert "x86/fault: BUG() when uaccess helpers fault on kernel addresses" 2019-02-25 09:10:51 -08:00
no-block.c
nsfs.c
open.c overlayfs update for 4.19 2018-08-21 18:19:09 -07:00
pipe.c Merge branch 'work.open3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-08-13 19:58:36 -07:00
pnode.c vfs: Suppress MS_* flag defs within the kernel unless explicitly enabled 2018-12-20 16:32:56 +00:00
pnode.h
posix_acl.c
proc_namespace.c
read_write.c Remove 'type' argument from access_ok() function 2019-01-03 18:57:57 -08:00
readdir.c Remove 'type' argument from access_ok() function 2019-01-03 18:57:57 -08:00
select.c Remove 'type' argument from access_ok() function 2019-01-03 18:57:57 -08:00
seq_file.c fs/seq_file.c: simplify seq_file iteration code and interface 2018-08-17 16:20:28 -07:00
signalfd.c signal: Distinguish between kernel_siginfo and siginfo 2018-10-03 16:47:43 +02:00
splice.c splice: don't read more than available pipe space 2018-12-04 08:50:49 -08:00
stack.c
stat.c y2038: Remove newstat family from default syscall set 2018-08-29 15:42:20 +02:00
statfs.c kernel: add kcompat_sys_{f,}statfs64() 2018-07-12 14:49:48 +01:00
super.c mount_fs: suppress MAC on MS_SUBMOUNT as well as MS_KERNMOUNT 2018-12-21 11:51:23 -05:00
sync.c
timerfd.c y2038: globally rename compat_time to old_time32 2018-08-27 14:48:48 +02:00
userfaultfd.c userfaultfd: clear flag if remap event not enabled 2018-12-28 12:11:51 -08:00
utimes.c y2038: utimes: Rework #ifdef guards for compat syscalls 2018-08-29 15:42:23 +02:00
xattr.c sysfs: Do not return POSIX ACL xattrs via listxattr 2018-09-18 07:30:48 -04:00