linux

Author	SHA1	Message	Date
J. Bruce Fields	8838203667	nfsd: update workqueue creation No real change in functionality, but the old interface seems to be deprecated. We don't actually care about ordering necessarily, but we do depend on running at most one work item at a time: nfsd4_process_cb_update() assumes that no other thread is running it, and that no new callbacks are starting while it's running. Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2016-11-14 11:13:43 -05:00
Theodore Ts'o	1566a48aaa	ext4: don't lock buffer in ext4_commit_super if holding spinlock If there is an error reported in mballoc via ext4_grp_locked_error(), the code is holding a spinlock, so ext4_commit_super() must not try to lock the buffer head, or else it will trigger a BUG: BUG: sleeping function called from invalid context at ./include/linux/buffer_head.h:358 in_atomic(): 1, irqs_disabled(): 0, pid: 993, name: mount CPU: 0 PID: 993 Comm: mount Not tainted 4.9.0-rc1-clouder1 #62 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014 ffff880006423548 ffffffff81318c89 ffffffff819ecdd0 0000000000000166 ffff880006423558 ffffffff810810b0 ffff880006423580 ffffffff81081153 ffff880006e5a1a0 ffff88000690e400 0000000000000000 ffff8800064235c0 Call Trace: [<ffffffff81318c89>] dump_stack+0x67/0x9e [<ffffffff810810b0>] ___might_sleep+0xf0/0x140 [<ffffffff81081153>] __might_sleep+0x53/0xb0 [<ffffffff8126c1dc>] ext4_commit_super+0x19c/0x290 [<ffffffff8126e61a>] __ext4_grp_locked_error+0x14a/0x230 [<ffffffff81081153>] ? __might_sleep+0x53/0xb0 [<ffffffff812822be>] ext4_mb_generate_buddy+0x1de/0x320 Since ext4_grp_locked_error() calls ext4_commit_super with sync == 0 (and it is the only caller which does so), avoid locking and unlocking the buffer in this case. This can result in races with ext4_commit_super() if there are other problems (which is what commit `4743f83990` was trying to address), but a Warning is better than BUG. Fixes: `4743f83990` Cc: stable@vger.kernel.org # 4.9 Reported-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>	2016-11-13 22:02:29 -05:00
Theodore Ts'o	d0abb36db4	ext4: allow ext4_ext_truncate() to return an error Return errors to the caller instead of declaring the file system corrupted. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>	2016-11-13 22:02:28 -05:00
Theodore Ts'o	2c98eb5ea2	ext4: allow ext4_truncate() to return an error This allows us to properly propagate errors back up to ext4_truncate()'s callers. This also means we no longer have to silently ignore some errors (e.g., when trying to add the inode to the orphan inode list). Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>	2016-11-13 22:02:26 -05:00
Theodore Ts'o	6da22013bb	Merge branch 'fscrypt' into origin	2016-11-13 22:02:22 -05:00
Theodore Ts'o	a2f6d9c4c0	Merge branch 'dax-4.10-iomap-pmd' into origin	2016-11-13 22:02:15 -05:00
Eric Biggers	a6e0891286	fscrypto: don't use on-stack buffer for key derivation With the new (in 4.9) option to use a virtually-mapped stack (CONFIG_VMAP_STACK), stack buffers cannot be used as input/output for the scatterlist crypto API because they may not be directly mappable to struct page. get_crypt_info() was using a stack buffer to hold the output from the encryption operation used to derive the per-file key. Fix it by using a heap buffer. This bug could most easily be observed in a CONFIG_DEBUG_SG kernel because this allowed the BUG in sg_set_buf() to be triggered. Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2016-11-13 21:56:25 -05:00
Eric Biggers	08ae877f4e	fscrypto: don't use on-stack buffer for filename encryption With the new (in 4.9) option to use a virtually-mapped stack (CONFIG_VMAP_STACK), stack buffers cannot be used as input/output for the scatterlist crypto API because they may not be directly mappable to struct page. For short filenames, fname_encrypt() was encrypting a stack buffer holding the padded filename. Fix it by encrypting the filename in-place in the output buffer, thereby making the temporary buffer unnecessary. This bug could most easily be observed in a CONFIG_DEBUG_SG kernel because this allowed the BUG in sg_set_buf() to be triggered. Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2016-11-13 21:56:19 -05:00
David Gstir	9c4bb8a3a9	fscrypt: Let fs select encryption index/tweak Avoid re-use of page index as tweak for AES-XTS when multiple parts of same page are encrypted. This will happen on multiple (partial) calls of fscrypt_encrypt_page on same page. page->index is only valid for writeback pages. Signed-off-by: David Gstir <david@sigma-star.at> Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2016-11-13 20:18:16 -05:00
David Gstir	0b93e1b94b	fscrypt: Constify struct inode pointer Some filesystems, such as UBIFS, maintain a const pointer for struct inode. Signed-off-by: David Gstir <david@sigma-star.at> Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2016-11-13 20:18:01 -05:00
David Gstir	7821d4dd45	fscrypt: Enable partial page encryption Not all filesystems work on full pages, thus we should allow them to hand partial pages to fscrypt for en/decryption. Signed-off-by: David Gstir <david@sigma-star.at> Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2016-11-13 18:55:21 -05:00
David Gstir	b50f7b268b	fscrypt: Allow fscrypt_decrypt_page() to function with non-writeback pages Some filesystem might pass pages which do not have page->mapping->host set to the encrypted inode. We want the caller to explicitly pass the corresponding inode. Signed-off-by: David Gstir <david@sigma-star.at> Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2016-11-13 18:53:10 -05:00
David Gstir	1c7dcf69ee	fscrypt: Add in-place encryption mode ext4 and f2fs require a bounce page when encrypting pages. However, not all filesystems will need that (eg. UBIFS). This is handled via a flag on fscrypt_operations where a fs implementation can select in-place encryption over using a bounce page (which is the default). Signed-off-by: David Gstir <david@sigma-star.at> Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2016-11-13 18:47:04 -05:00
Deepa Dinamani	362ad5d58e	fs: jfs: Replace CURRENT_TIME_SEC by current_time() jfs uses nanosecond granularity for filesystem timestamps. Only this assignment is not using nanosecond granularity. Use current_time() to get the right granularity. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>	2016-11-11 15:51:39 -06:00
Jens Axboe	bbd7bb7017	block: move poll code to blk-mq The poll code is blk-mq specific, let's move it to blk-mq.c. This is a prep patch for improving the polling code. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-11-11 13:40:25 -07:00
Joel Fernandes	d8991f51e5	pstore: Warn on PSTORE_TYPE_PMSG using deprecated function PMSG now uses ramoops_pstore_write_buf_user() instead of ...write_buf(). Print a ratelimited warning if gets accidentally called. Signed-off-by: Joel Fernandes <joelaf@google.com> [kees: adjusted commit log and added -EINVAL return] Signed-off-by: Kees Cook <keescook@chromium.org>	2016-11-11 10:36:46 -08:00
Joel Fernandes	109704492e	pstore: Make spinlock per zone instead of global Currently pstore has a global spinlock for all zones. Since the zones are independent and modify different areas of memory, there's no need to have a global lock, so we should use a per-zone lock as introduced here. Also, when ramoops's ftrace use-case has a FTRACE_PER_CPU flag introduced later, which splits the ftrace memory area into a single zone per CPU, it will eliminate the need for locking. In preparation for this, make the locking optional. Signed-off-by: Joel Fernandes <joelaf@google.com> [kees: updated commit message] Signed-off-by: Kees Cook <keescook@chromium.org>	2016-11-11 10:35:37 -08:00
Linus Torvalds	968ef8de55	Merge branch 'akpm' (patches from Andrew) Merge misc fixes from Andrew Morton: "15 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: lib/stackdepot: export save/fetch stack for drivers mm: kmemleak: scan .data.ro_after_init memcg: prevent memcg caches to be both OFF_SLAB & OBJFREELIST_SLAB coredump: fix unfreezable coredumping task mm/filemap: don't allow partially uptodate page for pipes mm/hugetlb: fix huge page reservation leak in private mapping error paths ocfs2: fix not enough credit panic Revert "console: don't prefer first registered if DT specifies stdout-path" mm: hwpoison: fix thp split handling in memory_failure() swapfile: fix memory corruption via malformed swapfile mm/cma.c: check the max limit for cma allocation scripts/bloat-o-meter: fix SIGPIPE shmem: fix pageflags after swapping DMA32 object mm, frontswap: make sure allocated frontswap map is assigned mm: remove extra newline from allocation stall warning	2016-11-11 09:44:23 -08:00
Linus Torvalds	c5e4ca6da9	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull VFS fixes from Al Viro: "Christoph's and Jan's aio fixes, fixup for generic_file_splice_read (removal of pointless detritus that actually breaks it when used for gfs2 ->splice_read()) and fixup for generic_file_read_iter() interaction with ITER_PIPE destinations." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: splice: remove detritus from generic_file_splice_read() mm/filemap: don't allow partially uptodate page for pipes aio: fix freeze protection of aio writes fs: remove aio_run_iocb fs: remove the never implemented aio_fsync file operation aio: hold an extra file reference over AIO read/write operations	2016-11-11 09:19:01 -08:00
Linus Torvalds	ef091b3cef	Ceph's ->read_iter() implementation is incompatible with the new generic_file_splice_read() code that went into -rc1. Switch to the less efficient default_file_splice_read() for now; the proper fix is being held for 4.10. We also have a fix for a 4.8 regression and a trival libceph fixup. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQEcBAABCAAGBQJYJdjPAAoJEEp/3jgCEfOLzEoH/A3B1qqiqs2WoMn0O4pnEdcM TxaU46VOkYcK2wh/xbYAns2kZEXKgcCySv+kXn4l3Gh6/lXVxv4WexNqWdO1o6yN GqEufIH7yQM6QOE/hkwtUciBXmPfQMPxF14vvprYQuyu5Bs96mrphiAa7vX6Vbk5 VhfE/j0shb8Q2QQj/Om0mWqM6JtOAlr5aFtEcJcodbCk1k8CptUcBsSoQ31PXMC7 UcaBHh1VHGLvx9WeG1Rw1g9tc2LiUyu+UK0csolp51+amB7HezgfmzDQzHtXzBmm n90SQwonf0DrdWUGuQlOpHnREwxLSgN19s68FCjLc0jeMTP4b6TFEIUgFxiqWc4= =Ws5s -----END PGP SIGNATURE----- Merge tag 'ceph-for-4.9-rc5' of git://github.com/ceph/ceph-client Pull Ceph fixes from Ilya Dryomov: "Ceph's ->read_iter() implementation is incompatible with the new generic_file_splice_read() code that went into -rc1. Switch to the less efficient default_file_splice_read() for now; the proper fix is being held for 4.10. We also have a fix for a 4.8 regression and a trival libceph fixup" * tag 'ceph-for-4.9-rc5' of git://github.com/ceph/ceph-client: libceph: initialize last_linger_id with a large integer libceph: fix legacy layout decode with pool 0 ceph: use default file splice read callback	2016-11-11 09:17:10 -08:00
Linus Torvalds	ef5beed998	NFS client bugfixes for Linux 4.9 Bugfixes: - Trim extra slashes in v4 nfs_paths to fix tools that use this - Fix a -Wmaybe-uninitialized warnings - Fix suspicious RCU usages - Fix Oops when mounting multiple servers at once - Suppress a false-positive pNFS error - Fix a DMAR failure in NFS over RDMA -----BEGIN PGP SIGNATURE----- iQIcBAABCAAGBQJYJOCbAAoJENfLVL+wpUDrbO0QAIkcxdUu2iQeOrk07VP48kDE UEfJTal8vbW/KtKyL9bIeRa1qCvYpSJXnnKcR/Uo5VHE5nMz/5omoJofWf5Zg0UM iEHyZfOsuGFieBbl1NBaLjEd6MCoJYpmWFUj+3drZ8zqSdqDTL+JgrP7k3XEU2Mx glKb7U0AKoclm3h1MKyCyo5TgDVeI5TOhi+i3VVw2IN79VY2CUp4lHWMY4vloghp h+GuJWeVFS1nBpfCF9PpTU6LdHDfg4o/J5+DrP+IjIffD1XGzGEjfFR0BX5HyDcN PgOSF3fc7uVOOUIBEAqHUHY/7XiKlv6TEMRPdM8ALVoCXZ6hPSSFxq8JBJSWoVEp r11ts66VgYxdQgHbs51Y5AaKudLBwU60KosWuddbdZVb4YPM0cn5WQzVezrpoQYu k4rfrpt+LFv23NGfIJa6JaTSFBzM+YXmggEGUI8TI/YUFSN+wEp4uzLB4r19nqAP ff32iunzV9Z5edpPQFDCf3/1HAhzrL5KWo7E8EvijpdQKZl5k5CnUJxbG22lh4ct QIyYg51LjhCayzbRH8Mu+TKUFT29ORlcSp851BotLjT8ZdUetWXcFab93nAkQI7g sMREml4DvcXWy8qFAOzi8mX1ddTBumxBfOD0m3skPg+odxwsl/KiwjLCRwfTrgwS jfSXsXmrwTniPCDWgKg3 =hFod -----END PGP SIGNATURE----- Merge tag 'nfs-for-4.9-3' of git://git.linux-nfs.org/projects/anna/linux-nfs Pull NFS client bugfixes from Anna Schumaker: "Most of these fix regressions in 4.9, and none are going to stable this time around. Bugfixes: - Trim extra slashes in v4 nfs_paths to fix tools that use this - Fix a -Wmaybe-uninitialized warnings - Fix suspicious RCU usages - Fix Oops when mounting multiple servers at once - Suppress a false-positive pNFS error - Fix a DMAR failure in NFS over RDMA" * tag 'nfs-for-4.9-3' of git://git.linux-nfs.org/projects/anna/linux-nfs: xprtrdma: Fix DMAR failure in frwr_op_map() after reconnect fs/nfs: Fix used uninitialized warn in nfs4_slot_seqid_in_use() NFS: Don't print a pNFS error if we aren't using pNFS NFS: Ignore connections that have cl_rpcclient uninitialized SUNRPC: Fix suspicious RCU usage NFSv4.1: work around -Wmaybe-uninitialized warning NFS: Trim extra slash in v4 nfs_path	2016-11-11 09:15:30 -08:00
Linus Torvalds	a4fac3b5d1	xfs: update for 4.9-rc5 In this update: o fix for aborting deferred transactions on filesystem shutdown. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJYI/hIAAoJEK3oKUf0dfodjtIP/28PmMnwkSwC3IfNzMEuIlvM Q96IlnIXiOlqXSIlO69VM2cS8YCW2oVUYD6pjbhC2YcOaogYGo4TlFmQMrFL4dgG /q9aH+Hf6jGgGfwpC2MKq6WIYFU8oYYzG2FIm3jyEnnDCIFMH4H+FYGgzrNVIraQ 31nWn3ye/xHD44EWvRbEUAE0ROxbZfgm7+QST+R2+lAhLitTqZLNKkenlN8v5P9h WInRYBHxq5beIbbhAgm50wKfvxalrYChDeujorGsGAjJLliWpgJnbah6TwPNvRXH SwSkJfzI53AFXqH/45P2X5Ib34P7mIz6hTl3zednfU1GBeB3PxUGay6QU0rxRKAL 4vv/RB6pd5EzmU+hvnLJbM7lpRWpnV2FoP9YwRoqIR8vUvlXTYLpecEw1mtJwFmd VZ99F+8mtz+jGKK01hZDAtP7tHt1JSFhOnOQ506UsmnqhbgWaYuDEOOktX08iTik +zwTETuoh4frQDVLjrQ507RWJl3XblWXEgD4Dw+N19w7oTSAbuF6zIU7sTlcLJSx APqD1uJUzvfYV11eyY7bosuCx7QtTRwUQWQSkKu0f9FXAuILf2KdDI62lZCnKi5M 6wTUKFNZfqED+ergGyzPuVuQr/PhbQPEC61EehYMpQPSjeVEay2e30ouM6A+8VPp ZQal03fQzJvbxm/+8m3v =62jA -----END PGP SIGNATURE----- Merge tag 'xfs-fixes-for-linus-4.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs Pull xfs fix from Dave Chinner: "This is a fix for an unmount hang (regression) when the filesystem is shutdown. It was supposed to go to you for -rc3, but I accidentally tagged the commit prior to it in that pullreq. Summary: - fix for aborting deferred transactions on filesystem shutdown" * tag 'xfs-fixes-for-linus-4.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: xfs: defer should abort intent items if the trans roll fails	2016-11-11 09:13:48 -08:00
Andrey Ryabinin	70d78fe7c8	coredump: fix unfreezable coredumping task It could be not possible to freeze coredumping task when it waits for 'core_state->startup' completion, because threads are frozen in get_signal() before they got a chance to complete 'core_state->startup'. Inability to freeze a task during suspend will cause suspend to fail. Also CRIU uses cgroup freezer during dump operation. So with an unfreezable task the CRIU dump will fail because it waits for a transition from 'FREEZING' to 'FROZEN' state which will never happen. Use freezer_do_not_count() to tell freezer to ignore coredumping task while it waits for core_state->startup completion. Link: http://lkml.kernel.org/r/1475225434-3753-1-git-send-email-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Acked-by: Pavel Machek <pavel@ucw.cz> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Tejun Heo <tj@kernel.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-11-11 08:12:37 -08:00
Junxiao Bi	d006c71f8a	ocfs2: fix not enough credit panic The following panic was caught when run ocfs2 disconfig single test (block size 512 and cluster size 8192). ocfs2_journal_dirty() return -ENOSPC, that means credits were used up. The total credit should include 3 times of "num_dx_leaves" from ocfs2_dx_dir_rebalance(), because 2 times will be consumed in ocfs2_dx_dir_transfer_leaf() and 1 time will be consumed in ocfs2_dx_dir_new_cluster() -> __ocfs2_dx_dir_new_cluster() -> ocfs2_dx_dir_format_cluster(). But only two times is included in ocfs2_dx_dir_rebalance_credits(), fix it. This can cause read-only fs(v4.1+) or panic for mainline linux depending on mount option. ------------[ cut here ]------------ kernel BUG at fs/ocfs2/journal.c:775! invalid opcode: 0000 [#1] SMP Modules linked in: ocfs2 nfsd lockd grace nfs_acl auth_rpcgss sunrpc autofs4 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs sd_mod sg ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ppdev xen_kbdfront xen_netfront fb_sys_fops sysimgblt sysfillrect syscopyarea parport_pc parport acpi_cpufreq i2c_piix4 i2c_core pcspkr ext4 jbd2 mbcache xen_blkfront floppy pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod CPU: 2 PID: 10601 Comm: dd Not tainted 4.1.12-71.el6uek.bug24939243.x86_64 #2 Hardware name: Xen HVM domU, BIOS 4.4.4OVM 02/11/2016 task: ffff8800b6de6200 ti: ffff8800a7d48000 task.ti: ffff8800a7d48000 RIP: ocfs2_journal_dirty+0xa7/0xb0 [ocfs2] RSP: 0018:ffff8800a7d4b6d8 EFLAGS: 00010286 RAX: 00000000ffffffe4 RBX: 00000000814d0a9c RCX: 00000000000004f9 RDX: ffffffffa008e990 RSI: ffffffffa008f1ee RDI: ffff8800622b6460 RBP: ffff8800a7d4b6f8 R08: ffffffffa008f288 R09: ffff8800622b6460 R10: 0000000000000000 R11: 0000000000000282 R12: 0000000002c8421e R13: ffff88006d0cad00 R14: ffff880092beef60 R15: 0000000000000070 FS: 00007f9b83e92700(0000) GS:ffff8800be880000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fb2c0d1a000 CR3: 0000000008f80000 CR4: 00000000000406e0 Call Trace: ocfs2_dx_dir_transfer_leaf+0x159/0x1a0 [ocfs2] ocfs2_dx_dir_rebalance+0xd9b/0xea0 [ocfs2] ocfs2_find_dir_space_dx+0xd3/0x300 [ocfs2] ocfs2_prepare_dx_dir_for_insert+0x219/0x450 [ocfs2] ocfs2_prepare_dir_for_insert+0x1d6/0x580 [ocfs2] ocfs2_mknod+0x5a2/0x1400 [ocfs2] ocfs2_create+0x73/0x180 [ocfs2] vfs_create+0xd8/0x100 lookup_open+0x185/0x1c0 do_last+0x36d/0x780 path_openat+0x92/0x470 do_filp_open+0x4a/0xa0 do_sys_open+0x11a/0x230 SyS_open+0x1e/0x20 system_call_fastpath+0x12/0x71 Code: 1d 3f 29 09 00 48 85 db 74 1f 48 8b 03 0f 1f 80 00 00 00 00 48 8b 7b 08 48 83 c3 10 4c 89 e6 ff d0 48 8b 03 48 85 c0 75 eb eb 90 <0f> 0b eb fe 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 RIP ocfs2_journal_dirty+0xa7/0xb0 [ocfs2] ---[ end trace 91ac5312a6ee1288 ]--- Kernel panic - not syncing: Fatal exception Kernel Offset: disabled Link: http://lkml.kernel.org/r/1478248135-31963-1-git-send-email-junxiao.bi@oracle.com Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-11-11 08:12:37 -08:00
Al Viro	e519e77747	splice: remove detritus from generic_file_splice_read() i_size check is a leftover from the horrors that used to play with the page cache in that function. With the switch to ->read_iter(), it's neither needed nor correct - for gfs2 it ends up being buggy, since i_size is not guaranteed to be correct until later (inside ->read_iter()). Spotted-by: Abhi Das <adas@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-11-10 18:32:13 -05:00
Yan, Zheng	8a8d561766	ceph: use default file splice read callback Splice read/write implementation changed recently. When using generic_file_splice_read(), iov_iter with type == ITER_PIPE is passed to filesystem's read_iter callback. But ceph_sync_read() can't serve ITER_PIPE iov_iter correctly (ITER_PIPE iov_iter expects pages from page cache). Fixing ceph_sync_read() requires a big patch. So use default splice read callback for now. Signed-off-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2016-11-10 20:13:04 +01:00
Dave Chinner	0fc204e2eb	Merge branch 'xfs-4.10-misc-fixes-1' into for-next	2016-11-10 10:29:43 +11:00
Dave Chinner	8f23d318aa	Merge branch 'xfs-4.10-libxfs-cleanups' into for-next	2016-11-10 10:29:29 +11:00
Dave Chinner	b649c42e25	Merge branch 'dax-4.10-iomap-pmd' into for-next	2016-11-10 10:29:06 +11:00
Jan Kara	9484ab1bf4	dax: Introduce IOMAP_FAULT flag Introduce a flag telling iomap operations whether they are handling a fault or other IO. That may influence behavior wrt inode size and similar things. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-11-10 10:26:50 +11:00
Sebastian Andrzej Siewior	fc4d24c9b4	fs/buffer: Convert to hotplug state machine Install the callbacks via the state machine. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: linux-fsdevel@vger.kernel.org Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20161103145021.28528-2-bigeasy@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2016-11-09 23:45:25 +01:00
Brian Foster	98efe8af1c	xfs: fix unbalanced inode reclaim flush locking Filesystem shutdown testing on an older distro kernel has uncovered an imbalanced locking pattern for the inode flush lock in xfs_reclaim_inode(). Specifically, there is a double unlock sequence between the call to xfs_iflush_abort() and xfs_reclaim_inode() at the "reclaim:" label. This actually does not cause obvious problems on current kernels due to the current flush lock implementation. Older kernels use a counting based flush lock mechanism, however, which effectively breaks the lock indefinitely when an already unlocked flush lock is repeatedly unlocked. Though this only currently occurs on filesystem shutdown, it has reproduced the effect of elevating an fs shutdown to a system-wide crash or hang. As it turns out, the flush lock is not actually required for the reclaim logic in xfs_reclaim_inode() because by that time we have already cycled the flush lock once while holding ILOCK_EXCL. Therefore, remove the additional flush lock/unlock cycle around the 'reclaim:' label and update branches into this label to release the flush lock where appropriate. Add an assert to xfs_ifunlock() to help prevent future occurences of the same problem. Reported-by: Zorro Lang <zlang@redhat.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-11-10 08:23:22 +11:00
Linus Torvalds	3c6106da74	We recently refactored the Orangefs debugfs code. The refactor seemed to trigger dan.carpenter@oracle.com's static tester to find a possible double-free in the code. While designing the fix we saw a condition under which the buffer being freed could also be overflowed. We also realized how to rebuild the related debugfs file's "contents" (a string) without deleting and re-creating the file. This fix should eliminate the possible double-free, the potential overflow and improve code readability. -----BEGIN PGP SIGNATURE----- iQIcBAABAgAGBQJYI04nAAoJEM9EDqnrzg2+rT4P/1sN1ZUDwKgyJ3Qk3n5AAvlR PtbqeRhzUD7QdTR5yb/k/37rqYB7BBB5xd5VDlYKuD8luppjoAS2J4SRkngPFQiV NZMP1Sq5nWeEyeG8it+MzH364zBUGK+D94VqlhUDLHKa27WMWTB2vrcLq/DI06np 35r4dWnV3+2+lgg0zvJGP7QoQLlPByB5q7pwTA9TBaPnDHVh/Myq4jS70wfEYDqY NMxkq02vQHS7a2mysZbrE2fXC2OBRTCGP+9lsdvJx9XfYQkIHfe4qMAxu0XlqDSM 6POHEr5cPizENGSp7myV9G97FuF0UHZXnLEKe04DLbuGao2omMiNHvrnS/zPgKRQ zpCMsf4FVChjCipQzne1eLQbQskWVDy58ziURzefO+bV8aFe61KBOvAiNEZ+7i21 CeEBeU2A5brd1y6ELcFCf2SDjmyd/ScyYqwIIrsY0eK1D3GveeHX1tkMGH8y7hWD CoY/cKUSHwZ6ZwFpzeCs96wzZ19o7BLKogyIyYQWc1s09uijIHInxkRxUqtfVErj 2SpOXOqLkpjpU0Kga8beU6OtXLRv/ob518RPTPF5NAzjl08xn20pZckKL092uhVj k/VlJUE72Om2XJmBbta2Sz7iN8QfujJO0Ql/Nl51lX++pVbQIjFDaRhIYt6Yb0af y1XRJXqDJ3sh9J/od3fN =ELyS -----END PGP SIGNATURE----- Merge tag 'for-linus-4.9-rc4-ofs-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux Pull orangefs fix from Mike Marshall: "We recently refactored the Orangefs debugfs code. The refactor seemed to trigger dan.carpenter@oracle.com's static tester to find a possible double-free in the code. While designing the fix we saw a condition under which the buffer being freed could also be overflowed. We also realized how to rebuild the related debugfs file's "contents" (a string) without deleting and re-creating the file. This fix should eliminate the possible double-free, the potential overflow and improve code readability" * tag 'for-linus-4.9-rc4-ofs-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux: orangefs: clean up debugfs	2016-11-09 11:36:43 -08:00
Darrick J. Wong	bec9d48d7a	xfs: check minimum block size for CRC filesystems Check the minimum block size on v5 filesystems. [dchinner: cleaned up XFS_MIN_CRC_BLOCKSIZE check] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-11-09 12:11:12 +11:00
Li Pengcheng	959217c84c	pstore: Actually give up during locking failure Without a return after the pr_err(), dumps will collide when two threads call pstore_dump() at the same time. Signed-off-by: Liu Hailong <liuhailong5@huawei.com> Signed-off-by: Li Pengcheng <lipengcheng8@huawei.com> Signed-off-by: Li Zhong <lizhong11@hisilicon.com> [kees: improved commit message] Signed-off-by: Kees Cook <keescook@chromium.org>	2016-11-08 16:44:33 -08:00
Eric Sandeen	5d829300be	xfs: provide helper for counting extents from if_bytes The open-coded pattern: ifp->if_bytes / (uint)sizeof(xfs_bmbt_rec_t) is all over the xfs code; provide a new helper xfs_iext_count(ifp) to count the number of inline extents in an inode fork. [dchinner: pick up several missed conversions] Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-11-08 12:59:42 +11:00
Eric Sandeen	4dfce57db6	xfs: fix up xfs_swap_extent_forks inline extent handling There have been several reports over the years of NULL pointer dereferences in xfs_trans_log_inode during xfs_fsr processes, when the process is doing an fput and tearing down extents on the temporary inode, something like: BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 PID: 29439 TASK: ffff880550584fa0 CPU: 6 COMMAND: "xfs_fsr" [exception RIP: xfs_trans_log_inode+0x10] #9 [ffff8800a57bbbe0] xfs_bunmapi at ffffffffa037398e [xfs] #10 [ffff8800a57bbce8] xfs_itruncate_extents at ffffffffa0391b29 [xfs] #11 [ffff8800a57bbd88] xfs_inactive_truncate at ffffffffa0391d0c [xfs] #12 [ffff8800a57bbdb8] xfs_inactive at ffffffffa0392508 [xfs] #13 [ffff8800a57bbdd8] xfs_fs_evict_inode at ffffffffa035907e [xfs] #14 [ffff8800a57bbe00] evict at ffffffff811e1b67 #15 [ffff8800a57bbe28] iput at ffffffff811e23a5 #16 [ffff8800a57bbe58] dentry_kill at ffffffff811dcfc8 #17 [ffff8800a57bbe88] dput at ffffffff811dd06c #18 [ffff8800a57bbea8] __fput at ffffffff811c823b #19 [ffff8800a57bbef0] ____fput at ffffffff811c846e #20 [ffff8800a57bbf00] task_work_run at ffffffff81093b27 #21 [ffff8800a57bbf30] do_notify_resume at ffffffff81013b0c #22 [ffff8800a57bbf50] int_signal at ffffffff8161405d As it turns out, this is because the i_itemp pointer, along with the d_ops pointer, has been overwritten with zeros when we tear down the extents during truncate. When the in-core inode fork on the temporary inode used by xfs_fsr was originally set up during the extent swap, we mistakenly looked at di_nextents to determine whether all extents fit inline, but this misses extents generated by speculative preallocation; we should be using if_bytes instead. This mistake corrupts the in-memory inode, and code in xfs_iext_remove_inline eventually gets bad inputs, causing it to memmove and memset incorrect ranges; this became apparent because the two values in ifp->if_u2.if_inline_ext[1] contained what should have been in d_ops and i_itemp; they were memmoved due to incorrect array indexing and then the original locations were zeroed with memset, again due to an array overrun. Fix this by properly using i_df.if_bytes to determine the number of extents, not di_nextents. Thanks to dchinner for looking at this with me and spotting the root cause. Cc: stable@vger.kernel.org Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-11-08 12:55:18 +11:00
Brian Foster	04197b341f	xfs: don't BUG() on mixed direct and mapped I/O We've had reports of generic/095 causing XFS to BUG() in __xfs_get_blocks() due to the existence of delalloc blocks on a direct I/O read. generic/095 issues a mix of various types of I/O, including direct and memory mapped I/O to a single file. This is clearly not supported behavior and is known to lead to such problems. E.g., the lack of exclusion between the direct I/O and write fault paths means that a write fault can allocate delalloc blocks in a region of a file that was previously a hole after the direct read has attempted to flush/inval the file range, but before it actually reads the block mapping. In turn, the direct read discovers a delalloc extent and cannot proceed. While the appropriate solution here is to not mix direct and memory mapped I/O to the same regions of the same file, the current BUG_ON() behavior is probably overkill as it can crash the entire system. Instead, localize the failure to the I/O in question by returning an error for a direct I/O that cannot be handled safely due to delalloc blocks. Be careful to allow the case of a direct write to post-eof delalloc blocks. This can occur due to speculative preallocation and is safe as post-eof blocks are not accompanied by dirty pages in pagecache (conversely, preallocation within eof must have been zeroed, and thus dirtied, before the inode size could have been increased beyond said blocks). Finally, provide an additional warning if a direct I/O write occurs while the file is memory mapped. This may not catch all problematic scenarios, but provides a hint that some known-to-be-problematic I/O methods are in use. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-11-08 12:54:14 +11:00
Brian Foster	399372349a	xfs: don't skip cow forks w/ delalloc blocks in cowblocks scan The cowblocks background scanner currently clears the cowblocks tag for inodes without any real allocations in the cow fork. This excludes inodes with only delalloc blocks in the cow fork. While we might never expect to clear delalloc blocks from the cow fork in the background scanner, it is not necessarily correct to clear the cowblocks tag from such inodes. For example, if the background scanner happens to process an inode between a buffered write and writeback, the scanner catches the inode in a state after delalloc blocks have been allocated to the cow fork but before the delalloc blocks have been converted to real blocks by writeback. The background scanner then incorrectly clears the cowblocks tag, even if part of the aforementioned delalloc reservation will not be remapped to the data fork (i.e., extra blocks due to the cowextsize hint). This means that any such additional blocks in the cow fork might never be reclaimed by the background scanner and could persist until the inode itself is reclaimed. To address this problem, only skip and clear inodes without any cow fork allocations whatsoever from the background scanner. While we generally do not want to cancel delalloc reservations from the background scanner, the pagecache dirty check following the cowblocks check should prevent that situation. If we do end up with delalloc cow fork blocks without a dirty address space mapping, this is probably an indication that something has gone wrong and the blocks should be reclaimed, as they may never be converted to a real allocation. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-11-08 12:53:33 +11:00
Darrick J. Wong	4fd29ec472	xfs: check return value of _trans_reserve_quota_nblks Check the return value of xfs_trans_reserve_quota_nblks for errors. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-11-08 11:59:26 +11:00
Darrick J. Wong	5e52365ac8	xfs: move dir_ino_validate declaration per xfsprogs Move the declaration of _dir_ino_validate out of the private dir2 header file into the public one, since xfsprogs did that for the benefit of xfs_repair. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-11-08 11:59:12 +11:00
Eric Sandeen	e6fc6fcf44	xfs: don't call xfs_sb_quota_from_disk twice Source xfsprogs commit: ee3754254e8c186c99b6cdd4d59f741759d04acb Kernel commit `5ef828c4` ("xfs: avoid false quotacheck after unclean shutdown") made xfs_sb_from_disk() also call xfs_sb_quota_from_disk by default. However, when this was merged to libxfs, existing separate calls to libxfs_sb_quota_from_disk remained, and calling it twice in a row on a V4 superblock leads to issues, because: if (sbp->sb_qflags & XFS_PQUOTA_ACCT) { ... sbp->sb_pquotino = sbp->sb_gquotino; sbp->sb_gquotino = NULLFSINO; and after the second call, we have set both pquotino and gquotino to NULLFSINO. Fix this by making it safe to call twice, and also remove the extra calls to libxfs_sb_quota_from_disk. This is only spotted when running xfstests with "-m crc=0" because the sb_from_disk change came about after V5 became default, and the above behavior only exists on a V4 superblock. Reported-by: Eryu Guan <eguan@redhat.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-11-08 11:58:55 +11:00
Darrick J. Wong	523b2e76e3	libxfs: clean up _dir2_data_freescan Refactor the implementations of xfs_dir2_data_freescan into a routine that takes the raw directory block parameters and a second function that figures out the raw parameters from the directory inode. This enables us to use the exact same code for both userspace and the kernel, since repair knows exactly which directory block geometry parameters it needs. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-11-08 11:56:51 +11:00
Darrick J. Wong	ae90b994b4	libxfs: fix xfs_attr_shortform_bytesfit declaration Change the xfs_attr_shortform_bytesfit declaration to have struct xfs_inode to avoid tripping up the libxfs-diff scanner. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-11-08 11:56:20 +11:00
Darrick J. Wong	68c098582b	libxfs: fix whitespace problems Fix some whitespace problems that trip up my libxfs-diff script. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-11-08 11:56:13 +11:00
Darrick J. Wong	420fbeb4bf	libxfs: synchronize dinode_verify with userspace The userspace version of _dinode_verify takes a raw inode number instead of an inode itself. Since neither version actually needs the inode, port the changes to the kernel. This will also reduce the libxfs diff noise. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-11-08 11:56:06 +11:00
Darrick J. Wong	755c7bf5dd	libxfs: convert ushort to unsigned short Since xfsprogs dropped ushort in favor of unsigned short, do that here too. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-11-08 11:55:48 +11:00
Ross Zwisler	190b5caad7	dax: remove "depends on BROKEN" from FS_DAX_PMD Now that DAX PMD faults are once again working and are now participating in DAX's radix tree locking scheme, allow their config option to be enabled. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-11-08 11:35:16 +11:00
Ross Zwisler	862f1b9d67	xfs: use struct iomap based DAX PMD fault path Switch xfs_filemap_pmd_fault() from using dax_pmd_fault() to the new and improved dax_iomap_pmd_fault(). Also, now that it has no more users, remove xfs_get_blocks_dax_fault(). Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-11-08 11:35:02 +11:00
Ross Zwisler	642261ac99	dax: add struct iomap based DAX PMD support DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based locking. This patch allows DAX PMDs to participate in the DAX radix tree based locking scheme so that they can be re-enabled using the new struct iomap based fault handlers. There are currently three types of DAX 4k entries: 4k zero pages, 4k DAX mappings that have an associated block allocation, and 4k DAX empty entries. The empty entries exist to provide locking for the duration of a given page fault. This patch adds three equivalent 2MiB DAX entries: Huge Zero Page (HZP) entries, PMD DAX entries that have associated block allocations, and 2 MiB DAX empty entries. Unlike the 4k case where we insert a struct page* into the radix tree for 4k zero pages, for HZP we insert a DAX exceptional entry with the new RADIX_DAX_HZP flag set. This is because we use a single 2 MiB zero page in every 2MiB hole mapping, and it doesn't make sense to have that same struct page* with multiple entries in multiple trees. This would cause contention on the single page lock for the one Huge Zero Page, and it would break the page->index and page->mapping associations that are assumed to be valid in many other places in the kernel. One difficult use case is when one thread is trying to use 4k entries in radix tree for a given offset, and another thread is using 2 MiB entries for that same offset. The current code handles this by making the 2 MiB user fall back to 4k entries for most cases. This was done because it is the simplest solution, and because the use of 2MiB pages is already opportunistic. If we were to try to upgrade from 4k pages to 2MiB pages for a given range, we run into the problem of how we lock out 4k page faults for the entire 2MiB range while we clean out the radix tree so we can insert the 2MiB entry. We can solve this problem if we need to, but I think that the cases where both 2MiB entries and 4K entries are being used for the same range will be rare enough and the gain small enough that it probably won't be worth the complexity. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-11-08 11:34:45 +11:00
Ross Zwisler	422476c464	dax: move put_(un)locked_mapping_entry() in dax.c No functional change. The static functions put_locked_mapping_entry() and put_unlocked_mapping_entry() will soon be used in error cases in grab_mapping_entry(), so move their definitions above this function. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-11-08 11:33:44 +11:00
Ross Zwisler	fa28f7296a	dax: move RADIX_DAX_* defines to dax.h The RADIX_DAX_* defines currently mostly live in fs/dax.c, with just RADIX_DAX_ENTRY_LOCK being in include/linux/dax.h so it can be used in mm/filemap.c. When we add PMD support, though, mm/filemap.c will also need access to the RADIX_DAX_PTE type so it can properly construct a 4k sized empty entry. Instead of shifting the defines between dax.c and dax.h as they are individually used in other code, just move them wholesale to dax.h so they'll be available when we need them. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-11-08 11:33:35 +11:00
Ross Zwisler	1550290b08	dax: dax_iomap_fault() needs to call iomap_end() Currently iomap_end() doesn't do anything for DAX page faults for both ext2 and XFS. ext2_iomap_end() just checks for a write underrun, and xfs_file_iomap_end() checks to see if it needs to finish a delayed allocation. However, in the future iomap_end() calls might be needed to make sure we have balanced allocations, locks, etc. So, add calls to iomap_end() with appropriate error handling to dax_iomap_fault(). Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-11-08 11:33:26 +11:00
Ross Zwisler	333ccc978e	dax: add dax_iomap_sector() helper function To be able to correctly calculate the sector from a file position and a struct iomap there is a complex little bit of logic that currently happens in both dax_iomap_actor() and dax_iomap_fault(). This will need to be repeated yet again in the DAX PMD fault handler when it is added, so break it out into a helper function. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-11-08 11:33:09 +11:00
Ross Zwisler	11c59c92f4	dax: correct dax iomap code namespace The recently added DAX functions that use the new struct iomap data structure were named iomap_dax_rw(), iomap_dax_fault() and iomap_dax_actor(). These are actually defined in fs/dax.c, though, so should be part of the "dax" namespace and not the "iomap" namespace. Rename them to dax_iomap_rw(), dax_iomap_fault() and dax_iomap_actor() respectively. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Suggested-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-11-08 11:32:46 +11:00
Ross Zwisler	b9fde0462e	dax: remove dax_pmd_fault() dax_pmd_fault() is the old struct buffer_head + get_block_t based 2 MiB DAX fault handler. This fault handler has been disabled for several kernel releases, and support for PMDs will be reintroduced using the struct iomap interface instead. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-11-08 11:32:35 +11:00
Ross Zwisler	63e95b5c4f	dax: coordinate locking for offsets in PMD range DAX radix tree locking currently locks entries based on the unique combination of the 'mapping' pointer and the pgoff_t 'index' for the entry. This works for PTEs, but as we move to PMDs we will need to have all the offsets within the range covered by the PMD to map to the same bit lock. To accomplish this, for ranges covered by a PMD entry we will instead lock based on the page offset of the beginning of the PMD entry. The 'mapping' pointer is still used in the same way. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-11-08 11:32:20 +11:00
Ross Zwisler	e3ad61c64a	dax: consistent variable naming for DAX entries No functional change. Consistently use the variable name 'entry' instead of 'ret' for DAX radix tree entries. This was already happening in most of the code, so update get_unlocked_mapping_entry(), grab_mapping_entry() and dax_unlock_mapping_entry(). Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-11-08 11:32:12 +11:00
Ross Zwisler	aada54f980	dax: remove the last BUG_ON() from fs/dax.c Don't take down the kernel if we get an invalid 'from' and 'length' argument pair. Just warn once and return an error. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-11-08 11:32:00 +11:00
Ross Zwisler	ce95ab0fa6	dax: make 'wait_table' global variable static The global 'wait_table' variable is only used within fs/dax.c, and generates the following sparse warning: fs/dax.c:39:19: warning: symbol 'wait_table' was not declared. Should it be static? Make it static so it has scope local to fs/dax.c, and to make sparse happy. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-11-08 11:31:44 +11:00
Ross Zwisler	03e0990fc8	ext2: remove support for DAX PMD faults DAX PMD support was added via the following commit: commit `e7b1ea2ad6` ("ext2: huge page fault support") I believe this path to be untested as ext2 doesn't reliably provide block allocations that are aligned to 2MiB. In my testing I've been unable to get ext2 to actually fault in a PMD. It always fails with a "pfn unaligned" message because the sector returned by ext2_get_block() isn't aligned. I've tried various settings for the "stride" and "stripe_width" extended options to mkfs.ext2, without any luck. Since we can't reliably get PMDs, remove support so that we don't have an untested code path that we may someday traverse when we happen to get an aligned block allocation. This should also make 4k DAX faults in ext2 a bit faster since they will no longer have to call the PMD fault handler only to get a response of VM_FAULT_FALLBACK. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-11-08 11:31:33 +11:00
Ross Zwisler	fa0d3fce7c	dax: remove buffer_size_valid() Now that ext4 properly sets bh.b_size when we call get_block() for a hole, rely on that value and remove the buffer_size_valid() sanity check. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-11-08 11:31:14 +11:00
Ross Zwisler	547edce3ba	ext4: tell DAX the size of allocation holes When DAX calls _ext4_get_block() and the file offset points to a hole we currently don't set bh->b_size. This is current worked around via buffer_size_valid() in fs/dax.c. _ext4_get_block() has the hole size information from ext4_map_blocks(), so populate bh->b_size so we can remove buffer_size_valid() in a later patch. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-11-08 11:30:58 +11:00
Shuah Khan	0ac84b72c0	fs/nfs: Fix used uninitialized warn in nfs4_slot_seqid_in_use() Fix the following warn: fs/nfs/nfs4session.c: In function ‘nfs4_slot_seqid_in_use’: fs/nfs/nfs4session.c:203:54: warning: ‘cur_seq’ may be used uninitialized in this function [-Wmaybe-uninitialized] if (nfs4_slot_get_seqid(tbl, slotid, &cur_seq) == 0 && ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~ cur_seq == seq_nr && test_bit(slotid, tbl->used_slots)) ~~~~~~~~~~~~~~~~~ Signed-off-by: Shuah Khan <shuahkh@osg.samsung.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-11-07 16:11:30 -05:00
Anna Schumaker	192747166a	NFS: Don't print a pNFS error if we aren't using pNFS We used to check for a valid layout type id before verifying pNFS flags as an indicator for if we are using pNFS. This changed in `3132e49ece` with the introduction of multiple layout types, since now we are passing an array of ids instead of just one. Since then, users have been seeing a KERN_ERR printk show up whenever mounting NFS v4 without pNFS. This patch restores the original behavior of exiting set_pnfs_layoutdriver() early if we aren't using pNFS. Fixes `3132e49ece` ("pnfs: track multiple layout types in fsinfo structure") Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-11-07 16:11:30 -05:00
Petr Vandrovec	8ef3295530	NFS: Ignore connections that have cl_rpcclient uninitialized cl_rpcclient starts as ERR_PTR(-EINVAL), and connections like that are floating freely through the system. Most places check whether pointer is valid before dereferencing it, but newly added code in nfs_match_client does not. Which causes crashes when more than one NFS mount point is present. Signed-off-by: Petr Vandrovec <petr@vandrovec.name> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-11-07 16:11:29 -05:00
Mike Marshall	dc0336214e	orangefs: clean up debugfs We recently refactored the Orangefs debugfs code. The refactor seemed to trigger dan.carpenter@oracle.com's static tester to find a possible double-free in the code. While designing the fix we saw a condition under which the buffer being freed could also be overflowed. We also realized how to rebuild the related debugfs file's "contents" (a string) without deleting and re-creating the file. This fix should eliminate the possible double-free, the potential overflow and improve code readability. Signed-off-by: Mike Marshall <hubcap@omnibond.com> Signed-off-by: Martin Brandenburg <martin@omnibond.com>	2016-11-07 10:41:55 -05:00
Linus Torvalds	fb415f222c	Fixes for some recent regressions including fallout from the vmalloc'd stack change (after which we can no longer encrypt stuff on the stack). -----BEGIN PGP SIGNATURE----- iQIcBAABAgAGBQJYHNwpAAoJECebzXlCjuG++DMP/3mUUAF09DfFR/EHl7knDT1f kZ53UVHYzr02w0wXfwxVLlp2H7TdSAufgsSvPT6qksA3eY7gL6nJ9zHkl+Nv5yCx y6vsFWjO1QEUWFOZWCKcmT2dAI3Ddt9IhK13pfZEKN1XKvK2zWB16HEVzSg6fR2K NwHlpMnQUI4HWThURzwTZb1M5YhxRCAnyiv8BTAAPjbEfzPzdL7j3jxwqtH8bOWp qIcDDvjC744b9zy0YuAEY/NyGBhYZPdM6gWsBBes1TRzBWUL9qsUYTWDJTmg/F1l Or0Jz7CUEN9uOHLGnkATPDc+eBg9YFV+bSsSnJu1/W4Er7dX1Af/lol79zEp/Zw1 Snd9FelSPj3vxmYAFTCLnHRTRgsyiDhbbb7gVrzH9bxnCrRNR6p2kY018s1Cl9Td uWQoNNFQwwnYxWYEeZdO5PgX+pcgoCzhHACNk5oA93YaBE0GuLHHugwwIrYE8TM1 1iY20sLC5lJcnPqxdgnoprZnnHMuL6rx5KRbvBeflNZ4huK2PIcPJyeB83XH6s12 G67PjJ0rfWzSBF14O/ZtQA6he+kXvnH3pKqpNnaMiBxZZ2J8E1eQvrKTLLIwmtlP 18KKJpZIzh7jTTZ/99nAMAt/BGw97P9TToLdnI8dCxYygHEaywpEYtcsE8IWFAvA 3XkS5QdlJhhAaAUUYBXy =oPbZ -----END PGP SIGNATURE----- Merge tag 'nfsd-4.9-1' of git://linux-nfs.org/~bfields/linux Pull nfsd bugfixes from Bruce Fields: "Fixes for some recent regressions including fallout from the vmalloc'd stack change (after which we can no longer encrypt stuff on the stack)" * tag 'nfsd-4.9-1' of git://linux-nfs.org/~bfields/linux: nfsd: Fix general protection fault in release_lock_stateid() svcrdma: backchannel cannot share a page for send and rcv buffers sunrpc: fix some missing rq_rbuffer assignments sunrpc: don't pass on-stack memory to sg_set_buf nfsd: move blocked lock handling under a dedicated spinlock	2016-11-04 20:12:10 -07:00
Linus Torvalds	46d7cbb2c4	Merge branch 'for-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from Chris Mason: "Some fixes that Dave Sterba collected. We held off on these last week because I was focused on the memory corruption testing" * 'for-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix WARNING in btrfs_select_ref_head() Btrfs: remove some no-op casts btrfs: pass correct args to btrfs_async_run_delayed_refs() btrfs: make file clone aware of fatal signals btrfs: qgroup: Prevent qgroup->reserved from going subzero Btrfs: kill BUG_ON in do_relocation	2016-11-04 20:08:16 -07:00
Linus Torvalds	bd30fac18f	Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs fixes from Miklos Szeredi: "Fix two more POSIX ACL bugs introduced in 4.8 and add a missing fsync during copy up to prevent possible data loss" * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: ovl: fsync after copy-up ovl: fix get_acl() on tmpfs ovl: update S_ISGID when setting posix ACLs	2016-11-04 20:03:14 -07:00
Jan Kara	ce98321bf7	fs: Remove unmap_underlying_metadata Nobody is using this function anymore. Remove it. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-11-04 14:34:47 -06:00
Jan Kara	e64855c6cf	fs: Add helper to clean bdev aliases under a bh and use it Add a helper function that clears buffer heads from a block device aliasing passed bh. Use this helper function from filesystems instead of the original unmap_underlying_metadata() to save some boiler plate code and also have a better name for the functionalily since it is not unmapping anything for a long time. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-11-04 14:34:47 -06:00
Jan Kara	69a9bea146	ext2: Use clean_bdev_aliases() instead of iteration Use clean_bdev_aliases() instead of iterating through blocks one by one. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-11-04 14:34:47 -06:00
Jan Kara	64e1c57fa4	ext4: Use clean_bdev_aliases() instead of iteration Use clean_bdev_aliases() instead of iterating through blocks one by one. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-11-04 14:34:47 -06:00
Jan Kara	f734c89cc9	direct-io: Use clean_bdev_aliases() instead of handmade iteration Use new provided function instead of an iteration through all allocated blocks. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-11-04 14:34:47 -06:00
Jan Kara	29f3ad7d83	fs: Provide function to unmap metadata for a range of blocks Provide function equivalent to unmap_underlying_metadata() for a range of blocks. We somewhat optimize the function to use pagevec lookups instead of looking up buffer heads one by one and use page lock to pin buffer heads instead of mapping's private_lock to improve scalability. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-11-04 14:34:47 -06:00
Alexey Dobriyan	b4eb4f7f1a	audit: less stack usage for /proc/*/loginuid %u requires 10 characters at most not 20. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>	2016-11-03 17:20:00 -04:00
Jens Axboe	7637241e65	writeback: add wbc_to_write_flags() Add wbc_to_write_flags(), which returns the write modifier flags to use, based on a struct writeback_control. No functional changes in this patch, but it prepares us for factoring other wbc fields for write type. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-11-02 10:24:03 -06:00
Chris Mason	e3597e6090	Merge branch 'for-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.9	2016-11-01 12:54:45 -07:00
J. Bruce Fields	e864c189e1	nfsd: catch errors in decode_fattr earlier `3c8e03166a` "NFSv4: do exact check about attribute specified" fixed some handling of unsupported-attribute errors, but it also delayed checking for unwriteable attributes till after we decode them. This could lead to odd behavior in the case a client attemps to set an attribute we don't know about followed by one we try to parse. In that case the parser for the known attribute will attempt to parse the unknown attribute. It should fail in some safe way, but the error might at least be incorrect (probably bad_xdr instead of inval). So, it's better to do that check at the start. As far as I know this doesn't cause any problems with current clients but it might be a minor issue e.g. if we encounter a future client that supports a new attribute that we currently don't. Cc: Yu Zhiguo <yuzg@cn.fujitsu.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2016-11-01 15:47:52 -04:00
J. Bruce Fields	916d2d844a	nfsd: clean up supported attribute handling Minor cleanup, no change in behavior. Provide helpers for some common attribute bitmap operations. Drop some comments that just echo the code. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2016-11-01 15:47:52 -04:00
Jeff Layton	851238a22f	nfsd: fix error handling for clients that fail to return the layout Currently, when the client continually returns NFS4ERR_DELAY on a CB_LAYOUTRECALL, we'll give up trying to retransmit after two lease periods, but leave the layout in place. What we really need to do here is fence the client in this case. Have it fall through to that code in that case instead of into the NFS4ERR_NOMATCHING_LAYOUT case. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2016-11-01 15:47:43 -04:00
Jeff Layton	8f97514b42	nfsd: more robust allocation failure handling in nfsd_reply_cache_init Currently, we try to allocate the cache as a single, large chunk, which can fail if no big chunks of memory are available. We _do_ try to size it according to the amount of memory in the box, but if the server is started well after boot time, then the allocation can fail due to memory fragmentation. Fall back to doing a vzalloc if the kcalloc fails, and switch the shutdown code to do a kvfree to handle freeing correctly. Reported-by: Olaf Hering <olaf@aepfle.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2016-11-01 15:47:43 -04:00
Chuck Lever	f46c445b79	nfsd: Fix general protection fault in release_lock_stateid() When I push NFSv4.1 / RDMA hard, (xfstests generic/089, for example), I get this crash on the server: Oct 28 22:04:30 klimt kernel: general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC Oct 28 22:04:30 klimt kernel: Modules linked in: cts rpcsec_gss_krb5 iTCO_wdt iTCO_vendor_support sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm btrfs irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd xor pcspkr raid6_pq i2c_i801 i2c_smbus lpc_ich mfd_core sg mei_me mei ioatdma shpchp wmi ipmi_si ipmi_msghandler rpcrdma ib_ipoib rdma_ucm acpi_power_meter acpi_pad ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c mlx4_ib mlx4_en ib_core sr_mod cdrom sd_mod ast drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm crc32c_intel igb ahci libahci ptp mlx4_core pps_core dca libata i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod Oct 28 22:04:30 klimt kernel: CPU: 7 PID: 1558 Comm: nfsd Not tainted 4.9.0-rc2-00005-g82cd754 #8 Oct 28 22:04:30 klimt kernel: Hardware name: Supermicro Super Server/X10SRL-F, BIOS 1.0c 09/09/2015 Oct 28 22:04:30 klimt kernel: task: ffff880835c3a100 task.stack: ffff8808420d8000 Oct 28 22:04:30 klimt kernel: RIP: 0010:[<ffffffffa05a759f>] [<ffffffffa05a759f>] release_lock_stateid+0x1f/0x60 [nfsd] Oct 28 22:04:30 klimt kernel: RSP: 0018:ffff8808420dbce0 EFLAGS: 00010246 Oct 28 22:04:30 klimt kernel: RAX: ffff88084e6660f0 RBX: ffff88084e667020 RCX: 0000000000000000 Oct 28 22:04:30 klimt kernel: RDX: 0000000000000007 RSI: 0000000000000000 RDI: ffff88084e667020 Oct 28 22:04:30 klimt kernel: RBP: ffff8808420dbcf8 R08: 0000000000000001 R09: 0000000000000000 Oct 28 22:04:30 klimt kernel: R10: ffff880835c3a100 R11: ffff880835c3aca8 R12: 6b6b6b6b6b6b6b6b Oct 28 22:04:30 klimt kernel: R13: ffff88084e6670d8 R14: ffff880835f546f0 R15: ffff880835f1c548 Oct 28 22:04:30 klimt kernel: FS: 0000000000000000(0000) GS:ffff88087bdc0000(0000) knlGS:0000000000000000 Oct 28 22:04:30 klimt kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Oct 28 22:04:30 klimt kernel: CR2: 00007ff020389000 CR3: 0000000001c06000 CR4: 00000000001406e0 Oct 28 22:04:30 klimt kernel: Stack: Oct 28 22:04:30 klimt kernel: ffff88084e667020 0000000000000000 ffff88084e6670d8 ffff8808420dbd20 Oct 28 22:04:30 klimt kernel: ffffffffa05ac80d ffff880835f54548 ffff88084e640008 ffff880835f545b0 Oct 28 22:04:30 klimt kernel: ffff8808420dbd70 ffffffffa059803d ffff880835f1c768 0000000000000870 Oct 28 22:04:30 klimt kernel: Call Trace: Oct 28 22:04:30 klimt kernel: [<ffffffffa05ac80d>] nfsd4_free_stateid+0xfd/0x1b0 [nfsd] Oct 28 22:04:30 klimt kernel: [<ffffffffa059803d>] nfsd4_proc_compound+0x40d/0x690 [nfsd] Oct 28 22:04:30 klimt kernel: [<ffffffffa0583114>] nfsd_dispatch+0xd4/0x1d0 [nfsd] Oct 28 22:04:30 klimt kernel: [<ffffffffa047bbf9>] svc_process_common+0x3d9/0x700 [sunrpc] Oct 28 22:04:30 klimt kernel: [<ffffffffa047ca64>] svc_process+0xf4/0x330 [sunrpc] Oct 28 22:04:30 klimt kernel: [<ffffffffa05827ca>] nfsd+0xfa/0x160 [nfsd] Oct 28 22:04:30 klimt kernel: [<ffffffffa05826d0>] ? nfsd_destroy+0x170/0x170 [nfsd] Oct 28 22:04:30 klimt kernel: [<ffffffff810b367b>] kthread+0x10b/0x120 Oct 28 22:04:30 klimt kernel: [<ffffffff810b3570>] ? kthread_stop+0x280/0x280 Oct 28 22:04:30 klimt kernel: [<ffffffff8174e8ba>] ret_from_fork+0x2a/0x40 Oct 28 22:04:30 klimt kernel: Code: c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 53 48 8b 87 b0 00 00 00 48 89 fb 4c 8b a0 98 00 00 00 <49> 8b 44 24 20 48 8d b8 80 03 00 00 e8 10 66 1a e1 48 89 df e8 Oct 28 22:04:30 klimt kernel: RIP [<ffffffffa05a759f>] release_lock_stateid+0x1f/0x60 [nfsd] Oct 28 22:04:30 klimt kernel: RSP <ffff8808420dbce0> Oct 28 22:04:30 klimt kernel: ---[ end trace cf5d0b371973e167 ]--- Jeff Layton says: > Hm...now that I look though, this is a little suspicious: > > struct nfs4_openowner *oo = openowner(stp->st_openstp->st_stateowner); > > I wonder if it's possible for the openstateid to have already been > destroyed at this point. > > We might be better off doing something like this to get the client pointer: > > stp->st_stid.sc_client; > > ...which should be more direct and less dependent on other stateids > staying valid. With the suggested change, I am no longer able to reproduce the above oops. v2: Fix unhash_lock_stateid() as well Fix-suggested-by: Jeff Layton <jlayton@redhat.com> Fixes: `42691398be` ('nfsd: Fix race between FREE_STATEID and LOCK') Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2016-11-01 15:24:43 -04:00
Christoph Hellwig	be297968da	mm: only include blk_types in swap.h if CONFIG_SWAP is enabled It's only needed for the CONFIG_SWAP-only use of bio_end_io_t. Because CONFIG_SWAP implies CONFIG_BLOCK this will allow to drop some ifdefs in blk_types.h. Instead we'll need to add a few explicit includes that were implicit before, though. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-11-01 09:43:26 -06:00
Christoph Hellwig	2f8b544477	block,fs: untangle fs.h and blk_types.h Nothing in fs.h should require blk_types.h to be included. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-11-01 09:43:26 -06:00
Christoph Hellwig	70fd76140a	block,fs: use REQ_* flags directly Remove the WRITE_* and READ_SYNC wrappers, and just use the flags directly. Where applicable this also drops usage of the bio_set_op_attrs wrapper. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-11-01 09:43:26 -06:00
Christoph Hellwig	67f055c798	btrfs: use op_is_sync to check for synchronous requests Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-11-01 09:43:26 -06:00
Andrey Vagin	c62cce2cae	net: add an ioctl to get a socket network namespace Each socket operates in a network namespace where it has been created, so if we want to dump and restore a socket, we have to know its network namespace. We have a socket_diag to get information about sockets, it doesn't report sockets which are not bound or connected. This patch introduces a new socket ioctl, which is called SIOCGSKNS and used to get a file descriptor for a socket network namespace. A task must have CAP_NET_ADMIN in a target network namespace to use this ioctl. Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrei Vagin <avagin@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-31 10:56:36 -04:00
Miklos Szeredi	641089c154	ovl: fsync after copy-up Make sure the copied up file hits the disk before renaming to the final destination. If this is not done then the copy-up may corrupt the data in the file in case of a crash. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Cc: <stable@vger.kernel.org>	2016-10-31 14:42:14 +01:00
Miklos Szeredi	b93d4a0eb3	ovl: fix get_acl() on tmpfs tmpfs doesn't have ->get_acl() because it only uses cached acls. This fixes the acl tests in pjdfstest when tmpfs is used as the upper layer of the overlay. Reported-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Fixes: `39a25b2b37` ("ovl: define ->get_acl() for overlay inodes") Cc: <stable@vger.kernel.org> # v4.8	2016-10-31 14:42:14 +01:00
Miklos Szeredi	fd3220d37b	ovl: update S_ISGID when setting posix ACLs This change fixes xfstest generic/375, which failed to clear the setgid bit in the following test case on overlayfs: touch $testfile chown 100:100 $testfile chmod 2755 $testfile _runas -u 100 -g 101 -- setfacl -m u::rwx,g::rwx,o::rwx $testfile Reported-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Tested-by: Amir Goldstein <amir73il@gmail.com> Fixes: `d837a49bd5` ("ovl: fix POSIX ACL setting") Cc: <stable@vger.kernel.org> # v4.8	2016-10-31 14:42:14 +01:00
Jan Kara	70fe2f4815	aio: fix freeze protection of aio writes Currently we dropped freeze protection of aio writes just after IO was submitted. Thus aio write could be in flight while the filesystem was frozen and that could result in unexpected situation like aio completion wanting to convert extent type on frozen filesystem. Testcase from Dmitry triggering this is like: for ((i=0;i<60;i++));do fsfreeze -f /mnt ;sleep 1;fsfreeze -u /mnt;done & fio --bs=4k --ioengine=libaio --iodepth=128 --size=1g --direct=1 \ --runtime=60 --filename=/mnt/file --name=rand-write --rw=randwrite Fix the problem by dropping freeze protection only once IO is completed in aio_complete(). Reported-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Jan Kara <jack@suse.cz> [hch: forward ported on top of various VFS and aio changes] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-10-30 13:09:42 -04:00
Christoph Hellwig	89319d31d2	fs: remove aio_run_iocb Pass the ABI iocb structure to aio_setup_rw and let it handle the non-vectored I/O case as well. With that and a new helper for the AIO return value handling we can now define new aio_read and aio_write helpers that implement reads and writes in a self-contained way without duplicating too much code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-10-30 13:09:42 -04:00
Christoph Hellwig	723c038475	fs: remove the never implemented aio_fsync file operation Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-10-30 13:09:42 -04:00
Christoph Hellwig	0b944d3a4b	aio: hold an extra file reference over AIO read/write operations Otherwise we might dereference an already freed file and/or inode when aio_complete is called before we return from the read_iter or write_iter method. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-10-30 13:09:42 -04:00
David S. Miller	27058af401	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Mostly simple overlapping changes. For example, David Ahern's adjacency list revamp in 'net-next' conflicted with an adjacency list traversal bug fix in 'net'. Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-30 12:42:58 -04:00
Linus Torvalds	2a26d99b25	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Pull networking fixes from David Miller: "Lots of fixes, mostly drivers as is usually the case. 1) Don't treat zero DMA address as invalid in vmxnet3, from Alexey Khoroshilov. 2) Fix element timeouts in netfilter's nft_dynset, from Anders K. Pedersen. 3) Don't put aead_req crypto struct on the stack in mac80211, from Ard Biesheuvel. 4) Several uninitialized variable warning fixes from Arnd Bergmann. 5) Fix memory leak in cxgb4, from Colin Ian King. 6) Fix bpf handling of VLAN header push/pop, from Daniel Borkmann. 7) Several VRF semantic fixes from David Ahern. 8) Set skb->protocol properly in ip6_tnl_xmit(), from Eli Cooper. 9) Socket needs to be locked in udp_disconnect(), from Eric Dumazet. 10) Div-by-zero on 32-bit fix in mlx4 driver, from Eugenia Emantayev. 11) Fix stale link state during failover in NCSCI driver, from Gavin Shan. 12) Fix netdev lower adjacency list traversal, from Ido Schimmel. 13) Propvide proper handle when emitting notifications of filter deletes, from Jamal Hadi Salim. 14) Memory leaks and big-endian issues in rtl8xxxu, from Jes Sorensen. 15) Fix DESYNC_FACTOR handling in ipv6, from Jiri Bohac. 16) Several routing offload fixes in mlxsw driver, from Jiri Pirko. 17) Fix broadcast sync problem in TIPC, from Jon Paul Maloy. 18) Validate chunk len before using it in SCTP, from Marcelo Ricardo Leitner. 19) Revert a netns locking change that causes regressions, from Paul Moore. 20) Add recursion limit to GRO handling, from Sabrina Dubroca. 21) GFP_KERNEL in irq context fix in ibmvnic, from Thomas Falcon. 22) Avoid accessing stale vxlan/geneve socket in data path, from Pravin Shelar" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (189 commits) geneve: avoid using stale geneve socket. vxlan: avoid using stale vxlan socket. qede: Fix out-of-bound fastpath memory access net: phy: dp83848: add dp83822 PHY support enic: fix rq disable tipc: fix broadcast link synchronization problem ibmvnic: Fix missing brackets in init_sub_crq_irqs ibmvnic: Fix releasing of sub-CRQ IRQs in interrupt context Revert "ibmvnic: Fix releasing of sub-CRQ IRQs in interrupt context" arch/powerpc: Update parameters for csum_tcpudp_magic & csum_tcpudp_nofold net/mlx4_en: Save slave ethtool stats command net/mlx4_en: Fix potential deadlock in port statistics flow net/mlx4: Fix firmware command timeout during interrupt test net/mlx4_core: Do not access comm channel if it has not yet been initialized net/mlx4_en: Fix panic during reboot net/mlx4_en: Process all completions in RX rings after port goes up net/mlx4_en: Resolve dividing by zero in 32-bit system net/mlx4_core: Change the default value of enable_qos net/mlx4_core: Avoid setting ports to auto when only one port type is supported net/mlx4_core: Fix the resource-type enum in res tracker to conform to FW spec ...	2016-10-29 20:33:20 -07:00
Linus Torvalds	efa563752c	This pull request contains fixes for issues in both UBI and UBIFS: - A regression wrt. overlayfs, introduced in -rc2. - An UBI issue, found by Dan Carpenter's static checker. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABAgAGBQJYFPHWAAoJEEtJtSqsAOnWcK4P/AwBcqPa0em/HXrdCExanQXY 8U3uCPbDua4sW1Eaw5dVFoZuVoPzhibLLaVoVIWs8LOXiD8v23VYQ8ezu0D0O9fc cAsrxg0MtQLF/hyyVbdihxaqCB2H/j9PDJdIdCiRindPEwm0k6KBkVMk3N8O3m2U xDSA+Oq8Ns5cgjx+yfOhMJbGOFUzky26SV/M+PTAIU9Sj2w7RJS9R18BtWv4EFoK q1sT8aEte3kryb+v/a4s9RNzWOOHqRvZ4XizOMvma9I6uX6hOU4oeLknmJx1gPnb U5z75uAVn+IeNRnrco3pD91N3X9hEtv4IgZhFafNseVTY9MirDX5ss4th+XrSM6y wKgWEC8UmcV9Y7zDV/towZjhCipIh1yJPu3493IVHB/1UDPoNDfOGpK6NuhIEZHy 1sNY8F2j3BBnLw6Fc2uC1FxM3a9MQ9CgJWQ0y9src73VNgQ8miz1WH2rsFp5DwNu HdZGBXGElmhbJbNFSsRqC1j+K0Y2LzL5BVOrBblkJNpUmxufRx0LIdXE7p4tPazq 8dVOH/Ktx+mDQFbtyA8vXK+Cyyp0c/snR3BZo3AWLfrlip6iwZPG6arN4Wu6P4Nl ZFWUlHKaMJS/lvsdAuCdZ/lawRvENTOEQMORJR8U7CX/7gDLV1KiaFRpB3fFDUW5 xm5r2qsbVzElu6skk4xk =eOKJ -----END PGP SIGNATURE----- Merge tag 'upstream-4.9-rc3' of git://git.infradead.org/linux-ubifs Pull ubi/ubifs fixes from Richard Weinberger: "This contains fixes for issues in both UBI and UBIFS: - A regression wrt overlayfs, introduced in -rc2. - An UBI issue, found by Dan Carpenter's static checker" * tag 'upstream-4.9-rc3' of git://git.infradead.org/linux-ubifs: ubifs: Fix regression in ubifs_readdir() ubi: fastmap: Fix add_vol() return value test in ubi_attach_fastmap()	2016-10-29 13:15:24 -07:00
Linus Torvalds	c636e176d8	driver core fixes for 4.9-rc3 Here are two small driver core / kernfs fixes for 4.9-rc3. One makes the Kconfig entry for DEBUG_TEST_DRIVER_REMOVE a bit more explicit that this is a crazy thing to enable for a distro kernel (thanks for trying Fedora!), the other resolves an issue with vim opening kernfs files (sysfs, configfs, etc.). Both have been in linux-next with no reported issues. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- iFYEABECABYFAlgU0CMPHGdyZWdAa3JvYWguY29tAAoJEDFH1A3bLfspgv4AoJhR YJeG57ReBKjlzAj497Z1X7QcAJ9GXcbbbxmwj2IcUln5I3uEyuPCkQ== =pS6k -----END PGP SIGNATURE----- Merge tag 'driver-core-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core fixes from Greg KH: "Here are two small driver core / kernfs fixes for 4.9-rc3. One makes the Kconfig entry for DEBUG_TEST_DRIVER_REMOVE a bit more explicit that this is a crazy thing to enable for a distro kernel (thanks for trying Fedora!), the other resolves an issue with vim opening kernfs files (sysfs, configfs, etc.) Both have been in linux-next with no reported issues" * tag 'driver-core-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: driver core: Make Kconfig text for DEBUG_TEST_DRIVER_REMOVE stronger kernfs: Add noop_fsync to supported kernfs_file_fops	2016-10-29 10:57:40 -07:00
Al Viro	ad5cb123fd	ceph: switch to use of ->d_init() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-10-28 22:05:13 -04:00
Al Viro	18fc8abdb7	ceph: unify dentry_operations instances Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-10-28 21:52:50 -04:00
Linus Torvalds	f6167514c8	Merge branch 'for-linus-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "My patch fixes the btrfs list_head abuse that we tracked down during Dave Jones' memory corruption investigation. With both Jens and my patches in place, I'm no longer able to trigger problems. Filipe is fixing a difficult old bug between snapshots, balance and send. Dave is cooking a few more for the next rc, but these are tested and ready" * 'for-linus-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: btrfs: fix races on root_log_ctx lists btrfs: fix incremental send failure caused by balance	2016-10-28 10:07:35 -07:00
Christoph Hellwig	ef295ecf09	block: better op and flags encoding Now that we don't need the common flags to overflow outside the range of a 32-bit type we can encode them the same way for both the bio and request fields. This in addition allows us to place the operation first (and make some room for more ops while we're at it) and to stop having to shift around the operation values. In addition this allows passing around only one value in the block layer instead of two (and eventuall also in the file systems, but we can do that later) and thus clean up a lot of code. Last but not least this allows decreasing the size of the cmd_flags field in struct request to 32-bits. Various functions passing this value could also be updated, but I'd like to avoid the churn for now. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-10-28 08:48:16 -06:00
Richard Weinberger	a00052a296	ubifs: Fix regression in ubifs_readdir() Commit `c83ed4c9db` ("ubifs: Abort readdir upon error") broke overlayfs support because the fix exposed an internal error code to VFS. Reported-by: Peter Rosin <peda@axentia.se> Tested-by: Peter Rosin <peda@axentia.se> Reported-by: Ralph Sennhauser <ralph.sennhauser@gmail.com> Tested-by: Ralph Sennhauser <ralph.sennhauser@gmail.com> Fixes: `c83ed4c9db` ("ubifs: Abort readdir upon error") Cc: stable@vger.kernel.org Signed-off-by: Richard Weinberger <richard@nod.at>	2016-10-28 14:48:31 +02:00
Linus Torvalds	14970f204b	Merge branch 'akpm' (patches from Andrew) Merge misc fixes from Andrew Morton: "20 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: drivers/misc/sgi-gru/grumain.c: remove bogus 0x prefix from printk cris/arch-v32: cryptocop: print a hex number after a 0x prefix ipack: print a hex number after a 0x prefix block: DAC960: print a hex number after a 0x prefix fs: exofs: print a hex number after a 0x prefix lib/genalloc.c: start search from start of chunk mm: memcontrol: do not recurse in direct reclaim CREDITS: update credit information for Martin Kepplinger proc: fix NULL dereference when reading /proc/<pid>/auxv mm: kmemleak: ensure that the task stack is not freed during scanning lib/stackdepot.c: bump stackdepot capacity from 16MB to 128MB latent_entropy: raise CONFIG_FRAME_WARN by default kconfig.h: remove config_enabled() macro ipc: account for kmem usage on mqueue and msg mm/slab: improve performance of gathering slabinfo stats mm: page_alloc: use KERN_CONT where appropriate mm/list_lru.c: avoid error-path NULL pointer deref h8300: fix syscall restarting kcov: properly check if we are in an interrupt mm/slab: fix kmemcg cache creation delayed issue	2016-10-27 19:58:39 -07:00
Uwe Kleine-König	14f947c87a	fs: exofs: print a hex number after a 0x prefix It makes the message hard to interpret correctly if a base 10 number is prefixed by 0x. So change to a hex number. Link: http://lkml.kernel.org/r/20161026125658.25728-2-u.kleine-koenig@pengutronix.de Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Cc: Boaz Harrosh <ooo@electrozaur.com> Cc: Benny Halevy <bhalevy@primarydata.com> Cc: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-27 18:43:43 -07:00
Leon Yu	06b2849d10	proc: fix NULL dereference when reading /proc/<pid>/auxv Reading auxv of any kernel thread results in NULL pointer dereferencing in auxv_read() where mm can be NULL. Fix that by checking for NULL mm and bailing out early. This is also the original behavior changed by recent commit `c531716785` ("proc: switch auxv to use of __mem_open()"). # cat /proc/2/auxv Unable to handle kernel NULL pointer dereference at virtual address 000000a8 Internal error: Oops: 17 [#1] PREEMPT SMP ARM CPU: 3 PID: 113 Comm: cat Not tainted 4.9.0-rc1-ARCH+ #1 Hardware name: BCM2709 task: ea3b0b00 task.stack: e99b2000 PC is at auxv_read+0x24/0x4c LR is at do_readv_writev+0x2fc/0x37c Process cat (pid: 113, stack limit = 0xe99b2210) Call chain: auxv_read do_readv_writev vfs_readv default_file_splice_read splice_direct_to_actor do_splice_direct do_sendfile SyS_sendfile64 ret_fast_syscall Fixes: `c531716785` ("proc: switch auxv to use of __mem_open()") Link: http://lkml.kernel.org/r/1476966200-14457-1-git-send-email-chianglungyu@gmail.com Signed-off-by: Leon Yu <chianglungyu@gmail.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Kees Cook <keescook@chromium.org> Cc: John Stultz <john.stultz@linaro.org> Cc: Mateusz Guzik <mguzik@redhat.com> Cc: Janis Danisevskis <jdanis@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-27 18:43:43 -07:00
Johannes Berg	56989f6d85	genetlink: mark families as __ro_after_init Now genl_register_family() is the only thing (other than the users themselves, perhaps, but I didn't find any doing that) writing to the family struct. In all families that I found, genl_register_family() is only called from __init functions (some indirectly, in which case I've add __init annotations to clarifly things), so all can actually be marked __ro_after_init. This protects the data structure from accidental corruption. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-27 16:16:09 -04:00
Johannes Berg	489111e5c2	genetlink: statically initialize families Instead of providing macros/inline functions to initialize the families, make all users initialize them statically and get rid of the macros. This reduces the kernel code size by about 1.6k on x86-64 (with allyesconfig). Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-27 16:16:09 -04:00
Johannes Berg	a07ea4d994	genetlink: no longer support using static family IDs Static family IDs have never really been used, the only use case was the workaround I introduced for those users that assumed their family ID was also their multicast group ID. Additionally, because static family IDs would never be reserved by the generic netlink code, using a relatively low ID would only work for built-in families that can be registered immediately after generic netlink is started, which is basically only the control family (apart from the workaround code, which I also had to add code for so it would reserve those IDs) Thus, anything other than GENL_ID_GENERATE is flawed and luckily not used except in the cases I mentioned. Move those workarounds into a few lines of code, and then get rid of GENL_ID_GENERATE entirely, making it more robust. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-27 16:16:09 -04:00
Linus Torvalds	e3300ffef0	orangefs: a couple of cleanups sent in by other developers use d_fsdata instead of d_time Miklos Szeredi <mszeredi@redhat.com> use file_inode(file) instead of file->f_path.dentry->d_inode Amir Goldstein <amir73il@gmail.com> -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iQIcBAABAgAGBQJYD65MAAoJEM9EDqnrzg2+kaUP/0HPDYJyWSgbGVSKuNqOiyml VbAGRbDAcpyYCFww2cRO9Xvvh6bJmGEqZUUbNxgi3q5L2KnvvoQ0jkHFfHaVii53 uWP0WGrxBcRNxv72jfo1cBxYTcTqEfzXZBQb6HhzfbjMCvejbhSbYDowElTE7Oar AwcgEdv0Utm7zD/0K+OW56Q4fUYzOSFI4c/tNGUyjQCLE+N3R2roXdivz3maEfee uDg262lfQgkzbEYGJOdt8MpUak6YEp2bFa+Xf8bRoKMze8KbVDLwuTlYXuSdc/i8 e8QO/Zr+irX/jJ/Sc998FwGquUljPuxz4wHSNEVO3HqYFIe30zkUD0mqQcxqx6YD F4DhSn8Ok5PuKv5aw1Q7AMA0Zd+bKaJzb/E0JdlHn1n9PFMiod82rdTfmGxP1rZb BwuOW/dsp/RLBZhCYpkNTBiNAH+TSIp8M7eOavO68AZ2zJXN69e/Qv2iJsaAZJZ0 of+i9I4kmXUS4F6OPjgT6xJbH4aD/X4/jei4dPKDATXM0MW+GsZ7VodAYmAqGGCO l66UoL4o11BCMJNGfsdPxWJkUgpn4OBb+RSkS0f6qQ7Nlp1OaYeRYKNbX5ICHcgj A0PHXZ8Pub3iVgX5xUrQmYk3txbLt0ISDYBXzfPZ0rreztN0o5FRB4TNVLC82VwJ XHBdehhgLsNc1PMKSzZo =b9Ly -----END PGP SIGNATURE----- Merge tag 'for-linus-4.9-rc2-ofs-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux Pull oreangefs updates from Mike Marshall: "A couple of orangefs cleanups sent in by other developers: - use d_fsdata instead of d_time (Miklos Szeredi) - use file_inode(file) instead of file->f_path.dentry->d_inode (Amir Goldstein)" * tag 'for-linus-4.9-rc2-ofs-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux: orangefs: don't use d_time orangefs: user file_inode() where it is due	2016-10-27 12:52:46 -07:00
Linus Torvalds	e890038e6a	xfs: updates for 4.9-rc3 Changes in this update: o iomap page offset masking fix for page faults o add IOMAP_REPORT to distinguish between read and fiemap map requests o cleanups to new shared data extent code o fix mount active status on failed log recovery o fix broken dquots in a buffer calculation o fix locking order issues and merge xfs_reflink_remap_range and xfs_file_share_range o rework unmapping of CoW extents and remove now unused functions o clean state when CoW is done. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJYEfdWAAoJEK3oKUf0dfoddg4P/0Tl/i58sBL/Um90kSGOjxjI yaOKuFImS3MFSYDwiYADnXdhq6BgVLUJWS07t9/P6Nn3OZr1wBCZDZdyRS1+JwAA qOui4sp/v21HprydscN+BAdxyYmuo4yFgu9lkFFSM55yiaAb5C8hsYKF42Gja1+m gS40/Lsa5nauSz58UOZ5oEljAvBldAdyMlk8rVSGXVm7+pqs7Lxmhjif/Y8y/Y+i 097auIrGk+oRDukXqhtZyCQ7VP99WzM+ksajtrNwVOOzSMhrcDCHKuLe0i4LsyjN UTx1ioY/AD8PUYhSmLqALD9vtFHnJbx50/MQFHNLc+hDQb2jb/jQmqx9LyEYDt38 sw/Wy55hh9PylILdE//bWH0vSgqmnNCWviBUzjDtAJ9FKfv19slFlwtu2K4lOHoq C6Q2uh2mB7BC6efksk9DeA6/N9tFQuiXa48sN5+D2zMfZAmdkgzDCKfGrpRnS1Yl 4h+sfiK/DTf11Q2nTaPAHylt02SmHsikQWvb5Fxu76UI8k4RsjCZc3ep/NUNJBlU E8f+cdNlAF5k/AWBY7107N1iUqL/vS2wXLdburJkckmQqRcI5WuRaLhi9g4tFjFI o+m9EM1WuOP6jeOuVImwgCRJoLVnTVKwee/d4J8y9Ad//Rs6B9pB0SIDfxJa9LY6 B1XjT8z/NVyK6GsfP1Qs =LDDu -----END PGP SIGNATURE----- Merge tag 'xfs-fixes-for-linus-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs Pull xfs fixes from Dave Chinner: "This update contains fixes for most of the outstanding regressions introduced with the 4.9-rc1 XFS merge. There is also a fix for an iomap bug, too. This is a quite a bit larger than I'd prefer for a -rc3, but most of the change comes from cleaning up the new reflink copy on write code; it's much simpler and easier to understand now. These changes fixed several bugs in the new code, and it wasn't clear that there was an easier/simpler way to fix them. The rest of the fixes are the usual size you'd expect at this stage. I've left the commits to soak in linux-next for a some extra time because of the size before asking you to pull, no new problems with them have been reported so I think it's all OK. Summary: - iomap page offset masking fix for page faults - add IOMAP_REPORT to distinguish between read and fiemap map requests - cleanups to new shared data extent code - fix mount active status on failed log recovery - fix broken dquots in a buffer calculation - fix locking order issues and merge xfs_reflink_remap_range and xfs_file_share_range - rework unmapping of CoW extents and remove now unused functions - clean state when CoW is done" * tag 'xfs-fixes-for-linus-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (25 commits) xfs: clear cowblocks tag when cow fork is emptied xfs: fix up inode cowblocks tracking tracepoints fs: Do to trim high file position bits in iomap_page_mkwrite_actor xfs: remove xfs_bunmapi_cow xfs: optimize xfs_reflink_end_cow xfs: optimize xfs_reflink_cancel_cow_blocks xfs: refactor xfs_bunmapi_cow xfs: optimize writes to reflink files xfs: don't bother looking at the refcount tree for reads xfs: handle "raw" delayed extents xfs_reflink_trim_around_shared xfs: add xfs_trim_extent iomap: add IOMAP_REPORT xfs: merge xfs_reflink_remap_range and xfs_file_share_range xfs: remove xfs_file_wait_for_io xfs: move inode locking from xfs_reflink_remap_range to xfs_file_share_range xfs: fix the same_inode check in xfs_file_share_range xfs: remove the same fs check from xfs_file_share_range libxfs: v3 inodes are only valid on crc-enabled filesystems libxfs: clean up _calc_dquots_per_chunk xfs: unset MS_ACTIVE if mount fails ...	2016-10-27 12:34:50 -07:00
Chris Mason	570dd45042	btrfs: fix races on root_log_ctx lists btrfs_remove_all_log_ctxs takes a shortcut where it avoids walking the list because it knows all of the waiters are patiently waiting for the commit to finish. But, there's a small race where btrfs_sync_log can remove itself from the list if it finds a log commit is already done. Also, it uses list_del_init() to remove itself from the list, but there's no way to know if btrfs_remove_all_log_ctxs has already run, so we don't know for sure if it is safe to call list_del_init(). This gets rid of all the shortcuts for btrfs_remove_all_log_ctxs(), and just calls it with the proper locking. This is part two of the corruption fixed by `cbd60aa7cd`. I should have done this in the first place, but convinced myself the optimizations were safe. A 12 hour run of dbench 2048 will eventually trigger a list debug WARN_ON for the list_del_init() in btrfs_sync_log(). Fixes: `d1433debe7` Reported-by: Dave Jones <davej@codemonkey.org.uk> cc: stable@vger.kernel.org # 3.15+ Signed-off-by: Chris Mason <clm@fb.com>	2016-10-27 10:42:20 -07:00
Tony Luck	2a9becdd4d	kernfs: Add noop_fsync to supported kernfs_file_fops If you edit a kernfs backed file with vi(1), you see an ugly error message when you write the file because vi tries to fsync(2) the file after writing, which fails. We have noop_fsync() for this, use it. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2016-10-27 17:47:11 +02:00
Linus Torvalds	272ddc8b37	proc: don't use FOLL_FORCE for reading cmdline and environment Now that Lorenzo cleaned things up and made the FOLL_FORCE users explicit, it becomes obvious how some of them don't really need FOLL_FORCE at all. So remove FOLL_FORCE from the proc code that reads the command line and arguments from user space. The mem_rw() function actually does want FOLL_FORCE, because gdd (and possibly many other debuggers) use it as a much more convenient version of PTRACE_PEEKDATA, but we should consider making the FOLL_FORCE part conditional on actually being a ptracer. This does not actually do that, just moves adds a comment to that effect and moves the gup_flags settings next to each other. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-24 19:00:44 -07:00
Jeff Layton	0cc11a61b8	nfsd: move blocked lock handling under a dedicated spinlock Bruce was hitting some lockdep warnings in testing, showing that we could hit a deadlock with the new CB_NOTIFY_LOCK handling, involving a rather complex situation involving four different spinlocks. The crux of the matter is that we end up taking the nn->client_lock in the lm_notify handler. The simplest fix is to just declare a new per-nfsd_net spinlock to protect the new CB_NOTIFY_LOCK structures. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2016-10-24 16:51:21 -04:00
Miklos Szeredi	804b1737d7	orangefs: don't use d_time Instead use d_fsdata which is the same size. Hoping to get rid of d_time, which is used by very few filesystems by this time. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Reviewed-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>	2016-10-24 14:50:07 -04:00
Amir Goldstein	d62a9025ae	orangefs: user file_inode() where it is due Replace wrong use of file->f_path.dentry->d_inode with file_inode(file). In case orangefs ever finds itself as an overelayfs layer, it would want to get its own inode and not overlayfs's inode. DISCLAIMER: I did not test this patch because I do not know how to setup an orangefs mount Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>	2016-10-24 14:29:39 -04:00
Arnd Bergmann	68a564006a	NFSv4.1: work around -Wmaybe-uninitialized warning A bugfix introduced a harmless gcc warning in nfs4_slot_seqid_in_use if we enable -Wmaybe-uninitialized again: fs/nfs/nfs4session.c:203:54: error: 'cur_seq' may be used uninitialized in this function [-Werror=maybe-uninitialized] gcc is not smart enough to conclude that the IS_ERR/PTR_ERR pair results in a nonzero return value here. Using PTR_ERR_OR_ZERO() instead makes this clear to the compiler. The warning originally did not appear in v4.8 as it was globally disabled, but the bugfix that introduced the warning got backported to stable kernels which again enable it, and this is now the only warning in the v4.7 builds. Fixes: `e09c978aae` ("NFSv4.1: Fix Oopsable condition in server callback races") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-10-24 13:54:43 -04:00
Wang Xiaoguang	9d1032cc49	btrfs: fix WARNING in btrfs_select_ref_head() This issue was found when testing in-band dedupe enospc behaviour, sometimes run_one_delayed_ref() may fail for enospc reason, then __btrfs_run_delayed_refs（）will return, but forget to add num_heads_read back, which will trigger "WARN_ON(delayed_refs->num_heads_ready == 0)" in btrfs_select_ref_head(). Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-10-24 18:20:29 +02:00
Dan Carpenter	9c894696f5	Btrfs: remove some no-op casts We cast 0 to a u8 but then because of type promotion, it's immediately cast to int back to int before we do a bitwise negate. The cast doesn't matter in this case, the code works as intended. It causes a static checker warning though so let's remove it. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-10-24 18:20:29 +02:00
Wang Xiaoguang	dd4b857aab	btrfs: pass correct args to btrfs_async_run_delayed_refs() In btrfs_truncate_inode_items()->btrfs_async_run_delayed_refs(), we swap the arg2 and arg3 wrongly, fix this. This bug just impacts asynchronous delayed refs handle when we truncate inodes. In delayed_ref_async_start(), there is such codes: trans = btrfs_join_transaction(async->root); if (trans->transid > async->transid) goto end; ret = btrfs_run_delayed_refs(trans, async->root, async->count); From this codes, we can see that this just influence whether can we handle delayed refs or the number of delayed refs to handle, this may impact performance, but will not result in missing delayed refs, all delayed refs will be handled in btrfs_commit_transaction(). Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com> Reviewed-by: Holger Hoffstätte <holger@applied-asynchrony.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-10-24 18:20:29 +02:00
Wang Xiaoguang	69ae5e4459	btrfs: make file clone aware of fatal signals Indeed this just make the behavior similar to xfs when process has fatal signals pending, and it'll make fstests/generic/298 happy. Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-10-24 18:20:29 +02:00
Goldwyn Rodrigues	0b34c261e2	btrfs: qgroup: Prevent qgroup->reserved from going subzero While free'ing qgroup->reserved resources, we much check if the page has not been invalidated by a truncate operation by checking if the page is still dirty before reducing the qgroup resources. Resources in such a case are free'd when the entire extent is released by delayed_ref. This fixes a double accounting while releasing resources in case of truncating a file, reproduced by the following testcase. SCRATCH_DEV=/dev/vdb SCRATCH_MNT=/mnt mkfs.btrfs -f $SCRATCH_DEV mount -t btrfs $SCRATCH_DEV $SCRATCH_MNT cd $SCRATCH_MNT btrfs quota enable $SCRATCH_MNT btrfs subvolume create a btrfs qgroup limit 500m a $SCRATCH_MNT sync for c in {1..15}; do dd if=/dev/zero bs=1M count=40 of=$SCRATCH_MNT/a/file; done sleep 10 sync sleep 5 touch $SCRATCH_MNT/a/newfile echo "Removing file" rm $SCRATCH_MNT/a/file Fixes: `b9d0b38928` ("btrfs: Add handler for invalidate page") Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-10-24 18:20:21 +02:00
Benjamin Coddington	86a6c211d6	NFS: Trim extra slash in v4 nfs_path A NFSv4 mount of a subdirectory will show an extra slash (as in 'server://path') in proc's mountinfo which will not match the device name and path. This can cause problems for programs searching for the mount. Fix this by checking for a leading slash in the dentry path, if so trim away any trailing slashes in the device name. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-10-24 12:06:01 -04:00
Wei Yongjun	26c1ec2fe4	dlm: fix error return code in sctp_accept_from_sock() Fix to return a negative error code from the error handling case instead of 0, as done elsewhere in this function. Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: David Teigland <teigland@redhat.com>	2016-10-24 10:01:51 -05:00
Mauro Carvalho Chehab	8c27ceff36	docs: fix locations of several documents that got moved The previous patch renamed several files that are cross-referenced along the Kernel documentation. Adjust the links to point to the right places. Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>	2016-10-24 08:12:35 -02:00
Darrick J. Wong	b77428b12b	xfs: defer should abort intent items if the trans roll fails If the deferred ops transaction roll fails, we need to abort the intent items if we haven't already logged a done item for it, regardless of whether or not the deferred ops has had a transaction committed. Dave found this while running generic/388. Move the tracepoint to make it easier to track object lifetimes. Reported-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-10-24 14:21:18 +11:00
Brian Foster	c17a8ef43d	xfs: clear cowblocks tag when cow fork is emptied The background cowblocks scan job takes care of scanning for inodes with potentially lingering blocks in the cow fork and clearing them out. If the background scanner reclaims the cow fork blocks, however, it doesn't immediately clear the cowblocks tag from the inode. Instead, the inode remains tagged until the background scanner comes around again, discovers the inode cow fork has no blocks, clears the tag and fires the trace_xfs_inode_free_cowblocks_invalid() tracepoint to indicate that the inode may have been incorrectly tagged. This is not a major functional problem as the tag is ultimately cleared. Nonetheless, clear the tag when an inode cow fork is explicitly emptied to avoid the extra round trip through the background scanner and spurious "invalid" tracepoint. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-10-24 14:21:08 +11:00
Brian Foster	7b7381f043	xfs: fix up inode cowblocks tracking tracepoints These calls are still using the eofblocks tracepoints. The cowblocks equivalents are already defined, we just aren't actually calling them. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-10-24 14:21:00 +11:00
Jan Kara	c663e29f88	fs: Do to trim high file position bits in iomap_page_mkwrite_actor iomap_page_mkwrite_actor() calls __block_write_begin_int() with position masked as pos & ~PAGE_MASK which is equivalent to pos & (PAGE_SIZE-1). Thus it masks off high bits of file position. However __block_write_begin_int() expects full file position on input. This does not cause any visible issues because all __block_write_begin_int() really cares about are low file position bits but still it is a bug waiting to happen. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-10-24 14:20:25 +11:00
Linus Torvalds	5ff93abc7a	This pull requests contains fixes for issues in both UBI and UBIFS: - Fallout from the merge window, refactoring UBI code introduced some issues. - Fixes for an UBIFS readdir bug which can cause getdents() to busy loop for ever and a bug in the UBIFS xattr code. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABAgAGBQJYDMa5AAoJEEtJtSqsAOnWxKgP+wT4lCM9gP3/1FywhrJRxA4Z vH9YYWP6vjdYhZP8tt3RVIrJ/BxPMDx8+7IkZBzxVRQcnvaoaGibgEsfkGmTngyW 2rFVqDuwFIDIWLvNrKW26ep4p1Ek8yZhIIcW4upHKtnaJpZwvn6BmwxRep1JeLuc yZjGIJtejRbvuuaVwEBu+Et3Rlflg5/D6oPWOJfYqwjJjxihkb4hfAgzJkLeBK3Q Qw65S8FxKDPa7vAj2+jor3Cq0ETg3b2cQR4+UnGmDat9RVMquS3dDTBzBn6TNZx+ xw2aiOPi0JPMeEnJP+Z61/moeQhlLddZsEVdRQ5Ud6LcOeq6Rg7v5J+POkQ0hhIy DUfxHjnsmB4P9XqtaGGr74d8trjIm15cL6yAVKG/jMnb11oCWVDVyr0FmsXSmO7I O+b6P9hM7C3o+eAETdCLhd8Jg5isOm27WWQ2Bqq2FOjY9EmvTIFl+Imp+++3YHA6 R6jlFfMbju0gCfyPZdDPmTc91CPtWdTze43bpIdl2N3L2/efG2I0xFjjlr+WWEkL htYQr+b3vjO+moTl8KvT7pmvVNPUtNljOZsHHJjrsBLvuMDb0+7X1Wy860klTOPp B7NntTqwBUF6HtPpeebHvEfBiTruyspGZfokvkud6rqPuO1DbsJrVNY7Lwh9XA8M iGn9LwwlNjQYiyZNx0GT =Gjo0 -----END PGP SIGNATURE----- Merge tag 'upstream-4.9-rc2' of git://git.infradead.org/linux-ubifs Pull UBI[FS] fixes from Richard Weinberger: "This contains fixes for issues in both UBI and UBIFS: - Fallout from the merge window, refactoring UBI code introduced some issues. - Fixes for an UBIFS readdir bug which can cause getdents() to busy loop for ever and a bug in the UBIFS xattr code" * tag 'upstream-4.9-rc2' of git://git.infradead.org/linux-ubifs: ubifs: Abort readdir upon error UBI: Fix crash in try_recover_peb() ubi: fix swapped arguments to call to ubi_alloc_aeb ubifs: Fix xattr_names length in exit paths ubifs: Rename ubifs_rename2	2016-10-23 16:58:55 -07:00
Linus Torvalds	c761923cb8	A few bug fixes and add some missing KERN_CONT annotations -----BEGIN PGP SIGNATURE----- iQEcBAABCAAGBQJYDK6KAAoJEPL5WVaVDYGjhZ0H/2aLu4BQOmIPJZBBS+I2FurE 7FFdnQ8r1gBPWktvfUTn6MzTE4VKe0b1js5EiRCiCJhJq9UadBu53dUWTgfZ5Egi Sc6p0NGqDRgixLXbFRt8wP7iPtVg0tlysE0EJ6ae4VA1wUpf5aoHaPqgO9V0hirW 9pUJq8kzBGs628CROcYtQ5IL5AfouM1q/fzazw4Voz48LTgvhnDGCkqQmNsKkRo+ bN5tkjSTQUdW3OrRVsNwNND/iDYpTa6PcX1XXQiFFhQ4SbZoNS/dzowz09QreGxA Uz/rt2hMnv552Zd52d5q6N/jPWg+O+x0b4PcYtn7NDjPn/1KZUyX0pQK/EoevXQ= =2Kri -----END PGP SIGNATURE----- Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "A few bug fixes and add some missing KERN_CONT annotations" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: add missing KERN_CONT to a few more debugging uses fscrypto: lock inode while setting encryption policy ext4: correct endianness conversion in __xattr_check_inode() fscrypto: make XTS tweak initialization endian-independent ext4: do not advertise encryption support when disabled jbd2: fix incorrect unlock on j_list_lock ext4: super.c: Update logging style using KERN_CONT	2016-10-23 16:52:19 -07:00
Linus Torvalds	86c5bf7101	Merge branch 'mm-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull vmap stack fixes from Ingo Molnar: "This is fallout from CONFIG_HAVE_ARCH_VMAP_STACK=y on x86: stack accesses that used to be just somewhat questionable are now totally buggy. These changes try to do it without breaking the ABI: the fields are left there, they are just reporting zero, or reporting narrower information (the maps file change)" * 'mm-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: mm: Change vm_is_stack_for_task() to vm_is_stack_for_current() fs/proc: Stop trying to report thread stacks fs/proc: Stop reporting eip and esp in /proc/PID/stat mm/numa: Remove duplicated include from mprotect.c	2016-10-22 09:39:10 -07:00
Linus Torvalds	02593ac680	NFS client bugfixes for Linux 4.9 Stable bugfix: - Fix last_write_offset incorrectly set to page boundary Other bugfix: - Fix missing-braces warning -----BEGIN PGP SIGNATURE----- iQIcBAABCAAGBQJYCnklAAoJENfLVL+wpUDrv5MQALGRrZyyvQVCGwHt8BhiZDMp 5OAB1B7mFF0yf/L7j5rLUEvXs6+YyGVHTRrqWlAm1Mq7aqqGjW3YcE260KOJwse3 sk0eZ8mj92Bbm19ktRRGJCWeeCi16BsywIJEIbFFLs0ssKltSJMMnhE/8gyZ3Oj1 /TQ0jFCsAGxErr9GVny9FiVa4mlgkauOEfY/QJsgMHH7FBYftBU7mo7rH43RaxQ7 XLLv9XTe/WFCedxAa0uY/SikmAplLCpShOHCnCvveOF4WhKdx1gjaCnp6nZSMMP8 Hyd49AZHfxlQWK3B6amhHtI5iU2/tyNl8aFN49PXUdbN1VqJoSFnMCZgc/BWKIsQ NGpUuQSTqz2qnMtHC3sErWfi2/c9kNDn9R3DPkTJPtZKoE0+FHnnxlhTWl9YSvju iW4hisaDbldmP2davoMeKKDIrP9g+z0+8akcZx4lSoVEhswVtsDzpFpGETL8bM6Y 0002b8UU0qj4QVLUoW1HNCad5/H0G3ir0utXr+//OduQb2SMAilQmscltOcFXzfe TzR6YD7RP2RZs/t5fqnxlvBB2kYkSa8vWC/dJdVC5MC0nq6L1yO1n6L0p+E7Keck 9S2fWi89WnGN4guKxtIOo58vbr6wAcA+g6zM35WwLDgxtklQHZCKOTPJEsPbnlSr DpeZFTwLeG7/SENBFZhI =VadI -----END PGP SIGNATURE----- Merge tag 'nfs-for-4.9-2' of git://git.linux-nfs.org/projects/anna/linux-nfs Pull NFS client bugfixes from Anna Schumaker: "Just two bugfixes this time: Stable bugfix: - Fix last_write_offset incorrectly set to page boundary Other bugfix: - Fix missing-braces warning" * tag 'nfs-for-4.9-2' of git://git.linux-nfs.org/projects/anna/linux-nfs: nfs4: fix missing-braces warning pnfs/blocklayout: fix last_write_offset incorrectly set to page boundary	2016-10-21 19:06:59 -07:00
Linus Torvalds	bdcff41597	rbd exclusive-lock edge case fix and several filesystem fixups. Nikolay's error path patch is tagged for stable, everything else but readdir vs frags race was introduced in 4.9-rc1. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQEcBAABCAAGBQJYCPFwAAoJEEp/3jgCEfOLQxkH/3t7m/NaC0S+1eISHQWne0rs GtI4wx6Yh5KUV0SKgzYTYs0AEusW459XvUzwLwe/Tp9Qdp/KehviGJdQY8WBP6Es J5u7WLU+Ja1GwB586YUzhG7L3PAi8DXxbkTB+MYB4circhZ0w8ecuJUL4o++5VuH yAfoKn6tFyCTpvhFGd9dBPn3tVl90/vpwiH/hHp04PWHq6dNvLyJuIbvUD4JaV3O NYQqq3fFG76jqwyu2dE0DN4IPNb3tUjJ1oY86Uvkq7DP4ZiI61JNx45XTW1XIplx lWi2f2MurwznAJZl9kaU0TiTdS7liizkRdb2cu56nMRmzVSDz+va5X3CdDSpQtg= =JwMW -----END PGP SIGNATURE----- Merge tag 'ceph-for-4.9-rc2' of git://github.com/ceph/ceph-client Pull Ceph fixes from Ilya Dryomov: "An rbd exclusive-lock edge case fix and several filesystem fixups. Nikolay's error path patch is tagged for stable, everything else but readdir vs frags race was introduced in this merge window" * tag 'ceph-for-4.9-rc2' of git://github.com/ceph/ceph-client: ceph: fix non static symbol warning ceph: fix uninitialized dentry pointer in ceph_real_mount() ceph: fix readdir vs fragmentation race ceph: fix error handling in ceph_read_iter rbd: don't retry watch reregistration if header object is gone rbd: don't wait for the lock forever if blacklisted	2016-10-20 09:57:51 -07:00
Linus Torvalds	a28ad14e05	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull misc filesystem fixes from Jan Kara: "A fix for an isofs change apparently breaking mount(8) in some cases and one ext2 warning fix" * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: ext2: avoid bogus -Wmaybe-uninitialized warning isofs: Do not return EACCES for unknown filesystems	2016-10-20 08:49:03 -07:00
Andy Lutomirski	b18cb64ead	fs/proc: Stop trying to report thread stacks This reverts more of: `b76437579d` ("procfs: mark thread stack correctly in proc/<pid>/maps") ... which was partially reverted by: `65376df582` ("proc: revert /proc/<pid>/maps [stack:TID] annotation") Originally, /proc/PID/task/TID/maps was the same as /proc/TID/maps. In current kernels, /proc/PID/maps (or /proc/TID/maps even for threads) shows "[stack]" for VMAs in the mm's stack address range. In contrast, /proc/PID/task/TID/maps uses KSTK_ESP to guess the target thread's stack's VMA. This is racy, probably returns garbage and, on arches with CONFIG_TASK_INFO_IN_THREAD=y, is also crash-prone: KSTK_ESP is not safe to use on tasks that aren't known to be running ordinary process-context kernel code. This patch removes the difference and just shows "[stack]" for VMAs in the mm's stack range. This is IMO much more sensible -- the actual "stack" address really is treated specially by the VM code, and the current thread stack isn't even well-defined for programs that frequently switch stacks on their own. Reported-by: Jann Horn <jann@thejh.net> Signed-off-by: Andy Lutomirski <luto@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Linux API <linux-api@vger.kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tycho Andersen <tycho.andersen@canonical.com> Link: http://lkml.kernel.org/r/3e678474ec14e0a0ec34c611016753eea2e1b8ba.1475257877.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2016-10-20 09:21:41 +02:00
Andy Lutomirski	0a1eb2d474	fs/proc: Stop reporting eip and esp in /proc/PID/stat Reporting these fields on a non-current task is dangerous. If the task is in any state other than normal kernel code, they may contain garbage or even kernel addresses on some architectures. (x86_64 used to do this. I bet lots of architectures still do.) With CONFIG_THREAD_INFO_IN_TASK=y, it can OOPS, too. As far as I know, there are no use programs that make any material use of these fields, so just get rid of them. Reported-by: Jann Horn <jann@thejh.net> Signed-off-by: Andy Lutomirski <luto@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Linux API <linux-api@vger.kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Tycho Andersen <tycho.andersen@canonical.com> Link: http://lkml.kernel.org/r/a5fed4c3f4e33ed25d4bb03567e329bc5a712bcc.1475257877.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2016-10-20 09:21:41 +02:00
Christoph Hellwig	64e6428ddd	xfs: remove xfs_bunmapi_cow Since no one uses it anymore. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-10-20 15:54:59 +11:00
Christoph Hellwig	c1112b6e62	xfs: optimize xfs_reflink_end_cow Instead of doing a full extent list search for each extent that is to be deleted using xfs_bmapi_read and then doing another one inside of xfs_bunmapi_cow use the same scheme that xfs_bumapi uses: look up the last extent to be deleted and then use the extent index to walk downward until we are outside the range to be deleted. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-10-20 15:54:45 +11:00
Christoph Hellwig	3e0ee78f7a	xfs: optimize xfs_reflink_cancel_cow_blocks Rewrite xfs_reflink_cancel_cow_blocks so that we only do a search for the first extent in the extent list and then iterate over the remaining extents using the extent index, passing the extent we operate on directly to xfs_bmap_del_extent_delay or xfs_bmap_del_extent_cow instead of going through xfs_bunmapi and doing yet another extent list lookup. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-10-20 15:54:31 +11:00
Christoph Hellwig	fa5c836ca8	xfs: refactor xfs_bunmapi_cow Split out two helpers for deleting delayed or real extents from the COW fork. This allows to call them directly from xfs_reflink_cow_end_io once that function is refactored to iterate the extent tree. It will also allow to reuse the delalloc deletion from xfs_bunmapi in the future. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-10-20 15:54:14 +11:00
Christoph Hellwig	3ba020befe	xfs: optimize writes to reflink files Instead of reserving space as the first thing in write_begin move it past reading the extent in the data fork. That way we only have to read from the data fork once and can reuse that information for trimming the extent to the shared/unshared boundary. Additionally this allows to easily limit the actual write size to said boundary, and avoid a roundtrip on the ilock. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-10-20 15:53:50 +11:00
Christoph Hellwig	5f9268ca53	xfs: don't bother looking at the refcount tree for reads There is no need to trim an extent into a shared or non-shared one, or report any flags for plain old reads. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-10-20 15:53:32 +11:00
Christoph Hellwig	62c5ac89de	xfs: handle "raw" delayed extents xfs_reflink_trim_around_shared Delalloc extents in the extent list contain the number of reserved indirect blocks in their startblock value and don't use the magic DELAYSTARTBLOCK constant. Ensure that xfs_reflink_trim_around_shared handles them properly by checking for isnullstartblock(). Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-10-20 15:52:00 +11:00
Darrick J. Wong	0a0af28cad	xfs: add xfs_trim_extent This helpers allows to trim an extent to a subset of it's original range while making sure the block numbers in it remain valid, In the future xfs_trim_extent and xfs_bmapi_trim_map should probably be merged in some form. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> [hch: split from a previous patch from Darrick, moved around and added support for "raw" delayed extents"] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-10-20 15:51:50 +11:00
Christoph Hellwig	d33fd776f9	iomap: add IOMAP_REPORT This allows the file system to tell a FIEMAP from a read operation, and thus avoids the need to report flags that aren't actually used in the read path. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-10-20 15:51:28 +11:00
Christoph Hellwig	5faaf4fa0a	xfs: merge xfs_reflink_remap_range and xfs_file_share_range There is no clear division of responsibility between those functions, so just merge them into one to keep the code simple. Also move xfs_file_wait_for_io to xfs_reflink.c together with its only caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-10-20 15:50:07 +11:00
Christoph Hellwig	ec40759902	xfs: remove xfs_file_wait_for_io filemap_write_and_wait_range operates on full pages, so there is no need for the rounding operations. Additionally this allows us to micro-optimize by skipping the second inode_dio_wait for a intra-file clone. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-10-20 15:49:55 +11:00
Christoph Hellwig	576177818e	xfs: move inode locking from xfs_reflink_remap_range to xfs_file_share_range We need the iolock protection to stabilizie the IS_SWAPFILE and IS_IMMUTABLE values, as well as preventing new buffered writers re-dirtying the file data that we just wrote out. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-10-20 15:49:19 +11:00
Christoph Hellwig	a62e82b35b	xfs: fix the same_inode check in xfs_file_share_range The VFS i_ino is an unsigned long, while XFS inode numbers are 64-bit wide, so checking i_ino for equality could lead to rate false positives on 32-bit architectures. Just compare the inode pointers themselves to be safe. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-10-20 15:49:03 +11:00
Christoph Hellwig	4fbc2c6525	xfs: remove the same fs check from xfs_file_share_range The VFS already does the check, and the placement of this duplicate is in the way of the following locking rework. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-10-20 15:48:54 +11:00
Roger Willcocks	8cdcc8102c	libxfs: v3 inodes are only valid on crc-enabled filesystems xfs_repair was not detecting that version 3 inodes are invalid for for non-CRC filesystems. The result is specific inode corruptions go undetected and hence aren't repaired if only the version number is out of range. The core of the problem is that the XFS_DINODE_GOOD_VERSION() macro doesn't know that valid inode versions are dependent on a superblock version number. Fix this in libxfs, and propagate the new function out into the rest of xfsprogs to fix the issue. [Darrick: port to kernel from xfsprogs] Reported-by: Leslie Rhorer <lrhorer@mygrande.net> Signed-off-by: Roger Willcocks <roger@filmlight.ltd.uk> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-10-20 15:48:38 +11:00
Darrick J. Wong	58d7896785	libxfs: clean up _calc_dquots_per_chunk The function xfs_calc_dquots_per_chunk takes a parameter in units of basic blocks. The kernel seems to get the units wrong, but userspace got 'fixed' by commenting out the unnecessary conversion. Fix both. cc: <stable@vger.kernel.org> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-10-20 15:46:18 +11:00
Darrick J. Wong	d099245297	xfs: unset MS_ACTIVE if mount fails As part of the inode block map intent log item recovery process, we had to set the IRECOVERY flag to prevent an unlinked inode from being truncated during the first iput call. This required us to set MS_ACTIVE so that iput puts the inode on the lru instead of immediately evicting the inode. Unfortunately, if the mount fails later on, the inodes that have been loaded (root dir and realtime) actually need to be evicted since we're aborting the mount. If we don't clear MS_ACTIVE in the failure step, those inodes are not evicted and therefore leak. The leak was found by running xfs/130 and rmmoding xfs immediately after the test. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-10-20 15:45:40 +11:00
Eric Sandeen	fe23759eaf	xfs: remove pointless error goto in xfs_bmap_remap_alloc The commit: `f65306ea` xfs: map an inode's offset to an exact physical block added a pointless error0: target; remove it. Addresses-Coverity-Id: 1373865 Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Bill O'Donnell <billodo@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-10-20 15:44:53 +11:00
Christoph Hellwig	0ee7a3f6b5	xfs: don't take the IOLOCK exclusive for direct I/O page invalidation XFS historically took the iolock exclusive when invalidating pages before direct I/O operations to protect against writeback starvations. But this writeback starvation issues has been fixed a long time ago in the core writeback code, and all other file systems manage to do without the exclusive lock. Convert XFS over to avoid the exclusive lock in this case, and also move to range invalidations like done by the other file systems. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-10-20 15:44:14 +11:00
Eric Biggers	f1b8243c55	xfs: add some 'static' annotations sparse reported that several variables and a function were not forward-declared anywhere and therefore should be 'static'. Found with sparse by running 'make C=2 CF=-D__CHECK_ENDIAN__ fs/xfs/' Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-10-20 15:42:30 +11:00
Geert Uytterhoeven	1be7f9be0e	xfs: Fix uninitialized variable in xfs_reflink_reserve_cow_range() with gcc 4.1.2: fs/xfs/xfs_reflink.c: In function xfs_reflink_reserve_cow_range: fs/xfs/xfs_reflink.c:327: warning: error may be used uninitialized in this function Indeed, if "count" is zero, the function will return an uninitialized error value. While "count" is unlikely to be zero, this function is called through the public iomap API. Hence fix this by preinitializing error to zero. Fixes: `2a06705cd5` ("xfs: create delalloc extents in CoW fork") Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-10-20 15:41:48 +11:00
Colin Ian King	1d55a4bfd0	xfs: remove redundant assignment of ifp Remove redundant ifp = ifp statement, it does nothing. Found with static analysis by CoverityScan. Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-10-20 15:40:55 +11:00
Richard Weinberger	c83ed4c9db	ubifs: Abort readdir upon error If UBIFS is facing an error while walking a directory, it reports this error and ubifs_readdir() returns the error code. But the VFS readdir logic does not make the getdents system call fail in all cases. When the readdir cursor indicates that more entries are present, the system call will just return and the libc wrapper will try again since it also knows that more entries are present. This causes the libc wrapper to busy loop for ever when a directory is corrupted on UBIFS. A common approach do deal with corrupted directory entries is skipping them by setting the cursor to the next entry. On UBIFS this approach is not possible since we cannot compute the next directory entry cursor position without reading the current entry. So all we can do is setting the cursor to the "no more entries" position and make getdents exit. Cc: stable@vger.kernel.org Signed-off-by: Richard Weinberger <richard@nod.at>	2016-10-20 00:06:11 +02:00
Richard Weinberger	843741c577	ubifs: Fix xattr_names length in exit paths When the operation fails we also have to undo the changes we made to ->xattr_names. Otherwise listxattr() will report wrong lengths. Cc: stable@vger.kernel.org Signed-off-by: Richard Weinberger <richard@nod.at>	2016-10-20 00:05:54 +02:00
Richard Weinberger	390975ac39	ubifs: Rename ubifs_rename2 Since ->rename2 is gone, rename ubifs_rename2() to ubifs_rename(). Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Richard Weinberger <richard@nod.at>	2016-10-20 00:05:47 +02:00
Arnd Bergmann	83aa3e0f79	nfs4: fix missing-braces warning A bugfix introduced a harmless warning for update_open_stateid: fs/nfs/nfs4proc.c:1548:2: error: missing braces around initializer [-Werror=missing-braces] Removing the zero in the initializer will do the right thing here and initialize the entire structure to zero. Fixes: `1393d9612b` ("NFSv4: Fix a race when updating an open_stateid") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-10-19 14:39:15 -04:00
Bob Peterson	aa9f101285	dlm: don't specify WQ_UNBOUND for the ast callback workqueue This patch removes the WQ_UNBOUND flag (which implies WQ_HIGHPRI) from the DLM's ast work queue, in favor of just WQ_HIGHPRI. This has been shown to cause a 19 percent performance increase for simultaneous inode creates on GFS2 with fs_mark. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>	2016-10-19 11:13:04 -05:00
Bob Peterson	d2fee58a3b	dlm: remove lock_sock to avoid scheduling while atomic Before this patch, functions save_callbacks and restore_callbacks called function lock_sock and release_sock to prevent other processes from messing with the struct sock while the callbacks were saved and restored. However, function add_sock calls write_lock_bh prior to calling it save_callbacks, which disables preempts. So the call to lock_sock would try to schedule when we can't schedule. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>	2016-10-19 11:00:03 -05:00
Bob Peterson	3735b4b9f1	dlm: don't save callbacks after accept When DLM calls accept() on a socket, the comm code copies the sk after we've saved its callbacks. Afterward, it calls add_sock which saves the callbacks a second time. Since the error reporting function lowcomms_error_report calls the previous callback too, this results in a recursive call to itself. This patch adds a new parameter to function add_sock to tell whether to save the callbacks. Function tcp_accept_from_sock (and its sctp counterpart) then calls it with false to avoid the recursion. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>	2016-10-19 11:00:03 -05:00
Paul Gortmaker	7963b8a598	dlm: audit and remove any unnecessary uses of module.h Historically a lot of these existed because we did not have a distinction between what was modular code and what was providing support to modules via EXPORT_SYMBOL and friends. That changed when we forked out support for the latter into the export.h file. This means we should be able to reduce the usage of module.h in code that is obj-y Makefile or bool Kconfig. In the case of some code where it is modular, we can extend that to also include files that are building basic support functionality but not related to loading or registering the final module; such files also have no need whatsoever for module.h The advantage in removing such instances is that module.h itself sources about 15 other headers; adding significantly to what we feed cpp, and it can obscure what headers we are effectively using. Since module.h might have been the implicit source for init.h (for __init) and for export.h (for EXPORT_SYMBOL) we consider each instance for the presence of either and replace as needed. In the dlm case, we remove module.h from a global header and only introduce it in the files where it is explicitly required, since there is nothing modular in dlm_internal.h itself. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: David Teigland <teigland@redhat.com>	2016-10-19 11:00:03 -05:00
Stephen Hemminger	dbef1c0534	dlm: make genl_ops const This table contains function points and should be const. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David Teigland <teigland@redhat.com>	2016-10-19 11:00:03 -05:00
Linus Torvalds	63ae602cea	Merge branch 'gup_flag-cleanups' Merge the gup_flags cleanups from Lorenzo Stoakes: "This patch series adjusts functions in the get_user_pages* family such that desired FOLL_* flags are passed as an argument rather than implied by flags. The purpose of this change is to make the use of FOLL_FORCE explicit so it is easier to grep for and clearer to callers that this flag is being used. The use of FOLL_FORCE is an issue as it overrides missing VM_READ/VM_WRITE flags for the VMA whose pages we are reading from/writing to, which can result in surprising behaviour. The patch series came out of the discussion around commit `38e0885465` ("mm: check VMA flags to avoid invalid PROT_NONE NUMA balancing"), which addressed a BUG_ON() being triggered when a page was faulted in with PROT_NONE set but having been overridden by FOLL_FORCE. do_numa_page() was run on the assumption the page _must_ be one marked for NUMA node migration as an actual PROT_NONE page would have been dealt with prior to this code path, however FOLL_FORCE introduced a situation where this assumption did not hold. See https://marc.info/?l=linux-mm&m=147585445805166 for the patch proposal" Additionally, there's a fix for an ancient bug related to FOLL_FORCE and FOLL_WRITE by me. [ This branch was rebased recently to add a few more acked-by's and reviewed-by's ] * gup_flag-cleanups: mm: replace access_process_vm() write parameter with gup_flags mm: replace access_remote_vm() write parameter with gup_flags mm: replace __access_remote_vm() write parameter with gup_flags mm: replace get_user_pages_remote() write/force parameters with gup_flags mm: replace get_user_pages() write/force parameters with gup_flags mm: replace get_vaddr_frames() write/force parameters with gup_flags mm: replace get_user_pages_locked() write/force parameters with gup_flags mm: replace get_user_pages_unlocked() write/force parameters with gup_flags mm: remove write/force parameters from __get_user_pages_unlocked() mm: remove write/force parameters from __get_user_pages_locked() mm: remove gup_flags FOLL_WRITE games from __get_user_pages()	2016-10-19 08:39:47 -07:00
Lorenzo Stoakes	6347e8d5bc	mm: replace access_remote_vm() write parameter with gup_flags This removes the 'write' argument from access_remote_vm() and replaces it with 'gup_flags' as use of this function previously silently implied FOLL_FORCE, whereas after this patch callers explicitly pass this flag. We make this explicit as use of FOLL_FORCE can result in surprising behaviour (and hence bugs) within the mm subsystem. Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-19 08:12:14 -07:00
Lorenzo Stoakes	9beae1ea89	mm: replace get_user_pages_remote() write/force parameters with gup_flags This removes the 'write' and 'force' from get_user_pages_remote() and replaces them with 'gup_flags' to make the use of FOLL_FORCE explicit in callers as use of this flag can result in surprising behaviour (and hence bugs) within the mm subsystem. Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-19 08:12:02 -07:00
Linus Torvalds	1a1891d762	This includes fixing a bug which references a wrong pointer, sum_page, in f2fs_gc. It was newly introduced in 4.9-rc1. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJYBoL+AAoJEEAUqH6CSFDSQEoP/irufW4HDUszwKCxISLGksBN /i85zfbL9pY11Ci78qW4N5pegd2ouKk9WtUdrtXwT6y8Y+CWs9Gx5FaYbdA8aj7T nQBfRlfN/zKIJkRqqqXh/YRTMyQfUticur0dWZ00l7wpGAxA6Xhe+QDV1T7rxQxZ W8Ne9jcAD+SLu8G5Ci8zOTDcK+q6SWeXFNFtM1MPqr/S86PXiTRmlWFNcubACBi9 iqLl5moD++oYBJSU6sqxaKXvf27GhJMZUGp+upYT9lEIGF98vyhw+yXT64HYDvrW lMJspz80m5CRT3Pz7nxG+1d6ctONbXwDjyBPeEINUlHi4DN07UJZCJUnUWyMYM8D JffKq4DGhmsh0CJoX5pKjzxDhuHtrWwN+d/gMuUZrgC7oecXs0U5O2WR49oMGCs8 MMh6T/MUh1fWBI7F1Y3+OvyshU5XA9LCcNguu0T/RSKsjvqxkYTlpS/Zo1U/mpqv ar34mlvLXlLQNj8gTYDNybq/ufBfS2Fzq+juWbXCaVZ4OV0rsu66D9aUInQ66Iy6 1EqjlLaqAwTF+DJsyGeAxNEn6tMPiqMVhK9LTT9dz5Uk5vK4FdvuDxqeFuwwTnCi xYHdsZn/yhs7Nn6akwOclJayegDKiV3OVEG4ZyUHrXVR+njFTJyW20VcuUvN9RCn AxHW5MI9jaU/U+O4Opbb =1MpQ -----END PGP SIGNATURE----- Merge tag 'for-f2fs-4.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs bugfix from Jaegeuk Kim: "This fixes a bug which referenced the wrong pointer, sum_page, in f2fs_gc. It was newly introduced in 4.9-rc1. * tag 'for-f2fs-4.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: f2fs: fix wrong sum_page pointer in f2fs_gc	2016-10-18 14:15:23 -07:00
Linus Torvalds	e0ed1c22d4	Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixes from Ingo Molnar: "Two fixes: - a file locks fix (missing critical section, bug introduced in this merge window) - an x86 down_write() stack frame annotation" * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking, fs/locks: Add missing file_sem locks locking/rwsem/x86: Add stack frame dependency for ____down_write()	2016-10-18 09:04:17 -07:00
Chris Mason	112a3edf4b	Merge branch 'for-chris-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux into for-linus-4.9	2016-10-18 06:51:33 -07:00
Miklos Szeredi	0ce267ff95	fuse: fix root dentry initialization Add missing dentry initialization to root dentry. Fixes: `f75fdf22b0` ("fuse: don't use ->d_time") Reported-by: Andreas Reis <andreas.reis@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2016-10-18 15:36:48 +02:00
Wei Yongjun	5130ccea7c	ceph: fix non static symbol warning Fixes the following sparse warning: fs/ceph/xattr.c:19:28: warning: symbol 'ceph_other_xattr_handler' was not declared. Should it be static? Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2016-10-18 12:30:32 +02:00
Peter Zijlstra	5f43086bb9	locking, fs/locks: Add missing file_sem locks I overlooked a few code-paths that can lead to locks_delete_global_locks(). Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Jeff Layton <jlayton@poochiereds.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Bruce Fields <bfields@fieldses.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-fsdevel@vger.kernel.org Cc: syzkaller <syzkaller@googlegroups.com> Link: http://lkml.kernel.org/r/20161008081228.GF3142@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>	2016-10-18 12:21:28 +02:00
Geert Uytterhoeven	31ca587810	ceph: fix uninitialized dentry pointer in ceph_real_mount() fs/ceph/super.c: In function ‘ceph_real_mount’: fs/ceph/super.c:818: warning: ‘root’ may be used uninitialized in this function If s_root is already valid, dentry pointer root is never initialized, and returned by ceph_real_mount(). This will cause a crash later when the caller dereferences the pointer. Fixes: `ce2728aaa8` ("ceph: avoid accessing / when mounting a subpath") Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Yan, Zheng <zyan@redhat.com>	2016-10-18 12:10:59 +02:00
Yan, Zheng	f72f94555a	ceph: fix readdir vs fragmentation race following sequence of events tigger the race - client readdir frag 0* -> got item 'A' - MDS merges frag 0* and frag 1* - client send readdir request (frag 1, offset 2, readdir_start 'A') - MDS reply items (that are after item 'A') in frag Link: http://tracker.ceph.com/issues/17286 Signed-off-by: Yan, Zheng <zyan@redhat.com>	2016-10-18 12:09:58 +02:00
Arnd Bergmann	e952813e21	ext2: avoid bogus -Wmaybe-uninitialized warning On ARM, we get this false-positive warning since the rework of the ext2_get_blocks interface: fs/ext2/inode.c: In function 'ext2_get_block': include/linux/buffer_head.h:340:16: error: 'bno' may be used uninitialized in this function [-Werror=maybe-uninitialized] The calling conventions for this function are rather complex, and it's not surprising that the compiler gets this wrong, I spent a long time trying to understand how it all fits together myself. This change to avoid the warning makes sure the compiler sees that we always set 'bno' pointer whenever we have a positive return code. The transformation is correct because we always arrive at the 'got_it' label with a positive count that gets used as the return value, while any branch to the 'cleanup' label has a negative or zero 'err'. Fixes: `6750ad7198` ("ext2: stop passing buffer_head to ext2_get_blocks") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Jan Kara <jack@suse.cz>	2016-10-18 11:29:35 +02:00
Jan Kara	a2ed0b391d	isofs: Do not return EACCES for unknown filesystems When isofs_mount() is called to mount a device read-write, it returns EACCES even before it checks that the device actually contains an isofs filesystem. This may confuse mount(8) which then tries to mount all subsequent filesystem types in read-only mode. Fix the problem by returning EACCES only once we verify that the device indeed contains an iso9660 filesystem. CC: stable@vger.kernel.org Fixes: `17b7f7cf58` Reported-by: Kent Overstreet <kent.overstreet@gmail.com> Reported-by: Karel Zak <kzak@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz>	2016-10-18 11:28:21 +02:00
Junjie Mao	14155cafea	btrfs: assign error values to the correct bio structs Fixes: `4246a0b63b` ("block: add a bi_error field to struct bio") Signed-off-by: Junjie Mao <junjie.mao@enight.me> Acked-by: David Sterba <dsterba@suse.cz> Cc: stable@vger.kernel.org # 4.3+ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-17 14:16:14 -07:00
Liu Bo	4547f4d8ff	Btrfs: kill BUG_ON in do_relocation While updating btree, we try to push items between sibling nodes/leaves in order to keep height as low as possible. But we don't memset the original places with zero when pushing items so that we could end up leaving stale content in nodes/leaves. One may read the above stale content by increasing btree blocks' @nritems. One case I've come across is that in fs tree, a leaf has two parent nodes, hence running balance ends up with processing this leaf with two parent nodes, but it can only reach the valid parent node through btrfs_search_slot, so it'd be like, do_relocation for P in all parent nodes of block A: if !P->eb: btrfs_search_slot(key); --> get path from P to A. if lowest: BUG_ON(A->bytenr != bytenr of A recorded in P); btrfs_cow_block(P, A); --> change A's bytenr in P. After btrfs_cow_block, P has the new bytenr of A, but with the same @key, we get the same path again, and get panic by BUG_ON. Note that this is only happening in a corrupted fs, for a regular fs in which we have correct @nritems so that we won't read stale content in any case. Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-10-17 15:48:40 +02:00
Nikolay Borisov	0d7718f666	ceph: fix error handling in ceph_read_iter In case __ceph_do_getattr returns an error and the retry_op in ceph_read_iter is not READ_INLINE, then it's possible to invoke __free_page on a page which is NULL, this naturally leads to a crash. This can happen when, for example, a process waiting on a MDS reply receives sigterm. Fix this by explicitly checking whether the page is set or not. Cc: stable@vger.kernel.org # 3.19+ Signed-off-by: Nikolay Borisov <kernel@kyup.com> Reviewed-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2016-10-15 23:28:07 +02:00
Linus Torvalds	df34d04a6f	befs fixes for 4.9-rc1 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJYAnNOAAoJEGu/nxmHO1GNOzQH/3p+j1yPUR08+qhlZBdF/vCH i9Qb13yUT8yEN9tCZ7bsMhRZYQ70GuPMtLJbhklwGmnDAEZwzGoCrokexCsKoKiv 0RmzLUsbN7GM6LFXOyTj3QwFGxjQnVzk5TKXSR2qUpqvvffFsAFlTpg/JqRNpTjF c85naRDFYmZ3fGi2mT/emoY8MAu90XnjWbAMrg+uipsriBqOcbUD487CubDeR0CK svO3JSvv2W6vjMVzkLSWnpFrhiWmqAcOHFS4NEcCeQaJkDmyRCnmVNXBaB1YGZey 47+r8oLo64oByCt+Z60Dxb5rwDJfDLLDfRQeDOltgR4i2nnSZ5cS21V55Z5alqg= =sDD1 -----END PGP SIGNATURE----- Merge tag 'befs-v4.9-rc1' of git://github.com/luisbg/linux-befs Pull befs fixes from Luis de Bethencourt: "I recently took maintainership of the befs file system [0]. This is the first time I send you a git pull request, so please let me know if all the below is OK. Salah Triki and myself have been cleaning the code and fixing a few small bugs. Sorry I couldn't send this sooner in the merge window, I was waiting to have my GPG key signed by kernel members at ELCE in Berlin a few days ago." [0] https://lkml.org/lkml/2016/7/27/502 * tag 'befs-v4.9-rc1' of git://github.com/luisbg/linux-befs: (39 commits) befs: befs: fix style issues in datastream.c befs: improve documentation in datastream.c befs: fix typos in datastream.c befs: fix typos in btree.c befs: fix style issues in super.c befs: fix comment style befs: add check for ag_shift in superblock befs: dump inode_size superblock information befs: remove unnecessary initialization befs: fix typo in befs_sb_info befs: add flags field to validate superblock state befs: fix typo in befs_find_key befs: remove unused BEFS_BT_PARMATCH fs: befs: remove ret variable fs: befs: remove in vain variable assignment fs: befs: remove unnecessary *befs_sb variable fs: befs: remove useless initialization to zero fs: befs: remove in vain variable assignment fs: befs: Insert NULL inode to dentry fs: befs: Remove useless calls to brelse in befs_find_brun_dblindirect ...	2016-10-15 12:09:13 -07:00
Linus Torvalds	9ffc66941d	This adds a new gcc plugin named "latent_entropy". It is designed to extract as much possible uncertainty from a running system at boot time as possible, hoping to capitalize on any possible variation in CPU operation (due to runtime data differences, hardware differences, SMP ordering, thermal timing variation, cache behavior, etc). At the very least, this plugin is a much more comprehensive example for how to manipulate kernel code using the gcc plugin internals. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 Comment: Kees Cook <kees@outflux.net> iQIcBAABCgAGBQJX/BAFAAoJEIly9N/cbcAmzW8QALFbCs7EFFkML+M/M/9d8zEk 1QbUs/z8covJTTT1PjSdw7JUrAMulI3S00owpcQVd/PcWjRPU80QwfsXBgIB0tvC Kub2qxn6Oaf+kTB646zwjFgjdCecw/USJP+90nfcu2+LCnE8ReclKd1aUee+Bnhm iDEUyH2ONIoWq6ta2Z9sA7+E4y2ZgOlmW0iga3Mnf+OcPtLE70fWPoe5E4g9DpYk B+kiPDrD9ql5zsHaEnKG1ldjiAZ1L6Grk8rGgLEXmbOWtTOFmnUhR+raK5NA/RCw MXNuyPay5aYPpqDHFm+OuaWQAiPWfPNWM3Ett4k0d9ZWLixTcD1z68AciExwk7aW SEA8b1Jwbg05ZNYM7NJB6t6suKC4dGPxWzKFOhmBicsh2Ni5f+Az0BQL6q8/V8/4 8UEqDLuFlPJBB50A3z5ngCVeYJKZe8Bg/Swb4zXl6mIzZ9darLzXDEV6ystfPXxJ e1AdBb41WC+O2SAI4l64yyeswkGo3Iw2oMbXG5jmFl6wY/xGp7dWxw7gfnhC6oOh afOT54p2OUDfSAbJaO0IHliWoIdmE5ZYdVYVU9Ek+uWyaIwcXhNmqRg+Uqmo32jf cP5J9x2kF3RdOcbSHXmFp++fU+wkhBtEcjkNpvkjpi4xyA47IWS7lrVBBebrCq9R pa/A7CNQwibIV6YD8+/p =1dUK -----END PGP SIGNATURE----- Merge tag 'gcc-plugins-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull gcc plugins update from Kees Cook: "This adds a new gcc plugin named "latent_entropy". It is designed to extract as much possible uncertainty from a running system at boot time as possible, hoping to capitalize on any possible variation in CPU operation (due to runtime data differences, hardware differences, SMP ordering, thermal timing variation, cache behavior, etc). At the very least, this plugin is a much more comprehensive example for how to manipulate kernel code using the gcc plugin internals" * tag 'gcc-plugins-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: latent_entropy: Mark functions with __latent_entropy gcc-plugins: Add latent_entropy plugin	2016-10-15 10:03:15 -07:00
Joe Perches	d74f3d2528	ext4: add missing KERN_CONT to a few more debugging uses Recent commits require line continuing printks to always use pr_cont or KERN_CONT. Add these markings to a few more printks. Miscellaneaous: o Integrate the ea_idebug and ea_bdebug macros to use a single call to printk(KERN_DEBUG instead of 3 separate printks o Use the more common varargs macro style Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca>	2016-10-15 09:57:31 -04:00
Eric Biggers	8906a8223a	fscrypto: lock inode while setting encryption policy i_rwsem needs to be acquired while setting an encryption policy so that concurrent calls to FS_IOC_SET_ENCRYPTION_POLICY are correctly serialized (especially the ->get_context() + ->set_context() pair), and so that new files cannot be created in the directory during or after the ->empty_dir() check. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Richard Weinberger <richard@nod.at> Cc: stable@vger.kernel.org	2016-10-15 09:48:50 -04:00
Eric Biggers	199625098a	ext4: correct endianness conversion in __xattr_check_inode() It should be cpu_to_le32(), not le32_to_cpu(). No change in behavior. Found with sparse, and this was the only endianness warning in fs/ext4/. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>	2016-10-15 09:39:31 -04:00
Linus Torvalds	b26b5ef5ec	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull more misc uaccess and vfs updates from Al Viro: "The rest of the stuff from -next (more uaccess work) + assorted fixes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: score: traps: Add missing include file to fix build error fs/super.c: don't fool lockdep in freeze_super() and thaw_super() paths fs/super.c: fix race between freeze_super() and thaw_super() overlayfs: Fix setting IOP_XATTR flag iov_iter: kernel-doc import_iovec() and rw_copy_check_uvector() blackfin: no access_ok() for __copy_{to,from}_user() arm64: don't zero in __copy_from_user{,_inatomic} arm: don't zero in __copy_from_user_inatomic()/__copy_from_user() arc: don't leak bits of kernel stack into coredump alpha: get rid of tail-zeroing in __copy_user()	2016-10-14 18:19:05 -07:00
Linus Torvalds	87dbe42a16	Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6 Pull cifs fixes from Steve French: "Including: - nine bug fixes for stable. Some of these we found at the recent two weeks of SMB3 test events/plugfests. - significant improvements in reconnection (e.g. if server or network crashes) especially when mounted with "persistenthandles" or to server which advertises Continuous Availability on the share. - a new mount option "idsfromsid" which improves POSIX compatibility in some cases (when winbind not configured e.g.) by better (and faster) fetching uid/gid from acl (when "cifsacl" mount option is enabled). NB: we are almost complete work on "cifsacl" (querying mode/uid/gid from ACL) for SMB3, but SMB3 support for cifsacl is not included in this set. - improved handling for SMB3 "credits" (even if server is buggy) Still working on two sets of changes: - cifsacl enablement for SMB3 - cleanup of RFC1001 length calculation (so we can handle encryption and multichannel and RDMA) And a couple of new bugs were reported recently (unrelated to above) so will probably have another merge request next week" * 'for-next' of git://git.samba.org/sfrench/cifs-2.6: (21 commits) CIFS: Retrieve uid and gid from special sid if enabled CIFS: Add new mount option to set owner uid and gid from special sids in acl CIFS: Reset read oplock to NONE if we have mandatory locks after reopen CIFS: Fix persistent handles re-opening on reconnect SMB2: Separate RawNTLMSSP authentication from SMB2_sess_setup SMB2: Separate Kerberos authentication from SMB2_sess_setup Expose cifs module parameters in sysfs Cleanup missing frees on some ioctls Enable previous version support Do not send SMB3 SET_INFO request if nothing is changing SMB3: Add mount parameter to allow user to override max credits fs/cifs: reopen persistent handles on reconnect Clarify locking of cifs file and tcon structures and make more granular Fix regression which breaks DFS mounting fs/cifs: keep guid when assigning fid to fileinfo SMB3: GUIDs should be constructed as random but valid uuids Set previous session id correctly on SMB3 reconnect cifs: Limit the overall credit acquired Display number of credits available Add way to query creation time of file via cifs xattr ...	2016-10-14 17:47:31 -07:00
Linus Torvalds	d3304cadb2	Merge branch 'for-linus-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "Some fixes from Omar and Dave Sterba for our new free space tree. This isn't heavily used yet, but as we move toward making it the new default we wanted to nail down an endian bug" * 'for-linus-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: btrfs: tests: uninline member definitions in free_space_extent btrfs: tests: constify free space extent specs Btrfs: expand free space tree sanity tests to catch endianness bug Btrfs: fix extent buffer bitmap tests on big-endian systems Btrfs: catch invalid free space trees Btrfs: fix mount -o clear_cache,space_cache=v2 Btrfs: fix free space tree bitmaps on big-endian systems	2016-10-14 17:44:56 -07:00
Oleg Nesterov	f1a9622037	fs/super.c: don't fool lockdep in freeze_super() and thaw_super() paths sb_wait_write()->percpu_rwsem_release() fools lockdep to avoid the false-positives. Now that xfs was fixed by Dave's commit `dbad7c9930` ("xfs: stop holding ILOCK over filldir callbacks") we can remove it and change freeze_super() and thaw_super() to run with s_writers.rw_sem locks held; we add two trivial helpers for that, lockdep_sb_freeze_release() and lockdep_sb_freeze_acquire(). xfstests-dev/check `grep -il freeze tests/*/???` does not trigger any warning from lockdep. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-10-14 20:41:59 -04:00
Linus Torvalds	1a892b485f	Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs updates from Miklos Szeredi: "This update contains fixes to the "use mounter's permission to access underlying layers" area, and miscellaneous other fixes and cleanups. No new features this time" * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: ovl: use vfs_get_link() vfs: add vfs_get_link() helper ovl: use generic_readlink ovl: explain error values when removing acl from workdir ovl: Fix info leak in ovl_lookup_temp() ovl: during copy up, switch to mounter's creds early ovl: lookup: do getxattr with mounter's permission ovl: copy_up_xattr(): use strnlen	2016-10-14 17:23:33 -07:00
Oleg Nesterov	89f39af129	fs/super.c: fix race between freeze_super() and thaw_super() Change thaw_super() to check frozen != SB_FREEZE_COMPLETE rather than frozen == SB_UNFROZEN, otherwise it can race with freeze_super() which drops sb->s_umount after SB_FREEZE_WRITE to preserve the lock ordering. In this case thaw_super() will wrongly call s_op->unfreeze_fs() before it was actually frozen, and call sb_freeze_unlock() which leads to the unbalanced percpu_up_write(). Unfortunately lockdep can't detect this, so this triggers misc BUG_ON()'s in kernel/rcu/sync.c. Reported-and-tested-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-10-14 20:00:34 -04:00
Vivek Goyal	655042cc14	overlayfs: Fix setting IOP_XATTR flag ovl_fill_super calls ovl_new_inode to create a root inode for the new superblock before initializing sb->s_xattr. This wrongly causes IOP_XATTR to be cleared in i_opflags of the new inode, causing SELinux to log the following message: SELinux: (dev overlay, type overlay) has no xattr support Fix this by initializing sb->s_xattr and similar fields before calling ovl_new_inode. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-10-14 20:00:34 -04:00
Vegard Nossum	ffecee4f24	iov_iter: kernel-doc import_iovec() and rw_copy_check_uvector() Both import_iovec() and rw_copy_check_uvector() take an array (typically small and on-stack) which is used to hold an iovec array copy from userspace. This is to avoid an expensive memory allocation in the fast path (i.e. few iovec elements). The caller may have to check whether these functions actually used the provided buffer or allocated a new one -- but this differs between the too. Let's just add a kernel doc to clarify what the semantics are for each function. Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-10-14 20:00:34 -04:00
Steve French	3514de3fd5	CIFS: Retrieve uid and gid from special sid if enabled New mount option "idsfromsid" indicates to cifs.ko that it should try to retrieve the uid and gid owner fields from special sids. This patch adds the code to parse the owner sids in the ACL to see if they match, and if so populate the uid and/or gid from them. This is faster than upcalling for them and asking winbind, and is a fairly common case, and is also helpful when cifs.upcall and idmapping is not configured. Signed-off-by: Steve French <steve.french@primarydata.com> Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>	2016-10-14 14:22:16 -05:00
Steve French	9593265531	CIFS: Add new mount option to set owner uid and gid from special sids in acl Add "idsfromsid" mount option to indicate to cifs.ko that it should try to retrieve the uid and gid owner fields from special sids in the ACL if present. This first patch just adds the parsing for the mount option. Signed-off-by: Steve French <steve.french@primarydata.com> Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>	2016-10-14 14:22:01 -05:00
Linus Torvalds	f34d3606f7	Merge branch 'for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - tracepoints for basic cgroup management operations added - kernfs and cgroup path formatting functions updated to behave in the style of strlcpy() - non-critical bug fixes * 'for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: blkcg: Unlock blkcg_pol_mutex only once when cpd == NULL cgroup: fix error handling regressions in proc_cgroup_show() and cgroup_release_agent() cpuset: fix error handling regression in proc_cpuset_show() cgroup: add tracepoints for basic operations cgroup: make cgroup_path() and friends behave in the style of strlcpy() kernfs: remove kernfs_path_len() kernfs: make kernfs_path*() behave in the style of strlcpy() kernfs: add dummy implementation of kernfs_path_from_node()	2016-10-14 12:18:50 -07:00
David S. Miller	f1f081cef0	RxRPC rewrite -----BEGIN PGP SIGNATURE----- iQIVAwUAV/+wwfSw1s6N8H32AQLpVA/+NByreKyI8cHL1zgz816iTrzYbEP5Gbtw RI9TI5iweUa9ySe4PFQUw+VC0yaAP9brY8tTtss8KHk808Wu4xhlg8fAClOaZXwy WmHASdwnRaDWguEpPHyHRST+s9ZO/VD5vwGhREB/hojzdzd135bq1d6GKaHoLFx2 XDwDeyZc1z+aSzdMCoQuKJlqw9mfujsIOK5xZJ/h/JquYJ3iER55vdofettNCT+S hjueVBgWV988oORBtduPrfUYBbQI83QyiWl0xdo6QXWAoN784NfpVti8YAA9B7To qfup5aE6ky3LhuRD8GS00yWb96b43FGuPqt27LTH7SGnALX7KbETbBcgasyWqFeV UvPbVlk5R+0OXLxqOHvn20gRFS2c6HIdVAW6h7QHB0qwnLS1JJJToAczhK4QTsHQ eOXSGK4Gj0CiTd23bL7egaULKnD7eiZbagoty1UL05k8TAgPRnXXaMCRc0cupvKo 5Amk7xT7ZN1Iyh9TQSt2MRIBIG6AOJogsmjSqgKJZY4YdCD5rYAWyAzc7K2l/L0e oUwDaPFWwNAfVYxJIXSd223squyxaXen0B+NlY501tqw4Ce4NnuEssqsSWk6OTHZ R9HCT200HvvZthPOStP7rdFiQ6VRDH3aSdl3zU4ila7cxr4wfdVslShuj91OiLqI /7zodCmls/8= =7krl -----END PGP SIGNATURE----- Merge tag 'rxrpc-rewrite-20161013' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs David Howells says: ==================== rxrpc: Fixes This set of patches contains a bunch of fixes: (1) Fix use of kunmap() after change from kunmap_atomic() within AFS. (2) Don't use of ERR_PTR() with an always zero value. (3) Check the right error when using ip6_route_output(). (4) Be consistent about whether call->operation_ID is BE or CPU-E within AFS. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-14 10:44:45 -04:00
Miklos Szeredi	7764235bec	ovl: use vfs_get_link() Resulting in a complete removal of a function basically implementing the inverse of vfs_readlink(). As a bonus, now the proper security hook is also called. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2016-10-14 11:16:47 +02:00
Miklos Szeredi	d60874cd58	vfs: add vfs_get_link() helper This helper is for filesystems that want to read the symlink and are better off with the get_link() interface (returning a char *) rather than the readlink() interface (copy into a userspace buffer). Also call the LSM hook for readlink (not get_link) since this is for symlink reading not following. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2016-10-14 11:16:47 +02:00
Miklos Szeredi	78a3fa4f32	ovl: use generic_readlink All filesystems that are backers for overlayfs would also use generic_readlink(). Move this logic to the overlay itself, which is a nice cleanup. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2016-10-14 11:16:46 +02:00
Miklos Szeredi	cb348edb6b	ovl: explain error values when removing acl from workdir Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2016-10-14 11:16:46 +02:00
Linus Torvalds	c4a86165d1	NFS client updates for Linux 4.9 Highlights include: Stable bugfixes: - sunrpc: fix writ espace race causing stalls - NFS: Fix inode corruption in nfs_prime_dcache() - NFSv4: Don't report revoked delegations as valid in nfs_have_delegation() - NFSv4: nfs4_copy_delegation_stateid() must fail if the delegation is invalid - NFSv4: Open state recovery must account for file permission changes - NFSv4.2: Fix a reference leak in nfs42_proc_layoutstats_generic Features: - Add support for tracking multiple layout types with an ordered list - Add support for using multiple backchannel threads on the client - Add support for pNFS file layout session trunking - Delay xprtrdma use of DMA API (for device driver removal) - Add support for xprtrdma remote invalidation - Add support for larger xprtrdma inline thresholds - Use a scatter/gather list for sending xprtrdma RPC calls - Add support for the CB_NOTIFY_LOCK callback - Improve hashing sunrpc auth_creds by using both uid and gid Bugfixes: - Fix xprtrdma use of DMA API - Validate filenames before adding to the dcache - Fix corruption of xdr->nwords in xdr_copy_to_scratch - Fix setting buffer length in xdr_set_next_buffer() - Don't deadlock the state manager on the SEQUENCE status flags - Various delegation and stateid related fixes - Retry operations if an interrupted slot receives EREMOTEIO - Make nfs boot time y2038 safe -----BEGIN PGP SIGNATURE----- iQIcBAABCAAGBQJX/+ZfAAoJENfLVL+wpUDr5MUP/16s2Kp9ZZZZ7ICi3yrHOzb0 9WpCOmbKUIELXl8YgkxlvPUYMzTQTIc32TwbVgdFV0g41my/0+O3z3+IiTrUGxH5 8LgouMWBZ9KKmyUB//+KQAXr3j/bvDdF6Li6wJfz8a2o+9xT4oTkK1+Js8p0kn6e HNKfRknfCKwvE+j4tPCLfs2RX5qDyBFILXwWhj1fAbmT3rbnp+QqkXD4mWUrXb9z DBgxciXRhOkOQQAD2KQBFd2kUqWDZ5ED23b+aYsu9D3VCW45zitBqQFAxkQWL0hp x8Mp+MDCxlgdEaGQPUmUiDtPkG1X9ZxUJCAwaJWWsZaItwR2Il+en2sETctnTZ1X 0IAxZVFdolzSeLzIfNx3OG32JdWJdaNjUzkIZam8gO6i1f6PAmK4alR0J3CT31nJ /OEN76o1E7acGWRMmj+MAZ2U5gPfR7EitOzyE8ZUPcHgyeGMiynjwi56WIpeSvT2 F/Sp5kRe5+D5gtnYuppGp7Srp5vYdtFaz1zgPDUKpDLcxfDweO8AHGjJf3Zmrunx X24yia4A14CnfcUy4vKpISXRykmkG/3Z0tpWwV53uXZm4nlQfRc7gPibiW7Ay521 af8sDoItW98K3DK5NQU7IUn83ua1TStzpoqlAEafRw//g9zPMTbhHvNvOyrRfrcX kjWn6hNblMu9M34JOjtu =XOrF -----END PGP SIGNATURE----- Merge tag 'nfs-for-4.9-1' of git://git.linux-nfs.org/projects/anna/linux-nfs Pull NFS client updates from Anna Schumaker: "Highlights include: Stable bugfixes: - sunrpc: fix writ espace race causing stalls - NFS: Fix inode corruption in nfs_prime_dcache() - NFSv4: Don't report revoked delegations as valid in nfs_have_delegation() - NFSv4: nfs4_copy_delegation_stateid() must fail if the delegation is invalid - NFSv4: Open state recovery must account for file permission changes - NFSv4.2: Fix a reference leak in nfs42_proc_layoutstats_generic Features: - Add support for tracking multiple layout types with an ordered list - Add support for using multiple backchannel threads on the client - Add support for pNFS file layout session trunking - Delay xprtrdma use of DMA API (for device driver removal) - Add support for xprtrdma remote invalidation - Add support for larger xprtrdma inline thresholds - Use a scatter/gather list for sending xprtrdma RPC calls - Add support for the CB_NOTIFY_LOCK callback - Improve hashing sunrpc auth_creds by using both uid and gid Bugfixes: - Fix xprtrdma use of DMA API - Validate filenames before adding to the dcache - Fix corruption of xdr->nwords in xdr_copy_to_scratch - Fix setting buffer length in xdr_set_next_buffer() - Don't deadlock the state manager on the SEQUENCE status flags - Various delegation and stateid related fixes - Retry operations if an interrupted slot receives EREMOTEIO - Make nfs boot time y2038 safe" * tag 'nfs-for-4.9-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (100 commits) NFSv4.2: Fix a reference leak in nfs42_proc_layoutstats_generic fs: nfs: Make nfs boot time y2038 safe sunrpc: replace generic auth_cred hash with auth-specific function sunrpc: add RPCSEC_GSS hash_cred() function sunrpc: add auth_unix hash_cred() function sunrpc: add generic_auth hash_cred() function sunrpc: add hash_cred() function to rpc_authops struct Retry operation on EREMOTEIO on an interrupted slot pNFS: Fix atime updates on pNFS clients sunrpc: queue work on system_power_efficient_wq NFSv4.1: Even if the stateid is OK, we may need to recover the open modes NFSv4: If recovery failed for a specific open stateid, then don't retry NFSv4: Fix retry issues with nfs41_test/free_stateid NFSv4: Open state recovery must account for file permission changes NFSv4: Mark the lock and open stateids as invalid after freeing them NFSv4: Don't test open_stateid unless it is set NFSv4: nfs4_do_handle_exception() handle revoke/expiry of a single stateid NFS: Always call nfs_inode_find_state_and_recover() when revoking a delegation NFSv4: Fix a race when updating an open_stateid NFSv4: Fix a race in nfs_inode_reclaim_delegation() ...	2016-10-13 21:28:20 -07:00
Linus Torvalds	2778556474	Some RDMA work and some good bugfixes, and two new features that could benefit from user testing: Anna Schumacker contributed a simple NFSv4.2 COPY implementation. COPY is already supported on the client side, so a call to copy_file_range() on a recent client should now result in a server-side copy that doesn't require all the data to make a round trip to the client and back. Jeff Layton implemented callbacks to notify clients when contended locks become available, which should reduce latency on workloads with contended locks. -----BEGIN PGP SIGNATURE----- iQIcBAABAgAGBQJX/mcsAAoJECebzXlCjuG+MU0P/3SzTLGYXU5yOTAorx255/uf fUVKQQhTzzaA2xj3gWWWztYx3y0ZJUVgwU56a+Ap5Z8/goqDQ78H+ePEc+MG7BT/ /UXS/bITvt0MP/dvPrDzhSltvqx/wpelLPBo29hGLlAQ2dsnD4Y75IbOOQccWqcC iD2v6x7lnpWZ7j9Zhwzg/JNQHwISIb7tiLoYBjfcdNDEMU76KIyhxD0Cx9MSeBzH 9Rq/oEdwGDFS5WqVfNe2jxbngoauq1IupziQ2eQGv2D/POyXCx8fphoYjDz1XaW8 PxaJtJtM2owPGG+z2CxklJqNaS1Z4F+oppjg+nf4i/ibxmIBaTy8NluASX3vMh69 CDO1+ly+TiF0l1VqMOQJWRnqn1qGk6fLpF6P1Ac62B0oWpeLGU7nmik7XN1ORgsi 8ksxRKNAWeprZo3wl5xNrADu/wlZ7XCJTc4QoHEgYT04aHF+j8EMCHv+mtZ8+Bwn WWiA8iItZOgXV4vitCRJlvsixjYvmF3djPIoI2Lt5KDWIg+eL89sKwzTALSfeC4m Vjb0svzPX1MmZCNP1rCStFbl3gZYXZyqPk+uA6M7H8mjAjVeKxRPowWpMBgvYZHr FjCPb878bAuqCeBVbIyOLLcKWBLTw8PsUWZAor3gNg454JGkMjLUyJ/S22Cz5Nbo HdjoiTJtbPrHnCwTMXwa =nozl -----END PGP SIGNATURE----- Merge tag 'nfsd-4.9' of git://linux-nfs.org/~bfields/linux Pull nfsd updates from Bruce Fields: "Some RDMA work and some good bugfixes, and two new features that could benefit from user testing: - Anna Schumacker contributed a simple NFSv4.2 COPY implementation. COPY is already supported on the client side, so a call to copy_file_range() on a recent client should now result in a server-side copy that doesn't require all the data to make a round trip to the client and back. - Jeff Layton implemented callbacks to notify clients when contended locks become available, which should reduce latency on workloads with contended locks" * tag 'nfsd-4.9' of git://linux-nfs.org/~bfields/linux: NFSD: Implement the COPY call nfsd: handle EUCLEAN nfsd: only WARN once on unmapped errors exportfs: be careful to only return expected errors. nfsd4: setclientid_confirm with unmatched verifier should fail nfsd: randomize SETCLIENTID reply to help distinguish servers nfsd: set the MAY_NOTIFY_LOCK flag in OPEN replies nfs: add a new NFS4_OPEN_RESULT_MAY_NOTIFY_LOCK constant nfsd: add a LRU list for blocked locks nfsd: have nfsd4_lock use blocking locks for v4.1+ locks nfsd: plumb in a CB_NOTIFY_LOCK operation NFSD: fix corruption in notifier registration svcrdma: support Remote Invalidation svcrdma: Server-side support for rpcrdma_connect_private rpcrdma: RDMA/CM private message data structure svcrdma: Skip put_page() when send_reply() fails svcrdma: Tail iovec leaves an orphaned DMA mapping nfsd: fix dprintk in nfsd4_encode_getdeviceinfo nfsd: eliminate cb_minorversion field nfsd: don't set a FL_LAYOUT lease for flexfiles layouts	2016-10-13 21:04:42 -07:00
Linus Torvalds	35a891be96	xfs: reflink update for 4.9-rc1 < XFS has gained super CoW powers! > ---------------------------------- \ ^__^ \ (oo)\_______ (__)\ )\/\ \|\|----w \| \|\| \|\| Included in this update: - unshare range (FALLOC_FL_UNSHARE) support for fallocate - copy-on-write extent size hints (FS_XFLAG_COWEXTSIZE) for fsxattr interface - shared extent support for XFS - copy-on-write support for shared extents - copy_file_range support - clone_file_range support (implements reflink) - dedupe_file_range support - defrag support for reverse mapping enabled filesystems -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJX/hrZAAoJEK3oKUf0dfodpwcQAKkTerNPhhDcthqWUJ2+jC7w JIuhKUg2GYojJhIJ4+Ue1knmuBeIusda+PzGls+6gdy7GDGdux/esRIJSe1W7A5G RNeumiSKVX5iYsZNUEX35O2a/SwUM1Sm5mcIFs4CxUwIRwE/cayNby6vrlVExvz7 Ns6YYOI2bldUHLsxedg8MLG0it1JGTADB9gwGgb98bxQ3bD/UBn3TF9xTlj+ZH22 ebnWsogSJOnUigOOSGeaQsmy1pJAhRIhvt+f481KuZak1pdQcK2feL4RcKw0NpNt 15LCYRqX6RexC684VYgJZxXB4EKyfS2Bma71q41A7dz1x36kw7+wG18xasBqU++p GZwwL6si02rIGPMz1oD8xxZ0F97ADCGRmkgUHsCJKbP5UmGiP08K6GEN3osr5hAN xAmn9AxcprXVnV3WmnFxpBeWY/qCEsvSQqJuKSThYqAilqUc8wN2u5g/eEpE6mmg KEEhzaq5P4ovS/HOIQJWdBu1j5E9Mg2o/ncy87Q6uE+9Fa5AAP6GBWOtGcMwdFSU adbN7dqjgoHMyNHFrmePqyJYtOZ2hZovDlVndxnYysl5ZBfiBEEDISmr+x6KcSlo 3kyOltYQLjEVu1sLOT3COCddn0jt5Lr1QhGeVepnrMlU2E1h4461viCNMDinJRIp OYoMOS+J83G2FEFwgXYM =Sa+Y -----END PGP SIGNATURE----- Merge tag 'xfs-reflink-for-linus-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs < XFS has gained super CoW powers! > ---------------------------------- \ ^__^ \ (oo)\_______ (__)\ )\/\ \|\|----w \| \|\| \|\| Pull XFS support for shared data extents from Dave Chinner: "This is the second part of the XFS updates for this merge cycle. This pullreq contains the new shared data extents feature for XFS. Given the complexity and size of this change I am expecting - like the addition of reverse mapping last cycle - that there will be some follow-up bug fixes and cleanups around the -rc3 stage for issues that I'm sure will show up once the code hits a wider userbase. What it is: At the most basic level we are simply adding shared data extents to XFS - i.e. a single extent on disk can now have multiple owners. To do this we have to add new on-disk features to both track the shared extents and the number of times they've been shared. This is done by the new "refcount" btree that sits in every allocation group. When we share or unshare an extent, this tree gets updated. Along with this new tree, the reverse mapping tree needs to be updated to track each owner or a shared extent. This also needs to be updated ever share/unshare operation. These interactions at extent allocation and freeing time have complex ordering and recovery constraints, so there's a significant amount of new intent-based transaction code to ensure that operations are performed atomically from both the runtime and integrity/crash recovery perspectives. We also need to break sharing when writes hit a shared extent - this is where the new copy-on-write implementation comes in. We allocate new storage and copy the original data along with the overwrite data into the new location. We only do this for data as we don't share metadata at all - each inode has it's own metadata that tracks the shared data extents, the extents undergoing CoW and it's own private extents. Of course, being XFS, nothing is simple - we use delayed allocation for CoW similar to how we use it for normal writes. ENOSPC is a significant issue here - we build on the reservation code added in 4.8-rc1 with the reverse mapping feature to ensure we don't get spurious ENOSPC issues part way through a CoW operation. These mechanisms also help minimise fragmentation due to repeated CoW operations. To further reduce fragmentation overhead, we've also introduced a CoW extent size hint, which indicates how large a region we should allocate when we execute a CoW operation. With all this functionality in place, we can hook up .copy_file_range, .clone_file_range and .dedupe_file_range and we gain all the capabilities of reflink and other vfs provided functionality that enable manipulation to shared extents. We also added a fallocate mode that explicitly unshares a range of a file, which we implemented as an explicit CoW of all the shared extents in a file. As such, it's a huge chunk of new functionality with new on-disk format features and internal infrastructure. It warns at mount time as an experimental feature and that it may eat data (as we do with all new on-disk features until they stabilise). We have not released userspace suport for it yet - userspace support currently requires download from Darrick's xfsprogs repo and build from source, so the access to this feature is really developer/tester only at this point. Initial userspace support will be released at the same time the kernel with this code in it is released. The new code causes 5-6 new failures with xfstests - these aren't serious functional failures but things the output of tests changing slightly due to perturbations in layouts, space usage, etc. OTOH, we've added 150+ new tests to xfstests that specifically exercise this new functionality so it's got far better test coverage than any functionality we've previously added to XFS. Darrick has done a pretty amazing job getting us to this stage, and special mention also needs to go to Christoph (review, testing, improvements and bug fixes) and Brian (caught several intricate bugs during review) for the effort they've also put in. Summary: - unshare range (FALLOC_FL_UNSHARE) support for fallocate - copy-on-write extent size hints (FS_XFLAG_COWEXTSIZE) for fsxattr interface - shared extent support for XFS - copy-on-write support for shared extents - copy_file_range support - clone_file_range support (implements reflink) - dedupe_file_range support - defrag support for reverse mapping enabled filesystems" * tag 'xfs-reflink-for-linus-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (71 commits) xfs: convert COW blocks to real blocks before unwritten extent conversion xfs: rework refcount cow recovery error handling xfs: clear reflink flag if setting realtime flag xfs: fix error initialization xfs: fix label inaccuracies xfs: remove isize check from unshare operation xfs: reduce stack usage of _reflink_clear_inode_flag xfs: check inode reflink flag before calling reflink functions xfs: implement swapext for rmap filesystems xfs: refactor swapext code xfs: various swapext cleanups xfs: recognize the reflink feature bit xfs: simulate per-AG reservations being critically low xfs: don't mix reflink and DAX mode for now xfs: check for invalid inode reflink flags xfs: set a default CoW extent size of 32 blocks xfs: convert unwritten status of reverse mappings for shared files xfs: use interval query for rmap alloc operations on shared files xfs: add shared rmap map/unmap/convert log item types xfs: increase log reservations for reflink ...	2016-10-13 20:28:22 -07:00
Pavel Shilovsky	de74025052	CIFS: Reset read oplock to NONE if we have mandatory locks after reopen We are already doing the same thing for an ordinary open case: we can't keep read oplock on a file if we have mandatory byte-range locks because pagereading can conflict with these locks on a server. Fix it by setting oplock level to NONE. Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <smfrench@gmail.com>	2016-10-13 19:48:59 -05:00
Pavel Shilovsky	f2cca6a7c9	CIFS: Fix persistent handles re-opening on reconnect openFileList of tcon can be changed while cifs_reopen_file() is called that can lead to an unexpected behavior when we return to the loop. Fix this by introducing a temp list for keeping all file handles that need to be reopen. Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <smfrench@gmail.com>	2016-10-13 19:48:55 -05:00
Sachin Prabhu	166cea4dc3	SMB2: Separate RawNTLMSSP authentication from SMB2_sess_setup We split the rawntlmssp authentication into negotiate and authencate parts. We also clean up the code and add helpers. Signed-off-by: Sachin Prabhu <sprabhu@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>	2016-10-13 19:48:34 -05:00
Sachin Prabhu	3baf1a7b92	SMB2: Separate Kerberos authentication from SMB2_sess_setup Add helper functions and split Kerberos authentication off SMB2_sess_setup. Signed-off-by: Sachin Prabhu <sprabhu@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>	2016-10-13 19:48:30 -05:00
Germano Percossi	cb978ac8b8	Expose cifs module parameters in sysfs /sys/module/cifs/parameters should display the three other module load time configuration settings for cifs.ko Signed-off-by: Germano Percossi <germano.percossi@citrix.com> Signed-off-by: Steve French <steve.french@primarydata.com>	2016-10-13 19:48:25 -05:00
Steve French	24df1483c2	Cleanup missing frees on some ioctls Cleanup some missing mem frees on some cifs ioctls, and clarify others to make more obvious that no data is returned. CC: Stable <stable@vger.kernel.org> Signed-off-by: Steve French <smfrench@gmail.com> Acked-by: Sachin Prabhu <sprabhu@redhat.com>	2016-10-13 19:48:20 -05:00
Steve French	834170c859	Enable previous version support Add ioctl to query previous versions of file Allows listing snapshots on files on SMB3 mounts. Signed-off-by: Steve French <smfrench@gmail.com>	2016-10-13 19:48:11 -05:00
Steve French	18dd8e1a65	Do not send SMB3 SET_INFO request if nothing is changing [CIFS] We had cases where we sent a SMB2/SMB3 setinfo request with all timestamp (and DOS attribute) fields marked as 0 (ie do not change) e.g. on chmod or chown. Signed-off-by: Steve French <steve.french@primarydata.com> CC: Stable <stable@vger.kernel.org>	2016-10-13 19:46:51 -05:00
Linus Torvalds	e3799a210d	Merge git://www.linux-watchdog.org/linux-watchdog Pull watchdog updates from Wim Van Sebroeck: - a new watchdog pretimeout governor framework - support to upload the firmware on the ziirave_wdt - several fixes and cleanups * git://www.linux-watchdog.org/linux-watchdog: (26 commits) watchdog: imx2_wdt: add pretimeout function support watchdog: softdog: implement pretimeout support watchdog: pretimeout: add pretimeout_available_governors attribute watchdog: pretimeout: add option to select a pretimeout governor in runtime watchdog: pretimeout: add panic pretimeout governor watchdog: pretimeout: add noop pretimeout governor watchdog: add watchdog pretimeout governor framework watchdog: hpwdt: add support for iLO5 fs: compat_ioctl: add pretimeout functions for watchdogs watchdog: add pretimeout support to the core watchdog: imx2_wdt: use preferred BIT macro instead of open coded values watchdog: st_wdt: Remove support for obsolete platforms watchdog: bindings: Remove obsolete platforms from dt doc. watchdog: mt7621_wdt: Remove assignment of dev pointer watchdog: rt2880_wdt: Remove assignment of dev pointer watchdog: constify watchdog_ops structures watchdog: tegra: constify watchdog_ops structures watchdog: iTCO_wdt: constify iTCO_wdt_pm structure watchdog: cadence_wdt: Fix the suspend resume watchdog: txx9wdt: Add missing clock (un)prepare calls for CCF ...	2016-10-13 16:44:20 -07:00
Benjamin Coddington	a3f9d1b58a	pnfs/blocklayout: fix last_write_offset incorrectly set to page boundary Commit `41963c10c4` sets the block layout's last written byte to the offset of the end of the extent rather than the end of the write which incorrectly updates the inode's size for partial-page writes. Fixes: `41963c10c4` ("pnfs/blocklayout: update last_write_offset atomically with extents") Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Christoph Hellwig <hch@lst.de> Cc: stable@vger.kernel.org # 4.8+ Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-10-13 16:42:53 -04:00
David Howells	50a2c95381	afs: call->operation_ID sometimes used as __be32 sometimes as u32 call->operation_ID is sometimes being used as __be32 sometimes is being used as u32. Be consistent and settle on using as u32. Signed-off-by: David Howells <dhowells@redhat.com.	2016-10-13 17:03:52 +01:00
Dan Carpenter	233c9edcca	afs: unmapping the wrong buffer We switched from kmap_atomic() to kmap() so the kunmap() calls need to be updated to match. Fixes: `d001648ec7` ('rxrpc: Don't expose skbs to in-kernel users [ver #2]') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David Howells <dhowells@redhat.com>	2016-10-13 08:33:28 +01:00
Eric Biggers	fb4454376d	fscrypto: make XTS tweak initialization endian-independent The XTS tweak (or IV) was initialized differently on little endian and big endian systems. Because the ciphertext depends on the XTS tweak, it was not possible to use an encrypted filesystem created by a little endian system on a big endian system and vice versa, even if they shared the same PAGE_SIZE. Fix this by always using little endian. This will break hypothetical big endian users of ext4 or f2fs encryption. However, all users we are aware of are little endian, and it's believed that "real" big endian users are unlikely to exist yet. So this might as well be fixed now before it's too late. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org	2016-10-12 23:30:16 -04:00
Eric Biggers	c4704a4fbe	ext4: do not advertise encryption support when disabled The sysfs file /sys/fs/ext4/features/encryption was present on kernels compiled with CONFIG_EXT4_FS_ENCRYPTION=n. This was misleading because such kernels do not actually support ext4 encryption. Therefore, only provide this file on kernels compiled with CONFIG_EXT4_FS_ENCRYPTION=y. Note: since the ext4 feature files are all hardcoded to have a contents of "supported", it really is the presence or absence of the file that is significant, not the contents (and this change reflects that). Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org	2016-10-12 23:24:51 -04:00
Taesoo Kim	559cce698e	jbd2: fix incorrect unlock on j_list_lock When 'jh->b_transaction == transaction' (asserted by below) J_ASSERT_JH(jh, (jh->b_transaction == transaction \|\| ... 'journal->j_list_lock' will be incorrectly unlocked, since the the lock is aquired only at the end of if / else-if statements (missing the else case). Signed-off-by: Taesoo Kim <tsgatesv@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Fixes: `6e4862a5bb` Cc: stable@vger.kernel.org # 3.14+	2016-10-12 23:19:18 -04:00
Joe Perches	651e1c3b15	ext4: super.c: Update logging style using KERN_CONT Recent commit require line continuing printks to use PR_CONT. Update super.c to use KERN_CONT and use vsprintf extension %pV to avoid a printk/vprintk/printk("\n") sequence as well. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>	2016-10-12 23:12:53 -04:00
Jaegeuk Kim	de0dcc40f6	f2fs: fix wrong sum_page pointer in f2fs_gc This patch fixes using a wrong pointer for sum_page in f2fs_gc. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-10-12 16:23:36 -07:00
Chris Mason	d9ed71e545	Merge branch 'fst-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.9 Signed-off-by: Chris Mason <clm@fb.com>	2016-10-12 13:16:00 -07:00
Steve French	141891f472	SMB3: Add mount parameter to allow user to override max credits Add mount option "max_credits" to allow setting maximum SMB3 credits to any value from 10 to 64000 (default is 32000). This can be useful to workaround servers with problems allocating credits, or to throttle the client to use smaller amount of simultaneous i/o or to workaround server performance issues. Also adds a cap, so that even if the server granted us more than 65000 credits due to a server bug, we would not use that many. Signed-off-by: Steve French <steve.french@primarydata.com>	2016-10-12 12:08:33 -05:00
Steve French	52ace1ef12	fs/cifs: reopen persistent handles on reconnect Continuous Availability features like persistent handles require that clients reconnect their open files, not just the sessions, soon after the network connection comes back up, otherwise the server will throw away the state (byte range locks, leases, deny modes) on those handles after a timeout. Add code to reconnect handles when use_persistent set (e.g. Continuous Availability shares) after tree reconnect. Signed-off-by: Aurelien Aptel <aaptel@suse.com> Reviewed-by: Germano Percossi <germano.percossi@citrix.com> Signed-off-by: Steve French <smfrench@gmail.com>	2016-10-12 12:08:33 -05:00
Steve French	3afca265b5	Clarify locking of cifs file and tcon structures and make more granular Remove the global file_list_lock to simplify cifs/smb3 locking and have spinlocks that more closely match the information they are protecting. Add new tcon->open_file_lock and file->file_info_lock spinlocks. Locks continue to follow a heirachy, cifs_socket --> cifs_ses --> cifs_tcon --> cifs_file where global tcp_ses_lock still protects socket and cifs_ses, while the the newer locks protect the lower level structure's information (tcon and cifs_file respectively). CC: Stable <stable@vger.kernel.org> Signed-off-by: Steve French <steve.french@primarydata.com> Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Reviewed-by: Germano Percossi <germano.percossi@citrix.com>	2016-10-12 12:08:32 -05:00
Sachin Prabhu	d171356ff1	Fix regression which breaks DFS mounting Patch `a6b5058` results in -EREMOTE returned by is_path_accessible() in cifs_mount() to be ignored which breaks DFS mounting. Signed-off-by: Sachin Prabhu <sprabhu@redhat.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com> CC: Stable <stable@vger.kernel.org> Signed-off-by: Steve French <smfrench@gmail.com>	2016-10-12 12:08:32 -05:00
Aurelien Aptel	94f8737175	fs/cifs: keep guid when assigning fid to fileinfo When we open a durable handle we give a Globally Unique Identifier (GUID) to the server which we must keep for later reference e.g. when reopening persistent handles on reconnection. Without this the GUID generated for a new persistent handle was lost and 16 zero bytes were used instead on re-opening. Signed-off-by: Aurelien Aptel <aaptel@suse.com> CC: Stable <stable@vger.kernel.org> Signed-off-by: Steve French <smfrench@gmail.com>	2016-10-12 12:08:32 -05:00
Steve French	fa70b87cc6	SMB3: GUIDs should be constructed as random but valid uuids GUIDs although random, and 16 bytes, need to be generated as proper uuids. Signed-off-by: Steve French <steve.french@primarydata.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Reported-by: David Goebels <davidgoe@microsoft.com> CC: Stable <stable@vger.kernel.org>	2016-10-12 12:08:32 -05:00
Steve French	c2afb8147e	Set previous session id correctly on SMB3 reconnect Signed-off-by: Steve French <steve.french@primarydata.com> CC: Stable <stable@vger.kernel.org> Reported-by: David Goebel <davidgoe@microsoft.com>	2016-10-12 12:08:31 -05:00
Ross Lagerwall	7d414f396c	cifs: Limit the overall credit acquired The kernel client requests 2 credits for many operations even though they only use 1 credit (presumably to build up a buffer of credit). Some servers seem to give the client as much credit as is requested. In this case, the amount of credit the client has continues increasing to the point where (server->credits * MAX_BUFFER_SIZE) overflows in smb2_wait_mtu_credits(). Fix this by throttling the credit requests if an set limit is reached. For async requests where the credit charge may be > 1, request as much credit as what is charged. The limit is chosen somewhat arbitrarily. The Windows client defaults to 128 credits, the Windows server allows clients up to 512 credits (or 8192 for Windows 2016), and the NetApp server (and at least one other) does not limit clients at all. Choose a high enough value such that the client shouldn't limit performance. This behavior was seen with a NetApp filer (NetApp Release 9.0RC2). Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> CC: Stable <stable@vger.kernel.org> Signed-off-by: Steve French <smfrench@gmail.com>	2016-10-12 12:08:31 -05:00
Steve French	9742805d6b	Display number of credits available In debugging smb3, it is useful to display the number of credits available, so we can see when the server has not granted sufficient operations for the client to make progress, or alternatively the client has requested too many credits (as we saw in a recent bug) so we can compare with the number of credits the server thinks we have. Add a /proc/fs/cifs/DebugData line to display the client view on how many credits are available. Signed-off-by: Steve French <steve.french@primarydata.com> Reported-by: Germano Percossi <germano.percossi@citrix.com> CC: Stable <stable@vger.kernel.org>	2016-10-12 12:08:31 -05:00
Steve French	6609804413	Add way to query creation time of file via cifs xattr Add parsing for new pseudo-xattr user.cifs.creationtime file attribute to allow backup and test applications to view birth time of file on cifs/smb3 mounts. Signed-off-by: Steve French <steve.french@primarydata.com>	2016-10-12 12:08:31 -05:00
Steve French	a958fff242	Add way to query file attributes via cifs xattr Add parsing for new pseudo-xattr user.cifs.dosattrib file attribute so tools can recognize what kind of file it is, and verify if common SMB3 attributes (system, hidden, archive, sparse, indexed etc.) are set. Signed-off-by: Steve French <steve.french@primarydata.com> Reviewed-by: Pavel Shilovsky <pshilovsky@samba.org>	2016-10-12 12:08:30 -05:00
Filipe Manana	d5e84fd8d0	Btrfs: fix incremental send failure caused by balance Commit `951555856b` ("Btrfs: send, don't bug on inconsistent snapshots") removed some BUG_ON() statements (replacing them with returning errors to user space and logging error messages) when a snapshot is in an inconsistent state due to failures to update a delayed inode item (ENOMEM or ENOSPC) after adding/updating/deleting references, xattrs or file extent items. However there is a case, when no errors happen, where a file extent item can be modified without having the corresponding inode item updated. This case happens during balance under very specific timings, when relocation is in the stage where it updates data pointers and a leaf that contains file extent items is COWed. When that happens file extent items get their disk_bytenr field updated to a new value that reflects the post relocation logical address of the extent, without updating their respective inode items (as there is nothing that needs to be updated on them). This is performed at relocation.c:replace_file_extents() through relocation.c:btrfs_reloc_cow_block(). So make an incremental send deal with this case and don't do any processing for a file extent item that got its disk_bytenr field updated by relocation, since the extent's data is the same as the one pointed by the file extent item in the parent snapshot. After the recent commit mentioned above this case resulted in EIO errors returned to user space (and an error message logged to dmesg/syslog) when doing an incremental send, while before it, it resulted in hitting a BUG_ON leading to the following trace: [ 952.206705] ------------[ cut here ]------------ [ 952.206714] kernel BUG at ../fs/btrfs/send.c:5653! [ 952.206719] Internal error: Oops - BUG: 0 [#1] SMP [ 952.209854] Modules linked in: st dm_mod nls_utf8 isofs fuse nf_log_ipv6 xt_pkttype xt_physdev br_netfilter nf_log_ipv4 nf_log_common xt_LOG xt_limit ebtable_filter ebtables af_packet bridge stp llc ip6t_REJECT xt_tcpudp nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_raw ipt_REJECT iptable_raw xt_CT iptable_filter ip6table_mangle nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables xt_conntrack nf_conntrack ip6table_filter ip6_tables x_tables xfs libcrc32c nls_iso8859_1 nls_cp437 vfat fat joydev aes_ce_blk ablk_helper cryptd snd_intel8x0 aes_ce_cipher snd_ac97_codec ac97_bus snd_pcm ghash_ce sha2_ce sha1_ce snd_timer snd virtio_net soundcore btrfs xor sr_mod cdrom hid_generic usbhid raid6_pq virtio_blk virtio_scsi bochs_drm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm virtio_mmio xhci_pci xhci_hcd usbcore usb_common virtio_pci virtio_ring virtio drm sg efivarfs [ 952.228333] Supported: Yes [ 952.228908] CPU: 0 PID: 12779 Comm: snapperd Not tainted 4.4.14-50-default #1 [ 952.230329] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 [ 952.231683] task: ffff800058e94100 ti: ffff8000d866c000 task.ti: ffff8000d866c000 [ 952.233279] PC is at changed_cb+0x9f4/0xa48 [btrfs] [ 952.234375] LR is at changed_cb+0x58/0xa48 [btrfs] [ 952.236552] pc : [<ffff7ffffc39de7c>] lr : [<ffff7ffffc39d4e0>] pstate: 80000145 [ 952.238049] sp : ffff8000d866fa20 [ 952.238732] x29: ffff8000d866fa20 x28: 0000000000000019 [ 952.239840] x27: 00000000000028d5 x26: 00000000000024a2 [ 952.241008] x25: 0000000000000002 x24: ffff8000e66e92f0 [ 952.242131] x23: ffff8000b8c76800 x22: ffff800092879140 [ 952.243238] x21: 0000000000000002 x20: ffff8000d866fb78 [ 952.244348] x19: ffff8000b8f8c200 x18: 0000000000002710 [ 952.245607] x17: 0000ffff90d42480 x16: ffff800000237dc0 [ 952.246719] x15: 0000ffff90de7510 x14: ab000c000a2faf08 [ 952.247835] x13: 0000000000577c2b x12: ab000c000b696665 [ 952.248981] x11: 2e65726f632f6966 x10: 652d34366d72612f [ 952.250101] x9 : 32627572672f746f x8 : ab000c00092f1671 [ 952.251352] x7 : 8000000000577c2b x6 : ffff800053eadf45 [ 952.252468] x5 : 0000000000000000 x4 : ffff80005e169494 [ 952.253582] x3 : 0000000000000004 x2 : ffff8000d866fb78 [ 952.254695] x1 : 000000000003e2a3 x0 : 000000000003e2a4 [ 952.255803] [ 952.256150] Process snapperd (pid: 12779, stack limit = 0xffff8000d866c020) [ 952.257516] Stack: (0xffff8000d866fa20 to 0xffff8000d8670000) [ 952.258654] fa20: ffff8000d866fae0 ffff7ffffc308fc0 ffff800092879140 ffff8000e66e92f0 [ 952.260219] fa40: 0000000000000035 ffff800055de6000 ffff8000b8c76800 ffff8000d866fb78 [ 952.261745] fa60: 0000000000000002 00000000000024a2 00000000000028d5 0000000000000019 [ 952.263269] fa80: ffff8000d866fae0 ffff7ffffc3090f0 ffff8000d866fae0 ffff7ffffc309128 [ 952.264797] faa0: ffff800092879140 ffff8000e66e92f0 0000000000000035 ffff800055de6000 [ 952.268261] fac0: ffff8000b8c76800 ffff8000d866fb78 0000000000000002 0000000000001000 [ 952.269822] fae0: ffff8000d866fbc0 ffff7ffffc39ecfc ffff8000b8f8c200 ffff8000b8f8c368 [ 952.271368] fb00: ffff8000b8f8c378 ffff800055de6000 0000000000000001 ffff8000ecb17500 [ 952.272893] fb20: ffff8000b8c76800 ffff800092879140 ffff800062b6d000 ffff80007a9e2470 [ 952.274420] fb40: ffff8000b8f8c208 0000000005784000 ffff8000580a8000 ffff8000b8f8c200 [ 952.276088] fb60: ffff7ffffc39d488 00000002b8f8c368 0000000000000000 000000000003e2a4 [ 952.280275] fb80: 000000000000006c ffff7ffffc39ec00 000000000003e2a4 000000000000006c [ 952.283219] fba0: ffff8000b8f8c300 0000000000000100 0000000000000001 ffff8000ecb17500 [ 952.286166] fbc0: ffff8000d866fcd0 ffff7ffffc3643c0 ffff8000f8842700 0000ffff8ffe9278 [ 952.289136] fbe0: 0000000040489426 ffff800055de6000 0000ffff8ffe9278 0000000040489426 [ 952.292083] fc00: 000000000000011d 000000000000001d ffff80007a9e4598 ffff80007a9e43e8 [ 952.294959] fc20: ffff8000b8c7693f 0000000000003b24 0000000000000019 ffff8000b8f8c218 [ 952.301161] fc40: 00000001d866fc70 ffff8000b8c76800 0000000000000128 ffffffffffffff84 [ 952.305749] fc60: ffff800058e941ff 0000000000003a58 ffff8000d866fcb0 ffff8000000f7390 [ 952.308875] fc80: 000000000000012a 0000000000010290 ffff8000d866fc00 000000000000007b [ 952.311915] fca0: 0000000000010290 ffff800046c1b100 74732d7366727462 000001006d616572 [ 952.314937] fcc0: ffff8000fffc4100 cb88537fdc8ba60e ffff8000d866fe10 ffff8000002499e8 [ 952.318008] fce0: 0000000040489426 ffff8000f8842700 0000ffff8ffe9278 ffff80007a9e4598 [ 952.321321] fd00: 0000ffff8ffe9278 0000000040489426 000000000000011d 000000000000001d [ 952.324280] fd20: ffff80000072c000 ffff8000d866c000 ffff8000d866fda0 ffff8000000e997c [ 952.327156] fd40: ffff8000fffc4180 00000000000031ed ffff8000fffc4180 ffff800046c1b7d4 [ 952.329895] fd60: 0000000000000140 0000ffff907ea170 000000000000011d 00000000000000dc [ 952.334641] fd80: ffff80000072c000 ffff8000d866c000 0000000000000000 0000000000000002 [ 952.338002] fda0: ffff8000d866fdd0 ffff8000000ebacc ffff800046c1b080 ffff800046c1b7d4 [ 952.340724] fdc0: ffff8000d866fdf0 ffff8000000db67c 0000000000000040 ffff800000e69198 [ 952.343415] fde0: 0000ffff8ffea790 00000000000031ed ffff8000d866fe20 ffff800000254000 [ 952.346101] fe00: 000000000000001d 0000000000000004 ffff8000d866fe90 ffff800000249d3c [ 952.348980] fe20: ffff8000f8842700 0000000000000000 ffff8000f8842701 0000000000000008 [ 952.351696] fe40: ffff8000d866fe70 0000000000000008 ffff8000d866fe90 ffff800000249cf8 [ 952.354387] fe60: ffff8000f8842700 0000ffff8ffe9170 ffff8000f8842701 0000000000000008 [ 952.357083] fe80: 0000ffff8ffe9278 ffff80008ff85500 0000ffff8ffe90c0 ffff800000085c84 [ 952.359800] fea0: 0000000000000000 0000ffff8ffe9170 ffffffffffffffff 0000ffff90d473bc [ 952.365351] fec0: 0000000000000000 0000000000000015 0000000000000008 0000000040489426 [ 952.369550] fee0: 0000ffff8ffe9278 0000ffff907ea790 0000ffff907ea170 0000ffff907ea790 [ 952.372416] ff00: 0000ffff907ea170 0000000000000000 000000000000001d 0000000000000004 [ 952.375223] ff20: 0000ffff90a32220 00000000003d0f00 0000ffff907ea0a0 0000ffff8ffe8f30 [ 952.378099] ff40: 0000ffff9100f554 0000ffff91147000 0000ffff91117bc0 0000ffff90d473b0 [ 952.381115] ff60: 0000ffff9100f620 0000ffff880069b0 0000ffff8ffe9170 0000ffff8ffe91a0 [ 952.384003] ff80: 0000ffff8ffe9160 0000ffff8ffe9140 0000ffff88006990 0000ffff8ffe9278 [ 952.386860] ffa0: 0000ffff88008a60 0000ffff8ffe9480 0000ffff88014ca0 0000ffff8ffe90c0 [ 952.389654] ffc0: 0000ffff910be8e8 0000ffff8ffe90c0 0000ffff90d473bc 0000000000000000 [ 952.410986] ffe0: 0000000000000008 000000000000001d 6e2079747265706f 72616d223d656d61 [ 952.415497] Call trace: [ 952.417403] [<ffff7ffffc39de7c>] changed_cb+0x9f4/0xa48 [btrfs] [ 952.420023] [<ffff7ffffc308fc0>] btrfs_compare_trees+0x500/0x6b0 [btrfs] [ 952.422759] [<ffff7ffffc39ecfc>] btrfs_ioctl_send+0xb4c/0xe10 [btrfs] [ 952.425601] [<ffff7ffffc3643c0>] btrfs_ioctl+0x374/0x29a4 [btrfs] [ 952.428031] [<ffff8000002499e8>] do_vfs_ioctl+0x33c/0x600 [ 952.430360] [<ffff800000249d3c>] SyS_ioctl+0x90/0xa4 [ 952.432552] [<ffff800000085c84>] el0_svc_naked+0x38/0x3c [ 952.434803] Code: 2a1503e0 17fffdac b9404282 17ffff28 (d4210000) [ 952.437457] ---[ end trace 9afd7090c466cf15 ]--- Signed-off-by: Filipe Manana <fdmanana@suse.com>	2016-10-12 10:41:01 +01:00
Linus Torvalds	a379f71a30	Merge branch 'akpm' (patches from Andrew) Merge more updates from Andrew Morton: - a few block updates that fell in my lap - lib/ updates - checkpatch - autofs - ipc - a ton of misc other things * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (100 commits) mm: split gfp_mask and mapping flags into separate fields fs: use mapping_set_error instead of opencoded set_bit treewide: remove redundant #include <linux/kconfig.h> hung_task: allow hung_task_panic when hung_task_warnings is 0 kthread: add kerneldoc for kthread_create() kthread: better support freezable kthread workers kthread: allow to modify delayed kthread work kthread: allow to cancel kthread work kthread: initial support for delayed kthread work kthread: detect when a kthread work is used by more workers kthread: add kthread_destroy_worker() kthread: add kthread_create_worker*() kthread: allow to call __kthread_create_on_node() with va_list args kthread/smpboot: do not park in kthread_create_on_cpu() kthread: kthread worker API cleanup kthread: rename probe_kthread_data() to kthread_probe_data() scripts/tags.sh: enable code completion in VIM mm: kmemleak: avoid using __va() on addresses that don't have a lowmem mapping kdump, vmcoreinfo: report memory sections virtual addresses ipc/sem.c: add cond_resched in exit_sme ...	2016-10-11 17:34:10 -07:00
Michal Hocko	5114a97a8b	fs: use mapping_set_error instead of opencoded set_bit The mapping_set_error() helper sets the correct AS_ flag for the mapping so there is no reason to open code it. Use the helper directly. [akpm@linux-foundation.org: be honest about conversion from -ENXIO to -EIO] Link: http://lkml.kernel.org/r/20160912111608.2588-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-11 15:06:33 -07:00
Masahiro Yamada	97139d4a6f	treewide: remove redundant #include <linux/kconfig.h> Kernel source files need not include <linux/kconfig.h> explicitly because the top Makefile forces to include it with: -include $(srctree)/include/linux/kconfig.h This commit removes explicit includes except the following: * arch/s390/include/asm/facilities_src.h * tools/testing/radix-tree/linux/kernel.h These two are used for host programs. Link: http://lkml.kernel.org/r/1473656164-11929-1-git-send-email-yamada.masahiro@socionext.com Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-11 15:06:33 -07:00
Michael Kerrisk (man-pages)	086e774a57	pipe: cap initial pipe capacity according to pipe-max-size limit This is a patch that provides behavior that is more consistent, and probably less surprising to users. I consider the change optional, and welcome opinions about whether it should be applied. By default, pipes are created with a capacity of 64 kiB. However, /proc/sys/fs/pipe-max-size may be set smaller than this value. In this scenario, an unprivileged user could thus create a pipe whose initial capacity exceeds the limit. Therefore, it seems logical to cap the initial pipe capacity according to the value of pipe-max-size. The test program shown earlier in this patch series can be used to demonstrate the effect of the change brought about with this patch: # cat /proc/sys/fs/pipe-max-size 1048576 # sudo -u mtk ./test_F_SETPIPE_SZ 1 Initial pipe capacity: 65536 # echo 10000 > /proc/sys/fs/pipe-max-size # cat /proc/sys/fs/pipe-max-size 16384 # sudo -u mtk ./test_F_SETPIPE_SZ 1 Initial pipe capacity: 16384 # ./test_F_SETPIPE_SZ 1 Initial pipe capacity: 65536 The last two executions of 'test_F_SETPIPE_SZ' show that pipe-max-size caps the initial allocation for a new pipe for unprivileged users, but not for privileged users. Link: http://lkml.kernel.org/r/31dc7064-2a17-9c5b-1df1-4e3012ee992c@gmail.com Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com> Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com> Cc: Willy Tarreau <w@1wt.eu> Cc: <socketpair@gmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Jens Axboe <axboe@fb.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-11 15:06:32 -07:00
Michael Kerrisk (man-pages)	9c87bcf0a3	pipe: make account_pipe_buffers() return a value, and use it This is an optional patch, to provide a small performance improvement. Alter account_pipe_buffers() so that it returns the new value in user->pipe_bufs. This means that we can refactor too_many_pipe_buffers_soft() and too_many_pipe_buffers_hard() to avoid the costs of repeated use of atomic_long_read() to get the value user->pipe_bufs. Link: http://lkml.kernel.org/r/93e5f193-1e5e-3e1f-3a20-eae79b7e1310@gmail.com Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com> Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com> Cc: Willy Tarreau <w@1wt.eu> Cc: <socketpair@gmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Jens Axboe <axboe@fb.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-11 15:06:32 -07:00
Michael Kerrisk (man-pages)	a005ca0e68	pipe: fix limit checking in alloc_pipe_info() The limit checking in alloc_pipe_info() (used by pipe(2) and when opening a FIFO) has the following problems: (1) When checking capacity required for the new pipe, the checks against the limit in /proc/sys/fs/pipe-user-pages-{soft,hard} are made against existing consumption, and exclude the memory required for the new pipe capacity. As a consequence: (1) the memory allocation throttling provided by the soft limit does not kick in quite as early as it should, and (2) the user can overrun the hard limit. (2) As currently implemented, accounting and checking against the limits is done as follows: (a) Test whether the user has exceeded the limit. (b) Make new pipe buffer allocation. (c) Account new allocation against the limits. This is racey. Multiple processes may pass point (a) simultaneously, and then allocate pipe buffers that are accounted for only in step (c). The race means that the user's pipe buffer allocation could be pushed over the limit (by an arbitrary amount, depending on how unlucky we were in the race). [Thanks to Vegard Nossum for spotting this point, which I had missed.] This patch addresses the above problems as follows: * Alter the checks against limits to include the memory required for the new pipe. * Re-order the accounting step so that it precedes the buffer allocation. If the accounting step determines that a limit has been reached, revert the accounting and cause the operation to fail. Link: http://lkml.kernel.org/r/8ff3e9f9-23f6-510c-644f-8e70cd1c0bd9@gmail.com Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com> Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com> Cc: Willy Tarreau <w@1wt.eu> Cc: <socketpair@gmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Jens Axboe <axboe@fb.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-11 15:06:32 -07:00
Michael Kerrisk (man-pages)	09b4d19900	pipe: simplify logic in alloc_pipe_info() Replace an 'if' block that covers most of the code in this function with a 'goto'. This makes the code a little simpler to read, and also simplifies the next patch (fix limit checking in alloc_pipe_info()) Link: http://lkml.kernel.org/r/aef030c1-0257-98a9-4988-186efa48530c@gmail.com Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com> Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com> Cc: Willy Tarreau <w@1wt.eu> Cc: <socketpair@gmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Jens Axboe <axboe@fb.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-11 15:06:32 -07:00
Michael Kerrisk (man-pages)	b0b91d18e2	pipe: fix limit checking in pipe_set_size() The limit checking in pipe_set_size() (used by fcntl(F_SETPIPE_SZ)) has the following problems: (1) When increasing the pipe capacity, the checks against the limits in /proc/sys/fs/pipe-user-pages-{soft,hard} are made against existing consumption, and exclude the memory required for the increased pipe capacity. The new increase in pipe capacity can then push the total memory used by the user for pipes (possibly far) over a limit. This can also trigger the problem described next. (2) The limit checks are performed even when the new pipe capacity is less than the existing pipe capacity. This can lead to problems if a user sets a large pipe capacity, and then the limits are lowered, with the result that the user will no longer be able to decrease the pipe capacity. (3) As currently implemented, accounting and checking against the limits is done as follows: (a) Test whether the user has exceeded the limit. (b) Make new pipe buffer allocation. (c) Account new allocation against the limits. This is racey. Multiple processes may pass point (a) simultaneously, and then allocate pipe buffers that are accounted for only in step (c). The race means that the user's pipe buffer allocation could be pushed over the limit (by an arbitrary amount, depending on how unlucky we were in the race). [Thanks to Vegard Nossum for spotting this point, which I had missed.] This patch addresses the above problems as follows: * Perform checks against the limits only when increasing a pipe's capacity; an unprivileged user can always decrease a pipe's capacity. * Alter the checks against limits to include the memory required for the new pipe capacity. * Re-order the accounting step so that it precedes the buffer allocation. If the accounting step determines that a limit has been reached, revert the accounting and cause the operation to fail. The program below can be used to demonstrate problems 1 and 2, and the effect of the fix. The program takes one or more command-line arguments. The first argument specifies the number of pipes that the program should create. The remaining arguments are, alternately, pipe capacities that should be set using fcntl(F_SETPIPE_SZ), and sleep intervals (in seconds) between the fcntl() operations. (The sleep intervals allow the possibility to change the limits between fcntl() operations.) Problem 1 ========= Using the test program on an unpatched kernel, we first set some limits: # echo 0 > /proc/sys/fs/pipe-user-pages-soft # echo 1000000000 > /proc/sys/fs/pipe-max-size # echo 10000 > /proc/sys/fs/pipe-user-pages-hard # 40.96 MB Then show that we can set a pipe with capacity (100MB) that is over the hard limit # sudo -u mtk ./test_F_SETPIPE_SZ 1 100000000 Initial pipe capacity: 65536 Loop 1: set pipe capacity to 100000000 bytes F_SETPIPE_SZ returned 134217728 Now set the capacity to 100MB twice. The second call fails (which is probably surprising to most users, since it seems like a no-op): # sudo -u mtk ./test_F_SETPIPE_SZ 1 100000000 0 100000000 Initial pipe capacity: 65536 Loop 1: set pipe capacity to 100000000 bytes F_SETPIPE_SZ returned 134217728 Loop 2: set pipe capacity to 100000000 bytes Loop 2, pipe 0: F_SETPIPE_SZ failed: fcntl: Operation not permitted With a patched kernel, setting a capacity over the limit fails at the first attempt: # echo 0 > /proc/sys/fs/pipe-user-pages-soft # echo 1000000000 > /proc/sys/fs/pipe-max-size # echo 10000 > /proc/sys/fs/pipe-user-pages-hard # sudo -u mtk ./test_F_SETPIPE_SZ 1 100000000 Initial pipe capacity: 65536 Loop 1: set pipe capacity to 100000000 bytes Loop 1, pipe 0: F_SETPIPE_SZ failed: fcntl: Operation not permitted There is a small chance that the change to fix this problem could break user-space, since there are cases where fcntl(F_SETPIPE_SZ) calls that previously succeeded might fail. However, the chances are small, since (a) the pipe-user-pages-{soft,hard} limits are new (in 4.5), and the default soft/hard limits are high/unlimited. Therefore, it seems warranted to make these limits operate more precisely (and behave more like what users probably expect). Problem 2 ========= Running the test program on an unpatched kernel, we first set some limits: # getconf PAGESIZE 4096 # echo 0 > /proc/sys/fs/pipe-user-pages-soft # echo 1000000000 > /proc/sys/fs/pipe-max-size # echo 10000 > /proc/sys/fs/pipe-user-pages-hard # 40.96 MB Now perform two fcntl(F_SETPIPE_SZ) operations on a single pipe, first setting a pipe capacity (10MB), sleeping for a few seconds, during which time the hard limit is lowered, and then set pipe capacity to a smaller amount (5MB): # sudo -u mtk ./test_F_SETPIPE_SZ 1 10000000 15 5000000 & [1] 748 # Initial pipe capacity: 65536 Loop 1: set pipe capacity to 10000000 bytes F_SETPIPE_SZ returned 16777216 Sleeping 15 seconds # echo 1000 > /proc/sys/fs/pipe-user-pages-hard # 4.096 MB # Loop 2: set pipe capacity to 5000000 bytes Loop 2, pipe 0: F_SETPIPE_SZ failed: fcntl: Operation not permitted In this case, the user should be able to lower the limit. With a kernel that has the patch below, the second fcntl() succeeds: # echo 0 > /proc/sys/fs/pipe-user-pages-soft # echo 1000000000 > /proc/sys/fs/pipe-max-size # echo 10000 > /proc/sys/fs/pipe-user-pages-hard # sudo -u mtk ./test_F_SETPIPE_SZ 1 10000000 15 5000000 & [1] 3215 # Initial pipe capacity: 65536 # Loop 1: set pipe capacity to 10000000 bytes F_SETPIPE_SZ returned 16777216 Sleeping 15 seconds # echo 1000 > /proc/sys/fs/pipe-user-pages-hard # Loop 2: set pipe capacity to 5000000 bytes F_SETPIPE_SZ returned 8388608 8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x--- /* test_F_SETPIPE_SZ.c (C) 2016, Michael Kerrisk; licensed under GNU GPL version 2 or later Test operation of fcntl(F_SETPIPE_SZ) for setting pipe capacity and interactions with limits defined by /proc/sys/fs/pipe-* files. / #define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <fcntl.h> #include <unistd.h> int main(int argc, char argv[]) { int (pfd)[2]; int npipes; int pcap, rcap; int j, p, s, stime, loop; if (argc < 2) { fprintf(stderr, "Usage: %s num-pipes " "[pipe-capacity sleep-time]...\n", argv[0]); exit(EXIT_FAILURE); } npipes = atoi(argv[1]); pfd = calloc(npipes, sizeof (int [2])); if (pfd == NULL) { perror("calloc"); exit(EXIT_FAILURE); } for (j = 0; j < npipes; j++) { if (pipe(pfd[j]) == -1) { fprintf(stderr, "Loop %d: pipe() failed: ", j); perror("pipe"); exit(EXIT_FAILURE); } } printf("Initial pipe capacity: %d\n", fcntl(pfd[0][0], F_GETPIPE_SZ)); for (j = 2; j < argc; j += 2 ) { loop = j / 2; pcap = atoi(argv[j]); printf(" Loop %d: set pipe capacity to %d bytes\n", loop, pcap); for (p = 0; p < npipes; p++) { s = fcntl(pfd[p][0], F_SETPIPE_SZ, pcap); if (s == -1) { fprintf(stderr, " Loop %d, pipe %d: F_SETPIPE_SZ " "failed: ", loop, p); perror("fcntl"); exit(EXIT_FAILURE); } if (p == 0) { printf(" F_SETPIPE_SZ returned %d\n", s); rcap = s; } else { if (s != rcap) { fprintf(stderr, " Loop %d, pipe %d: F_SETPIPE_SZ " "unexpected return: %d\n", loop, p, s); exit(EXIT_FAILURE); } } stime = (j + 1 < argc) ? atoi(argv[j + 1]) : 0; if (stime > 0) { printf(" Sleeping %d seconds\n", stime); sleep(stime); } } } exit(EXIT_SUCCESS); } 8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x--- Patch history: v2 Switch order of test in 'if' statement to avoid function call (to capability()) in normal path. [This is a fix to a preexisting wart in the code. Thanks to Willy Tarreau] * Perform (size > pipe_max_size) check before calling account_pipe_buffers(). [Thanks to Vegard Nossum] Quoting Vegard: The potential problem happens if the user passes a very large number which will overflow pipe->user->pipe_bufs. On 32-bit, sizeof(int) == sizeof(long), so if they pass arg = INT_MAX then round_pipe_size() returns INT_MAX. Although it's true that the accounting is done in terms of pages and not bytes, so you'd need on the order of (1 << 13) = 8192 processes hitting the limit at the same time in order to make it overflow, which seems a bit unlikely. (See https://lkml.org/lkml/2016/8/12/215 for another discussion on the limit checking) Link: http://lkml.kernel.org/r/1e464945-536b-2420-798b-e77b9c7e8593@gmail.com Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com> Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com> Cc: Willy Tarreau <w@1wt.eu> Cc: <socketpair@gmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Jens Axboe <axboe@fb.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-11 15:06:32 -07:00
Michael Kerrisk (man-pages)	3734a13b96	pipe: refactor argument for account_pipe_buffers() This is a preparatory patch for following work. account_pipe_buffers() performs accounting in the 'user_struct'. There is no need to pass a pointer to a 'pipe_inode_info' struct (which is then dereferenced to obtain a pointer to the 'user' field). Instead, pass a pointer directly to the 'user_struct'. This change is needed in preparation for a subsequent patch that the fixes the limit checking in alloc_pipe_info() (and the resulting code is a little more logical). Link: http://lkml.kernel.org/r/7277bf8c-a6fc-4a7d-659c-f5b145c981ab@gmail.com Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com> Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com> Cc: Willy Tarreau <w@1wt.eu> Cc: <socketpair@gmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Jens Axboe <axboe@fb.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-11 15:06:31 -07:00
Michael Kerrisk (man-pages)	d37d416664	pipe: move limit checking logic into pipe_set_size() This is a preparatory patch for following work. Move the F_SETPIPE_SZ limit-checking logic from pipe_fcntl() into pipe_set_size(). This simplifies the code a little, and allows for reworking required in a later patch that fixes the limit checking in pipe_set_size() Link: http://lkml.kernel.org/r/3701b2c5-2c52-2c3e-226d-29b9deb29b50@gmail.com Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com> Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com> Cc: Willy Tarreau <w@1wt.eu> Cc: <socketpair@gmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Jens Axboe <axboe@fb.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-11 15:06:31 -07:00
Michael Kerrisk (man-pages)	f491bd7111	pipe: relocate round_pipe_size() above pipe_set_size() Patch series "pipe: fix limit handling", v2. When changing a pipe's capacity with fcntl(F_SETPIPE_SZ), various limits defined by /proc/sys/fs/pipe-* files are checked to see if unprivileged users are exceeding limits on memory consumption. While documenting and testing the operation of these limits I noticed that, as currently implemented, these checks have a number of problems: (1) When increasing the pipe capacity, the checks against the limits in /proc/sys/fs/pipe-user-pages-{soft,hard} are made against existing consumption, and exclude the memory required for the increased pipe capacity. The new increase in pipe capacity can then push the total memory used by the user for pipes (possibly far) over a limit. This can also trigger the problem described next. (2) The limit checks are performed even when the new pipe capacity is less than the existing pipe capacity. This can lead to problems if a user sets a large pipe capacity, and then the limits are lowered, with the result that the user will no longer be able to decrease the pipe capacity. (3) As currently implemented, accounting and checking against the limits is done as follows: (a) Test whether the user has exceeded the limit. (b) Make new pipe buffer allocation. (c) Account new allocation against the limits. This is racey. Multiple processes may pass point (a) simultaneously, and then allocate pipe buffers that are accounted for only in step (c). The race means that the user's pipe buffer allocation could be pushed over the limit (by an arbitrary amount, depending on how unlucky we were in the race). [Thanks to Vegard Nossum for spotting this point, which I had missed.] This patch series addresses these three problems. This patch (of 8): This is a minor preparatory patch. After subsequent patches, round_pipe_size() will be called from pipe_set_size(), so place round_pipe_size() above pipe_set_size(). Link: http://lkml.kernel.org/r/91a91fdb-a959-ba7f-b551-b62477cc98a1@gmail.com Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com> Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com> Cc: Willy Tarreau <w@1wt.eu> Cc: <socketpair@gmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Jens Axboe <axboe@fb.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-11 15:06:31 -07:00
Tomohiro Kusumi	fcc24534b0	autofs: refactor ioctl fn vector in iookup_dev_ioctl() cmd part of this struct is the same as an index of itself within _ioctls[]. In fact this cmd is unused, so we can drop this part. Link: http://lkml.kernel.org/r/20160831033414.9910.66697.stgit@pluto.themaw.net Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-11 15:06:31 -07:00
Tomohiro Kusumi	962ca7cfbd	autofs: remove possibly misleading /* #define DEBUG */ Having this in autofs_i.h gives illusion that uncommenting this enables pr_debug(), but it doesn't enable all the pr_debug() in autofs because inclusion order matters. XFS has the same DEBUG macro in its core header fs/xfs/xfs.h, however XFS seems to have a rule to include this prior to other XFS headers as well as kernel headers. This is not the case with autofs, and DEBUG could be enabled via Makefile, so autofs should just get rid of this comment to make the code less confusing. It's a comment, so there is literally no functional difference. Link: http://lkml.kernel.org/r/20160831033409.9910.77067.stgit@pluto.themaw.net Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-11 15:06:31 -07:00
Tomohiro Kusumi	390855547c	autofs: fix print format for ioctl warning message All other warnings use "cmd(0x%08x)" and this is the only one with "cmd(%d)". (below comes from my userspace debug program, but not automount daemon) [ 1139.905676] autofs4:pid:1640:check_dev_ioctl_version: ioctl control interface version mismatch: kernel(1.0), user(0.0), cmd(-1072131215) Link: http://lkml.kernel.org/r/20160812024851.12352.75458.stgit@pluto.themaw.net Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Signed-off-by: Ian Kent <ikent@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-11 15:06:31 -07:00
Ian Kent	d9e1923207	autofs: add autofs_dev_ioctl_version() for AUTOFS_DEV_IOCTL_VERSION_CMD No functional changes, based on the following justification. 1. Make the code more consistent using the ioctl vector _ioctls[], rather than assigning NULL only for this ioctl command. 2. Remove goto done; for better maintainability in the long run. 3. The existing code is based on the fact that validate_dev_ioctl() sets ioctl version for any command, but AUTOFS_DEV_IOCTL_VERSION_CMD should explicitly set it regardless of the default behavior. Link: http://lkml.kernel.org/r/20160812024846.12352.9885.stgit@pluto.themaw.net Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Signed-off-by: Ian Kent <ikent@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-11 15:06:31 -07:00
Ian Kent	aa8419367b	autofs: fix dev ioctl number range check The count of miscellaneous device ioctls in fs/autofs4/autofs_i.h is wrong. The number of ioctls is the difference between AUTOFS_DEV_IOCTL_VERSION_CMD and AUTOFS_DEV_IOCTL_ISMOUNTPOINT_CMD (14) not the difference between AUTOFS_IOC_COUNT and 11 (21). [kusumi.tomohiro@gmail.com: fix typo that made the count macro negative] Link: http://lkml.kernel.org/r/20160831033420.9910.16809.stgit@pluto.themaw.net Link: http://lkml.kernel.org/r/20160812024841.12352.11975.stgit@pluto.themaw.net Signed-off-by: Ian Kent <raven@themaw.net> Cc: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-11 15:06:31 -07:00
Tomohiro Kusumi	b6e3795a06	autofs: fix pr_debug() message This isn't a return value, so change the message to indicate the status is the result of may_umount(). (or locate pr_debug() after put_user() with the same message) Link: http://lkml.kernel.org/r/20160812024836.12352.74628.stgit@pluto.themaw.net Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Signed-off-by: Ian Kent <ikent@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-11 15:06:31 -07:00
Tomohiro Kusumi	41a4497a4f	autofs: don't fail to free_dev_ioctl(param) Returning -ENOTTY here fails to free dynamically allocated param. Link: http://lkml.kernel.org/r/20160812024815.12352.69153.stgit@pluto.themaw.net Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Signed-off-by: Ian Kent <ikent@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-11 15:06:31 -07:00
Tomohiro Kusumi	eea618e6d5	autofs: remove obsolete sb fields These two were left from commit `aa55ddf340` ("autofs4: remove unused ioctls") which removed unused ioctls. Link: http://lkml.kernel.org/r/20160812024810.12352.96377.stgit@pluto.themaw.net Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Signed-off-by: Ian Kent <ikent@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-11 15:06:31 -07:00
Tomohiro Kusumi	ca552599bf	autofs: use autofs4_free_ino() to kfree dentry data kfree dentry data allocated by autofs4_new_ino() with autofs4_free_ino() instead of raw kfree. (since we have the interface to free autofs_info*) This patch was modified to remove the need to set the dentry info field to NULL dew to a change in the previous patch. Link: http://lkml.kernel.org/r/20160812024805.12352.43650.stgit@pluto.themaw.net Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-11 15:06:31 -07:00
Ian Kent	1574fa7beb	autofs: remove ino free in autofs4_dir_symlink() The inode allocation failure case in autofs4_dir_symlink() frees the autofs dentry info of the dentry without setting ->d_fsdata to NULL. That could lead to a double free so just get rid of the free and leave it to ->d_release(). Link: http://lkml.kernel.org/r/20160812024759.12352.10653.stgit@pluto.themaw.net Signed-off-by: Ian Kent <raven@themaw.net> Cc: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-11 15:06:31 -07:00
Tomohiro Kusumi	97537b35b6	autofs: add WARN_ON(1) for non dir/link inode case It's invalid if the given mode is neither dir nor link, so warn on else case. Link: http://lkml.kernel.org/r/20160812024754.12352.8536.stgit@pluto.themaw.net Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-11 15:06:31 -07:00
Ian Kent	1973a12269	autofs: fix autofs4_fill_super() error exit handling Somewhere along the line the error handling gotos have become incorrect. Link: http://lkml.kernel.org/r/20160812024749.12352.15100.stgit@pluto.themaw.net Signed-off-by: Ian Kent <raven@themaw.net> Cc: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-11 15:06:31 -07:00
Tomohiro Kusumi	749800ef53	autofs: test autofs versions first on sb initialization This patch does what the below comment says. It could be and it's considered better to do this first before various functions get called during initialization. /* Couldn't this be tested earlier? */ Link: http://lkml.kernel.org/r/20160812024744.12352.43075.stgit@pluto.themaw.net Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-11 15:06:31 -07:00
Tomohiro Kusumi	4a44c1859f	autofs: drop unnecessary extern in autofs_i.h autofs4_kill_sb() doesn't need to be declared as extern, and no other functions in .h are explicitly declared as extern. Link: http://lkml.kernel.org/r/20160812024739.12352.99354.stgit@pluto.themaw.net Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-11 15:06:31 -07:00
Vlastimil Babka	2d19309cf8	fs/select: add vmalloc fallback for select(2) The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows with the number of fds passed. We had a customer report page allocation failures of order-4 for this allocation. This is a costly order, so it might easily fail, as the VM expects such allocation to have a lower-order fallback. Such trivial fallback is vmalloc(), as the memory doesn't have to be physically contiguous and the allocation is temporary for the duration of the syscall only. There were some concerns, whether this would have negative impact on the system by exposing vmalloc() to userspace. Although an excessive use of vmalloc can cause some system wide performance issues - TLB flushes etc. - a large order allocation is not for free either and an excessive reclaim/compaction can have a similar effect. Also note that the size is effectively limited by RLIMIT_NOFILE which defaults to 1024 on the systems I checked. That means the bitmaps will fit well within single page and thus the vmalloc() fallback could be only excercised for processes where root allows a higher limit. Note that the poll(2) syscall seems to use a linked list of order-0 pages, so it doesn't need this kind of fallback. [eric.dumazet@gmail.com: fix failure path logic] [akpm@linux-foundation.org: use proper type for size] Link: http://lkml.kernel.org/r/20160927084536.5923-1-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Laight <David.Laight@ACULAB.COM> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Jason Baron <jbaron@akamai.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-11 15:06:30 -07:00
Darrick J. Wong	25f4c41415	block: implement (some of) fallocate for block devices After much discussion, it seems that the fallocate feature flag FALLOC_FL_ZERO_RANGE maps nicely to SCSI WRITE SAME; and the feature FALLOC_FL_PUNCH_HOLE maps nicely to the devices that have been whitelisted for zeroing SCSI UNMAP. Punch still requires that FALLOC_FL_KEEP_SIZE is set. A length that goes past the end of the device will be clamped to the device size if KEEP_SIZE is set; or will return -EINVAL if not. Both start and length must be aligned to the device's logical block size. Since the semantics of fallocate are fairly well established already, wire up the two pieces. The other fallocate variants (collapse range, insert range, and allocate blocks) are not supported. Link: http://lkml.kernel.org/r/147518379992.22791.8849838163218235007.stgit@birch.djwong.org Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Mike Snitzer <snitzer@redhat.com> # tweaked header Cc: Brian Foster <bfoster@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Hannes Reinecke <hare@suse.de> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-11 15:06:30 -07:00
Guozhonghua	0cc482ee41	ocfs2: fix memory leak in dlm_migrate_request_handler() In the dlm_migrate_request_handler(), when `ret' is -EEXIST, the mle should be freed, otherwise the memory will be leaked. Link: http://lkml.kernel.org/r/71604351584F6A4EBAE558C676F37CA4A3D3522A@H3CMLB12-EX.srv.huawei-3com.com Signed-off-by: Guozhonghua <guozhonghua@h3c.com> Reviewed-by: Mark Fasheh <mfasheh@versity.com> Cc: Eric Ren <zren@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-11 15:06:30 -07:00
Linus Torvalds	d09ba13110	libnvdimm for 4.9 * PMEM sub-division support: Allow a single PMEM region to be divided into multiple namespaces. Originally, ~2 years ago, it was thought that partitions of a /dev/pmemX block device could handle sub-allocations of persistent memory for different use cases. With the decision to not support DAX mappings of raw block-devices, and the genesis of device-dax, the need for having multiple pmem-namespace per region has grown. * Device-DAX unified inode: In support of dynamic-resizing of a device-dax instance the kernel arranges for all mappings of a device-dax node to share the same inode. This allows unmap / truncate / invalidation events to affect all instances of the device similar to the behavior of mmap on block devices. * Hardware error scrubbing reworks: The original address-range-scrub + badblocks tracking solution allowed clearing entries at the individual namespace level, but it failed to clear the internal list of media errors maintained at the bus level. The result was that the next scrub or namespace disable/re-enable event would restore the cleared badblocks, but now that is fixed. The v4.8 kernel introduced an auto-scrub-on-machine-check behavior to repopulate the badblocks list. Now, in v4.9, the auto-scrub behavior can be disabled and simply arrange for the error reported in the machine-check to be added to the list. * DIMM health-event notification support: ACPI 6.1 defines a notification event code that can be send to ACPI NVDIMM devices. A poll(2) capable file descriptor for these events can be obtained from the nmemX/nfit/flags sysfs-attribute of a libnvdimm memory device. * Miscellaneous fixes: NVDIMM-N probe error, device-dax build error, and a change to dedup the flush hint list to not flush the memory controller more than necessary. -----BEGIN PGP SIGNATURE----- iQIcBAABAgAGBQJX/B2oAAoJEB7SkWpmfYgCe3YQAJiH4ZYRxr6HeJzVQltbhB2k qyLC+7vIssefPPqn/Wycc3aHJjyk2ktetmFyjYE1q/vlJJWCG3y/ACfz2SZANXXx 2tgLsI+3dXZaGgIxRsZF8MsB672owqCbzJHbbmTRu3EtgMplagfh27G7HFZxt4Jd FyKnRkknYsCEbHry/s0aRcZWPmacu5v1TDJyWgd0edNTG32GrKOtwxWrWEPRDJE1 dIK5JjPaDwMFMKjV6lgRuBVlsMKCzIC4YjSYZZmN/Mf/JCJBJuPSlkYEdGZ+xx84 /ZmKrE/XRPr7469f66QyD8iRtGAQ9OparhChbuzCagCHRAwgYy4yQGbK7rk0lwUM 18jysZU8NJxp4jEJIt0u2ap6W9ySePX5Bm+3CSwqxT0Ernew2AUJDLIw9f1hAAbX rippSWyHp0JtBTjOeaV2ZY1LJlm+J//AycbFo51lAERHoX5zPimHL730EM8mJu7y fIbFpau3fjob+ovQMXMIYam8C/MpTqAvcjpBFhkSlsY7q/l+ARgFpjYpg9qVir8g v6PZ0UoGBhQvD2lTNTUjaCaHOc+sjo8PLeNI1ZsFebh63rF3k5sOLOk7wXllf8z5 jQBnYtYnPCJI67BLLZmwWzoBb0HpCbcPp9/0/c1rdLTcAo+3gi6SY4pVJgznxCZZ +fkeOvSutJ687tFMarc1 =SenK -----END PGP SIGNATURE----- Merge tag 'libnvdimm-for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm updates from Dan Williams: "Aside from the recently added pmem sub-division support these have been in -next for several releases with no reported issues. The sub- division support was included in next-20161010 with no reported issues. It passes all unit tests including new tests for all the new functionality below. Summary: - PMEM sub-division support: Allow a single PMEM region to be divided into multiple namespaces. Originally, ~2 years ago, it was thought that partitions of a /dev/pmemX block device could handle sub-allocations of persistent memory for different use cases. With the decision to not support DAX mappings of raw block-devices, and the genesis of device-dax, the need for having multiple pmem-namespace per region has grown. - Device-DAX unified inode: In support of dynamic-resizing of a device-dax instance the kernel arranges for all mappings of a device-dax node to share the same inode. This allows unmap / truncate / invalidation events to affect all instances of the device similar to the behavior of mmap on block devices. - Hardware error scrubbing reworks: The original address-range-scrub and badblocks tracking solution allowed clearing entries at the individual namespace level, but it failed to clear the internal list of media errors maintained at the bus level. The result was that the next scrub or namespace disable/re-enable event would restore the cleared badblocks, but now that is fixed. The v4.8 kernel introduced an auto-scrub-on-machine-check behavior to repopulate the badblocks list. Now, in v4.9, the auto-scrub behavior can be disabled and simply arrange for the error reported in the machine-check to be added to the list. - DIMM health-event notification support: ACPI 6.1 defines a notification event code that can be send to ACPI NVDIMM devices. A poll(2) capable file descriptor for these events can be obtained from the nmemX/nfit/flags sysfs-attribute of a libnvdimm memory device. - Miscellaneous fixes: NVDIMM-N probe error, device-dax build error, and a change to dedup the flush hint list to not flush the memory controller more than necessary" * tag 'libnvdimm-for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (39 commits) /dev/dax: fix Kconfig dependency build breakage dax: use correct dev_t value dax: convert devm_create_dax_dev to PTR_ERR libnvdimm, namespace: allow creation of multiple pmem-namespaces per region libnvdimm, namespace: lift single pmem limit in scan_labels() libnvdimm, namespace: filter out of range labels in scan_labels() libnvdimm, namespace: enable allocation of multiple pmem namespaces libnvdimm, namespace: update label implementation for multi-pmem libnvdimm, namespace: expand pmem device naming scheme for multi-pmem libnvdimm, region: update nd_region_available_dpa() for multi-pmem support libnvdimm, namespace: sort namespaces by dpa at init libnvdimm, namespace: allow multiple pmem-namespaces per region at scan time tools/testing/nvdimm: support for sub-dividing a pmem region libnvdimm, namespace: unify blk and pmem label scanning libnvdimm, namespace: refactor uuid_show() into a namespace_to_uuid() helper libnvdimm, label: convert label tracking to a linked list libnvdimm, region: move region-mapping input-paramters to nd_mapping_desc nvdimm: reduce duplicated wpq flushes libnvdimm: clear the internal poison_list when clearing badblocks pmem: reduce kmap_atomic sections to the memcpys only ...	2016-10-11 12:19:31 -07:00
Linus Torvalds	f29135b54b	Merge branch 'for-linus-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs updates from Chris Mason: "This is a big variety of fixes and cleanups. Liu Bo continues to fixup fuzzer related problems, and some of Josef's cleanups are prep for his bigger extent buffer changes (slated for v4.10)" * 'for-linus-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (39 commits) Revert "btrfs: let btrfs_delete_unused_bgs() to clean relocated bgs" Btrfs: remove unnecessary btrfs_mark_buffer_dirty in split_leaf Btrfs: don't BUG() during drop snapshot btrfs: fix btrfs_no_printk stub helper Btrfs: memset to avoid stale content in btree leaf btrfs: parent_start initialization cleanup btrfs: Remove already completed TODO comment btrfs: Do not reassign count in btrfs_run_delayed_refs btrfs: fix a possible umount deadlock Btrfs: fix memory leak in do_walk_down btrfs: btrfs_debug should consume fs_info when DEBUG is not defined btrfs: convert send's verbose_printk to btrfs_debug btrfs: convert pr_* to btrfs_* where possible btrfs: convert printk(KERN_* to use pr_* calls btrfs: unsplit printed strings btrfs: clean the old superblocks before freeing the device Btrfs: kill BUG_ON in run_delayed_tree_ref Btrfs: don't leak reloc root nodes on error btrfs: squash lines for simple wrapper functions Btrfs: improve check_node to avoid reading corrupted nodes ...	2016-10-11 11:23:06 -07:00
Linus Torvalds	4c609922a3	This pull request contains: * Fixes for both UBI and UBIFS * overlayfs support (O_TMPFILE, RENAME_WHITEOUT/EXCHANGE) * Code refactoring for the upcoming MLC support -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABAgAGBQJX/QOCAAoJEEtJtSqsAOnWtp4QAKItkx/LrW44rHhkoJfqG62i o+OaxMKNu43/v/io+68JNEkIqgEap2vMZVkfoIgIyuyPxMG7nA/zG3c2JFvQ/ReS uH0PmcpkIXbRBKe9IEn6rXmRz9q9UTNGhP2U5kg0rL22vwVGYIuzF4Bny25Irzf/ LLtYOkpfZfaNTSjs1pmuJMWVFF1Rj68eVJEWL6JZ1BPQ4bRPbn5sNgOKNTJYkrJs GcXXNtonf3B0zOzFnmfFhVO5neo4FEG3QEQafR+qbhoNBvXSluVIAFoO4VKEcyHD BJbotsT64TBsBj7ol97EXxz+N6LkB3tNM3bFBvhAFXZ+EvrJ0o+2QoEOH0igWjMI 4AXwSl6htCs+wRmqAqpJfZpfI7kv2MDUB9ZGAbuXRS888OK78Dzt1CupPW7Q12xh yYMNsXZvRvK82n0DfqBLQ53SIe/L3PotG2Cc29hjGaHjK+YcwVRvdp/2B3ID3O2L 6ap/M6KA+i1SiYZI6yAEYT76jKOam9YG/psb76q66xILJ7h5XQOZODYQ9zC2towo Pjb+bCPzHZPm+v7xtSsP6aanZ+5xRXO91JjvsWl9UOQVDCA/Jt98H5qhCJZjIeIs OJ7z9PbTv0/jcBBRrjJyZIUE85omDliY4h04B3Yu44xa7Q9e7wbE+Vs/6L9txS0e L8TBNHmrYB7ZIprCIhcE =UB7l -----END PGP SIGNATURE----- Merge tag 'upstream-4.9-rc1' of git://git.infradead.org/linux-ubifs Pull UBI/UBIFS updates from Richard Weinberger: "This pull request contains: - Fixes for both UBI and UBIFS - overlayfs support (O_TMPFILE, RENAME_WHITEOUT/EXCHANGE) - Code refactoring for the upcoming MLC support" [ Ugh, we just got rid of the "rename2()" naming for the extended rename functionality. And this re-introduces it in ubifs with the cross- renaming and whiteout support. But rather than do any re-organizations in the merge itself, the naming can be cleaned up later ] * tag 'upstream-4.9-rc1' of git://git.infradead.org/linux-ubifs: (27 commits) UBIFS: improve function-level documentation ubifs: fix host xattr_len when changing xattr ubifs: Use move variable in ubifs_rename() ubifs: Implement RENAME_EXCHANGE ubifs: Implement RENAME_WHITEOUT ubifs: Implement O_TMPFILE ubi: Fix Fastmap's update_vol() ubi: Fix races around ubi_refill_pools() ubi: Deal with interrupted erasures in WL UBI: introduce the VID buffer concept UBI: hide EBA internals UBI: provide an helper to query LEB information UBI: provide an helper to check whether a LEB is mapped or not UBI: add an helper to check lnum validity UBI: simplify LEB write and atomic LEB change code UBI: simplify recover_peb() code UBI: move the global ech and vidh variables into struct ubi_attach_info UBI: provide helpers to allocate and free aeb elements UBI: fastmap: use ubi_io_{read, write}_data() instead of ubi_io_{read, write}() UBI: fastmap: use ubi_rb_for_each_entry() in unmap_peb() ...	2016-10-11 10:49:44 -07:00
Linus Torvalds	6b5e09a748	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Pull networking fixes from David Miller: 1) Netfilter list handling fix, from Linus. 2) RXRPC/AFS bug fixes from David Howells (oops on call to serviceless endpoints, build warnings, missing notifications, etc.) From David Howells. 3) Kernel log message missing newlines, from Colin Ian King. 4) Don't enter direct reclaim in netlink dumps, the idea is to use a high order allocation first and fallback quickly to a 0-order allocation if such a high-order one cannot be done cheaply and without reclaim. From Eric Dumazet. 5) Fix firmware download errors in btusb bluetooth driver, from Ethan Hsieh. 6) Missing Kconfig deps for QCOM_EMAC, from Geert Uytterhoeven. 7) Fix MDIO_XGENE dup Kconfig entry. From Laura Abbott. 8) Constrain ipv6 rtr_solicits sysctl values properly, from Maciej Żenczykowski. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (65 commits) netfilter: Fix slab corruption. be2net: Enable VF link state setting for BE3 be2net: Fix TX stats for TSO packets be2net: Update Copyright string in be_hw.h be2net: NCSI FW section should be properly updated with ethtool for BE3 be2net: Provide an alternate way to read pf_num for BEx chips wan/fsl_ucc_hdlc: Fix size used in dma_free_coherent() net: macb: NULL out phydev after removing mdio bus xen-netback: make sure that hashes are not send to unaware frontends Fixing a bug in team driver due to incorrect 'unsigned int' to 'int' conversion MAINTAINERS: add myself as a maintainer of xen-netback ipv6 addrconf: disallow rtr_solicits < -1 Bluetooth: btusb: Fix atheros firmware download error drivers: net: phy: Correct duplicate MDIO_XGENE entry ethernet: qualcomm: QCOM_EMAC should depend on HAS_DMA and HAS_IOMEM net: ethernet: mediatek: remove hwlro property in the device tree net: ethernet: mediatek: get hw lro capability by the chip id instead of by the dtsi net: ethernet: mediatek: get the chip id by ETHDMASYS registers net: bgmac: Fix errant feature flag check netlink: do not enter direct reclaim from netlink_dump() ...	2016-10-11 08:10:19 -07:00
Linus Torvalds	101105b171	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull more vfs updates from Al Viro: ">rename2() work from Miklos + current_time() from Deepa" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs: Replace current_fs_time() with current_time() fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps fs: Replace CURRENT_TIME with current_time() for inode timestamps fs: proc: Delete inode time initializations in proc_alloc_inode() vfs: Add current_time() api vfs: add note about i_op->rename changes to porting fs: rename "rename2" i_op to "rename" vfs: remove unused i_op->rename fs: make remaining filesystems use .rename2 libfs: support RENAME_NOREPLACE in simple_rename() fs: support RENAME_NOREPLACE for local filesystems ncpfs: fix unused variable warning	2016-10-10 20:16:43 -07:00
Al Viro	3873691e5a	Merge remote-tracking branch 'ovl/rename2' into for-linus	2016-10-10 23:02:51 -04:00
Linus Torvalds	97d2116708	Merge branch 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs xattr updates from Al Viro: "xattr stuff from Andreas This completes the switch to xattr_handler ->get()/->set() from ->getxattr/->setxattr/->removexattr" * 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: vfs: Remove {get,set,remove}xattr inode operations xattr: Stop calling {get,set,remove}xattr inode operations vfs: Check for the IOP_XATTR flag in listxattr xattr: Add __vfs_{get,set,remove}xattr helpers libfs: Use IOP_XATTR flag for empty directory handling vfs: Use IOP_XATTR flag for bad-inode handling vfs: Add IOP_XATTR inode operations flag vfs: Move xattr_resolve_name to the front of fs/xattr.c ecryptfs: Switch to generic xattr handlers sockfs: Get rid of getxattr iop sockfs: getxattr: Fail with -EOPNOTSUPP for invalid attribute names kernfs: Switch to generic xattr handlers hfs: Switch to generic xattr handlers jffs2: Remove jffs2_{get,set,remove}xattr macros xattr: Remove unnecessary NULL attribute name check	2016-10-10 17:11:50 -07:00
Christoph Hellwig	feac470e36	xfs: convert COW blocks to real blocks before unwritten extent conversion We need to splice COW blocks we've completed in xfs_end_io_direct_write into the data fork before converting unwritten extents. Otherwise xfs_bmapi_write might first allocate blocks for any holes in the data fork, which isn't only not needed but also harmful as it might cause reserved block underruns in the transaction. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-10-11 09:03:19 +11:00
Emese Revfy	0766f788eb	latent_entropy: Mark functions with __latent_entropy The __latent_entropy gcc attribute can be used only on functions and variables. If it is on a function then the plugin will instrument it for gathering control-flow entropy. If the attribute is on a variable then the plugin will initialize it with random contents. The variable must be an integer, an integer array type or a structure with integer fields. These specific functions have been selected because they are init functions (to help gather boot-time entropy), are called at unpredictable times, or they have variable loops, each of which provide some level of latent entropy. Signed-off-by: Emese Revfy <re.emese@gmail.com> [kees: expanded commit message] Signed-off-by: Kees Cook <keescook@chromium.org>	2016-10-10 14:51:45 -07:00
Linus Torvalds	6763afe4b9	dlm for 4.9 This includes a bug fix for a bad memory access during workqueue cleanup, which can happen while shutting down the dlm networking layer. -----BEGIN PGP SIGNATURE----- iQIcBAABAgAGBQJX+63GAAoJEDgbc8f8gGmq4hIP/R63HIQkCQJOCrV34gdrk4tN 7+mwqKkQeWfDYEB0TI+B/iwMsEqtE12Wob6lN9P1pYlTp1OOXulj/jV3xBcENMkM trxmcscCwKcVnLvkW1cVqKfLdswFEQZv95g0CVIAaLghI3v39Sf5WDVcaw+L6IEv 7ko5vet2OY5eJm6vJEDXTJbxWQ3itkOvIrD8f50jA6IFkrgfJQ0oFkPnfVFTU0Mp g1v2w05voWINoHQ3b1AfTz7iAYUIv94CnVDYgyIwsqUW2M133Tvw2Cj78rK0EqJa t2vr/8+8twm0D2NDq7xX4BLLKkrlchhDVQWNivVHw9DGOtjXdMu3r7WmgDM8FvW6 NethPt45xLNHtcP/GvE1z7YzUus50aSwjWIRHyRUhMDD/8QwIwCiw3FBx2H3TRuf OXZCXOpCXFxM63uOAkZaI9xCni0SWCVCdHxFEmd1VF41t4RNc4jbpGWSe1jqacLx a1Dpf/mLClQ1WOgLtnVQ+/OeRhayVnnhRD3u2XxqWl3oZkU4QqJp8cTSeq3gVbgE m9ZDi/MozN9amkGECkI7JlyRHgk73ZM0KPCcYgl9v0v9l6R1i8Bi8xvs5YS+i/Je e5jyDM2VtcqoGY9lzSQMMbqz6Lrv7aagrDYGPts/togYuKT/925L6BiIVoHIZtfy H+NKEqCPMq70mRDjbd/S =4las -----END PGP SIGNATURE----- Merge tag 'dlm-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm Pull dlm fix from David Teigland: "This includes a bug fix for a bad memory access during workqueue cleanup, which can happen while shutting down the dlm networking layer" * tag 'dlm-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm: dlm: free workqueues after the connections	2016-10-10 13:58:06 -07:00
Linus Torvalds	8dfb790b15	The big ticket item here is support for rbd exclusive-lock feature, with maintenance operations offloaded to userspace (Douglas Fuller, Mike Christie and myself). Another block device bullet is a series fixing up layering error paths (myself). On the filesystem side, we've got patches that improve our handling of buffered vs dio write races (Neil Brown) and a few assorted fixes from Zheng. Also included a couple of random cleanups and a minor CRUSH update. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQEcBAABCAAGBQJX+PjZAAoJEEp/3jgCEfOLVuoH/RwtFLIb6/KZUYtBOrVVrTun kReRlfq2xKYrGGtyQEqSuz7fBdwT1LVCVcL8kC4GFD4R67o+tNMAr6PfM/7pZABj HRoRLgSZ9FLw4W5n0VpBIznih75QUbCdXiTCtH9eorMHU5q1YpTvVHHlF9W9Pm2I eNGnBWpGyHVeiK66mpUCH+EQKQ4GkAVD9rneTNqLHgq2yotHkVl1j258+DL6JRGs OBoh3RmNQaGOAS37Lss8erCSusAGEcAeGV6ubuK2lFUKyR41EkD3I0xkhNSPe+CD RifFcpVziIeTu//cLgl0nnHGtmUytD7HgJubaPthArKIOen9ZDAfEkgI0o+JI2A= =45O7 -----END PGP SIGNATURE----- Merge tag 'ceph-for-4.9-rc1' of git://github.com/ceph/ceph-client Pull Ceph updates from Ilya Dryomov: "The big ticket item here is support for rbd exclusive-lock feature, with maintenance operations offloaded to userspace (Douglas Fuller, Mike Christie and myself). Another block device bullet is a series fixing up layering error paths (myself). On the filesystem side, we've got patches that improve our handling of buffered vs dio write races (Neil Brown) and a few assorted fixes from Zheng. Also included a couple of random cleanups and a minor CRUSH update" * tag 'ceph-for-4.9-rc1' of git://github.com/ceph/ceph-client: (39 commits) crush: remove redundant local variable crush: don't normalize input of crush_ln iteratively libceph: ceph_build_auth() doesn't need ceph_auth_build_hello() libceph: use CEPH_AUTH_UNKNOWN in ceph_auth_build_hello() ceph: fix description for rsize and rasize mount options rbd: use kmalloc_array() in rbd_header_from_disk() ceph: use list_move instead of list_del/list_add ceph: handle CEPH_SESSION_REJECT message ceph: avoid accessing / when mounting a subpath ceph: fix mandatory flock check ceph: remove warning when ceph_releasepage() is called on dirty page ceph: ignore error from invalidate_inode_pages2_range() in direct write ceph: fix error handling of start_read() rbd: add rbd_obj_request_error() helper rbd: img_data requests don't own their page array rbd: don't call rbd_osd_req_format_read() for !img_data requests rbd: rework rbd_img_obj_exists_submit() error paths rbd: don't crash or leak on errors in rbd_img_obj_parent_read_full_callback() rbd: move bumping img_request refcount into rbd_obj_request_submit() rbd: mark the original request as done if stat request fails ...	2016-10-10 13:52:05 -07:00
Chris Mason	19c4d2f994	Revert "btrfs: let btrfs_delete_unused_bgs() to clean relocated bgs" This reverts commit `5d8eb6fe51`. When we remove devices, we free the device structures. Delaying btfs_remove_chunk() ends up hitting a use-after-free on them. Signed-off-by: Chris Mason <clm@fb.com>	2016-10-10 13:43:31 -07:00
Linus Torvalds	fed41f7d03	Merge branch 'work.splice_read' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull splice fixups from Al Viro: "A couple of fixups for interaction of pipe-backed iov_iter with O_DIRECT reads + constification of a couple of primitives in uio.h missed by previous rounds. Kudos to davej - his fuzzing has caught those bugs" * 'work.splice_read' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: [btrfs] fix check_direct_IO() for non-iovec iterators constify iov_iter_count() and iter_is_iovec() fix ITER_PIPE interaction with direct_IO	2016-10-10 13:38:49 -07:00
Linus Torvalds	abb5a14fa2	Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc vfs updates from Al Viro: "Assorted misc bits and pieces. There are several single-topic branches left after this (rename2 series from Miklos, current_time series from Deepa Dinamani, xattr series from Andreas, uaccess stuff from from me) and I'd prefer to send those separately" * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (39 commits) proc: switch auxv to use of __mem_open() hpfs: support FIEMAP cifs: get rid of unused arguments of CIFSSMBWrite() posix_acl: uapi header split posix_acl: xattr representation cleanups fs/aio.c: eliminate redundant loads in put_aio_ring_file fs/internal.h: add const to ns_dentry_operations declaration compat: remove compat_printk() fs/buffer.c: make __getblk_slow() static proc: unsigned file descriptors fs/file: more unsigned file descriptors fs: compat: remove redundant check of nr_segs cachefiles: Fix attempt to read i_blocks after deleting file [ver #2] cifs: don't use memcpy() to copy struct iov_iter get rid of separate multipage fault-in primitives fs: Avoid premature clearing of capabilities fs: Give dentry to inode_change_ok() instead of inode fuse: Propagate dentry down to inode_change_ok() ceph: Propagate dentry down to inode_change_ok() xfs: Propagate dentry down to inode_change_ok() ...	2016-10-10 13:04:49 -07:00
Al Viro	cd27e45504	[btrfs] fix check_direct_IO() for non-iovec iterators looking for duplicate ->iov_base makes sense only for iovec-backed iterators; for kvec-backed ones it's pointless, for bvec-backed ones it's pointless and broken on 32bit (we walk through an array of struct bio_vec accessing them as if they were struct iovec; works by accident on 64bit, but on 32bit it'll blow up) and for pipe-backed ones it's pointless and ends up oopsing. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-10-10 13:58:16 -04:00
Al Viro	c3a6902404	fix ITER_PIPE interaction with direct_IO by making sure we call iov_iter_advance() on original iov_iter even if direct_IO (done on its copy) has returned 0. It's a no-op for old iov_iter flavours and does the right thing (== truncation of the stuff we'd allocated, but not filled) in ITER_PIPE case. Failures (e.g. -EIO) get caught and dealt with by cleanup in generic_file_read_iter(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-10-10 13:36:06 -04:00
Marcelo Ricardo Leitner	3a8db79889	dlm: free workqueues after the connections After backporting commit `ee44b4bc05` ("dlm: use sctp 1-to-1 API") series to a kernel with an older workqueue which didn't use RCU yet, it was noticed that we are freeing the workqueues in dlm_lowcomms_stop() too early as free_conn() will try to access that memory for canceling the queued works if any. This issue was introduced by commit `0d737a8cfd` as before it such attempt to cancel the queued works wasn't performed, so the issue was not present. This patch fixes it by simply inverting the free order. Cc: stable@vger.kernel.org Fixes: `0d737a8cfd` ("dlm: fix race while closing connections") Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David Teigland <teigland@redhat.com>	2016-10-10 09:54:00 -05:00
Darrick J. Wong	6f97077ff6	xfs: rework refcount cow recovery error handling The error handling in xfs_refcount_recover_cow_leftovers is confused and can potentially leak memory, so rework it to release resources correctly on error. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reported-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-10-10 17:23:07 +11:00
Darrick J. Wong	1987fd7434	xfs: clear reflink flag if setting realtime flag Since we can only turn on the rt flag if there are no data extents, we can safely turn off the reflink flag if the rt flag is being turned on. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reported-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-10-10 16:49:29 +11:00
Darrick J. Wong	9780643cde	xfs: fix error initialization Eric Sandeen reported a gcc complaint about uninitialized error variables, so fix that. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reported-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-10-10 16:49:18 +11:00
Darrick J. Wong	93fed47013	xfs: fix label inaccuracies Since we don't unlock anything on the way out, change the label. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reported-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-10-10 16:49:10 +11:00
Darrick J. Wong	97a1b87ea7	xfs: remove isize check from unshare operation Now that fallocate has an explicit unshare flag again, let's try to remove the inode reflink flag whenever the user unshares any part of a file since checking is cheap compared to the CoW. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reported-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-10-10 16:49:01 +11:00
Darrick J. Wong	024adf4870	xfs: reduce stack usage of _reflink_clear_inode_flag The loop in _reflink_clear_inode_flag isn't necessary since we jump out if any part of any extent is shared. Remove the loop and we no longer need two maps, so we can save some stack use. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reported-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-10-10 16:47:40 +11:00
Darrick J. Wong	63646fc58d	xfs: check inode reflink flag before calling reflink functions There are a couple of places where we don't check the inode's reflink flag before calling into the reflink code. Fix those, and add some asserts so we don't make this mistake again. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reported-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-10-10 16:47:32 +11:00
Al Viro	e55f1d1d13	Merge remote-tracking branch 'jk/vfs' into work.misc	2016-10-08 11:06:08 -04:00
Al Viro	f334bcd94b	Merge remote-tracking branch 'ovl/misc' into work.misc	2016-10-08 11:00:01 -04:00
Al Viro	73e8fb2d59	Merge branch 'work.const-qstr' into work.misc	2016-10-08 10:44:55 -04:00
Al Viro	33e09f0ee7	Merge branch 'work.iget' into work.misc	2016-10-08 10:44:37 -04:00
Luis de Bethencourt	a17e7d2010	befs: befs: fix style issues in datastream.c Fixing the following checkpatch.pl errors: ERROR: "foo * bar" should be "foo bar" + befs_blocknr_t blockno, befs_block_run run); WARNING: Missing a blank line after declarations + struct buffer_head bh; + befs_debug(sb, "---> %s length: %llu", __func__, len); WARNING: Block comments use on subsequent lines + /* + Double indir block, plus all the indirect blocks it maps. (and other instances of these) Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com> Signed-off-by: Salah Triki <salah.triki@gmail.com>	2016-10-08 10:01:36 +01:00
Luis de Bethencourt	a20af5f9ea	befs: improve documentation in datastream.c Convert function descriptions to kernel-doc style. Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com> Signed-off-by: Salah Triki <salah.triki@gmail.com>	2016-10-08 10:01:36 +01:00
Luis de Bethencourt	d327e612bd	befs: fix typos in datastream.c Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com> Signed-off-by: Salah Triki <salah.triki@gmail.com>	2016-10-08 10:01:35 +01:00
Luis de Bethencourt	02d91f97fd	befs: fix typos in btree.c Fixing typos in kernel-doc function descriptions in fs/befs/btree.c. Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com> Signed-off-by: Salah Triki <salah.triki@gmail.com>	2016-10-08 10:01:34 +01:00
Luis de Bethencourt	103c0fb340	befs: fix style issues in super.c Fixing the following checkpatch.pl error: ERROR: "foo * bar" should be "foo bar" +befs_load_sb(struct super_block sb, befs_super_block * disk_sb) And the following warnings: WARNING: suspect code indent for conditional statements (8, 12) + if (disk_sb->fs_byte_order == BEFS_BYTEORDER_NATIVE_LE) + befs_sb->byte_order = BEFS_BYTESEX_LE; WARNING: suspect code indent for conditional statements (8, 12) + else if (disk_sb->fs_byte_order == BEFS_BYTEORDER_NATIVE_BE) + befs_sb->byte_order = BEFS_BYTESEX_BE; WARNING: break quoted strings at a space character + befs_error(sb, "blocksize(%u) cannot be larger" + "than system pagesize(%lu)", befs_sb->block_size, WARNING: line over 80 characters + if (befs_sb->log_start != befs_sb->log_end \|\| befs_sb->flags == BEFS_DIRTY) { Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com> Signed-off-by: Salah Triki <salah.triki@gmail.com>	2016-10-08 10:01:34 +01:00
Luis de Bethencourt	11674239f9	befs: fix comment style The description of befs_load_sb was confusing the kernel-doc system since, because it starts with /**, it thinks it will document the function with kernel-doc formatting. Which it isn't. Fix other comment style issues in the file while we are at it. Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com> Signed-off-by: Salah Triki <salah.triki@gmail.com>	2016-10-08 10:01:33 +01:00
Luis de Bethencourt	bbe1bd0b6b	befs: add check for ag_shift in superblock ag_shift and blocks_per_ag contain the same information in different ways, same as block_shift and block_size do. It is worth checking this two are consistent, but since blocks_per_ag isn't documented as mandatory to use some implementations of befs don't enforce this, so making it non-fatal if they don't match and just having it as a warning. Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com> Signed-off-by: Salah Triki <salah.triki@gmail.com>	2016-10-08 10:01:31 +01:00
Luis de Bethencourt	d1a8c70676	befs: dump inode_size superblock information befs_dump_super_block() wasn't giving the inode_size information when dumping all elements of the superblock. Add this element to have complete information of the superblock. Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com> Signed-off-by: Salah Triki <salah.triki@gmail.com>	2016-10-08 10:01:29 +01:00
Salah Triki	78f647c27f	befs: remove unnecessary initialization There is no need to init block, since it will be overwitten later by iaddr2blockno(). Signed-off-by: Salah Triki <salah.triki@gmail.com> Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>	2016-10-08 10:01:28 +01:00
Salah Triki	2ac636b4d0	befs: fix typo in befs_sb_info Fixing jornal to Journal. Signed-off-by: Salah Triki <salah.triki@gmail.com> Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>	2016-10-08 10:01:27 +01:00
Salah Triki	6ea4558f9b	befs: add flags field to validate superblock state For validating superblock state, add flags field to befs_sb_info, read the state from the disk and check if it is equal to BEFS_DIRTY. Signed-off-by: Salah Triki <salah.triki@gmail.com> Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>	2016-10-08 10:01:27 +01:00
Luis de Bethencourt	bb75e66627	befs: fix typo in befs_find_key Fixing skeep to skip. Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com> Signed-off-by: Salah Triki <salah.triki@gmail.com>	2016-10-08 10:01:26 +01:00
Luis de Bethencourt	672a8515ee	befs: remove unused BEFS_BT_PARMATCH befs_btree_find(), the only caller of befs_find_key(), only cares about if the return from that function is BEFS_BT_MATCH or not. It never uses the partial match given with BEFS_BT_PARMATCH. Make the overflow return clearer by having BEFS_BT_OVERFLOW instead of BEFS_BT_PARMATCH. Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com> Signed-off-by: Salah Triki <salah.triki@gmail.com>	2016-10-08 10:01:26 +01:00
Salah Triki	33c712b4fc	fs: befs: remove ret variable ret is initialized to -EIO and is never modified, so remove ret and use -EIO directly. Signed-off-by: Salah Triki <salah.triki@gmail.com> Acked-by: Luis de Bethencourt <luisbg@osg.samsung.com>	2016-10-08 10:01:25 +01:00
Salah Triki	abcf911691	fs: befs: remove in vain variable assignment There is no need to init res, since it will be overwitten later by befs_fblock2brun(). Signed-off-by: Salah Triki <salah.triki@gmail.com> Acked-by: Luis de Bethencourt <luisbg@osg.samsung.com>	2016-10-08 10:01:24 +01:00
Salah Triki	f30661035b	fs: befs: remove unnecessary befs_sb variable Remove befs_sb and just call BEFS_SB(sb) directly, since the returned value by this function is only used once. Signed-off-by: Salah Triki <salah.triki@gmail.com> Acked-by: Luis de Bethencourt <luisbg@osg.samsung.com>	2016-10-08 10:01:23 +01:00
Salah Triki	143d2a615f	fs: befs: remove useless initialization to zero node_off is unconditionally set to bt_super.root_node_ptr, so no need to init it to zero. Signed-off-by: Salah Triki <salah.triki@gmail.com> Acked-by: Luis de Bethencourt <luisbg@osg.samsung.com>	2016-10-08 10:01:23 +01:00
Salah Triki	88ff34446b	fs: befs: remove in vain variable assignment There is no need to set *value, it will be overwritten later. Signed-off-by: Salah Triki <salah.triki@gmail.com> Acked-by: Luis de Bethencourt <luisbg@osg.samsung.com>	2016-10-08 10:01:22 +01:00
Salah Triki	a26bc1adc7	fs: befs: Insert NULL inode to dentry As VFS expects, lookup inserts NULL inode to dentry when the named inode does not exist. Signed-off-by: Salah Triki <salah.triki@gmail.com> Acked-by: Luis de Bethencourt <luisbg@osg.samsung.com>	2016-10-08 10:01:21 +01:00
Salah Triki	d70ee4f2de	fs: befs: Remove useless calls to brelse in befs_find_brun_dblindirect The calls to brelse are useless since dbl_indir_block and indir_block are NULL. Signed-off-by: Salah Triki <salah.triki@gmail.com> Acked-by: Luis de Bethencourt <luisbg@osg.samsung.com>	2016-10-08 10:01:20 +01:00
Salah Triki	4bb594329a	fs: befs: Coding style fix Constant has to be capitalized. Signed-off-by: Salah Triki <salah.triki@gmail.com> Acked-by: Luis de Bethencourt <luisbg@osg.samsung.com>	2016-10-08 10:01:20 +01:00
Salah Triki	d84e4a5a09	fs: befs: Remove redundant validation from befs_find_brun_direct The only caller of befs_find_brun_direct is befs_fblock2brun, which already validates that the block is within the range of direct blocks. So remove the duplicate validation. Signed-off-by: Salah Triki <salah.triki@gmail.com> Acked-by: Luis de Bethencourt <luisbg@osg.samsung.com>	2016-10-08 10:01:19 +01:00
Luis de Bethencourt	2dfa8a6e56	befs: fix typo in befs_bt_read_node documentation Fixing a grammatical error in the documentation. Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>	2016-10-08 10:01:18 +01:00
Luis de Bethencourt	cfe0cb20e6	befs: in memory free_node_ptr and max_size never read The only place the values of free_node_ptr and max_size are read is in befs_dump_index_entry(), which both times it is called, it is passed the on disk superblock. Removing assignment of unused values. Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>	2016-10-08 10:01:17 +01:00
Luis de Bethencourt	4c3897cce0	befs: make consistent use of befs_error() befs_error() is used in potential errors that could happen in befs to provide informational log messages. befs_debug() is silent when CONFIG_BEFS_DEBUG=no, and very verbose when switched on, which is why it is used for general debugging but not for errors. Fix a few cases where the befs debug utility usage isn't following the expected pattern. To make sure we have consistent information in the logs. Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>	2016-10-08 10:01:16 +01:00
Luis de Bethencourt	9ae51a32b1	befs: use simpler while loop Replace goto with simpler while loop to make befs_readdir() more readable. Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>	2016-10-08 10:01:16 +01:00
Luis de Bethencourt	50858ef96d	befs: remove constant variable Use macro directly instead of via assigning it to an unchanging variable. Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com> Acked-by: Salah Triki <salah.triki@gmail.com>	2016-10-08 10:01:15 +01:00
Luis de Bethencourt	f7769f9cf9	befs: avoid dereferencing dentry twice No need to dereference dentry twice to get the name when we already have it stored in a local variable. Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>	2016-10-08 10:01:15 +01:00
Luis de Bethencourt	39dcfd3b34	fs: befs: remove comment that confuses kernel-doc This comment with a mysterious unfinished line confuses the kernel-doc system since, because it starts with /**, it thinks it is documenting a function. Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>	2016-10-08 10:01:14 +01:00
Luis de Bethencourt	a64998504e	fs: befs: check silent flag before logging error Log error only when silent flag is not set. Fixes: dbe6460388bc ("fs/befs/linuxvfs.c: check silent flag before logging errors") Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com> Acked-by: Salah Triki <salah.triki@gmail.com>	2016-10-08 10:01:13 +01:00
Salah Triki	f7f675406b	fs: befs: replace befs_bread by sb_bread Since befs_bread merely calls sb_bread, replace it by sb_bread. Link: http://lkml.kernel.org/r/1466800258-4542-1-git-send-email-salah.triki@gmail.com Signed-off-by: Salah Triki <salah.triki@gmail.com> Acked-by: Luis de Bethencourt <luisbg@osg.samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2016-10-08 10:01:12 +01:00
Luis de Bethencourt	f0f2536fe3	befs: remove unused functions befs_iaddr_is_empty() and befs_brun_size() are unused. Remove them. Link: http://lkml.kernel.org/r/1465700235-22881-3-git-send-email-luisbg@osg.samsung.com Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2016-10-08 10:01:12 +01:00
Luis de Bethencourt	10145d6116	befs: fix function name in documentation Documentation of function befs_load_cb() lists it as load_befs_sb(). Fix the misnomer. Link: http://lkml.kernel.org/r/1465700235-22881-2-git-send-email-luisbg@osg.samsung.com Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2016-10-08 10:01:11 +01:00
Luis de Bethencourt	173b066f58	befs: check return of sb_min_blocksize Confirm sb_min_blocksize() succeeded before continuing. Link: http://lkml.kernel.org/r/1465700235-22881-1-git-send-email-luisbg@osg.samsung.com Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2016-10-08 10:01:10 +01:00
Salah Triki	c08f1cb627	fs: befs: remove useless pr_err in befs_init_inodecache() Remove pr_err since kmem_cache_create log error and dump stack. Link: http://lkml.kernel.org/r/e6d03cbc9542495dc6174b59e32fcd41c1393cfc.1464226521.git.salah.triki@acm.org Signed-off-by: Salah Triki <salah.triki@acm.org>	2016-10-08 10:01:10 +01:00
Salah Triki	e808792784	fs/befs/linuxvfs.c: remove useless befs_error Remove befs_error since when kmalloc fails there is a generic out of memory and stack dump. Link: http://lkml.kernel.org/r/3de4d388d98bbb570462a5eb8e64623e17fb5d74.1464226521.git.salah.triki@acm.org Signed-off-by: Salah Triki <salah.triki@acm.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2016-10-08 10:01:10 +01:00
Salah Triki	c625426fb6	fs/befs/linuxvfs.c: remove useless pr_err in befs_fill_super() Remove pr_err since when kzalloc fails there is a generic out of memory and stack dump. Link: http://lkml.kernel.org/r/c5a7f2d42ec0fc8465c118248e88cd221c483391.1464226521.git.salah.triki@acm.org Signed-off-by: Salah Triki <salah.triki@acm.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2016-10-08 10:01:09 +01:00
Salah Triki	dceee2e230	fs/befs/linuxvfs.c: check silent flag before logging errors Log errors only when silent flag is not set. Link: http://lkml.kernel.org/r/d400aaf5a7430de79bd956e40ec075fb1cb08474.1464226521.git.salah.triki@acm.org Signed-off-by: Salah Triki <salah.triki@acm.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2016-10-08 10:01:08 +01:00
Salah Triki	30982583e4	fs/befs/linuxvfs.c: move useless assignment Control is transfered to unacquire_none when sb->s_fs_info is equal to NULL, so the assignment to NULL is useless and it is moved above unacquire_none. Link: http://lkml.kernel.org/r/ed41da113fc693c7daa4e8813ca04cc766ddfc05.1464226521.git.salah.triki@acm.org Signed-off-by: Salah Triki <salah.triki@acm.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2016-10-08 10:01:08 +01:00
Linus Torvalds	b66484cd74	Merge branch 'akpm' (patches from Andrew) Merge updates from Andrew Morton: - fsnotify updates - ocfs2 updates - all of MM * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (127 commits) console: don't prefer first registered if DT specifies stdout-path cred: simpler, 1D supplementary groups CREDITS: update Pavel's information, add GPG key, remove snail mail address mailmap: add Johan Hovold .gitattributes: set git diff driver for C source code files uprobes: remove function declarations from arch/{mips,s390} spelling.txt: "modeled" is spelt correctly nmi_backtrace: generate one-line reports for idle cpus arch/tile: adopt the new nmi_backtrace framework nmi_backtrace: do a local dump_stack() instead of a self-NMI nmi_backtrace: add more trigger__cpu_backtrace() methods min/max: remove sparse warnings when they're nested Documentation/filesystems/proc.txt: add more description for maps/smaps mm, proc: fix region lost in /proc/self/smaps proc: fix timerslack_ns CAP_SYS_NICE check when adjusting self proc: add LSM hook checks to /proc/<tid>/timerslack_ns proc: relax /proc/<tid>/timerslack_ns capability requirements meminfo: break apart a very long seq_printf with #ifdefs seq/proc: modify seq_put_decimal_[u]ll to take a const char , not char proc: faster /proc/*/status ...	2016-10-07 21:38:00 -07:00
Andreas Gruenbacher	fd50ecaddf	vfs: Remove {get,set,remove}xattr inode operations These inode operations are no longer used; remove them. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-10-07 21:48:36 -04:00
Alexey Dobriyan	81243eacfa	cred: simpler, 1D supplementary groups Current supplementary groups code can massively overallocate memory and is implemented in a way so that access to individual gid is done via 2D array. If number of gids is <= 32, memory allocation is more or less tolerable (140/148 bytes). But if it is not, code allocates full page (!) regardless and, what's even more fun, doesn't reuse small 32-entry array. 2D array means dependent shifts, loads and LEAs without possibility to optimize them (gid is never known at compile time). All of the above is unnecessary. Switch to the usual trailing-zero-len-array scheme. Memory is allocated with kmalloc/vmalloc() and only as much as needed. Accesses become simpler (LEA 8(gi,idx,4) or even without displacement). Maximum number of gids is 65536 which translates to 256KB+8 bytes. I think kernel can handle such allocation. On my usual desktop system with whole 9 (nine) aux groups, struct group_info shrinks from 148 bytes to 44 bytes, yay! Nice side effects: - "gi->gid[i]" is shorter than "GROUP_AT(gi, i)", less typing, - fix little mess in net/ipv4/ping.c should have been using GROUP_AT macro but this point becomes moot, - aux group allocation is persistent and should be accounted as such. Link: http://lkml.kernel.org/r/20160817201927.GA2096@p183.telecom.by Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Vasily Kulikov <segoon@openwall.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-07 18:46:30 -07:00
Robert Ho	855af072b6	mm, proc: fix region lost in /proc/self/smaps Recently, Redhat reported that nvml test suite failed on QEMU/KVM, more detailed info please refer to: https://bugzilla.redhat.com/show_bug.cgi?id=1365721 Actually, this bug is not only for NVDIMM/DAX but also for any other file systems. This simple test case abstracted from nvml can easily reproduce this bug in common environment: -------------------------- testcase.c ----------------------------- int is_pmem_proc(const void addr, size_t len) { const char caddr = addr; FILE fp; if ((fp = fopen("/proc/self/smaps", "r")) == NULL) { printf("!/proc/self/smaps"); return 0; } int retval = 0; / assume false until proven otherwise / char line[PROCMAXLEN]; / for fgets() / char lo = NULL; /* beginning of current range in smaps file / char hi = NULL; /* end of current range in smaps file / int needmm = 0; / looking for mm flag for current range / while (fgets(line, PROCMAXLEN, fp) != NULL) { static const char vmflags[] = "VmFlags:"; static const char mm[] = " wr"; / check for range line / if (sscanf(line, "%p-%p", &lo, &hi) == 2) { if (needmm) { / last range matched, but no mm flag found / printf("never found mm flag.\n"); break; } else if (caddr < lo) { / never found the range for caddr / printf("#######no match for addr %p.\n", caddr); break; } else if (caddr < hi) { / start address is in this range / size_t rangelen = (size_t)(hi - caddr); / remember that matching has started / needmm = 1; / calculate remaining range to search for / if (len > rangelen) { len -= rangelen; caddr += rangelen; printf("matched %zu bytes in range " "%p-%p, %zu left over.\n", rangelen, lo, hi, len); } else { len = 0; printf("matched all bytes in range " "%p-%p.\n", lo, hi); } } } else if (needmm && strncmp(line, vmflags, sizeof(vmflags) - 1) == 0) { if (strstr(&line[sizeof(vmflags) - 1], mm) != NULL) { printf("mm flag found.\n"); if (len == 0) { / entire range matched / retval = 1; break; } needmm = 0; / saw what was needed / } else { / mm flag not set for some or all of range / printf("range has no mm flag.\n"); break; } } } fclose(fp); printf("returning %d.\n", retval); return retval; } void Addr; size_t Size; /* * worker -- the work each thread performs / static void worker(void arg) { int ret = (int )arg; ret = is_pmem_proc(Addr, Size); return NULL; } int main(int argc, char argv[]) { if (argc < 2 \|\| argc > 3) { printf("usage: %s file [env].\n", argv[0]); return -1; } int fd = open(argv[1], O_RDWR); struct stat stbuf; fstat(fd, &stbuf); Size = stbuf.st_size; Addr = mmap(0, stbuf.st_size, PROT_READ\|PROT_WRITE, MAP_PRIVATE, fd, 0); close(fd); pthread_t threads[NTHREAD]; int ret[NTHREAD]; / kick off NTHREAD threads / for (int i = 0; i < NTHREAD; i++) pthread_create(&threads[i], NULL, worker, &ret[i]); / wait for all the threads to complete / for (int i = 0; i < NTHREAD; i++) pthread_join(threads[i], NULL); / verify that all the threads return the same value */ for (int i = 1; i < NTHREAD; i++) { if (ret[0] != ret[i]) { printf("Error i %d ret[0] = %d ret[i] = %d.\n", i, ret[0], ret[i]); } } printf("%d", ret[0]); return 0; } It failed as some threads can not find the memory region in "/proc/self/smaps" which is allocated in the main process It is caused by proc fs which uses 'file->version' to indicate the VMA that is the last one has already been handled by read() system call. When the next read() issues, it uses the 'version' to find the VMA, then the next VMA is what we want to handle, the related code is as follows: if (last_addr) { vma = find_vma(mm, last_addr); if (vma && (vma = m_next_vma(priv, vma))) return vma; } However, VMA will be lost if the last VMA is gone, e.g: The process VMA list is A->B->C->D CPU 0 CPU 1 read() system call handle VMA B version = B return to userspace unmap VMA B issue read() again to continue to get the region info find_vma(version) will get VMA C m_next_vma(C) will get VMA D handle D !!! VMA C is lost !!! In order to fix this bug, we make 'file->version' indicate the end address of the current VMA. m_start will then look up a vma which with vma_start < last_vm_end and moves on to the next vma if we found the same or an overlapping vma. This will guarantee that we will not miss an exclusive vma but we can still miss one if the previous vma was shrunk. This is acceptable because guaranteeing "never miss a vma" is simply not feasible. User has to cope with some inconsistencies if the file is not read in one go. [mhocko@suse.com: changelog fixes] Link: http://lkml.kernel.org/r/1475296958-27652-1-git-send-email-robert.hu@intel.com Acked-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Robert Hu <robert.hu@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Gleb Natapov <gleb@kernel.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Stefan Hajnoczi <stefanha@redhat.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-07 18:46:30 -07:00
John Stultz	4b2bd5fec0	proc: fix timerslack_ns CAP_SYS_NICE check when adjusting self In changing from checking ptrace_may_access(p, PTRACE_MODE_ATTACH_FSCREDS) to capable(CAP_SYS_NICE), I missed that ptrace_my_access succeeds when p == current, but the CAP_SYS_NICE doesn't. Thus while the previous commit was intended to loosen the needed privileges to modify a processes timerslack, it needlessly restricted a task modifying its own timerslack via the proc/<tid>/timerslack_ns (which is permitted also via the PR_SET_TIMERSLACK method). This patch corrects this by checking if p == current before checking the CAP_SYS_NICE value. This patch applies on top of my two previous patches currently in -mm Link: http://lkml.kernel.org/r/1471906870-28624-1-git-send-email-john.stultz@linaro.org Signed-off-by: John Stultz <john.stultz@linaro.org> Acked-by: Kees Cook <keescook@chromium.org> Cc: "Serge E. Hallyn" <serge@hallyn.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Oren Laadan <orenl@cellrox.com> Cc: Ruchi Kandoi <kandoiruchi@google.com> Cc: Rom Lemarchand <romlem@android.com> Cc: Todd Kjos <tkjos@google.com> Cc: Colin Cross <ccross@android.com> Cc: Nick Kralevich <nnk@google.com> Cc: Dmitry Shmidt <dimitrysh@google.com> Cc: Elliott Hughes <enh@google.com> Cc: Android Kernel Team <kernel-team@android.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-07 18:46:30 -07:00
John Stultz	904763e1fb	proc: add LSM hook checks to /proc/<tid>/timerslack_ns As requested, this patch checks the existing LSM hooks task_getscheduler/task_setscheduler when reading or modifying the task's timerslack value. Previous versions added new get/settimerslack LSM hooks, but since they checked the same PROCESS__SET/GETSCHED values as existing hooks, it was suggested we just use the existing ones. Link: http://lkml.kernel.org/r/1469132667-17377-2-git-send-email-john.stultz@linaro.org Signed-off-by: John Stultz <john.stultz@linaro.org> Cc: Kees Cook <keescook@chromium.org> Cc: "Serge E. Hallyn" <serge@hallyn.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Oren Laadan <orenl@cellrox.com> Cc: Ruchi Kandoi <kandoiruchi@google.com> Cc: Rom Lemarchand <romlem@android.com> Cc: Todd Kjos <tkjos@google.com> Cc: Colin Cross <ccross@android.com> Cc: Nick Kralevich <nnk@google.com> Cc: Dmitry Shmidt <dimitrysh@google.com> Cc: Elliott Hughes <enh@google.com> Cc: James Morris <jmorris@namei.org> Cc: Android Kernel Team <kernel-team@android.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-07 18:46:30 -07:00
John Stultz	7abbaf9404	proc: relax /proc/<tid>/timerslack_ns capability requirements When an interface to allow a task to change another tasks timerslack was first proposed, it was suggested that something greater then CAP_SYS_NICE would be needed, as a task could be delayed further then what normally could be done with nice adjustments. So CAP_SYS_PTRACE was adopted instead for what became the /proc/<tid>/timerslack_ns interface. However, for Android (where this feature originates), giving the system_server CAP_SYS_PTRACE would allow it to observe and modify all tasks memory. This is considered too high a privilege level for only needing to change the timerslack. After some discussion, it was realized that a CAP_SYS_NICE process can set a task as SCHED_FIFO, so they could fork some spinning processes and set them all SCHED_FIFO 99, in effect delaying all other tasks for an infinite amount of time. So as a CAP_SYS_NICE task can already cause trouble for other tasks, using it as a required capability for accessing and modifying /proc/<tid>/timerslack_ns seems sufficient. Thus, this patch loosens the capability requirements to CAP_SYS_NICE and removes CAP_SYS_PTRACE, simplifying some of the code flow as well. This is technically an ABI change, but as the feature just landed in 4.6, I suspect no one is yet using it. Link: http://lkml.kernel.org/r/1469132667-17377-1-git-send-email-john.stultz@linaro.org Signed-off-by: John Stultz <john.stultz@linaro.org> Reviewed-by: Nick Kralevich <nnk@google.com> Acked-by: Serge Hallyn <serge@hallyn.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: Kees Cook <keescook@chromium.org> Cc: "Serge E. Hallyn" <serge@hallyn.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Oren Laadan <orenl@cellrox.com> Cc: Ruchi Kandoi <kandoiruchi@google.com> Cc: Rom Lemarchand <romlem@android.com> Cc: Todd Kjos <tkjos@google.com> Cc: Colin Cross <ccross@android.com> Cc: Nick Kralevich <nnk@google.com> Cc: Dmitry Shmidt <dimitrysh@google.com> Cc: Elliott Hughes <enh@google.com> Cc: Android Kernel Team <kernel-team@android.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-07 18:46:30 -07:00
Joe Perches	e16e2d8e14	meminfo: break apart a very long seq_printf with #ifdefs Use a specific routine to emit most lines so that the code is easier to read and maintain. akpm: text data bss dec hex filename 2976 8 0 2984 ba8 fs/proc/meminfo.o before 2669 8 0 2677 a75 fs/proc/meminfo.o after Link: http://lkml.kernel.org/r/8fce7fdef2ba081a4ef531594e97da8a9feebb58.1470810406.git.joe@perches.com Signed-off-by: Joe Perches <joe@perches.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-07 18:46:30 -07:00
Joe Perches	75ba1d07fd	seq/proc: modify seq_put_decimal_[u]ll to take a const char *, not char Allow some seq_puts removals by taking a string instead of a single char. [akpm@linux-foundation.org: update vmstat_show(), per Joe] Link: http://lkml.kernel.org/r/667e1cf3d436de91a5698170a1e98d882905e956.1470704995.git.joe@perches.com Signed-off-by: Joe Perches <joe@perches.com> Cc: Joe Perches <joe@perches.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-07 18:46:30 -07:00
Alexey Dobriyan	f7a5f132b4	proc: faster /proc//status top(1) opens the following files for every PID: /proc//stat /proc//statm /proc//status This patch switches /proc/*/status away from seq_printf(). The result is 13.5% speedup. Benchmark is open("/proc/self/status")+read+close 1.000.000 million times. BEFORE $ perf stat -r 10 taskset -c 3 ./proc-self-status Performance counter stats for 'taskset -c 3 ./proc-self-status' (10 runs): 10748.474301 task-clock (msec) # 0.954 CPUs utilized ( +- 0.91% ) 12 context-switches # 0.001 K/sec ( +- 1.09% ) 1 cpu-migrations # 0.000 K/sec 104 page-faults # 0.010 K/sec ( +- 0.45% ) 37,424,127,876 cycles # 3.482 GHz ( +- 0.04% ) 8,453,010,029 stalled-cycles-frontend # 22.59% frontend cycles idle ( +- 0.12% ) 3,747,609,427 stalled-cycles-backend # 10.01% backend cycles idle ( +- 0.68% ) 65,632,764,147 instructions # 1.75 insn per cycle # 0.13 stalled cycles per insn ( +- 0.00% ) 13,981,324,775 branches # 1300.773 M/sec ( +- 0.00% ) 138,967,110 branch-misses # 0.99% of all branches ( +- 0.18% ) 11.263885428 seconds time elapsed ( +- 0.04% ) ^^^^^^^^^^^^ AFTER $ perf stat -r 10 taskset -c 3 ./proc-self-status Performance counter stats for 'taskset -c 3 ./proc-self-status' (10 runs): 9010.521776 task-clock (msec) # 0.925 CPUs utilized ( +- 1.54% ) 11 context-switches # 0.001 K/sec ( +- 1.54% ) 1 cpu-migrations # 0.000 K/sec ( +- 11.11% ) 103 page-faults # 0.011 K/sec ( +- 0.60% ) 32,352,310,603 cycles # 3.591 GHz ( +- 0.07% ) 7,849,199,578 stalled-cycles-frontend # 24.26% frontend cycles idle ( +- 0.27% ) 3,269,738,842 stalled-cycles-backend # 10.11% backend cycles idle ( +- 0.73% ) 56,012,163,567 instructions # 1.73 insn per cycle # 0.14 stalled cycles per insn ( +- 0.00% ) 11,735,778,795 branches # 1302.453 M/sec ( +- 0.00% ) 98,084,459 branch-misses # 0.84% of all branches ( +- 0.28% ) 9.741247736 seconds time elapsed ( +- 0.07% ) ^^^^^^^^^^^ Link: http://lkml.kernel.org/r/20160806125608.GB1187@p183.telecom.by Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Joe Perches <joe@perches.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-07 18:46:30 -07:00
zhong jiang	72e2936c04	mm: remove unnecessary condition in remove_inode_hugepages When the huge page is added to the page cahce (huge_add_to_page_cache), the page private flag will be cleared. since this code (remove_inode_hugepages) will only be called for pages in the page cahce, PagePrivate(page) will always be false. The patch remove the code without any functional change. Link: http://lkml.kernel.org/r/1475113323-29368-1-git-send-email-zhongjiang@huawei.com Signed-off-by: zhong jiang <zhongjiang@huawei.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Tested-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-07 18:46:29 -07:00
Yisheng Xie	461a718432	mm/hugetlb: introduce ARCH_HAS_GIGANTIC_PAGE Avoid making ifdef get pretty unwieldy if many ARCHs support gigantic page. No functional change with this patch. Link: http://lkml.kernel.org/r/1475227569-63446-2-git-send-email-xieyisheng1@huawei.com Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Hanjun Guo <guohanjun@huawei.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Sudeep Holla <sudeep.holla@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Rob Herring <robh+dt@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-07 18:46:29 -07:00
Huang Ying	8cd797887a	mm: remove page_file_index After using the offset of the swap entry as the key of the swap cache, the page_index() becomes exactly same as page_file_index(). So the page_file_index() is removed and the callers are changed to use page_index() instead. Link: http://lkml.kernel.org/r/1473270649-27229-2-git-send-email-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Trond Myklebust <trond.myklebust@primarydata.com> Cc: Anna Schumaker <anna.schumaker@netapp.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-07 18:46:28 -07:00
Aaron Lu	6fcb52a56f	thp: reduce usage of huge zero page's atomic counter The global zero page is used to satisfy an anonymous read fault. If THP(Transparent HugePage) is enabled then the global huge zero page is used. The global huge zero page uses an atomic counter for reference counting and is allocated/freed dynamically according to its counter value. CPU time spent on that counter will greatly increase if there are a lot of processes doing anonymous read faults. This patch proposes a way to reduce the access to the global counter so that the CPU load can be reduced accordingly. To do this, a new flag of the mm_struct is introduced: MMF_USED_HUGE_ZERO_PAGE. With this flag, the process only need to touch the global counter in two cases: 1 The first time it uses the global huge zero page; 2 The time when mm_user of its mm_struct reaches zero. Note that right now, the huge zero page is eligible to be freed as soon as its last use goes away. With this patch, the page will not be eligible to be freed until the exit of the last process from which it was ever used. And with the use of mm_user, the kthread is not eligible to use huge zero page either. Since no kthread is using huge zero page today, there is no difference after applying this patch. But if that is not desired, I can change it to when mm_count reaches zero. Case used for test on Haswell EP: usemem -n 72 --readonly -j 0x200000 100G Which spawns 72 processes and each will mmap 100G anonymous space and then do read only access to that space sequentially with a step of 2MB. CPU cycles from perf report for base commit: 54.03% usemem [kernel.kallsyms] [k] get_huge_zero_page CPU cycles from perf report for this commit: 0.11% usemem [kernel.kallsyms] [k] mm_get_huge_zero_page Performance(throughput) of the workload for base commit: 1784430792 Performance(throughput) of the workload for this commit: 4726928591 164% increase. Runtime of the workload for base commit: 707592 us Runtime of the workload for this commit: 303970 us 50% drop. Link: http://lkml.kernel.org/r/fe51a88f-446a-4622-1363-ad1282d71385@intel.com Signed-off-by: Aaron Lu <aaron.lu@intel.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-07 18:46:28 -07:00
James Morse	0f30206bf2	fs/proc/task_mmu.c: make the task_mmu walk_page_range() limit in clear_refs_write() obvious Trying to walk all of virtual memory requires architecture specific knowledge. On x86_64, addresses must be sign extended from bit 48, whereas on arm64 the top VA_BITS of address space have their own set of page tables. clear_refs_write() calls walk_page_range() on the range 0 to ~0UL, it provides a test_walk() callback that only expects to be walking over VMAs. Currently walk_pmd_range() will skip memory regions that don't have a VMA, reporting them as a hole. As this call only expects to walk user address space, make it walk 0 to 'highest_vm_end'. Link: http://lkml.kernel.org/r/1472655792-22439-1-git-send-email-james.morse@arm.com Signed-off-by: James Morse <james.morse@arm.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-07 18:46:28 -07:00
Toshi Kani	dbe6ec8156	ext2/4, xfs: call thp_get_unmapped_area() for pmd mappings To support DAX pmd mappings with unmodified applications, filesystems need to align an mmap address by the pmd size. Call thp_get_unmapped_area() from f_op->get_unmapped_area. Note, there is no change in behavior for a non-DAX file. Link: http://lkml.kernel.org/r/1472497881-9323-3-git-send-email-toshi.kani@hpe.com Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jan Kara <jack@suse.cz> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-07 18:46:28 -07:00
Joseph Qi	48e509ece9	ocfs2: fix undefined struct variable in inode.h The extern struct variable ocfs2_inode_cache is not defined. It meant to use ocfs2_inode_cachep defined in super.c, I think. Fortunately it is not used anywhere now, so no impact actually. Clean it up to fix this mistake. Link: http://lkml.kernel.org/r/57E1E49D.8050503@huawei.com Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Reviewed-by: Eric Ren <zren@suse.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-07 18:46:26 -07:00
Bhaktipriya Shridhar	055fdcff35	fs/ocfs2/dlm: remove deprecated create_singlethread_workqueue() The workqueue "dlm_worker" queues a single work item &dlm->dispatched_work and thus it doesn't require execution ordering. Hence, alloc_workqueue has been used to replace the deprecated create_singlethread_workqueue instance. The WQ_MEM_RECLAIM flag has been set to ensure forward progress under memory pressure. Since there are fixed number of work items, explicit concurrency limit is unnecessary here. Link: http://lkml.kernel.org/r/2b5ad8d6688effe1a9ddb2bc2082d26fbbe00302.1472590094.git.bhaktipriya96@gmail.com Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-07 18:46:26 -07:00
Bhaktipriya Shridhar	44be975691	fs/ocfs2/super: remove deprecated create_singlethread_workqueue() The workqueue "ocfs2_wq" queues multiple work items viz &osb->la_enable_wq, &journal->j_recovery_work, &os->os_orphan_scan_work, &osb->osb_truncate_log_wq which require strict execution ordering. Hence, an ordered dedicated workqueue has been used. WQ_MEM_RECLAIM has been set to ensure forward progress under memory pressure because the workqueue is being used on a memory reclaim path. Link: http://lkml.kernel.org/r/66279de510a7f4cfc6e386d99b7e04b3f65fb11b.1472590094.git.bhaktipriya96@gmail.com Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-07 18:46:26 -07:00
Bhaktipriya Shridhar	bf940776c0	fs/ocfs2/cluster: remove deprecated create_singlethread_workqueue() The workqueue "o2net_wq" queues multiple work items viz &old_sc->sc_shutdown_work, &sc->sc_rx_work, &sc->sc_connect_work which require strict execution ordering. Hence, an ordered dedicated workqueue has been used. WQ_MEM_RECLAIM has been set to ensure forward progress under memory pressure. Link: http://lkml.kernel.org/r/ddc12e5766c79ba26f8a00d98049107f8a1d4866.1472590094.git.bhaktipriya96@gmail.com Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-07 18:46:26 -07:00
Bhaktipriya Shridhar	0b41be0763	fs/ocfs2/dlmfs: remove deprecated create_singlethread_workqueue() The workqueue "user_dlm_worker" queues a single work item &lockres->l_work per user_lock_res instance and so it doesn't require execution ordering. Hence, alloc_workqueue has been used to replace the deprecated create_singlethread_workqueue instance. The WQ_MEM_RECLAIM flag has been set to ensure forward progress under memory pressure. Since there are fixed number of work items, explicit concurrency limit is unnecessary here. Link: http://lkml.kernel.org/r/9748136d3a3b18138ad1d6ba708367aa1fe9f98c.1472590094.git.bhaktipriya96@gmail.com Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-07 18:46:26 -07:00
Jan Kara	ed2726406c	fsnotify: clean up spinlock assertions Use assert_spin_locked() macro instead of hand-made BUG_ON statements. Link: http://lkml.kernel.org/r/1474537439-18919-1-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Suggested-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-07 18:46:26 -07:00
Jan Kara	0b1b86527d	fanotify: fix possible false warning when freeing events When freeing permission events by fsnotify_destroy_event(), the warning WARN_ON(!list_empty(&event->list)); may falsely hit. This is because although fanotify_get_response() saw event->response set, there is nothing to make sure the current CPU also sees the removal of the event from the list. Add proper locking around the WARN_ON() to avoid the false warning. Link: http://lkml.kernel.org/r/1473797711-14111-7-git-send-email-jack@suse.cz Reported-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Lino Sanfilippo <LinoSanfilippo@gmx.de> Cc: Eric Paris <eparis@redhat.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-07 18:46:26 -07:00
Jan Kara	073f65522a	fanotify: use notification_lock instead of access_lock Fanotify code has its own lock (access_lock) to protect a list of events waiting for a response from userspace. However this is somewhat awkward as the same list_head in the event is protected by notification_lock if it is part of the notification queue and by access_lock if it is part of the fanotify private queue which makes it difficult for any reliable checks in the generic code. So make fanotify use the same lock - notification_lock - for protecting its private event list. Link: http://lkml.kernel.org/r/1473797711-14111-6-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Lino Sanfilippo <LinoSanfilippo@gmx.de> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Eric Paris <eparis@redhat.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-07 18:46:26 -07:00
Jan Kara	c21dbe20f6	fsnotify: convert notification_mutex to a spinlock notification_mutex is used to protect the list of pending events. As such there's no reason to use a sleeping lock for it. Convert it to a spinlock. [jack@suse.cz: fixed version] Link: http://lkml.kernel.org/r/1474031567-1831-1-git-send-email-jack@suse.cz Link: http://lkml.kernel.org/r/1473797711-14111-5-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Lino Sanfilippo <LinoSanfilippo@gmx.de> Tested-by: Guenter Roeck <linux@roeck-us.net> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Eric Paris <eparis@redhat.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-07 18:46:26 -07:00
Jan Kara	1404ff3cc3	fsnotify: drop notification_mutex before destroying event fsnotify_flush_notify() and fanotify_release() destroy notification event while holding notification_mutex. The destruction of fanotify event includes a path_put() call which may end up calling into a filesystem to delete an inode if we happen to be the last holders of dentry reference which happens to be the last holder of inode reference. That in turn may violate lock ordering for some filesystems since notification_mutex is also acquired e. g. during write when generating fanotify event. Also this is the only thing that forces notification_mutex to be a sleeping lock. So drop notification_mutex before destroying a notification event. Link: http://lkml.kernel.org/r/1473797711-14111-4-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Lino Sanfilippo <LinoSanfilippo@gmx.de> Cc: Eric Paris <eparis@redhat.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-07 18:46:26 -07:00
Al Viro	41fefa36be	Merge remote-tracking branch 'fuse/xattr' into work.xattr	2016-10-07 20:10:55 -04:00
Andreas Gruenbacher	6c6ef9f26e	xattr: Stop calling {get,set,remove}xattr inode operations All filesystems that support xattrs by now do so via xattr handlers. They all define sb->s_xattr, and their getxattr, setxattr, and removexattr inode operations use the generic inode operations. On filesystems that don't support xattrs, the xattr inode operations are all NULL, and sb->s_xattr is also NULL. This means that we can remove the getxattr, setxattr, and removexattr inode operations and directly call the generic handlers, or better, inline expand those handlers into fs/xattr.c. Filesystems that do not support xattrs on some inodes should clear the IOP_XATTR i_opflags flag in those inodes. (Right now, some filesystems have checks to disable xattrs on some inodes in the ->list, ->get, and ->set xattr handler operations instead.) The IOP_XATTR flag is automatically cleared in inodes of filesystems that don't have xattr support. In orangefs, symlinks do have a setxattr iop but no getxattr iop. Add a check for symlinks to orangefs_inode_getxattr to preserve the current, weird behavior; that check may not be necessary though. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-10-07 20:10:44 -04:00
Andreas Gruenbacher	bf3ee71363	vfs: Check for the IOP_XATTR flag in listxattr When an inode doesn't support xattrs, turn listxattr off as well. (When xattrs are "turned off", the VFS still passes security xattr operations through to security modules, which can still expose inode security labels that way.) Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-10-07 20:10:44 -04:00
Andreas Gruenbacher	5d6c31910b	xattr: Add __vfs_{get,set,remove}xattr helpers Right now, various places in the kernel check for the existence of getxattr, setxattr, and removexattr inode operations and directly call those operations. Switch to helper functions and test for the IOP_XATTR flag instead. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Acked-by: James Morris <james.l.morris@oracle.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-10-07 20:10:44 -04:00
Andreas Gruenbacher	f5c2443837	libfs: Use IOP_XATTR flag for empty directory handling Instead of special xattr inode operations, use the IOP_XATTR inode operations flag for the special libfs empty directories. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-10-07 20:10:43 -04:00
Andreas Gruenbacher	5f6e59ae82	vfs: Use IOP_XATTR flag for bad-inode handling With this change, all the xattr handler based operations will produce an -EIO result for bad inodes, and we no longer only depend on inode->i_op to be set to bad_inode_ops. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-10-07 20:10:43 -04:00
Andreas Gruenbacher	d0a5b995a3	vfs: Add IOP_XATTR inode operations flag The IOP_XATTR inode operations flag in inode->i_opflags indicates that the inode has xattr support. The flag is automatically set by new_inode() on filesystems with xattr support (where sb->s_xattr is defined), and cleared otherwise. Filesystems can explicitly clear it for inodes that should not have xattr support. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-10-07 20:10:42 -04:00
Andreas Gruenbacher	b6ba11773d	vfs: Move xattr_resolve_name to the front of fs/xattr.c Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-10-07 20:10:42 -04:00
Dan Williams	e476f94482	Merge branch 'for-4.9/dax' into libnvdimm-for-next	2016-10-07 16:46:30 -07:00
Linus Torvalds	d1f5323370	Merge branch 'work.splice_read' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull VFS splice updates from Al Viro: "There's a bunch of branches this cycle, both mine and from other folks and I'd rather send pull requests separately. This one is the conversion of ->splice_read() to ITER_PIPE iov_iter (and introduction of such). Gets rid of a lot of code in fs/splice.c and elsewhere; there will be followups, but these are for the next cycle... Some pipe/splice-related cleanups from Miklos in the same branch as well" * 'work.splice_read' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: pipe: fix comment in pipe_buf_operations pipe: add pipe_buf_steal() helper pipe: add pipe_buf_confirm() helper pipe: add pipe_buf_release() helper pipe: add pipe_buf_get() helper relay: simplify relay_file_read() switch default_file_splice_read() to use of pipe-backed iov_iter switch generic_file_splice_read() to use of ->read_iter() new iov_iter flavour: pipe-backed fuse_dev_splice_read(): switch to add_to_pipe() skb_splice_bits(): get rid of callback new helper: add_to_pipe() splice: lift pipe_lock out of splice_to_pipe() splice: switch get_iovec_page_array() to iov_iter splice_to_pipe(): don't open-code wakeup_pipe_readers() consistent treatment of EFAULT on O_DIRECT read/write	2016-10-07 15:36:58 -07:00
Linus Torvalds	2eee010d09	Lots of bug fixes and cleanups. -----BEGIN PGP SIGNATURE----- iQEcBAABCAAGBQJX9pA6AAoJEPL5WVaVDYGj7fwH/0YcdQWBg0O5d7iXFnTcimh9 fiYkqKniBWQhgBAOFPMoNPRIW4tyeQmTtu8Rywx2Hr+v4lzJvuOaT18NDANdq/pp u5eDrnJ4R+uqPJlgxVOzopLVJ6I2glgSSRdvAKYxwTYcv8F88ObzVfsJ4M415gPq cbEKF+JT3l5hTGENR5sqmYvHYaNfOFkOqt4gulPtgk1eshy+BH/05M+qBSeA5a6k srdon0pFRoUV68m+T4G8FqOZxdybeT5Yx6X0GJf0eQJoX7IaiQTPcDrXzlrbDBbN rrzbpwsDeDKtgSOckbarCBroZKdToHFekfnOJ7IPWYq8IwYTSnZKFCWIRKO6z38= =IvhS -----END PGP SIGNATURE----- Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Lots of bug fixes and cleanups" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (40 commits) ext4: remove unused variable ext4: use journal inode to determine journal overhead ext4: create function to read journal inode ext4: unmap metadata when zeroing blocks ext4: remove plugging from ext4_file_write_iter() ext4: allow unlocked direct IO when pages are cached ext4: require encryption feature for EXT4_IOC_SET_ENCRYPTION_POLICY fscrypto: use standard macros to compute length of fname ciphertext ext4: do not unnecessarily null-terminate encrypted symlink data ext4: release bh in make_indexed_dir ext4: Allow parallel DIO reads ext4: allow DAX writeback for hole punch jbd2: fix lockdep annotation in add_transaction_credits() blockgroup_lock.h: simplify definition of NR_BG_LOCKS blockgroup_lock.h: remove debris from bgl_lock_ptr() conversion fscrypto: make filename crypto functions return 0 on success fscrypto: rename completion callbacks to reflect usage fscrypto: remove unnecessary includes fscrypto: improved validation when loading inode encryption metadata ext4: fix memory leak when symlink decryption fails ...	2016-10-07 15:15:33 -07:00
Linus Torvalds	513a4befae	Merge branch 'for-4.9/block' of git://git.kernel.dk/linux-block Pull block layer updates from Jens Axboe: "This is the main pull request for block layer changes in 4.9. As mentioned at the last merge window, I've changed things up and now do just one branch for core block layer changes, and driver changes. This avoids dependencies between the two branches. Outside of this main pull request, there are two topical branches coming as well. This pull request contains: - A set of fixes, and a conversion to blk-mq, of nbd. From Josef. - Set of fixes and updates for lightnvm from Matias, Simon, and Arnd. Followup dependency fix from Geert. - General fixes from Bart, Baoyou, Guoqing, and Linus W. - CFQ async write starvation fix from Glauber. - Add supprot for delayed kick of the requeue list, from Mike. - Pull out the scalable bitmap code from blk-mq-tag.c and make it generally available under the name of sbitmap. Only blk-mq-tag uses it for now, but the blk-mq scheduling bits will use it as well. From Omar. - bdev thaw error progagation from Pierre. - Improve the blk polling statistics, and allow the user to clear them. From Stephen. - Set of minor cleanups from Christoph in block/blk-mq. - Set of cleanups and optimizations from me for block/blk-mq. - Various nvme/nvmet/nvmeof fixes from the various folks" * 'for-4.9/block' of git://git.kernel.dk/linux-block: (54 commits) fs/block_dev.c: return the right error in thaw_bdev() nvme: Pass pointers, not dma addresses, to nvme_get/set_features() nvme/scsi: Remove power management support nvmet: Make dsm number of ranges zero based nvmet: Use direct IO for writes admin-cmd: Added smart-log command support. nvme-fabrics: Add host_traddr options field to host infrastructure nvme-fabrics: revise host transport option descriptions nvme-fabrics: rework nvmf_get_address() for variable options nbd: use BLK_MQ_F_BLOCKING blkcg: Annotate blkg_hint correctly cfq: fix starvation of asynchronous writes blk-mq: add flag for drivers wanting blocking ->queue_rq() blk-mq: remove non-blocking pass in blk_mq_map_request blk-mq: get rid of manual run of queue with __blk_mq_run_hw_queue() block: export bio_free_pages to other modules lightnvm: propagate device_add() error code lightnvm: expose device geometry through sysfs lightnvm: control life of nvm_dev in driver blk-mq: register device instead of disk ...	2016-10-07 14:42:05 -07:00
Anna Schumaker	29ae7f9dc2	NFSD: Implement the COPY call I only implemented the sync version of this call, since it's the easiest. I can simply call vfs_copy_range() and have the vfs do the right thing for the filesystem being exported. Signed-off-by: Anna Schumaker <bjschuma@netapp.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2016-10-07 14:54:25 -04:00
J. Bruce Fields	42e616167a	nfsd: handle EUCLEAN Eric Sandeen reports that xfs can return this if filesystem corruption prevented completing the operation. Reported-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2016-10-07 14:54:19 -04:00
J. Bruce Fields	ff30f08c32	nfsd: only WARN once on unmapped errors No need to spam the logs here. The only drawback is losing information if we ever encounter two different unmapped errors, but in practice we've rarely see even one. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2016-10-07 14:53:33 -04:00
Andreas Gruenbacher	4b899da50d	ecryptfs: Switch to generic xattr handlers Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-10-06 22:17:38 -04:00
Andreas Gruenbacher	bba0bd31b1	sockfs: Get rid of getxattr iop If we allow pseudo-filesystems created with mount_pseudo to have xattr handlers, we can replace sockfs_getxattr with a sockfs_xattr_get handler to use the xattr handler name parsing. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-10-06 22:17:38 -04:00
Andreas Gruenbacher	e72a1a8b3a	kernfs: Switch to generic xattr handlers Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-10-06 22:17:38 -04:00
Andreas Gruenbacher	b8020eff7f	hfs: Switch to generic xattr handlers Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-10-06 22:17:38 -04:00
Andreas Gruenbacher	6966f842c0	jffs2: Remove jffs2_{get,set,remove}xattr macros When CONFIG_JFFS2_FS_XATTR is off, jffs2_xattr_handlers is defined as NULL. With sb->s_xattr == NULL, the generic_{get,set,remove}xattr functions produce the same result as setting the {get,set,remove}xattr inode operations to NULL, so there is no need for these macros. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-10-06 22:17:38 -04:00
Andreas Gruenbacher	5d18cbf16c	xattr: Remove unnecessary NULL attribute name check When NULL is passed to one of the xattr system calls as the attribute name, copying that name from user space already fails with -EFAULT; xattr_resolve_name is never called with a NULL attribute name. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-10-06 22:17:38 -04:00
David S. Miller	0d818c2889	RxRPC rewrite -----BEGIN PGP SIGNATURE----- iQIVAwUAV/YbX/Sw1s6N8H32AQLwDg//W0fGt3OSFrOpEQHtKUSCWO3m4RRJgn/m Xbaz8ZO6Z8qmdkM267yrLCAp5hx0E77WP46l7V3B9p9wX0vA+P2QO7K5Kis6sNaY aceCCAKHqvUSiZa8tQ2aGpbxxa8qICbjHjiCg0lFABiGDWGRnIBNW8qV5LyGKZkI 7b3i9MGBkGLdZxetcJd498j6Gck9cuqOZDnfqgb0Q5pAtsjVM3EZXXsHO1ZD5WHG GUieQgY9Tp0rlVKjlLdR94fW/acMZYs0c5RO1uzGAoUeBALnSUS5+bSRSlGp1KOM C7r5/dK4FvkZY+xuS5pLXoI8WpsA4EDpBINGdO6L03wTJ10zx5y5CdTTl7G6Y53R BpmY8SDFmWYqpJs+gZiWYIlbnBQ+b0Mu7p7rKeSJS/q0+YEVwJlz3UFo2k1O+J3A ovpxP5E6IvOjlKF21Zs1hOR2m/sfR42v/TfwpApImSeY2k2m8vzyfXBJP4ClAk29 PGYOOqMLYwzIjLwdapDxL3ccjKvOwYeClCs1t6bKva2XCrF1ybtBnAQDxFp6KzXi p/y/QkHnseSeYct8mElDopRekbwoqa9YPwXn7lagvQhNxqNGIR4HT82IeohI/Dqe GtQbjSPc3uebk5lRf535kTZixu+l5/yKQeuRTsfoIgsMjVlMdqS9dUAphzI4IXLp FE0q49uLTVI= =+Jr3 -----END PGP SIGNATURE----- Merge tag 'rxrpc-rewrite-20161004' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs David Howells says: ==================== rxrpc: Fixes This set of patches contains a bunch of fixes: (1) Fix an oops on incoming call to a local endpoint without a bound service. (2) Only ping for a lost reply in a client call (this is inapplicable to service calls). (3) Fix maybe uninitialised variable warnings in the ACK/ABORT sending function by splitting it. (4) Fix loss of PING RESPONSE ACKs due to them being subsumed by PING ACK generation. (5) OpenAFS improperly terminates calls it makes as a client under some circumstances by not fully hard-ACK'ing the last DATA packets. This is alleviated by a new call appearing on the same channel implicitly completing the previous call on that channel. Handle this implicit completion. (6) Properly handle expiry of service calls due to the aforementioned improper termination with no follow up call to implicitly complete it: (a) The call's background processor needs to be queued to complete the call, send an abort and notify the socket. (b) The call's background processor needs to notify the socket (or the kernel service) when it has completed the call. (c) A negative error code must thence be returned to the kernel service so that it knows the call died. (d) The AFS filesystem must detect the fatal error and end the call. (7) Must produce a DELAY ACK when the actual service operation takes a while to process and must cancel the ACK when the reply is ready. (8) Don't request an ACK on the last DATA packet of the Tx phase as this confuses OpenAFS. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-06 21:04:24 -04:00
Linus Torvalds	4c1fad64ef	In this round, we've investigated how f2fs deals with errors given by our fault injection facility. With this, we could fix several corner cases. And, in order to improve the performance, we set inline_dentry by default and enhance the exisiting discard issue flow. In addition, we added f2fs_migrate_page for better memory management. = Enhancement = - set inline_dentry by default - improve discard issue flow - add more fault injection cases in f2fs - allow block preallocation for encrypted files - introduce migrate_page callback function - avoid truncating the next direct node block at every checkpoint = Bug fixes = - set page flag correctly between write_begin and write_end - missing error handling cases detected by fault injection - preallocate blocks regarding to 4KB alignement correctly - dentry and filename handling of encryption - lost xattrs of directories -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJX9sMhAAoJEEAUqH6CSFDSFhQQAIQ99GkcaPmSACHg7JNa9zG1 wb6eeKIDee+Jr4vu7yQ++T3Ih4lesl2ZLABVaP+IcXlsYWI2VUvlChczuwVSDQMg ZiBIR2IwXVVY6Zpb0xuw8C/vmQAJjLZTBV33s+wgsYHaTDobYexVUjkCM+pekrzj HBXrk7zx8NHUh41yr/kVQl6FY8KPC6bTtBH23UUp6Vuy1zMZDR/VjL440IyT5Ded JRSBX0XSAC9He6n+kZ4S2kMc11kmqZYW7mE4SmiPDzAhGwUv4SmQ1871lK00EOUp 5EN1Lcy8M7kkl8en2zpZ002R/LDbzRTYjb1fjGJVR+s5Q3piGokxtwAMd0/a7k9v wwZm64Bm4NMHBEK6uc/DPWFUmnUySrboTvOCDRunNogPGTjMJwnzAQmTcB/Hdpr5 oAJQwyAq7ZzkMk3xt0ifeNqy+78uiwfpPEnZDoWqU6zxa+vIyqpFDD+8wEPBO9qo JLRocH0Yl7+ExJvi+2W9wMQq9DsxZWR+CwUc8pg68E+1oOEycJ3weAwg5XSVHoNr 59I2blZQU6P922sH2HVhp0n58xZfYrR7Z3NSsiSfKXeL4gN222dHHT1UfRUmY+A3 7EeuYm8EUecKV0fZimMcqCCrUXQpubT+qGZfI6NZhu3Qhno1Y8ApxqH8Ieypx7ol YD5prZs2qqVKO5LjLV5o =crpN -----END PGP SIGNATURE----- Merge tag 'for-f2fs-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs updates from Jaegeuk Kim: "In this round, we've investigated how f2fs deals with errors given by our fault injection facility. With this, we could fix several corner cases. And, in order to improve the performance, we set inline_dentry by default and enhance the exisiting discard issue flow. In addition, we added f2fs_migrate_page for better memory management. Enhancements: - set inline_dentry by default - improve discard issue flow - add more fault injection cases in f2fs - allow block preallocation for encrypted files - introduce migrate_page callback function - avoid truncating the next direct node block at every checkpoint Bug fixes: - set page flag correctly between write_begin and write_end - missing error handling cases detected by fault injection - preallocate blocks regarding to 4KB alignement correctly - dentry and filename handling of encryption - lost xattrs of directories" * tag 'for-f2fs-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (69 commits) f2fs: introduce update_ckpt_flags to clean up f2fs: don't submit irrelevant page f2fs: fix to commit bio cache after flushing node pages f2fs: introduce get_checkpoint_version for cleanup f2fs: remove dead variable f2fs: remove redundant io plug f2fs: support checkpoint error injection f2fs: fix to recover old fault injection config in ->remount_fs f2fs: do fault injection initialization in default_options f2fs: remove redundant value definition f2fs: support configuring fault injection per superblock f2fs: adjust display format of segment bit f2fs: remove dirty inode pages in error path f2fs: do not unnecessarily null-terminate encrypted symlink data f2fs: handle errors during recover_orphan_inodes f2fs: avoid gc in cp_error case f2fs: should put_page for summary page f2fs: assign return value in f2fs_gc f2fs: add customized migrate_page callback f2fs: introduce cp_lock to protect updating of ckpt_flags ...	2016-10-06 15:30:40 -07:00
Linus Torvalds	0fb3ca447d	Fix bug in module unloading. Switch to always using spinlock over cmpxchg. Explicitly define pstore backend's supported modes. Remove bounce buffer from pmsg. Switch to using memcpy_to/fromio(). Error checking improvements. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 Comment: Kees Cook <kees@outflux.net> iQIcBAABCgAGBQJX9XPtAAoJEIly9N/cbcAmRr8P/0NoEX3bzEYgQWVMmsvzlk4U /mJ7LUk1+TDL0DOdQ84O1Tr3k6MQ2wRyiGXHjxhQ+aC2ompvmuT+SHEARWlqUZZx bEKr3u6nJ5qz1KZ5KwaPOH2EPs2MDq2jh6VvYDFzDGpBYsueDTzRqWJo7VhO/kmq MyVCePtEY3m1q4dZtaVLfDMGUEAU8s8j+D5HM9lmoijmzQuKAz3BFRuakasBIYSf 4ILY0W1E57HAUWsi19jhnYMHOvJt2Gcog0wRUYo4CYmPTyNqud6I5WU6HXeY2F7v LtWbhaS2QcpJRAxDEzzKBBSZ4IS6TINYDBBOf/0NEVo2qj4PHyy3f14MCtSo2LDg 4hoeI0DUgnAmp+NFgp1mQQ25DhR8TZlunBuntGXdeugb5qgT65NYXGtQxnMp5QJd s3DsfGW/diKbKfLWQN7GVcHHM/GNe+XM1yl1Q3TyDgSLJVjgAB21r/kPE7AIQzTO vDTLcv1w+KLdhDIrHlZqz1IAPATidTA21A7h8JeUWrOSetOhpZ0uXUwBR5+IZhyN tG1Wt0ohZAqlhv9ERXYN1g3iRHCCJ26V0LYOKsf80wAAutT8iRO4iH0PKdEYKX+a U0TqeX4TIh+4Q3FgnR7efFACzPXrM1RG9qnc1o5OR/BiyXIzLPdrpYYCVpejzj9K x6AoYCxRl6qYLJgYUR/H =FRpQ -----END PGP SIGNATURE----- Merge tag 'pstore-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull pstore updates from Kees Cook: - Fix bug in module unloading - Switch to always using spinlock over cmpxchg - Explicitly define pstore backend's supported modes - Remove bounce buffer from pmsg - Switch to using memcpy_to/fromio() - Error checking improvements * tag 'pstore-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: ramoops: move spin_lock_init after kmalloc error checking pstore/ram: Use memcpy_fromio() to save old buffer pstore/ram: Use memcpy_toio instead of memcpy pstore/pmsg: drop bounce buffer pstore/ram: Set pstore flags dynamically pstore: Split pstore fragile flags pstore/core: drop cmpxchg based updates pstore/ramoops: fixup driver removal	2016-10-06 15:16:16 -07:00
Linus Torvalds	3940ee36a0	orangefs: miscellaneous improvements and feature negotiation miscellaneous improvements - clean up debugfs globals - remove dead code in sysfs - reorganize duplicated sysfs attribute structs - consolidate sysfs show and store functions - remove duplicated sysfs_ops structures - describe organization of sysfs - make devreq_mutex static - g_orangefs_stats -> orangefs_stats for consistency - rename most remaining global variables feature negotiation enable Orangefs userspace and kernel module to negotiate mutually supported features. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iQIcBAABAgAGBQJX9Aa0AAoJEM9EDqnrzg2+/JIP/iBDvWIxWvqs1cywLQoWJhPx 1Lm0p1a7RQEFjYI1AJ3W5U2dr12Drxezgn/a1Yfn/5vX8d868gtcj4uv8hD6PY2Y wY69yidiA6GL1/vHOSyiTBofT7jeniCt44QbxS3fXNpSXEiGD2d1pJ4lSwg0Mkyp E+JcAnmp6rVUvQV0Kx+djBvaFBNQ1tT84UqLqdGBTpx4DqG+zGTw3tOgRPh4jAZt mDmtF8TKR9DhjzxnkeX66tfErxdGNZEHrNNeHSM/3ds1IMn09d1pxFkE+y5lWhd3 d3FJeONt6CJG+k7iPXGWScvvo83DoIfvjsDx3S4vJIvQxxRuKDwp3pR34BQYvbKO nSnaDBZ1okLaQEg0GYt6BlqWZHcEdKEiR870dBTDmzlGwIY33m4G2mx9uZibR6dt pcsel4e2q3Js1tZob0MXwtbrR7pl/4TPVpf/ZEiTTprX0egL2SMhxCNwk4DHDMyv JszdjdxC+SJgpsaBRYcdGEzb8Fd+FbDIVWxAei8uaQUUSK40j7kPoSdx4w17mHQl s7Mmp/12miO0/eGgKmI+cJjXhRzCxu8HG6ovzlBWLdfQKmPYtk+Hm38HXz2fz4P5 pWKHgwsFXHtAZ0pQ9VbOVmctCehbuAS3nef2rZsWfA3x65Z+O4GIwrUDtfMCsiXK OuDgcDysqhPMCKbmdSEw =kxaL -----END PGP SIGNATURE----- Merge tag 'for-linus-4.9-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux Pull orangefs updates from Mike Marshall: "Miscellaneous improvements: - clean up debugfs globals - remove dead code in sysfs - reorganize duplicated sysfs attribute structs - consolidate sysfs show and store functions - remove duplicated sysfs_ops structures - describe organization of sysfs - make devreq_mutex static - g_orangefs_stats -> orangefs_stats for consistency - rename most remaining global variables Feature negotiation: - enable Orangefs userspace and kernel module to negotiate mutually supported features" * tag 'for-linus-4.9-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux: Revert "orangefs: bump minimum userspace version" orangefs: bump minimum userspace version orangefs: rename most remaining global variables orangefs: g_orangefs_stats -> orangefs_stats for consistency orangefs: make devreq_mutex static orangefs: describe organization of sysfs orangefs: remove duplicated sysfs_ops structures orangefs: consolidate sysfs show and store functions orangefs: reorganize duplicated sysfs attribute structs orangefs: remove dead code in sysfs orangefs: clean up debugfs globals orangefs: do not allow client readahead cache without feature bit orangefs: add features op orangefs: record userspace version for feature compatbility orangefs: add readahead count and size to sysfs orangefs: re-add flush_racache from out-of-tree orangefs: turn param response value into union orangefs: add missing param request ops orangefs: rename remaining bits of mmap readahead cache	2016-10-06 13:33:35 -07:00
Linus Torvalds	14986a34e1	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull namespace updates from Eric Biederman: "This set of changes is a number of smaller things that have been overlooked in other development cycles focused on more fundamental change. The devpts changes are small things that were a distraction until we managed to kill off DEVPTS_MULTPLE_INSTANCES. There is an trivial regression fix to autofs for the unprivileged mount changes that went in last cycle. A pair of ioctls has been added by Andrey Vagin making it is possible to discover the relationships between namespaces when referring to them through file descriptors. The big user visible change is starting to add simple resource limits to catch programs that misbehave. With namespaces in general and user namespaces in particular allowing users to use more kinds of resources, it has become important to have something to limit errant programs. Because the purpose of these limits is to catch errant programs the code needs to be inexpensive to use as it always on, and the default limits need to be high enough that well behaved programs on well behaved systems don't encounter them. To this end, after some review I have implemented per user per user namespace limits, and use them to limit the number of namespaces. The limits being per user mean that one user can not exhause the limits of another user. The limits being per user namespace allow contexts where the limit is 0 and security conscious folks can remove from their threat anlysis the code used to manage namespaces (as they have historically done as it root only). At the same time the limits being per user namespace allow other parts of the system to use namespaces. Namespaces are increasingly being used in application sand boxing scenarios so an all or nothing disable for the entire system for the security conscious folks makes increasing use of these sandboxes impossible. There is also added a limit on the maximum number of mounts present in a single mount namespace. It is nontrivial to guess what a reasonable system wide limit on the number of mount structure in the kernel would be, especially as it various based on how a system is using containers. A limit on the number of mounts in a mount namespace however is much easier to understand and set. In most cases in practice only about 1000 mounts are used. Given that some autofs scenarious have the potential to be 30,000 to 50,000 mounts I have set the default limit for the number of mounts at 100,000 which is well above every known set of users but low enough that the mount hash tables don't degrade unreaonsably. These limits are a start. I expect this estabilishes a pattern that other limits for resources that namespaces use will follow. There has been interest in making inotify event limits per user per user namespace as well as interest expressed in making details about what is going on in the kernel more visible" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (28 commits) autofs: Fix automounts by using current_real_cred()->uid mnt: Add a per mount namespace limit on the number of mounts netns: move {inc,dec}_net_namespaces into #ifdef nsfs: Simplify __ns_get_path tools/testing: add a test to check nsfs ioctl-s nsfs: add ioctl to get a parent namespace nsfs: add ioctl to get an owning user namespace for ns file descriptor kernel: add a helper to get an owning user namespace for a namespace devpts: Change the owner of /dev/pts/ptmx to the mounter of /dev/pts devpts: Remove sync_filesystems devpts: Make devpts_kill_sb safe if fsi is NULL devpts: Simplify devpts_mount by using mount_nodev devpts: Move the creation of /dev/pts/ptmx into fill_super devpts: Move parse_mount_options into fill_super userns: When the per user per user namespace limit is reached return ENOSPC userns; Document per user per user namespace limits. mntns: Add a limit on the number of mount namespaces. netns: Add a limit on the number of net namespaces cgroupns: Add a limit on the number of cgroup namespaces ipcns: Add a limit on the number of ipc namespaces ...	2016-10-06 09:52:23 -07:00
Linus Torvalds	8d37059581	xfs: updates for 4.9-rc1 Included in this update: - change of XFS mailing list to linux-xfs@vger.kernel.org - iomap-based DAX infrastructure w/ XFS and ext2 support - small iomap fixes and additions - more efficient XFS delayed allocation infrastructure based on iomap - a rework of log recovery writeback scheduling to ensure we don't fail recovery when trying to replay items that are already on disk - some preparation patches for upcoming reflink support - configurable error handling fixes and documentation - aio access time update race fixes for XFS and generic_file_read_iter -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJX9WvjAAoJEK3oKUf0dfodrl8P/R1cS8tEHnrmNlKeENNWFTlN q8HEfP3tX43QLHXpeHd9F9qXs5/esrOFfWYFjeoAaB1cWiRXDJsUNOEH3PuQf0Go NKHgrL8GiU6XY9keZI6KJYphr2a5//qWJywxOeBuJh3446MDSYwOmI3eEIY8ac3/ k0e8bMnLhfryWOvyZE6v2w75lMi+SL1LH/W6OSJqGFKS3N+GqdqRKkMfYGQToHkM ZgIX1vDSq4xgJzkR1Q+AACCaSTGE2wEG/bnqZ1R3l19/bERB17LaOyEegBDXbrTT vI31EQnrN92O/Q2eYJlap8nFIm4lVaCFTU1R7KEVEXvUBRXXfxllu1sOSBpn1PSQ OrC5bbcCodcG8b1SlwRrcstqc42weojqwyl65eJxOa17valghaYEcLkqEZrrrssv Y+C0okfL3UB2JAxG4O1nFQ3py1cYlkYURf6CuhxNQfktXZxSpAMTLy9wYCRylBiO Eu6Say4zfnfKiVaSg0xlMhIaAyugVH+uVro62hZYxCU2mJ/biZHeQAUC6Krl6NsY NsAk0T7eUgMd7lLW+C9/rL2AQaXYwR72cl/1jAWBE2piBM2Gu1lcGHGwWHvOcYjO K2Yg4RMnR9TDbUX2jl1r4bZoQD3IZ3HpUjgVInmbTPtKY4q89kfC40haSpBQykm7 QzGLPvFz2sMrkmKPLbV2 =R9uL -----END PGP SIGNATURE----- Merge tag 'xfs-for-linus-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs Pull xfs and iomap updates from Dave Chinner: "The main things in this update are the iomap-based DAX infrastructure, an XFS delalloc rework, and a chunk of fixes to how log recovery schedules writeback to prevent spurious corruption detections when recovery of certain items was not required. The other main chunk of code is some preparation for the upcoming reflink functionality. Most of it is generic and cleanups that stand alone, but they were ready and reviewed so are in this pull request. Speaking of reflink, I'm currently planning to send you another pull request next week containing all the new reflink functionality. I'm working through a similar process to the last cycle, where I sent the reverse mapping code in a separate request because of how large it was. The reflink code merge is even bigger than reverse mapping, so I'll be doing the same thing again.... Summary for this update: - change of XFS mailing list to linux-xfs@vger.kernel.org - iomap-based DAX infrastructure w/ XFS and ext2 support - small iomap fixes and additions - more efficient XFS delayed allocation infrastructure based on iomap - a rework of log recovery writeback scheduling to ensure we don't fail recovery when trying to replay items that are already on disk - some preparation patches for upcoming reflink support - configurable error handling fixes and documentation - aio access time update race fixes for XFS and generic_file_read_iter" * tag 'xfs-for-linus-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (40 commits) fs: update atime before I/O in generic_file_read_iter xfs: update atime before I/O in xfs_file_dio_aio_read ext2: fix possible integer truncation in ext2_iomap_begin xfs: log recovery tracepoints to track current lsn and buffer submission xfs: update metadata LSN in buffers during log recovery xfs: don't warn on buffers not being recovered due to LSN xfs: pass current lsn to log recovery buffer validation xfs: rework log recovery to submit buffers on LSN boundaries xfs: quiesce the filesystem after recovery on readonly mount xfs: remote attribute blocks aren't really userdata ext2: use iomap to implement DAX ext2: stop passing buffer_head to ext2_get_blocks xfs: use iomap to implement DAX xfs: refactor xfs_setfilesize xfs: take the ilock shared if possible in xfs_file_iomap_begin xfs: fix locking for DAX writes dax: provide an iomap based fault handler dax: provide an iomap based dax read/write path dax: don't pass buffer_head to copy_user_dax dax: don't pass buffer_head to dax_insert_mapping ...	2016-10-06 08:18:10 -07:00
Linus Torvalds	82fa407da0	Merge branch 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm Pull ARM updates from Russell King: - Correct ARMs dma-mapping to use the correct printk format strings. - Avoid defining OBJCOPYFLAGS globally which upsets lkdtm rodata testing. - Cleanups to ARMs asm/memory.h include. - L2 cache cleanups. - Allow flat nommu binaries to be executed on ARM MMU systems. - Kernel hardening - add more read-only after init annotations, including making some kernel vdso variables const. - Ensure AMBA primecell clocks are appropriately defaulted. - ARM breakpoint cleanup. - Various StrongARM 11x0 and companion chip (SA1111) updates to bring this legacy platform to use more modern APIs for (eg) GPIOs and interrupts, which will allow us in the future to reduce some of the board-level driver clutter and elimate function callbacks into board code via platform data. There still appears to be interest in these platforms! - Remove the now redundant secure_flush_area() API. - Module PLT relocation optimisations. Ard says: This series of 4 patches optimizes the ARM PLT generation code that is invoked at module load time, to get rid of the O(n^2) algorithm that results in pathological load times of 10 seconds or more for large modules on certain STB platforms. - ARMv7M cache maintanence support. - L2 cache PMU support * 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm: (35 commits) ARM: sa1111: provide to_sa1111_device() macro ARM: sa1111: add sa1111_get_irq() ARM: sa1111: clean up duplication in IRQ chip implementation ARM: sa1111: implement a gpio_chip for SA1111 GPIOs ARM: sa1111: move irq cleanup to separate function ARM: sa1111: use devm_clk_get() ARM: sa1111: use devm_kzalloc() ARM: sa1111: ensure we only touch RAB bus type devices when removing ARM: 8611/1: l2x0: add PMU support ARM: 8610/1: V7M: Add dsb before jumping in handler mode ARM: 8609/1: V7M: Add support for the Cortex-M7 processor ARM: 8608/1: V7M: Indirect proc_info construction for V7M CPUs ARM: 8607/1: V7M: Wire up caches for V7M processors with cache support. ARM: 8606/1: V7M: introduce cache operations ARM: 8605/1: V7M: fix notrace variant of save_and_disable_irqs ARM: 8604/1: V7M: Add support for reading the CTR with read_cpuid_cachetype() ARM: 8603/1: V7M: Add addresses for mem-mapped V7M cache operations ARM: 8602/1: factor out CSSELR/CCSIDR operations that use cp15 directly ARM: kernel: avoid brute force search on PLT generation ARM: kernel: sort relocation sections before allocating PLTs ...	2016-10-06 07:59:37 -07:00
NeilBrown	09bb8bfffd	exportfs: be careful to only return expected errors. When nfsd calls fh_to_dentry, it expect ESTALE or ENOMEM as errors. In particular it can be tempting to return ENOENT, but this is not handled well by nfsd. Rather than requiring strict adherence to error code code filesystems, treat all unexpected error codes the same as ESTALE. This is safest. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2016-10-06 09:07:44 -04:00
Russell King	301a36fa70	Merge branches 'misc' and 'sa1111-base' into for-linus	2016-10-06 08:56:43 +01:00
David Howells	9008f998a2	afs: Check for fatal error when in waiting for ack state When it's in the waiting-for-ACK state, the AFS filesystem needs to check the result of rxrpc_kernel_recv_data() any time it is notified to see if it is indicating a fatal error. If this is the case, it needs to mark the call completed otherwise the call just sits there and never goes away. Signed-off-by: David Howells <dhowells@redhat.com>	2016-10-06 08:11:50 +01:00
Darrick J. Wong	1f08af52e7	xfs: implement swapext for rmap filesystems Implement swapext for filesystems that have reverse mapping. Back in the reflink patches, we augmented the bmap code with a 'REMAP' flag that updates only the bmbt and doesn't touch the allocator and implemented log redo items for those two operations. Now we can rewrite extent swapping as a (looong) series of remap operations. This is far less efficient than the fork swapping method implemented in the past, so we only switch this on for rmap. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-10-05 16:26:32 -07:00
Darrick J. Wong	39aff5fdb9	xfs: refactor swapext code Refactor the swapext function to pull out the fork swapping piece into a separate function. In the next patch we'll add in the bit we need to make it work with rmap filesystems. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-10-05 16:26:32 -07:00
Darrick J. Wong	e06259aa08	xfs: various swapext cleanups Replace structure typedefs with struct expressions and fix some whitespace issues that result. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-10-05 16:26:32 -07:00
Darrick J. Wong	e54b5bf9d7	xfs: recognize the reflink feature bit Add the reflink feature flag to the set of recognized feature flags. This enables users to write to reflink filesystems. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-10-05 16:26:31 -07:00
Darrick J. Wong	a35eb41519	xfs: simulate per-AG reservations being critically low Create an error injection point that enables us to simulate being critically low on per-AG block reservations. This should enable us to simulate this specific ENOSPC condition so that we can test falling back to a regular file copy. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-10-05 16:26:31 -07:00
Darrick J. Wong	4f435ebe7d	xfs: don't mix reflink and DAX mode for now Since we don't have a strategy for handling both DAX and reflink, for now we'll just prohibit both being set at the same time. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-10-05 16:26:31 -07:00
Darrick J. Wong	c8e156ac33	xfs: check for invalid inode reflink flags We don't support sharing blocks on the realtime device. Flag inodes with the reflink or cowextsize flags set when the reflink feature is disabled. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-10-05 16:26:31 -07:00
Darrick J. Wong	e153aa7990	xfs: set a default CoW extent size of 32 blocks If the admin doesn't set a CoW extent size or a regular extent size hint, default to creating CoW reservations 32 blocks long to reduce fragmentation. Signed-off-by: DarricK J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-10-05 16:26:31 -07:00
Darrick J. Wong	3f165b334e	xfs: convert unwritten status of reverse mappings for shared files Provide a function to convert an unwritten extent to a real one and vice versa when shared extents are possible. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-10-05 16:26:29 -07:00
Darrick J. Wong	ceeb9c832e	xfs: use interval query for rmap alloc operations on shared files When it's possible for reverse mappings to overlap (data fork extents of files on reflink filesystems), use the interval query function to find the left neighbor of an extent we're trying to add; and be careful to use the lookup functions to update the neighbors and/or add new extents. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-10-05 16:26:29 -07:00
Darrick J. Wong	0e07c039ba	xfs: add shared rmap map/unmap/convert log item types Wire up some rmap log redo item type codes to map, unmap, or convert shared data block extents. The actual log item recovery comes in a later patch. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-10-05 16:26:29 -07:00
Darrick J. Wong	80de462e09	xfs: increase log reservations for reflink Increase the log reservations to handle the increased rolling that happens at the end of copy-on-write operations. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-10-05 16:26:29 -07:00
Darrick J. Wong	83104d449e	xfs: garbage collect old cowextsz reservations Trim CoW reservations made on behalf of a cowextsz hint if they get too old or we run low on quota, so long as we don't have dirty data awaiting writeback or directio operations in progress. Garbage collection of the cowextsize extents are kept separate from prealloc extent reaping because setting the CoW prealloc lifetime to a (much) higher value than the regular prealloc extent lifetime has been useful for combatting CoW fragmentation on VM hosts where the VMs experience bursty write behaviors and we can keep the utilization ratios low enough that we don't start to run out of space. IOWs, it benefits us to keep the CoW fork reservations around for as long as we can unless we run out of blocks or hit inode reclaim. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-10-05 16:26:28 -07:00
Darrick J. Wong	90e2056d76	xfs: try other AGs to allocate a BMBT block Prior to the introduction of reflink, allocating a block and mapping it into a file was performed in a single transaction with a single block reservation, and the allocator was supposed to find enough blocks to allocate the extent and any BMBT blocks that might be necessary (unless we're low on space). However, due to the way copy on write works, allocation and mapping have been split into two transactions, which means that we must be able to handle the case where we allocate an extent for CoW but that AG runs out of free space before the blocks can be mapped into a file, and the mapping requires a new BMBT block. When this happens, look in one of the other AGs for a BMBT block instead of taking the FS down. The same applies to the functions that convert a data fork to extents and later btree format. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-10-05 16:26:28 -07:00
Darrick J. Wong	6fa164b865	xfs: don't allow reflink when the AG is low on space If the AG free space is down to the reserves, refuse to reflink our way out of space. Hopefully userspace will make a real copy and/or go elsewhere. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-10-05 16:26:27 -07:00
Darrick J. Wong	84d6961910	xfs: preallocate blocks for worst-case btree expansion To gracefully handle the situation where a CoW operation turns a single refcount extent into a lot of tiny ones and then run out of space when a tree split has to happen, use the per-AG reserved block pool to pre-allocate all the space we'll ever need for a maximal btree. For a 4K block size, this only costs an overhead of 0.3% of available disk space. When reflink is enabled, we have an unfortunate problem with rmap -- since we can share a block billions of times, this means that the reverse mapping btree can expand basically infinitely. When an AG is so full that there are no free blocks with which to expand the rmapbt, the filesystem will shut down hard. This is rather annoying to the user, so use the AG reservation code to reserve a "reasonable" amount of space for rmap. We'll prevent reflinks and CoW operations if we think we're getting close to exhausting an AG's free space rather than shutting down, but this permanent reservation should be enough for "most" users. Hopefully. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> [hch@lst.de: ensure that we invalidate the freed btree buffer] Signed-off-by: Christoph Hellwig <hch@lst.de>	2016-10-05 16:26:27 -07:00
Darrick J. Wong	f7ca352272	xfs: create a separate cow extent size hint for the allocator Create a per-inode extent size allocator hint for copy-on-write. This hint is separate from the existing extent size hint so that CoW can take advantage of the fragmentation-reducing properties of extent size hints without disabling delalloc for regular writes. The extent size hint that's fed to the allocator during a copy on write operation is the greater of the cowextsize and regular extsize hint. During reflink, if we're sharing the entire source file to the entire destination file and the destination file doesn't already have a cowextsize hint, propagate the source file's cowextsize hint to the destination file. Furthermore, zero the bulkstat buffer prior to setting the fields so that we don't copy kernel memory contents into userspace. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-10-05 16:26:26 -07:00
Darrick J. Wong	98cc2db5b8	xfs: unshare a range of blocks via fallocate Unshare all shared extents if the user calls fallocate with the new unshare mode flag set, so that we can guarantee that a subsequent write will not ENOSPC. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> [hch: pass inode instead of file to xfs_reflink_dirty_range, use iomap infrastructure for copy up] Signed-off-by: Christoph Hellwig <hch@lst.de>	2016-10-05 16:26:26 -07:00
Darrick J. Wong	f0bc4d134b	xfs: swap inode reflink flags when swapping inode extents When we're swapping the extents of two inodes, be sure to swap the reflink inode flag too. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-10-05 16:26:26 -07:00
Darrick J. Wong	f86f403794	xfs: teach get_bmapx about shared extents and the CoW fork Teach xfs_getbmapx how to report shared extents and CoW fork contents accurately in the bmap output by querying the refcount btree appropriately. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-10-05 16:26:26 -07:00
Darrick J. Wong	cc714660bb	xfs: add dedupe range vfs function Define a VFS function which allows userspace to request that the kernel reflink a range of blocks between two files if the ranges' contents match. The function fits the new VFS ioctl that standardizes the checking for the btrfs EXTENT SAME ioctl. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-10-05 16:26:26 -07:00
Darrick J. Wong	9fe26045e9	xfs: add clone file and clone range vfs functions Define two VFS functions which allow userspace to reflink a range of blocks between two files or to reflink one file's contents to another. These functions fit the new VFS ioctls that standardize the checking for the btrfs CLONE and CLONE RANGE ioctls. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-10-05 16:26:25 -07:00
Darrick J. Wong	862bb360ef	xfs: reflink extents from one file to another Reflink extents from one file to another; that is to say, iteratively remove the mappings from the destination file, copy the mappings from the source file to the destination file, and increment the reference count of all the blocks that got remapped. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-10-05 16:26:05 -07:00
Darrick J. Wong	174edb0e46	xfs: store in-progress CoW allocations in the refcount btree Due to the way the CoW algorithm in XFS works, there's an interval during which blocks allocated to handle a CoW can be lost -- if the FS goes down after the blocks are allocated but before the block remapping takes place. This is exacerbated by the cowextsz hint -- allocated reservations can sit around for a while, waiting to get used. Since the refcount btree doesn't normally store records with refcount of 1, we can use it to record these in-progress extents. In-progress blocks cannot be shared because they're not user-visible, so there shouldn't be any conflicts with other programs. This is a better solution than holding EFIs during writeback because (a) EFIs can't be relogged currently, (b) even if they could, EFIs are bound by available log space, which puts an unnecessary upper bound on how much CoW we can have in flight, and (c) we already have a mechanism to track blocks. At mount time, read the refcount records and free anything we find with a refcount of 1 because those were in-progress when the FS went down. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-10-05 16:26:05 -07:00
Darrick J. Wong	5e7e605c4d	xfs: cancel pending CoW reservations when destroying inodes When destroying the inode, cancel all pending reservations in the CoW fork so that all the reserved blocks go back to the free pile. In theory this sort of cleanup is only needed to clean up after write errors. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-10-05 16:26:05 -07:00
Darrick J. Wong	aa8968f227	xfs: cancel CoW reservations and clear inode reflink flag when freeing blocks When we're freeing blocks (truncate, punch, etc.), clear all CoW reservations in the range being freed. If the file block count drops to zero, also clear the inode reflink flag. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-10-05 16:26:04 -07:00
Darrick J. Wong	0613f16cd2	xfs: implement CoW for directio writes For O_DIRECT writes to shared blocks, we have to CoW them just like we would with buffered writes. For writes that are not block-aligned, just bounce them to the page cache. For block-aligned writes, however, we can do better than that. Use the same mechanisms that we employ for buffered CoW to set up a delalloc reservation, allocate all the blocks at once, issue the writes against the new blocks and use the same ioend functions to remap the blocks after the write. This should be fairly performant. Christoph discovered that xfs_reflink_allocate_cow_range may stumble over invalid entries in the extent array given that it drops the ilock but still expects the index to be stable. Simple fixing it to a new lookup for every iteration still isn't correct given that xfs_bmapi_allocate will trigger a BUG_ON() if hitting a hole, and there is nothing preventing a xfs_bunmapi_cow call removing extents once we dropped the ilock either. This patch duplicates the inner loop of xfs_bmapi_allocate into a helper for xfs_reflink_allocate_cow_range so that it can be done under the same ilock critical section as our CoW fork delayed allocation. The directio CoW warts will be revisited in a later patch. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2016-10-05 16:26:04 -07:00
Darrick J. Wong	db1327b16c	xfs: report shared extent mappings to userspace correctly Report shared extents through the iomap interface so that FIEMAP flags shared blocks accurately. Have xfs_vm_bmap return zero for reflinked files because the bmap-based swap code requires static block mappings, which is incompatible with copy on write. NOTE: Existing userspace bmap users such as lilo will have the same problem with reflink files. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2016-10-05 16:26:04 -07:00
Al Viro	c531716785	proc: switch auxv to use of __mem_open() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-10-05 18:43:43 -04:00
Mikulas Patocka	91fff9b347	hpfs: support FIEMAP Support the FIEMAP ioctl that reports extents allocated by a file. Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-10-05 18:31:58 -04:00
Miklos Szeredi	ca76f5b6bd	pipe: add pipe_buf_steal() helper Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-10-05 18:23:59 -04:00
Miklos Szeredi	fba597db42	pipe: add pipe_buf_confirm() helper Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-10-05 18:23:59 -04:00
Miklos Szeredi	a779638cf6	pipe: add pipe_buf_release() helper Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-10-05 18:23:58 -04:00
Miklos Szeredi	7bf2d1df80	pipe: add pipe_buf_get() helper Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-10-05 18:23:57 -04:00
Al Viro	523ac9afc7	switch default_file_splice_read() to use of pipe-backed iov_iter we only use iov_iter_get_pages_alloc() and iov_iter_advance() - pages are filled by kernel_readv() via a kvec array (as we used to do all along), so iov_iter here is used only as a way of arranging for those pages to be in pipe. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-10-05 18:23:56 -04:00
Al Viro	82c156f853	switch generic_file_splice_read() to use of ->read_iter() ... and kill the ->splice_read() instances that can be switched to it Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-10-05 18:23:56 -04:00
Al Viro	241699cd72	new iov_iter flavour: pipe-backed iov_iter variant for passing data into pipe. copy_to_iter() copies data into page(s) it has allocated and stuffs them into the pipe; copy_page_to_iter() stuffs there a reference to the page given to it. Both will try to coalesce if possible. iov_iter_zero() is similar to copy_to_iter(); iov_iter_get_pages() and friends will do as copy_to_iter() would have and return the pages where the data would've been copied. iov_iter_advance() will truncate everything past the spot it has advanced to. New primitive: iov_iter_pipe(), used for initializing those. pipe should be locked all along. Running out of space acts as fault would for iovec-backed ones; in other words, giving it to ->read_iter() may result in short read if the pipe overflows, or -EFAULT if it happens with nothing copied there. In other words, ->read_iter() on those acts pretty much like ->splice_read(). Moreover, all generic_file_splice_read() users, as well as many other ->splice_read() instances can be switched to that scheme - that'll happen in the next commit. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-10-05 18:23:36 -04:00
Darrick J. Wong	43caeb187d	xfs: move mappings from cow fork to data fork after copy-write After the write component of a copy-write operation finishes, clean up the bookkeeping left behind. On error, we simply free the new blocks and pass the error up. If we succeed, however, then we must remove the old data fork mapping and move the cow fork mapping to the data fork. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> [hch: Call the CoW failure function during xfs_cancel_ioend] Signed-off-by: Christoph Hellwig <hch@lst.de>	2016-10-05 13:55:40 -07:00
Darrick J. Wong	4862cfe825	xfs: support removing extents from CoW fork Create a helper method to remove extents from the CoW fork without any of the side effects (rmapbt/bmbt updates) of the regular extent deletion routine. We'll eventually use this to clear out the CoW fork during ioend processing. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-10-05 13:55:40 -07:00
Pierre Morel	997198ba1e	fs/block_dev.c: return the right error in thaw_bdev() When triggering thaw-filesystems via magic sysrq, the system enters a loop in do_thaw_one(), as thaw_bdev() still returns success if bd_fsfreeze_count == 0. To fix this, let thaw_bdev() always return error (and simplify the code a bit at the same time). Reviewed-by: Eric Farman <farman@linux.vnet.ibm.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Pierre Morel <pmorel@linux.vnet.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-10-05 14:35:13 -06:00
Linus Torvalds	edadd0e5a7	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse updates from Miklos Szeredi: "This adds POSIX ACL permission checking to the fuse kernel module. In addition there are minor bug fixes as well as cleanups" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: limit xattr returned size fuse: remove duplicate cs->offset assignment fuse: don't use fuse_ioctl_copy_user() helper fuse_ioctl_copy_user(): don't open-code copy_page_{to,from}_iter() fuse: get rid of fc->flags fuse: use timespec64 fuse: don't use ->d_time fuse: Add posix ACL support fuse: handle killpriv in userspace fs fuse: fix killing s[ug]id in setattr fuse: invalidate dir dentry after chmod fuse: Use generic xattr ops fuse: listxattr: verify xattr list	2016-10-05 10:58:15 -07:00
Linus Torvalds	3fb75cb80d	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull misc filesystem and quota fixes from Jan Kara: "Some smaller udf, ext2, quota & reiserfs fixes" * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: ext2: Unmap metadata when zeroing blocks udf: don't bother with full-page write optimisations in adinicb case reiserfs: Unlock superblock before calling reiserfs_quota_on_mount() udf: Remove useless check in udf_adinicb_write_begin() quota: fill in Q_XGETQSTAT inode information for inactive quotas ext2: Check return value from ext2_get_group_desc()	2016-10-05 10:53:03 -07:00
Linus Torvalds	687ee0ad4e	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next Pull networking updates from David Miller: 1) BBR TCP congestion control, from Neal Cardwell, Yuchung Cheng and co. at Google. https://lwn.net/Articles/701165/ 2) Do TCP Small Queues for retransmits, from Eric Dumazet. 3) Support collect_md mode for all IPV4 and IPV6 tunnels, from Alexei Starovoitov. 4) Allow cls_flower to classify packets in ip tunnels, from Amir Vadai. 5) Support DSA tagging in older mv88e6xxx switches, from Andrew Lunn. 6) Support GMAC protocol in iwlwifi mwm, from Ayala Beker. 7) Support ndo_poll_controller in mlx5, from Calvin Owens. 8) Move VRF processing to an output hook and allow l3mdev to be loopback, from David Ahern. 9) Support SOCK_DESTROY for UDP sockets. Also from David Ahern. 10) Congestion control in RXRPC, from David Howells. 11) Support geneve RX offload in ixgbe, from Emil Tantilov. 12) When hitting pressure for new incoming TCP data SKBs, perform a partial rathern than a full purge of the OFO queue (which could be huge). From Eric Dumazet. 13) Convert XFRM state and policy lookups to RCU, from Florian Westphal. 14) Support RX network flow classification to igb, from Gangfeng Huang. 15) Hardware offloading of eBPF in nfp driver, from Jakub Kicinski. 16) New skbmod packet action, from Jamal Hadi Salim. 17) Remove some inefficiencies in snmp proc output, from Jia He. 18) Add FIB notifications to properly propagate route changes to hardware which is doing forwarding offloading. From Jiri Pirko. 19) New dsa driver for qca8xxx chips, from John Crispin. 20) Implement RFC7559 ipv6 router solicitation backoff, from Maciej Żenczykowski. 21) Add L3 mode to ipvlan, from Mahesh Bandewar. 22) Support 802.1ad in mlx4, from Moshe Shemesh. 23) Support hardware LRO in mediatek driver, from Nelson Chang. 24) Add TC offloading to mlx5, from Or Gerlitz. 25) Convert various drivers to ethtool ksettings interfaces, from Philippe Reynes. 26) TX max rate limiting for cxgb4, from Rahul Lakkireddy. 27) NAPI support for ath10k, from Rajkumar Manoharan. 28) Support XDP in mlx5, from Rana Shahout and Saeed Mahameed. 29) UDP replicast support in TIPC, from Richard Alpe. 30) Per-queue statistics for qed driver, from Sudarsana Reddy Kalluru. 31) Support BQL in thunderx driver, from Sunil Goutham. 32) TSO support in alx driver, from Tobias Regnery. 33) Add stream parser engine and use it in kcm. 34) Support async DHCP replies in ipconfig module, from Uwe Kleine-König. 35) DSA port fast aging for mv88e6xxx driver, from Vivien Didelot. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1715 commits) mlxsw: switchx2: Fix misuse of hard_header_len mlxsw: spectrum: Fix misuse of hard_header_len net/faraday: Stop NCSI device on shutdown net/ncsi: Introduce ncsi_stop_dev() net/ncsi: Rework the channel monitoring net/ncsi: Allow to extend NCSI request properties net/ncsi: Rework request index allocation net/ncsi: Don't probe on the reserved channel ID (0x1f) net/ncsi: Introduce NCSI_RESERVED_CHANNEL net/ncsi: Avoid unused-value build warning from ia64-linux-gcc net: Add netdev all_adj_list refcnt propagation to fix panic net: phy: Add Edge-rate driver for Microsemi PHYs. vmxnet3: Wake queue from reset work i40e: avoid NULL pointer dereference and recursive errors on early PCI error qed: Add RoCE ll2 & GSI support qed: Add support for memory registeration verbs qed: Add support for QP verbs qed: PD,PKEY and CQ verb support qed: Add support for RoCE hw init qede: Add qedr framework ...	2016-10-05 10:11:24 -07:00
Darrick J. Wong	ef4736678f	xfs: allocate delayed extents in CoW fork Modify the writepage handler to find and convert pending delalloc extents to real allocations. Furthermore, when we're doing non-cow writes to a part of a file that already has a CoW reservation (the cowextsz hint that we set up in a subsequent patch facilitates this), promote the write to copy-on-write so that the entire extent can get written out as a single extent on disk, thereby reducing post-CoW fragmentation. Christoph moved the CoW support code in _map_blocks to a separate helper function, refactored other functions, and reduced the number of CoW fork lookups, so I merged those changes here to reduce churn. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2016-10-04 18:06:41 -07:00
Darrick J. Wong	60b4984fc3	xfs: support allocating delayed extents in CoW fork Modify xfs_bmap_add_extent_delay_real() so that we can convert delayed allocation extents in the CoW fork to real allocations, and wire this up all the way back to xfs_iomap_write_allocate(). In a subsequent patch, we'll modify the writepage handler to call this. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-10-04 18:06:41 -07:00
Darrick J. Wong	2a06705cd5	xfs: create delalloc extents in CoW fork Wire up iomap_begin to detect shared extents and create delayed allocation extents in the CoW fork: 1) Check if we already have an extent in the COW fork for the area. If so nothing to do, we can move along. 2) Look up block number for the current extent, and if there is none it's not shared move along. 3) Unshare the current extent as far as we are going to write into it. For this we avoid an additional COW fork lookup and use the information we set aside in step 1) above. 4) Goto 1) unless we've covered the whole range. Last but not least, this updates the xfs_reflink_reserve_cow_range calling convention to pass a byte offset and length, as that is what both callers expect anyway. This patch has been refactored considerably as part of the iomap transition. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2016-10-04 18:06:40 -07:00
Darrick J. Wong	be51f8119c	xfs: support bmapping delalloc extents in the CoW fork Allow the creation of delayed allocation extents in the CoW fork. In a subsequent patch we'll wire up iomap_begin to actually do this via reflink helper functions. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-10-04 18:06:40 -07:00
Darrick J. Wong	3993baeb3c	xfs: introduce the CoW fork Introduce a new in-core fork for storing copy-on-write delalloc reservations and allocated extents that are in the process of being written out. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-10-04 18:06:40 -07:00
Darrick J. Wong	11715a21bc	xfs: don't allow reflinked dir/dev/fifo/socket/pipe files Only non-rt files can be reflinked, so check that when we load an inode. Also, don't leak the attr fork if there's a failure. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-10-04 18:06:40 -07:00
Darrick J. Wong	f0ec1b8ef1	xfs: add reflink feature flag to geometry Report the reflink feature in the XFS geometry so that xfs_info and friends know the filesystem has this feature. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-10-04 18:06:40 -07:00
Darrick J. Wong	53aa1c34f4	xfs: define tracepoints for reflink activities Define all the tracepoints we need to inspect the runtime operation of reflink/dedupe/copy-on-write. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-10-04 18:06:39 -07:00
Darrick J. Wong	4453593be6	xfs: return work remaining at the end of a bunmapi operation Return the range of file blocks that bunmapi didn't free. This hint is used by CoW and reflink to figure out what part of an extent actually got freed so that it can set up the appropriate atomic remapping of just the freed range. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-10-04 18:06:39 -07:00
Linus Torvalds	a3443cda55	Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull security subsystem updates from James Morris: SELinux/LSM: - overlayfs support, necessary for container filesystems LSM: - finally remove the kernel_module_from_file hook Smack: - treat signal delivery as an 'append' operation TPM: - lots of bugfixes & updates Audit: - new audit data type: LSM_AUDIT_DATA_FILE * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (47 commits) Revert "tpm/tpm_crb: implement tpm crb idle state" Revert "tmp/tpm_crb: fix Intel PTT hw bug during idle state" Revert "tpm/tpm_crb: open code the crb_init into acpi_add" Revert "tmp/tpm_crb: implement runtime pm for tpm_crb" lsm,audit,selinux: Introduce a new audit data type LSM_AUDIT_DATA_FILE tmp/tpm_crb: implement runtime pm for tpm_crb tpm/tpm_crb: open code the crb_init into acpi_add tmp/tpm_crb: fix Intel PTT hw bug during idle state tpm/tpm_crb: implement tpm crb idle state tpm: add check for minimum buffer size in tpm_transmit() tpm: constify TPM 1.x header structures tpm/tpm_crb: fix the over 80 characters checkpatch warring tpm/tpm_crb: drop useless cpu_to_le32 when writing to registers tpm/tpm_crb: cache cmd_size register value. tmp/tpm_crb: drop include to platform_device tpm/tpm_tis: remove unused itpm variable tpm_crb: fix incorrect values of cmdReady and goIdle bits tpm_crb: refine the naming of constants tpm_crb: remove wmb()'s tpm_crb: fix crb_req_canceled behavior ...	2016-10-04 14:48:27 -07:00
Linus Torvalds	2105b9ff73	Minor jfs updates -----BEGIN PGP SIGNATURE----- iQIcBAABCAAGBQJX8//uAAoJEDaohF61QIxkBi8QAKKHQVPK+QAcoSYf3fqV7PKo +j83RnBOcPAo8Rhtycb5+azz1Or1IWmsZETbhSo8+4GZJgV8E0XZbgC5Cj630Lg8 ltmi31GfHD959kzIukDJKbPBsBROgOCk0k8Gry3/tWFdQRblPreoJkR0c3FeD6kJ AY7GrITRuxZQFde7pRM5mXmgO2CO6ERaXQit+BeG+cdMXpeoC3PHQvs8LQphV/ah Ybn6oJJnO/fP2lzbNoe8aN+owuaJbA2EasjCtZpuhRAUAsBpSGDy+nGlkBCg8MAZ DLQzLOYAafCyoXu5GuqStUjRJAtz7GWcL+QWYcHKpNZgztYyTQhzDbmDA3pdWffG CZUqYk6PR9+3dIa2wE8UJqRQ4YmggFhBC1zcHkOarFyzuTNMje0bOP+5BTJ5bJgB j2R/R8b3vn9PY0tobeV6Mju1ArXJHxde3mEvJ3RsOMPrwlGTUd3te9ANAu/T2MJG 5s1msjbY+SKk+605IF2gWrWqbDrvP8MBGkcBAqV/0jW+MhhDuj1c2+r417kKS6tj sZb71zoslVJW1y3dxQ0oLb3VWECH7X6GDA5Nz3JrHFuQRjSNHnFulXEdOFxmf7EZ y9Ld+YyOpKWqT0ifQlBoyO95IHL9EsvhUO+eNHsLvTDD+Z1W8YdgMxUNX2EdqDow 8tnw1N7Um+6FyrDm2OER =L+vH -----END PGP SIGNATURE----- Merge tag 'jfs-4.9' of git://github.com/kleikamp/linux-shaggy Pull jfs updates from David Kleikamp: "Minor jfs updates" * tag 'jfs-4.9' of git://github.com/kleikamp/linux-shaggy: jfs: Simplify code jfs: jump to error_out when filemap_{fdatawait, write_and_wait} fails	2016-10-04 13:45:09 -07:00
Linus Torvalds	5fdf4939dc	We've only got six GFS2 patches for this merge window. In patch order: 1. Fabian Frederick submitted a nice cleanup that uses the BIT macro rather than bit shifting. 2. Andreas Gruenbacher contributed a patch that fixes a long-standing annoyance whereby GFS2 warned about dirty pages. 3. Andreas also fixed a problem with the recent extended attribute readahead feature. 4. Chao Yu contributed a patch that checks the return code from function register_shrinker and reacts accordingly. Previously, it was not checked. 5. Andreas Gruenbacher also fixed a problem whereby incore file timestamps were forgotten if the file was invalidated. This merely moves the assignment inside the inode glock where it belongs. 6. He also fixed a problem where incore timestamps were not initialized. -----BEGIN PGP SIGNATURE----- iQEcBAABAgAGBQJX8oqYAAoJENeLYdPf93o7m5YIAIvBQ4WAmMmNuLT0AkvXIKXW ZHXtV5oizSOl+qOrb5x3ANbnZWZ5NnWRP6E0frDf3Y5wk6U4qWAqU0V8BTbdr2E+ IryOLQ+62CAa4UbHqgQRFCpwkPxEaCsOde7eQh/ppTyBKjP0da7tUvSfPcLrWU+9 qhYiqAv5qVk38JjFiwhw4zER+dOCPDIg1xkkMPG6fspjM8/CkXR9p4lh73qNJT/j NDzyjHSBYK32lkcb5xagjpLjmN/fIm6gXvdk65bD1euqxfUeuSCg6AF8QWkEXkcB pbqQVIOWrZixS9HMTqT7w8nNstsBKSrEwQhulZWBZygRAzJJAWu6IaHQ9gZkUsE= =1Fjo -----END PGP SIGNATURE----- Merge tag 'gfs2-4.8.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull gfs2 updates from Bob Peterson: "We've only got six GFS2 patches for this merge window. In patch order: - Fabian Frederick submitted a nice cleanup that uses the BIT macro rather than bit shifting. - Andreas Gruenbacher contributed a patch that fixes a long-standing annoyance whereby GFS2 warned about dirty pages. - Andreas also fixed a problem with the recent extended attribute readahead feature. - Chao Yu contributed a patch that checks the return code from function register_shrinker and reacts accordingly. Previously, it was not checked. - Andreas Gruenbacher also fixed a problem whereby incore file timestamps were forgotten if the file was invalidated. This merely moves the assignment inside the inode glock where it belongs. - Andreas also fixed a problem where incore timestamps were not initialized" * tag 'gfs2-4.8.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: gfs2: Initialize atime of I_NEW inodes gfs2: Update file times after grabbing glock gfs2: fix to detect failure of register_shrinker gfs2: Fix extended attribute readahead optimization gfs2: Remove dirty buffer warning from gfs2_releasepage GFS2: use BIT() macro	2016-10-04 13:42:13 -07:00
Linus Torvalds	c35bcfd8e4	File locking related changes for v4.9 -----BEGIN PGP SIGNATURE----- iQIcBAABAgAGBQJX8EMFAAoJEAAOaEEZVoIV138QALm9BtIpuLeg3m2L7DffC6tk uRhu0a+sZhES8n1YF8/Z40KqlGvZ8qlbRv08vYQ1xNGYQ/RMBEdVZUXuOvN1NDSt CgU3JSEtBo1Qg8eNkAUwvzfyLsfTazLYf6rus2v2wwrH/1pF8yeU2OZUhv4FhKd2 EoIczZ5NsWabJLktb4drckD+Xng9WHLKyB5bE7VKXR38cK7HWbuY30wg03JyX/em rkfw00rcRhh5JWqyL2NOO7INJSNXyJKBVZ/xeIQYnhj4ZA7aTFN+LgQebPqpfyzw g5jVet1ygaI+/8lp3IpB8rrkpmVSbtqLgmbPOvnDltiZOQbBlGOsw84TX/Dxp9VH 7q04zCmcDWGD1ZMnQmXDPJxQZ8+pYdutfSNait0Q7lYSySqO0+1nSLpMQ2yIrebS hSREgj/MyOWewn5todNCh102IpSPUvo0J9mcDijlUBFWmPrK30QDGWrG20Qzb6ON olYRxztSX7cs0rNIOSjeRNCiy6E5Eoz8zm22JuDgKd2TGzES0ZoPea++1iqsTKbM KZrjGw5oQPkRbOePxoIk8ZP1iGbZyXQgMsPVHe+cuKBhiPqujgRNex4bwGQzKBT0 O9o1YORl/wN2H04+K+HfsdAIh0cWeSZDiU7F9vPP5RmjVqzMwDc5YbP+KZFF3Nod Yu292qD+EcZL25PDt/Da =MUd+ -----END PGP SIGNATURE----- Merge tag 'locks-v4.9-1' of git://git.samba.org/jlayton/linux Pull file locking updates from Jeff Layton: "Only a single patch from Nikolay this cycle, with a small change to better handle /proc/locks in a containerized host" * tag 'locks-v4.9-1' of git://git.samba.org/jlayton/linux: locks: Filter /proc/locks output on proc pid ns	2016-10-04 13:36:19 -07:00
Jeff Layton	3f807e5ae5	NFSv4.2: Fix a reference leak in nfs42_proc_layoutstats_generic The caller of rpc_run_task also gets a reference that must be put. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Cc: stable@vger.kernel.org # 4.2+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-10-04 16:30:54 -04:00
Deepa Dinamani	2f86e0919a	fs: nfs: Make nfs boot time y2038 safe boot_time is represented as a struct timespec. struct timespec and CURRENT_TIME are not y2038 safe. Overall, the plan is to use timespec64 and ktime_t for all internal kernel representation of timestamps. CURRENT_TIME will also be removed. boot_time is used to construct the nfs client boot verifier. Use ktime_t to represent boot_time and ktime_get_real() for the boot_time value. Following Trond's request https://lkml.org/lkml/2016/6/9/22 , use ktime_t instead of converting to struct timespec64. Use higher and lower 32 bit parts of ktime_t for the boot verifier. Use the lower 32 bit part of ktime_t for the authsys_parms stamp field. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Cc: Trond Myklebust <trond.myklebust@primarydata.com> Cc: Anna Schumaker <anna.schumaker@netapp.com> Cc: linux-nfs@vger.kernel.org Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-10-04 16:20:26 -04:00
Darrick J. Wong	17c12bcd30	xfs: when replaying bmap operations, don't let unlinked inodes get reaped Log recovery will iget an inode to replay BUI items and iput the inode when it's done. Unfortunately, if the inode was unlinked, the iput will see that i_nlink == 0 and decide to truncate & free the inode, which prevents us from replaying subsequent BUIs. We can't skip the BUIs because we have to replay all the redo items to ensure that atomic operations complete. Since unlinked inode recovery will reap the inode anyway, we can safely introduce a new inode flag to indicate that an inode is in this 'unlinked recovery' state and should not be auto-reaped in the drop_inode path. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-10-04 11:05:44 -07:00
Darrick J. Wong	9f3afb57d5	xfs: implement deferred bmbt map/unmap operations Implement deferred versions of the inode block map/unmap functions. These will be used in subsequent patches to make reflink operations atomic. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-10-04 11:05:44 -07:00
Darrick J. Wong	4847acf868	xfs: pass bmapi flags through to bmap_del_extent Pass BMAPI_ flags from bunmapi into bmap_del_extent and extend BMAPI_REMAP (which means "don't touch the allocator or the quota accounting") to apply to bunmapi as well. This will be used to implement the unmap operation, which will be used by swapext. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-10-04 11:05:44 -07:00
Darrick J. Wong	f65306ea52	xfs: map an inode's offset to an exact physical block Teach the bmap routine to know how to map a range of file blocks to a specific range of physical blocks, instead of simply allocating fresh blocks. This enables reflink to map a file to blocks that are already in use. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-10-04 11:05:44 -07:00
Darrick J. Wong	77d61fe45e	xfs: log bmap intent items Provide a mechanism for higher levels to create BUI/BUD items, submit them to the log, and a stub function to deal with recovered BUI items. These parts will be connected to the rmapbt in a later patch. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-10-04 11:05:44 -07:00
Darrick J. Wong	6413a01420	xfs: create bmbt update intent log items Create bmbt update intent/done log items to record redo information in the log. Because we roll transactions multiple times for reflink operations, we also have to track the status of the metadata updates that will be recorded in the post-roll transactions in case we crash before committing the final transaction. This mechanism enables log recovery to finish what was already started. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-10-04 11:05:43 -07:00
Linus Torvalds	e6dce825fb	TTY/Serial patches for 4.9-rc1 Here is the big TTY and Serial patch set for 4.9-rc1. It also includes some drivers/dma/ changes, as those were needed by some serial drivers, and they were all acked by the DMA maintainer. Also in here is the long-suffering ACPI SPCR patchset, which was passed around from maintainer to maintainer like a hot-potato. Seems I was the sucker^Wlucky one. All of those patches have been acked by the various subsystem maintainers as well. All of this has been in linux-next with no reported issues. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- iFYEABECABYFAlfyNjEPHGdyZWdAa3JvYWguY29tAAoJEDFH1A3bLfspwIcAn2uN qCD8xQJ0Cs61hD1nUzhNygG8AJ94I4zz/fPGpyh/CtJfLQwtUdLhNA== =Rken -----END PGP SIGNATURE----- Merge tag 'tty-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty Pull tty and serial updates from Greg KH: "Here is the big tty and serial patch set for 4.9-rc1. It also includes some drivers/dma/ changes, as those were needed by some serial drivers, and they were all acked by the DMA maintainer. Also in here is the long-suffering ACPI SPCR patchset, which was passed around from maintainer to maintainer like a hot-potato. Seems I was the sucker^Wlucky one. All of those patches have been acked by the various subsystem maintainers as well. All of this has been in linux-next with no reported issues" * tag 'tty-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (111 commits) Revert "serial: pl011: add console matching function" MAINTAINERS: update entry for atmel_serial driver serial: pl011: add console matching function ARM64: ACPI: enable ACPI_SPCR_TABLE ACPI: parse SPCR and enable matching console of/serial: move earlycon early_param handling to serial Revert "drivers/tty: Explicitly pass current to show_stack" tty: amba-pl011: Don't complain on -EPROBE_DEFER when no irq nios2: dts: 10m50: Add tx-threshold parameter serial: 8250: Set Altera 16550 TX FIFO Threshold serial: 8250: of: Load TX FIFO Threshold from DT Documentation: dt: serial: Add TX FIFO threshold parameter drivers/tty: Explicitly pass current to show_stack serial: imx: Fix DCD reading serial: stm32: mark symbols static where possible serial: xuartps: Add some register initialisation to cdns_early_console_setup() serial: xuartps: Removed unwanted checks while reading the error conditions serial: xuartps: Rewrite the interrupt handling logic serial: stm32: use mapbase instead of membase for DMA tty/serial: atmel: fix fractional baud rate computation ...	2016-10-03 20:11:49 -07:00
Linus Torvalds	9929780e86	Driver core patches for 4.9-rc1 Here are the "big" driver core patches for 4.9-rc1. Also in here are a number of debugfs fixes that cropped up due to the changes that happened in 4.8 for that filesystem. Overall, nothing major, just a few fixes and cleanups. All of these have been in linux-next with no reported issues. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- iFYEABECABYFAlfyNw4PHGdyZWdAa3JvYWguY29tAAoJEDFH1A3bLfspLVYAoNXr FXBHGb2tNT/1PLfvUCwd5PqWAJ9Khb5WAHtvjTmEN1zabz45aSbcrA== =Uz6V -----END PGP SIGNATURE----- Merge tag 'driver-core-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here are the "big" driver core patches for 4.9-rc1. Also in here are a number of debugfs fixes that cropped up due to the changes that happened in 4.8 for that filesystem. Overall, nothing major, just a few fixes and cleanups. All of these have been in linux-next with no reported issues" * tag 'driver-core-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (23 commits) drivers: dma-coherent: Move spinlock in dma_alloc_from_coherent() drivers: dma-coherent: Fix DMA coherent size for less than page MAINTAINERS: extend firmware_class maintainer list debugfs: propagate release() call result driver-core: platform: Catch errors from calls to irq_get_irq_data sysfs print name of undiscoverable attribute group carl9170: fix debugfs crashes b43legacy: fix debugfs crash b43: fix debugfs crash debugfs: introduce a public file_operations accessor device core: Remove deprecated create_singlethread_workqueue drivers/base dmam_declare_coherent_memory leaks platform: don't return 0 from platform_get_irq[_byname]() on error cpu: clean up register_cpu func dma-mapping: use vma_pages(). drivers: dma-coherent: use vma_pages(). attribute_container: Fix typo base: soc: make it explicitly non-modular drivers: base: dma-mapping: page align the size when unmap_kernel_range platform driver: fix use-after-free in platform_device_del() ...	2016-10-03 20:03:24 -07:00
Al Viro	d82718e348	fuse_dev_splice_read(): switch to add_to_pipe() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-10-03 20:40:56 -04:00
Al Viro	79fddc4efd	new helper: add_to_pipe() single-buffer analogue of splice_to_pipe(); vmsplice_to_pipe() switched to that, leaving splice_to_pipe() only for ->splice_read() instances (and that only until they are converted as well). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-10-03 20:40:55 -04:00
Al Viro	8924feff66	splice: lift pipe_lock out of splice_to_pipe() * splice_to_pipe() stops at pipe overflow and does not take pipe_lock * ->splice_read() instances do the same * vmsplice_to_pipe() and do_splice() (ultimate callers of splice_to_pipe()) arrange for waiting, looping, etc. themselves. That should make pipe_lock the outermost one. Unfortunately, existing rules for the amount passed by vmsplice_to_pipe() and do_splice() are quite ugly _and_ userland code can be easily broken by changing those. It's not even "no more than the maximal capacity of this pipe" - it's "once we'd fed pipe->nr_buffers pages into the pipe, leave instead of waiting". Considering how poorly these rules are documented, let's try "wait for some space to appear, unless given SPLICE_F_NONBLOCK, then push into pipe and if we run into overflow, we are done". Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-10-03 20:40:55 -04:00
Al Viro	db85a9eb2e	splice: switch get_iovec_page_array() to iov_iter Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-10-03 20:40:54 -04:00
Al Viro	e7c3c64624	splice_to_pipe(): don't open-code wakeup_pipe_readers() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-10-03 20:40:54 -04:00
Al Viro	4038acdb18	consistent treatment of EFAULT on O_DIRECT read/write Make local filesystems treat a fault as shortened IO, returning -EFAULT only if nothing had been transferred. That's how everything else (NFS, FUSE, ceph, Lustre) behaves. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-10-03 20:38:55 -04:00
Linus Torvalds	8e4ef63867	Merge branch 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 vdso updates from Ingo Molnar: "The main changes in this cycle centered around adding support for 32-bit compatible C/R of the vDSO on 64-bit kernels, by Dmitry Safonov" * 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/vdso: Use CONFIG_X86_X32_ABI to enable vdso prctl x86/vdso: Only define map_vdso_randomized() if CONFIG_X86_64 x86/vdso: Only define prctl_map_vdso() if CONFIG_CHECKPOINT_RESTORE x86/signal: Add SA_{X32,IA32}_ABI sa_flags x86/ptrace: Down with test_thread_flag(TIF_IA32) x86/coredump: Use pr_reg size, rather that TIF_IA32 flag x86/arch_prctl/vdso: Add ARCH_MAP_VDSO_* x86/vdso: Replace calculate_addr in map_vdso() with addr x86/vdso: Unmap vdso blob on vvar mapping failure	2016-10-03 17:29:01 -07:00
Linus Torvalds	1a4a2bc460	Merge branch 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull low-level x86 updates from Ingo Molnar: "In this cycle this topic tree has become one of those 'super topics' that accumulated a lot of changes: - Add CONFIG_VMAP_STACK=y support to the core kernel and enable it on x86 - preceded by an array of changes. v4.8 saw preparatory changes in this area already - this is the rest of the work. Includes the thread stack caching performance optimization. (Andy Lutomirski) - switch_to() cleanups and all around enhancements. (Brian Gerst) - A large number of dumpstack infrastructure enhancements and an unwinder abstraction. The secret long term plan is safe(r) live patching plus maybe another attempt at debuginfo based unwinding - but all these current bits are standalone enhancements in a frame pointer based debug environment as well. (Josh Poimboeuf) - More __ro_after_init and const annotations. (Kees Cook) - Enable KASLR for the vmemmap memory region. (Thomas Garnier)" [ The virtually mapped stack changes are pretty fundamental, and not x86-specific per se, even if they are only used on x86 right now. ] * 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (70 commits) x86/asm: Get rid of __read_cr4_safe() thread_info: Use unsigned long for flags x86/alternatives: Add stack frame dependency to alternative_call_2() x86/dumpstack: Fix show_stack() task pointer regression x86/dumpstack: Remove dump_trace() and related callbacks x86/dumpstack: Convert show_trace_log_lvl() to use the new unwinder oprofile/x86: Convert x86_backtrace() to use the new unwinder x86/stacktrace: Convert save_stack_trace_*() to use the new unwinder perf/x86: Convert perf_callchain_kernel() to use the new unwinder x86/unwind: Add new unwind interface and implementations x86/dumpstack: Remove NULL task pointer convention fork: Optimize task creation by caching two thread stacks per CPU if CONFIG_VMAP_STACK=y sched/core: Free the stack early if CONFIG_THREAD_INFO_IN_TASK lib/syscall: Pin the task stack in collect_syscall() x86/process: Pin the target stack in get_wchan() x86/dumpstack: Pin the target stack when dumping it kthread: Pin the stack via try_get_task_stack()/put_task_stack() in to_live_kthread() function sched/core: Add try_get_task_stack() and put_task_stack() x86/entry/64: Fix a minor comment rebase error iommu/amd: Don't put completion-wait semaphore on stack ...	2016-10-03 16:13:28 -07:00
Linus Torvalds	00bcf5cdd6	Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar: "The main changes in this cycle were: - rwsem micro-optimizations (Davidlohr Bueso) - Improve the implementation and optimize the performance of percpu-rwsems. (Peter Zijlstra.) - Convert all lglock users to better facilities such as percpu-rwsems or percpu-spinlocks and remove lglocks. (Peter Zijlstra) - Remove the ticket (spin)lock implementation. (Peter Zijlstra) - Korean translation of memory-barriers.txt and related fixes to the English document. (SeongJae Park) - misc fixes and cleanups" * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits) x86/cmpxchg, locking/atomics: Remove superfluous definitions x86, locking/spinlocks: Remove ticket (spin)lock implementation locking/lglock: Remove lglock implementation stop_machine: Remove stop_cpus_lock and lg_double_lock/unlock() fs/locks: Use percpu_down_read_preempt_disable() locking/percpu-rwsem: Add down_read_preempt_disable() fs/locks: Replace lg_local with a per-cpu spinlock fs/locks: Replace lg_global with a percpu-rwsem locking/percpu-rwsem: Add DEFINE_STATIC_PERCPU_RWSEMand percpu_rwsem_assert_held() locking/pv-qspinlock: Use cmpxchg_release() in __pv_queued_spin_unlock() locking/rwsem, x86: Drop a bogus cc clobber futex: Add some more function commentry locking/hung_task: Show all locks locking/rwsem: Scan the wait_list for readers only once locking/rwsem: Remove a few useless comments locking/rwsem: Return void in __rwsem_mark_wake() locking, rcu, cgroup: Avoid synchronize_sched() in __cgroup_procs_write() locking/Documentation: Add Korean translation locking/Documentation: Fix a typo of example result locking/Documentation: Fix wrong section reference ...	2016-10-03 12:15:00 -07:00
Mike Marshall	f60fbdbf41	Revert "orangefs: bump minimum userspace version" The features op did make it into OrangeFS 2.9.6 after all. This reverts commit `0c95ad7636`.	2016-10-03 15:07:36 -04:00
Linus Torvalds	de956b8f45	Merge branch 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull EFI updates from Ingo Molnar: "Main changes in this cycle were: - Refactor the EFI memory map code into architecture neutral files and allow drivers to permanently reserve EFI boot services regions on x86, as well as ARM/arm64. (Matt Fleming) - Add ARM support for the EFI ESRT driver. (Ard Biesheuvel) - Make the EFI runtime services and efivar API interruptible by swapping spinlocks for semaphores. (Sylvain Chouleur) - Provide the EFI identity mapping for kexec which allows kexec to work on SGI/UV platforms with requiring the "noefi" kernel command line parameter. (Alex Thorlton) - Add debugfs node to dump EFI page tables on arm64. (Ard Biesheuvel) - Merge the EFI test driver being carried out of tree until now in the FWTS project. (Ivan Hu) - Expand the list of flags for classifying EFI regions as "RAM" on arm64 so we align with the UEFI spec. (Ard Biesheuvel) - Optimise out the EFI mixed mode if it's unsupported (CONFIG_X86_32) or disabled (CONFIG_EFI_MIXED=n) and switch the early EFI boot services function table for direct calls, alleviating us from having to maintain the custom function table. (Lukas Wunner) - Miscellaneous cleanups and fixes" * 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits) x86/efi: Round EFI memmap reservations to EFI_PAGE_SIZE x86/efi: Allow invocation of arbitrary boot services x86/efi: Optimize away setup_gop32/64 if unused x86/efi: Use kmalloc_array() in efi_call_phys_prolog() efi/arm64: Treat regions with WT/WC set but WB cleared as memory efi: Add efi_test driver for exporting UEFI runtime service interfaces x86/efi: Defer efi_esrt_init until after memblock_x86_fill efi/arm64: Add debugfs node to dump UEFI runtime page tables x86/efi: Remove unused find_bits() function fs/efivarfs: Fix double kfree() in error path x86/efi: Map in physical addresses in efi_map_region_fixed lib/ucs2_string: Speed up ucs2_utf8size() firmware-gsmi: Delete an unnecessary check before the function call "dma_pool_destroy" x86/efi: Initialize status to ensure garbage is not returned on small size efi: Replace runtime services spinlock with semaphore efi: Don't use spinlocks for efi vars efi: Use a file local lock for efivars efi/arm*: esrt: Add missing call to efi_esrt_init() efi/esrt: Use memremap not ioremap to access ESRT table in memory x86/efi-bgrt: Use efi_mem_reserve() to avoid copying image data ...	2016-10-03 11:33:18 -07:00
David Sterba	0e6757859e	btrfs: tests: uninline member definitions in free_space_extent The recommended way is to put all members on separate lines. Signed-off-by: David Sterba <dsterba@suse.com>	2016-10-03 18:52:15 +02:00
David Sterba	d2d9ac6aae	btrfs: tests: constify free space extent specs We don't change the given extent ranges, mark them const to catch accidental changes. Signed-off-by: David Sterba <dsterba@suse.com>	2016-10-03 18:52:15 +02:00
Omar Sandoval	781e3bcf0e	Btrfs: expand free space tree sanity tests to catch endianness bug The free space tree format conversion functions were broken on big-endian systems, but the sanity tests didn't catch it because all of the operations were aligned to multiple words. This was meant to catch any bugs in the extent buffer code's handling of high memory, but it ended up hiding the endianness bug. Expand the tests to do both sector-aligned and page-aligned operations. Tested-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-10-03 18:52:14 +02:00
Omar Sandoval	9426ce754f	Btrfs: fix extent buffer bitmap tests on big-endian systems The in-memory bitmap code manipulates words and is therefore sensitive to endianness, while the extent buffer bitmap code addresses bytes and is byte-order agnostic. Because the byte addressing of the extent buffer bitmaps is equivalent to a little-endian in-memory bitmap, the extent buffer bitmap tests fail on big-endian systems. `34b3e6c92a` ("Btrfs: self-tests: Fix extent buffer bitmap test fail on BE system") worked around another endianness bug in the tests but missed this one because `ed9e4afdb0` ("Btrfs: self-tests: Execute page straddling test only when nodesize < PAGE_SIZE") disables this part of the test on ppc64. That change lost the original meaning of the test, however. We really want to test that an equivalent series of operations using the in-memory bitmap API and the extent buffer bitmap API produces equivalent results. To fix this, don't use memcmp_extent_buffer() or write_extent_buffer(); do everything bit-by-bit. Reported-by: Anatoly Pugachev <matorola@gmail.com> Tested-by: Anatoly Pugachev <matorola@gmail.com> Tested-by: Feifei Xu <xufeifei@linux.vnet.ibm.com> Tested-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-10-03 18:52:14 +02:00
Omar Sandoval	6675df311d	Btrfs: catch invalid free space trees There are two separate issues that can lead to corrupted free space trees. 1. The free space tree bitmaps had an endianness issue on big-endian systems which is fixed by an earlier patch in this series. 2. btrfs-progs before v4.7.3 modified filesystems without updating the free space tree. To catch both of these issues at once, we need to force the free space tree to be rebuilt. To do so, add a FREE_SPACE_TREE_VALID compat_ro bit. If the bit isn't set, we know that it was either produced by a broken big-endian kernel or may have been corrupted by btrfs-progs. This also provides us with a way to add rudimentary read-write support for the free space tree to btrfs-progs: it can just clear this bit and have the kernel rebuild the free space tree. Cc: stable@vger.kernel.org # 4.5+ Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-10-03 18:52:14 +02:00
Omar Sandoval	f8d468a15c	Btrfs: fix mount -o clear_cache,space_cache=v2 We moved the code for creating the free space tree the first time that it's enabled, but didn't move the clearing code along with it. This breaks my (undocumented) intention that `mount -o clear_cache,space_cache=v2` would clear the free space tree and then recreate it. Fixes: `511711af91` ("btrfs: don't run delayed references while we are creating the free space tree") Cc: stable@vger.kernel.org # 4.5+ Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-10-03 18:52:14 +02:00
Omar Sandoval	2fe1d55134	Btrfs: fix free space tree bitmaps on big-endian systems In convert_free_space_to_{bitmaps,extents}(), we buffer the free space bitmaps in memory and copy them directly to/from the extent buffers with {read,write}_extent_buffer(). The extent buffer bitmap helpers use byte granularity, which is equivalent to a little-endian bitmap. This means that on big-endian systems, the in-memory bitmaps will be written to disk byte-swapped. To fix this, use byte-granularity for the bitmaps in memory. Fixes: `a5ed918285` ("Btrfs: implement the free space B-tree") Cc: stable@vger.kernel.org # 4.5+ Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-10-03 18:52:14 +02:00
Darrick J. Wong	350a27a6a6	xfs: introduce reflink utility functions These functions will be used by the other reflink functions to find the maximum length of a range of shared blocks. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.coM> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-10-03 09:11:25 -07:00
Darrick J. Wong	d0e853f360	xfs: reserve AG space for the refcount btree root Reduce the max AG usable space size so that we always have space for the refcount btree root. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-10-03 09:11:24 -07:00
Darrick J. Wong	a90c00f055	xfs: add refcount btree block detection to log recovery Identify refcountbt blocks in the log correctly so that we can validate them during log recovery. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-10-03 09:11:23 -07:00
Darrick J. Wong	62aab20f08	xfs: adjust refcount when unmapping file blocks When we're unmapping blocks from a reflinked file, decrease the refcount of the affected blocks and free the extents that are no longer in use. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-10-03 09:11:23 -07:00
Darrick J. Wong	33ba612920	xfs: connect refcount adjust functions to upper layers Plumb in the upper level interface to schedule and finish deferred refcount operations via the deferred ops mechanism. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-10-03 09:11:22 -07:00
Darrick J. Wong	3172725814	xfs: adjust refcount of an extent of blocks in refcount btree Provide functions to adjust the reference counts for an extent of physical blocks stored in the refcount btree. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2016-10-03 09:11:21 -07:00
Darrick J. Wong	f997ee2137	xfs: log refcount intent items Provide a mechanism for higher levels to create CUI/CUD items, submit them to the log, and a stub function to deal with recovered CUI items. These parts will be connected to the refcountbt in a later patch. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-10-03 09:11:21 -07:00
Darrick J. Wong	baf4bcacb7	xfs: create refcount update intent log items Create refcount update intent/done log items to record redo information in the log. Because we need to roll transactions between updating the bmbt mapping and updating the reverse mapping, we also have to track the status of the metadata updates that will be recorded in the post-roll transactions, just in case we crash before committing the final transaction. This mechanism enables log recovery to finish what was already started. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-10-03 09:11:20 -07:00
Darrick J. Wong	bdf28630b7	xfs: add refcount btree operations Implement the generic btree operations required to manipulate refcount btree blocks. The implementation is similar to the bmapbt, though it will only allocate and free blocks from the AG. Since the refcount root and level fields are separate from the existing roots and levels array, they need a separate logging flag. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> [hch: fix logging of AGF refcount btree fields] Signed-off-by: Christoph Hellwig <hch@lst.de>	2016-10-03 09:11:19 -07:00
Darrick J. Wong	f310bd2ecd	xfs: account for the refcount btree in the alloc/free log reservation Every time we allocate or free a data extent, we might need to split the refcount btree. Reserve some blocks in the transaction to handle this possibility. Even though the deferred refcount code can roll a transaction to avoid overloading the transaction, we can still exceed the reservation. Certain pathological workloads (1k blocks, no cowextsize hint, random directio writes), cause a perfect storm wherein a refcount adjustment of a large range of blocks causes full tree splits in two separate extents in two separate refcount tree blocks; allocating new refcount tree blocks causes rmap btree splits; and all the allocation activity causes the freespace btrees to split, blowing the reservation. (Reproduced by generic/167 over NFS atop XFS) Signed-off-by: Christoph Hellwig <hch@lst.de> [darrick.wong@oracle.com: add commit message] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2016-10-03 09:11:19 -07:00
Darrick J. Wong	ac4fef6938	xfs: add refcount btree support to growfs Modify the growfs code to initialize new refcount btree blocks. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-10-03 09:11:18 -07:00
Darrick J. Wong	1946b91cee	xfs: define the on-disk refcount btree format Start constructing the refcount btree implementation by establishing the on-disk format and everything needed to read, write, and manipulate the refcount btree blocks. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-10-03 09:11:18 -07:00
Darrick J. Wong	af30dfa144	xfs: refcount btree add more reserved blocks Since XFS reserves a small amount of space in each AG as the minimum free space needed for an operation, save some more space in case we touch the refcount btree. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-10-03 09:11:17 -07:00
Darrick J. Wong	46eeb521b9	xfs: introduce refcount btree definitions Add new per-AG refcount btree definitions to the per-AG structures. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-10-03 09:11:16 -07:00
Darrick J. Wong	c75c752d03	xfs: define tracepoints for refcount btree activities Define all the tracepoints we need to inspect the refcount btree runtime operation. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-10-03 09:11:15 -07:00
Darrick J. Wong	9cdafd8a76	xfs: return an error when an inline directory is too small If the size of an inline directory is so small that it doesn't even cover the required header size, return an error to userspace instead of ASSERTing and returning 0 like everything's ok. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reported-by: Jan Kara <jack@suse.cz> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2016-10-03 09:11:15 -07:00
Darrick J. Wong	71be6b4942	vfs: add a FALLOC_FL_UNSHARE mode to fallocate to unshare a range of blocks Add a new fallocate mode flag that explicitly unshares blocks on filesystems that support such features. The new flag can only be used with an allocate-mode fallocate call. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2016-10-03 09:11:14 -07:00
Wei Yongjun	8cdcc07dde	ceph: use list_move instead of list_del/list_add Using list_move() instead of list_del() + list_add(). Signed-off-by: Wei Yongjun <weiyj.lk@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2016-10-03 16:13:50 +02:00
Yan, Zheng	fcff415c94	ceph: handle CEPH_SESSION_REJECT message Signed-off-by: Yan, Zheng <zyan@redhat.com>	2016-10-03 16:13:50 +02:00
Yan, Zheng	ce2728aaa8	ceph: avoid accessing / when mounting a subpath Accessing / causes failuire if the client has caps that restrict path Signed-off-by: Yan, Zheng <zyan@redhat.com>	2016-10-03 16:13:50 +02:00
Yan, Zheng	db4a63aab4	ceph: fix mandatory flock check Signed-off-by: Yan, Zheng <zyan@redhat.com>	2016-10-03 16:13:49 +02:00
NeilBrown	e55f1a1871	ceph: remove warning when ceph_releasepage() is called on dirty page If O_DIRECT writes are racing with buffered writes, then the call to invalidate_inode_pages2_range() can call ceph_releasepage() on dirty pages. Most filesystems hold inode_lock() across O_DIRECT writes so they do not suffer this race, but cephfs deliberately drops the lock, and opens a window for the race. This race can be triggered with the generic/036 test from the xfstests test suite. It doesn't happen every time, but it does happen often. As the possibilty is expected, remove the warning, and instead include the PageDirty() status in the debug message. Signed-off-by: NeilBrown <neilb@suse.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Yan, Zheng <zyan@redhat.com>	2016-10-03 16:13:49 +02:00
NeilBrown	5d7eb1a322	ceph: ignore error from invalidate_inode_pages2_range() in direct write This call can fail if there are dirty pages. The preceding call to filemap_write_and_wait_range() will normally remove dirty pages, but as inode_lock() is not held over calls to ceph_direct_read_write(), it could race with non-direct writes and pages could be dirtied immediately after filemap_write_and_wait_range() returns If there are dirty pages, they will be removed by the subsequent call to truncate_inode_pages_range(), so having them here is not a problem. If the 'ret' value is left holding an error, then in the async IO case (aio_req is not NULL) the loop that would normally call ceph_osdc_start_request() will see the error in 'ret' and abort all requests. This doesn't seem like correct behaviour. So use separate 'ret2' instead of overloading 'ret'. Signed-off-by: NeilBrown <neilb@suse.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Yan, Zheng <zyan@redhat.com>	2016-10-03 16:13:49 +02:00
Yan, Zheng	1afe478569	ceph: fix error handling of start_read() If start_page() fails to add a page to page cache or fails to send OSD request. It should cal put_page() (instead of free_page()) for relevant pages. Besides, start_page() need to cancel fscache readpage if it fails to send OSD request. Signed-off-by: Yan, Zheng <zyan@redhat.com> Reported-by: Zhi Zhang <zhang.david2011@gmail.com>	2016-10-03 16:13:49 +02:00
Miklos Szeredi	63401ccdb2	fuse: limit xattr returned size Don't let userspace filesystem give bogus values for the size of xattr and xattr list. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2016-10-03 11:06:05 +02:00
David S. Miller	b50afd203a	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Three sets of overlapping changes. Nothing serious. Signed-off-by: David S. Miller <davem@davemloft.net>	2016-10-02 22:20:41 -04:00
Dave Chinner	155cd433b5	Merge branch 'xfs-4.9-log-recovery-fixes' into for-next	2016-10-03 09:56:28 +11:00
Dave Chinner	a1f45e668e	Merge branch 'iomap-4.9-dax' into for-next	2016-10-03 09:53:59 +11:00
Dave Chinner	a89b3f97bb	Merge branch 'xfs-4.9-delalloc-rework' into for-next	2016-10-03 09:52:51 +11:00
Dave Chinner	79ad576124	Merge branch 'xfs-4.9-reflink-prep' into for-next	2016-10-03 09:52:31 +11:00
Dave Chinner	b036b97050	Merge branch 'iomap-4.9-misc-fixes-1' into for-next	2016-10-03 09:52:11 +11:00
Christoph Hellwig	a447d7cd15	xfs: update atime before I/O in xfs_file_dio_aio_read After the call to __blkdev_direct_IO the final reference to the file might have been dropped by aio_complete already, and the call to file_accessed might cause a use after free. Instead update the access time before the I/O, similar to how we update the time stamps before writes. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-and-tested-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-10-03 09:47:34 +11:00
Christoph Hellwig	d5bfccdf38	ext2: fix possible integer truncation in ext2_iomap_begin For 32-bit architectures we need to cast first_block to u64 before shifting it left. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Jan Kara <jack@suse.cz> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-10-03 09:46:04 +11:00
Julia Lawall	ec037dfcc0	UBIFS: improve function-level documentation Fix various inconsistencies in the documentation associated with various functions. In the case of fs/ubifs/lprops.c, the second parameter of ubifs_get_lp_stats was renamed from st to lst in commit `84abf972cc` ("UBIFS: add re-mount debugging checks") In the case of fs/ubifs/lpt_commit.c, the excess variables have never existed in the associated functions since the code was introduced into the kernel. The others appear to be straightforward typos. Issues detected using Coccinelle (http://coccinelle.lip6.fr/) Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: Richard Weinberger <richard@nod.at>	2016-10-02 22:55:02 +02:00
Pascal Eberhard	74e9c700bc	ubifs: fix host xattr_len when changing xattr When an extended attribute is changed, xattr_len of host inode is recalculated. ui->data_len is updated before computation and result is wrong. This patch adds a temporary variable to fix computation. To reproduce the issue: ~# > a.txt ~# attr -s an-attr -V a-value a.txt ~# attr -s an-attr -V a-bit-bigger-value a.txt Now host inode xattr_len is wrong. Forcing dbg_check_filesystem() generates the following error: [ 130.620140] UBIFS (ubi0:2): background thread "ubifs_bgt0_2" started, PID 565 [ 131.470790] UBIFS error (ubi0:2 pid 564): check_inodes: inode 646 has xattr size 240, but calculated size is 256 [ 131.481697] UBIFS (ubi0:2): dump of the inode 646 sitting in LEB 29:114688 [ 131.488953] magic 0x6101831 [ 131.492876] crc 0x9fce9091 [ 131.496836] node_type 0 (inode node) [ 131.501193] group_type 1 (in node group) [ 131.505788] sqnum 9278 [ 131.509191] len 160 [ 131.512549] key (646, inode) [ 131.516688] creat_sqnum 9270 [ 131.520133] size 0 [ 131.523264] nlink 1 [ 131.526398] atime 1053025857.0 [ 131.530574] mtime 1053025857.0 [ 131.534714] ctime 1053025906.0 [ 131.538849] uid 0 [ 131.542009] gid 0 [ 131.545140] mode 33188 [ 131.548636] flags 0x1 [ 131.551977] xattr_cnt 1 [ 131.555108] xattr_size 240 [ 131.558420] xattr_names 12 [ 131.561670] compr_type 0x1 [ 131.564983] data len 0 [ 131.568125] UBIFS error (ubi0:2 pid 564): dbg_check_filesystem: file-system check failed with error -22 [ 131.578074] CPU: 0 PID: 564 Comm: mount Not tainted 4.4.12-g3639bea54a #24 [ 131.585352] Hardware name: Generic AM33XX (Flattened Device Tree) [ 131.591918] [<c00151c0>] (unwind_backtrace) from [<c0012acc>] (show_stack+0x10/0x14) [ 131.600177] [<c0012acc>] (show_stack) from [<c01c950c>] (dbg_check_filesystem+0x464/0x4d0) [ 131.608934] [<c01c950c>] (dbg_check_filesystem) from [<c019f36c>] (ubifs_mount+0x14f8/0x2130) [ 131.617991] [<c019f36c>] (ubifs_mount) from [<c00d7088>] (mount_fs+0x14/0x98) [ 131.625572] [<c00d7088>] (mount_fs) from [<c00ed674>] (vfs_kern_mount+0x4c/0xd4) [ 131.633435] [<c00ed674>] (vfs_kern_mount) from [<c00efb5c>] (do_mount+0x988/0xb50) [ 131.641471] [<c00efb5c>] (do_mount) from [<c00f004c>] (SyS_mount+0x74/0xa0) [ 131.648837] [<c00f004c>] (SyS_mount) from [<c000fe20>] (ret_fast_syscall+0x0/0x3c) [ 131.665315] UBIFS (ubi0:2): background thread "ubifs_bgt0_2" stops Signed-off-by: Pascal Eberhard <pascal.eberhard@gmail.com> Signed-off-by: Richard Weinberger <richard@nod.at>	2016-10-02 22:55:02 +02:00
Richard Weinberger	1e03953388	ubifs: Use move variable in ubifs_rename() ...to make the code more consistent since we use move already in other places. Signed-off-by: Richard Weinberger <richard@nod.at>	2016-10-02 22:55:02 +02:00
Richard Weinberger	9ec64962af	ubifs: Implement RENAME_EXCHANGE Adds RENAME_EXCHANGE to UBIFS, the operation itself is completely disjunct from a regular rename() that's why we dispatch very early in ubifs_reaname(). RENAME_EXCHANGE used by the renameat2() system call allows the caller to exchange two paths atomically. Both paths have to exist and have to be on the same filesystem. Signed-off-by: Richard Weinberger <richard@nod.at>	2016-10-02 22:55:02 +02:00
Richard Weinberger	9e0a1fff8d	ubifs: Implement RENAME_WHITEOUT Adds RENAME_WHITEOUT support to UBIFS, we implement it in the same way as ext4 and xfs do. For an overview of other ways to implement it please refere to commit `7dcf5c3e45` ("xfs: add RENAME_WHITEOUT support"). Signed-off-by: Richard Weinberger <richard@nod.at>	2016-10-02 22:55:02 +02:00
Richard Weinberger	474b93704f	ubifs: Implement O_TMPFILE This patchs adds O_TMPFILE support to UBIFS. A temp file is a reference to an unlinked inode, a user holding the reference can use it. As soon it is being closed all data vanishes. Signed-off-by: Richard Weinberger <richard@nod.at>	2016-10-02 22:55:02 +02:00
Miklos Szeredi	4680a7ee5d	fuse: remove duplicate cs->offset assignment Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2016-10-01 07:32:33 +02:00
Miklos Szeredi	acbe5fda1f	fuse: don't use fuse_ioctl_copy_user() helper The two invocations share little code. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2016-10-01 07:32:33 +02:00
Al Viro	3daa9c5165	fuse_ioctl_copy_user(): don't open-code copy_page_{to,from}_iter() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2016-10-01 07:32:33 +02:00
Seth Forshee	703c73629f	fuse: Use generic xattr ops In preparation for posix acl support, rework fuse to use xattr handlers and the generic setxattr/getxattr/listxattr callbacks. Split the xattr code out into it's own file, and promote symbols to module-global scope as needed. Functionally these changes have no impact, as fuse still uses a single handler for all xattrs which uses the old callbacks. Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2016-10-01 07:32:32 +02:00
Miklos Szeredi	29433a2991	fuse: get rid of fc->flags Only two flags: "default_permissions" and "allow_other". All other flags are handled via bitfields. So convert these two as well. They don't change during the lifetime of the filesystem, so this is quite safe. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2016-10-01 07:32:32 +02:00
Miklos Szeredi	cb3ae6d25a	fuse: listxattr: verify xattr list Make sure userspace filesystem is returning a well formed list of xattr names (zero or more nonzero length, null terminated strings). [Michael Theall: only verify in the nonzero size case] Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Cc: <stable@vger.kernel.org>	2016-10-01 07:32:32 +02:00
Miklos Szeredi	bcb6f6d2b9	fuse: use timespec64 And check for valid nsec value before passing into timespec64_to_jiffies(). Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2016-10-01 07:32:32 +02:00
Miklos Szeredi	f75fdf22b0	fuse: don't use ->d_time Store in memory pointed to by ->d_fsdata. Use ->d_init() to allocate the storage. Need to use RCU freeing because the data is used in RCU lookup mode. We could cast ->d_fsdata directly on 64bit archs, but I don't think this is worth the extra complexity. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2016-10-01 07:32:32 +02:00
Seth Forshee	60bcc88ad1	fuse: Add posix ACL support Add a new INIT flag, FUSE_POSIX_ACL, for negotiating ACL support with userspace. When it is set in the INIT response, ACL support will be enabled. ACL support also implies "default_permissions". When ACL support is enabled, the kernel will cache and have responsibility for enforcing ACLs. ACL xattrs will be passed to userspace, which is responsible for updating the ACLs in the filesystem, keeping the file mode in sync, and inheritance of default ACLs when new filesystem nodes are created. Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2016-10-01 07:32:32 +02:00
Miklos Szeredi	5e940c1dd3	fuse: handle killpriv in userspace fs Only userspace filesystem can do the killing of suid/sgid without races. So introduce an INIT flag and negotiate support for this. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2016-10-01 07:32:32 +02:00
Miklos Szeredi	a09f99edde	fuse: fix killing s[ug]id in setattr Fuse allowed VFS to set mode in setattr in order to clear suid/sgid on chown and truncate, and (since writeback_cache) write. The problem with this is that it'll potentially restore a stale mode. The poper fix would be to let the filesystems do the suid/sgid clearing on the relevant operations. Possibly some are already doing it but there's no way we can detect this. So fix this by refreshing and recalculating the mode. Do this only if ATTR_KILL_S[UG]ID is set to not destroy performance for writes. This is still racy but the size of the window is reduced. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Cc: <stable@vger.kernel.org>	2016-10-01 07:32:32 +02:00
Miklos Szeredi	5e2b8828ff	fuse: invalidate dir dentry after chmod Without "default_permissions" the userspace filesystem's lookup operation needs to perform the check for search permission on the directory. If directory does not allow search for everyone (this is quite rare) then userspace filesystem has to set entry timeout to zero to make sure permissions are always performed. Changing the mode bits of the directory should also invalidate the (previously cached) dentry to make sure the next lookup will have a chance of updating the timeout, if needed. Reported-by: Jean-Pierre André <jean-pierre.andre@wanadoo.fr> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Cc: <stable@vger.kernel.org>	2016-10-01 07:32:32 +02:00
Jaegeuk Kim	e4c5d8489a	f2fs: introduce update_ckpt_flags to clean up This patch add update_ckpt_flags() to clean up the flow. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-09-30 17:55:24 -07:00
Chao Yu	6ca56ca429	f2fs: don't submit irrelevant page While we call ->writepages, there are two cases: a. we didn't writeout any dirty pages, since they are writebacked by other thread concurrently. b. we writeout dirty pages, and have already submitted bio to block layer. In these cases, we don't need to do additional bio flushing unnecessarily, it may split bio in cache into smaller one. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-09-30 17:34:39 -07:00
Chao Yu	3f5f4959b1	f2fs: fix to commit bio cache after flushing node pages In sync_node_pages, we won't check and commit last merged pages in private bio cache of f2fs, as these pages were taged as writeback, someone who is waiting for writebacking of the page will be blocked until the cache was committed by someone else. We need to commit node type bio cache to avoid potential deadlock or long delay of waiting writeback. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-09-30 17:34:38 -07:00
Tiezhu Yang	fc0065adb2	f2fs: introduce get_checkpoint_version for cleanup There exists almost same codes when get the value of pre_version and cur_version in function validate_checkpoint, this patch adds get_checkpoint_version to clean up redundant codes. Signed-off-by: Tiezhu Yang <kernelpatch@126.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-09-30 17:34:37 -07:00
Sheng Yong	3fa565039e	f2fs: remove dead variable Signed-off-by: Sheng Yong <shengyong1@huawei.com> Acked-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-09-30 17:34:37 -07:00
Chao Yu	7fd748df45	f2fs: remove redundant io plug Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-09-30 17:34:36 -07:00
Chao Yu	0f34802858	f2fs: support checkpoint error injection This patch adds to support checkpoint error injection in f2fs for testing fatal error tolerance, it will be useful that it can simulate abnormal power off by f2fs itself instead of calling godown ioctl by running apps. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-09-30 17:34:35 -07:00
Chao Yu	2443b8b363	f2fs: fix to recover old fault injection config in ->remount_fs In ->remount_fs, we didn't recover original fault injection config if we encounter error, fix it. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-09-30 17:34:34 -07:00
Chao Yu	36dbd3287f	f2fs: do fault injection initialization in default_options Do fault injection initialization in default_options to keep consistent with other default option configurating. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-09-30 17:34:33 -07:00
Yunlei He	9c094040c5	f2fs: remove redundant value definition This patch remove redundant value definition in build_sit_entries Signed-off-by: Yunlei He <heyunlei@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-09-30 17:34:32 -07:00
Chao Yu	1ecc0c5c50	f2fs: support configuring fault injection per superblock Previously, we only support global fault injection configuration, so that when we configure type/rate of fault injection through sysfs, mount option, it will influence all f2fs partition which is being used. It is not make sence, since it will be not convenient if developer want to test separated partitions with different fault injection rate/type simultaneously, also it's not possible to enable fault injection in one partition and disable fault injection in other one. >From now on, we move global configuration of fault injection in module into per-superblock, hence injection testing can be more flexible. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-09-30 17:34:31 -07:00
Chao Yu	d32853de50	f2fs: adjust display format of segment bit Just adjust segment bit info printed in procfs. Before: 1008 5\|0 \|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1009 3\|183\|0 0 61 20 20 0 0 21 80 c0 2 e4 e 54 0 21 21 17 a 44 d0 28 e4 50 40 30 8 0 2d 32 0 5 b0 80 1 43 2 8e f8 7b 2 25 93 bf e0 73 8e 9a 19 44 60 ff e4 cc e6 8e bf f9 ff 5 3d 31 3d 13 1010 3\|1 \|0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 40 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 After: 1008 5\|0 \| 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 1009 4\|434\| ff 7d ff bf d9 3f ff e7 ff bf d7 bf ff bb be ff fb df f7 fb fa bf fb fe bb df dd ff fe ef ff fe ef e2 27 bf ab bf fb df fd bd bf fb db fc ff ff 3f ff ff bf ff 5f db 3f fb fb bf fb bf 4f ff ef 1010 4\|422\| ff bb fe ff ef d7 ee ff ff fc bf ef 7d eb ec fd fb 3f 97 7f ef ff af ff db ff ff 69 bf ff f6 e7 ff fb f7 7b fb df be ff ff ef f3 fe ff ff df fe f7 fa ff b7 77 be fe fb a9 7f 87 a2 ac c7 ff 75 Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-09-30 17:34:30 -07:00
Jaegeuk Kim	bb5dada7d2	f2fs: remove dirty inode pages in error path When getting EIO while handling orphan inodes, we can get some dirty node pages. Then, f2fs_write_node_pages() called by iput(node_inode) will try to flush node pages. But in this case, we should prevent to do that, since we will try again from the start. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-09-30 17:34:29 -07:00
Eric Biggers	ef68bf1197	f2fs: do not unnecessarily null-terminate encrypted symlink data Null-terminating the fscrypt_symlink_data on read is unnecessary because it is not string data --- it contains binary ciphertext. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-09-30 17:34:28 -07:00
Jaegeuk Kim	d41065e204	f2fs: handle errors during recover_orphan_inodes This patch fixes to handle EIO during recover_orphan_inode() given the below panic. F2FS-fs : inject IO error in f2fs_read_end_io+0xe6/0x100 [f2fs] ------------[ cut here ]------------ RIP: 0010:[<ffffffffc0b244e3>] [<ffffffffc0b244e3>] f2fs_evict_inode+0x433/0x470 [f2fs] RSP: 0018:ffff92f8b7fb7c30 EFLAGS: 00010246 RAX: ffff92fb88a13500 RBX: ffff92f890566ea0 RCX: 00000000fd3c255c RDX: 0000000000000001 RSI: ffff92fb88a13d90 RDI: ffff92fb8ee127e8 RBP: ffff92f8b7fb7c58 R08: 0000000000000001 R09: ffff92fb88a13d58 R10: 000000005a6a9373 R11: 0000000000000001 R12: 00000000fffffffb R13: ffff92fb8ee12000 R14: 00000000000034ca R15: ffff92fb8ee12620 FS: 00007f1fefd8e880(0000) GS:ffff92fb95600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fc211d34cdb CR3: 000000012d43a000 CR4: 00000000001406e0 Stack: ffff92f890566ea0 ffff92f890567078 ffffffffc0b5a0c0 ffff92f890566f28 ffff92fb888b2000 ffff92f8b7fb7c80 ffffffffbc27ff55 ffff92f890566ea0 ffff92fb8bf10000 ffffffffc0b5a0c0 ffff92f8b7fb7cb0 ffffffffbc28090d Call Trace: [<ffffffffbc27ff55>] evict+0xc5/0x1a0 [<ffffffffbc28090d>] iput+0x1ad/0x2c0 [<ffffffffc0b3304c>] recover_orphan_inodes+0x10c/0x2e0 [f2fs] [<ffffffffc0b2e0f4>] f2fs_fill_super+0x884/0x1150 [f2fs] [<ffffffffbc2644ac>] mount_bdev+0x18c/0x1c0 [<ffffffffc0b2d870>] ? f2fs_commit_super+0x100/0x100 [f2fs] [<ffffffffc0b2a755>] f2fs_mount+0x15/0x20 [f2fs] [<ffffffffbc264e49>] mount_fs+0x39/0x170 [<ffffffffbc28555b>] vfs_kern_mount+0x6b/0x160 [<ffffffffbc2881df>] do_mount+0x1cf/0xd00 [<ffffffffbc287f2c>] ? copy_mount_options+0xac/0x170 [<ffffffffbc289003>] SyS_mount+0x83/0xd0 [<ffffffffbc8ee880>] entry_SYSCALL_64_fastpath+0x23/0xc1 Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-09-30 17:34:27 -07:00
Jaegeuk Kim	646e759a4d	f2fs: avoid gc in cp_error case Otherwise, we can hit f2fs_bug_on(sbi, !PageUptodate(sum_page)); Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-09-30 17:34:26 -07:00
Jaegeuk Kim	f6fe2be3c6	f2fs: should put_page for summary page We should call put_page for preloaded summary pages in do_garbage_collect. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-09-30 17:34:25 -07:00
Jaegeuk Kim	2956e450fa	f2fs: assign return value in f2fs_gc This patch adds a return value of write_checkpoint for f2fs_gc. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-09-30 17:34:24 -07:00
Weichao Guo	5b7a487cf3	f2fs: add customized migrate_page callback This patch improves the migration of dirty pages and allows migrating atomic written pages that F2FS uses in Page Cache. Instead of the fallback releasing page path, it provides better performance for memory compaction, CMA and other users of memory page migrating. For dirty pages, there is no need to write back first when migrating. For an atomic written page before committing, we can migrate the page and update the related 'inmem_pages' list at the same time. Signed-off-by: Weichao Guo <guoweichao@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> [Jaegeuk Kim: fix some coding style] Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-09-30 17:34:23 -07:00
Chao Yu	aaec2b1d18	f2fs: introduce cp_lock to protect updating of ckpt_flags This patch introduces spinlock to protect updating process of ckpt_flags field in struct f2fs_checkpoint, it avoids incorrectly updating in race condition. Signed-off-by: Chao Yu <yuchao0@huawei.com> [Jaegeuk Kim: add __is_set_ckpt_flags likewise __set_ckpt_flags] Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-09-30 17:34:20 -07:00
Eric Ren	c33f0785bf	ocfs2: fix deadlock on mmapped page in ocfs2_write_begin_nolock() The testcase "mmaptruncate" of ocfs2-test deadlocks occasionally. In this testcase, we create a 2CLUSTER_SIZE file and mmap() on it; there are 2 process repeatedly performing the following operations respectively: one is doing memset(mmaped_addr + 2CLUSTER_SIZE - 1, 'a', 1), while the another is playing ftruncate(fd, 2*CLUSTER_SIZE) and then ftruncate(fd, CLUSTER_SIZE) again and again. This is the backtrace when the deadlock happens: __wait_on_bit_lock+0x50/0xa0 __lock_page+0xb7/0xc0 ocfs2_write_begin_nolock+0x163f/0x1790 [ocfs2] ocfs2_page_mkwrite+0x1c7/0x2a0 [ocfs2] do_page_mkwrite+0x66/0xc0 handle_mm_fault+0x685/0x1350 __do_page_fault+0x1d8/0x4d0 trace_do_page_fault+0x37/0xf0 do_async_page_fault+0x19/0x70 async_page_fault+0x28/0x30 In ocfs2_write_begin_nolock(), we first grab the pages and then allocate disk space for this write; ocfs2_try_to_free_truncate_log() will be called if -ENOSPC is returned; if we're lucky to get enough clusters, which is usually the case, we start over again. But in ocfs2_free_write_ctxt() the target page isn't unlocked, so we will deadlock when trying to grab the target page again. Also, -ENOMEM might be returned in ocfs2_grab_pages_for_write(). Another deadlock will happen in __do_page_mkwrite() if ocfs2_page_mkwrite() returns non-VM_FAULT_LOCKED, and along with a locked target page. These two errors fail on the same path, so fix them by unlocking the target page manually before ocfs2_free_write_ctxt(). Jan Kara helps me clear out the JBD2 part, and suggest the hint for root cause. Changes since v1: 1. Also put ENOMEM error case into consideration. Link: http://lkml.kernel.org/r/1474173902-32075-1-git-send-email-zren@suse.com Signed-off-by: Eric Ren <zren@suse.com> Reviewed-by: He Gang <ghe@suse.com> Acked-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-09-30 15:26:52 -07:00
Eric W. Biederman	069d5ac9ae	autofs: Fix automounts by using current_real_cred()->uid Seth Forshee reports that in 4.8-rcN some automounts are failing because the requesting the automount changed. The relevant call path is: follow_automount() ->d_automount autofs4_d_automount autofs4_mount_wait autofs4_wait In autofs4_wait wq_uid and wq_gid are set to current_uid() and current_gid respectively. With follow_automount now overriding creds uid that we export to userspace changes and that breaks existing setups. To remove the regression set wq_uid and wq_gid from current_real_cred()->uid and current_real_cred()->gid respectively. This restores the current behavior as current->real_cred is identical to current->cred except when override creds are used. Cc: stable@vger.kernel.org Fixes: `aeaa4a79ff` ("fs: Call d_automount with the filesystems creds") Reported-by: Seth Forshee <seth.forshee@canonical.com> Tested-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2016-09-30 12:48:01 -05:00
Eric W. Biederman	d29216842a	mnt: Add a per mount namespace limit on the number of mounts CAI Qian <caiqian@redhat.com> pointed out that the semantics of shared subtrees make it possible to create an exponentially increasing number of mounts in a mount namespace. mkdir /tmp/1 /tmp/2 mount --make-rshared / for i in $(seq 1 20) ; do mount --bind /tmp/1 /tmp/2 ; done Will create create 2^20 or 1048576 mounts, which is a practical problem as some people have managed to hit this by accident. As such CVE-2016-6213 was assigned. Ian Kent <raven@themaw.net> described the situation for autofs users as follows: > The number of mounts for direct mount maps is usually not very large because of > the way they are implemented, large direct mount maps can have performance > problems. There can be anywhere from a few (likely case a few hundred) to less > than 10000, plus mounts that have been triggered and not yet expired. > > Indirect mounts have one autofs mount at the root plus the number of mounts that > have been triggered and not yet expired. > > The number of autofs indirect map entries can range from a few to the common > case of several thousand and in rare cases up to between 30000 and 50000. I've > not heard of people with maps larger than 50000 entries. > > The larger the number of map entries the greater the possibility for a large > number of active mounts so it's not hard to expect cases of a 1000 or somewhat > more active mounts. So I am setting the default number of mounts allowed per mount namespace at 100,000. This is more than enough for any use case I know of, but small enough to quickly stop an exponential increase in mounts. Which should be perfect to catch misconfigurations and malfunctioning programs. For anyone who needs a higher limit this can be changed by writing to the new /proc/sys/fs/mount-max sysctl. Tested-by: CAI Qian <caiqian@redhat.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2016-09-30 12:46:48 -05:00
Chao Yu	fadb2fb8af	f2fs: fix to avoid race condition when updating sbi flag Making updating of sbi flag atomic by using {test,set,clear}_bit, otherwise in concurrency scenario, the flag could be updated incorrectly. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-09-30 10:05:50 -07:00
Jaegeuk Kim	9e1e6df412	f2fs: put directory inodes before checkpoint in roll-forward recovery Before checkpoint, we'd be better drop any inodes. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-09-30 10:05:49 -07:00
Jaegeuk Kim	a468f0ef51	f2fs: use crc and cp version to determine roll-forward recovery Previously, we used cp_version only to detect recoverable dnodes. In order to avoid same garbage cp_version, we needed to truncate the next dnode during checkpoint, resulting in additional discard or data write. If we can distinguish this by using crc in addition to cp_version, we can remove this overhead. There is backward compatibility concern where it changes node_footer layout. So, this patch introduces a new checkpoint flag, CP_CRC_RECOVERY_FLAG, to detect new layout. New layout will be activated only when this flag is set. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-09-30 10:05:46 -07:00
Thomas Gleixner	d7e25c66c9	Merge branch 'x86/urgent' into x86/asm Get the cr4 fixes so we can apply the final cleanup	2016-09-30 12:38:28 +02:00
Ingo Molnar	0b429e18c2	Merge branch 'linus' into locking/core, to pick up fixes Signed-off-by: Ingo Molnar <mingo@kernel.org>	2016-09-30 10:54:46 +02:00
Eric Engestrom	18017479ca	ext4: remove unused variable Signed-off-by: Eric Engestrom <eric.engestrom@imgtec.com>	2016-09-30 02:14:56 -04:00
Eric Whitney	3c816ded78	ext4: use journal inode to determine journal overhead When a file system contains an internal journal that has not been loaded, use the journal inode's i_size field to determine its contribution to the file system's overhead. (The journal's j_maxlen field is normally used to determine its size, but it's unavailable when the journal has not been loaded.) Signed-off-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2016-09-30 02:08:49 -04:00
Eric Whitney	c6cb7e776a	ext4: create function to read journal inode Factor out the code used in ext4_get_journal() to read a valid journal inode from storage, enabling its reuse in other functions. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2016-09-30 02:05:09 -04:00
Jan Kara	9b623df614	ext4: unmap metadata when zeroing blocks When zeroing blocks for DAX allocations, we also have to unmap aliases in the block device mappings. Otherwise writeback can overwrite zeros with stale data from block device page cache. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org	2016-09-30 02:02:29 -04:00
Jan Kara	51e8137b82	ext4: remove plugging from ext4_file_write_iter() do_blockdev_direct_IO() takes care of properly plugging direct IO so there's no need to plug again inside ext4_file_write_iter(). Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2016-09-30 01:57:41 -04:00
Jan Kara	4b0524aae0	ext4: allow unlocked direct IO when pages are cached Currently we do not allow unlocked (meaning without inode_lock) direct IO when the file has any pages cached. This check is not needed anymore as we keep inode lock until ext4_direct_IO_write() and thus can happily writeback and evict any pages conflicting with current direct IO write. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2016-09-30 01:55:32 -04:00
Richard Weinberger	9a200d075e	ext4: require encryption feature for EXT4_IOC_SET_ENCRYPTION_POLICY ...otherwise an user can enable encryption for certain files even when the filesystem is unable to support it. Such a case would be a filesystem created by mkfs.ext4's default settings, 1KiB block size. Ext4 supports encyption only when block size is equal to PAGE_SIZE. But this constraint is only checked when the encryption feature flag is set. Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2016-09-30 01:49:55 -04:00
Eric Biggers	55be3145d1	fscrypto: use standard macros to compute length of fname ciphertext Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2016-09-30 01:46:18 -04:00
Eric Biggers	cc91542ac8	ext4: do not unnecessarily null-terminate encrypted symlink data Null-terminating the fscrypt_symlink_data on read is unnecessary because it is not string data --- it contains binary ciphertext. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2016-09-30 01:44:17 -04:00
gmail	e81d44778d	ext4: release bh in make_indexed_dir The commit `6050d47adc`: "ext4: bail out from make_indexed_dir() on first error" could end up leaking bh2 in the error path. [ Also avoid renaming bh2 to bh, which just confuses things --tytso ] Cc: stable@vger.kernel.org Signed-off-by: yangsheng <yngsion@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2016-09-30 01:33:37 -04:00
Jan Kara	16c5468859	ext4: Allow parallel DIO reads We can easily support parallel direct IO reads. We only have to make sure we cannot expose uninitialized data by reading allocated block to which data was not written yet, or which was already truncated. That is easily achieved by holding inode_lock in shared mode - that excludes all writes, truncates, hole punches. We also have to guard against page writeback allocating blocks for delay-allocated pages - that race is handled by the fact that we writeback all the pages in the affected range and the lock protects us from new pages being created there. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2016-09-30 01:03:17 -04:00
Olga Kornievskaia	a865880e20	Retry operation on EREMOTEIO on an interrupted slot If an operation got interrupted, then since we don't know if the server processed it on not, we keep the seq#. Upon reuse of slot and seq# if we get reply from the cache (ie EREMOTEIO) then we need to retry the operation after bumping the seq# Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-29 12:31:48 -04:00
Martin Brandenburg	b78b11985a	Merge branch 'misc' into for-next Pull in an OrangeFS branch containing miscellaneous improvements. - clean up debugfs globals - remove dead code in sysfs - reorganize duplicated sysfs attribute structs - consolidate sysfs show and store functions - remove duplicated sysfs_ops structures - describe organization of sysfs - make devreq_mutex static - g_orangefs_stats -> orangefs_stats for consistency - rename most remaining global variables	2016-09-28 14:50:46 -04:00
Al Viro	dbbab32574	cifs: get rid of unused arguments of CIFSSMBWrite() they used to be used, but... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-09-27 21:54:53 -04:00
Andreas Gruenbacher	2211d5ba5c	posix_acl: xattr representation cleanups Remove the unnecessary typedefs and the zero-length a_entries array in struct posix_acl_xattr_header. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-09-27 21:52:00 -04:00
Rasmus Villemoes	de04e76935	fs/aio.c: eliminate redundant loads in put_aio_ring_file Using a local variable we can prevent gcc from reloading aio_ring_file->f_inode->i_mapping twice, eliminating 2x2 dependent loads. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-09-27 21:45:46 -04:00
Rasmus Villemoes	be218aa2e3	fs/internal.h: add const to ns_dentry_operations declaration The actual definition in fs/nsfs.c is already const. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-09-27 21:45:46 -04:00
Arnd Bergmann	9dcfcda576	compat: remove compat_printk() After `7e8e385aaf` ("x86/compat: Remove sys32_vm86_warning"), this function has become unused, so we can remove it as well. Link: http://lkml.kernel.org/r/20160617142903.3070388-1-arnd@arndb.de Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2016-09-27 21:20:53 -04:00
Deepa Dinamani	c2050a454c	fs: Replace current_fs_time() with current_time() current_fs_time() uses struct super_block* as an argument. As per Linus's suggestion, this is changed to take struct inode* as a parameter instead. This is because the function is primarily meant for vfs inode timestamps. Also the function was renamed as per Arnd's suggestion. Change all calls to current_fs_time() to use the new current_time() function instead. current_fs_time() will be deleted. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-09-27 21:06:22 -04:00
Deepa Dinamani	02027d42c3	fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps CURRENT_TIME_SEC is not y2038 safe. current_time() will be transitioned to use 64 bit time along with vfs in a separate patch. There is no plan to transistion CURRENT_TIME_SEC to use y2038 safe time interfaces. current_time() will also be extended to use superblock range checking parameters when range checking is introduced. This works because alloc_super() fills in the the s_time_gran in super block to NSEC_PER_SEC. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-09-27 21:06:22 -04:00
Deepa Dinamani	078cd8279e	fs: Replace CURRENT_TIME with current_time() for inode timestamps CURRENT_TIME macro is not appropriate for filesystems as it doesn't use the right granularity for filesystem timestamps. Use current_time() instead. CURRENT_TIME is also not y2038 safe. This is also in preparation for the patch that transitions vfs timestamps to use 64 bit time and hence make them y2038 safe. As part of the effort current_time() will be extended to do range checks. Hence, it is necessary for all file system timestamps to use current_time(). Also, current_time() will be transitioned along with vfs to be y2038 safe. Note that whenever a single call to current_time() is used to change timestamps in different inodes, it is because they share the same time granularity. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Felipe Balbi <balbi@kernel.org> Acked-by: Steven Whitehouse <swhiteho@redhat.com> Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Acked-by: David Sterba <dsterba@suse.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-09-27 21:06:21 -04:00
Deepa Dinamani	2554c72edb	fs: proc: Delete inode time initializations in proc_alloc_inode() proc uses new_inode_pseudo() to allocate a new inode. This in turn calls the proc_inode_alloc() callback. But, at this point, inode is still not initialized with the super_block pointer which only happens just before alloc_inode() returns after the call to inode_init_always(). Also, the inode times are initialized again after the call to new_inode_pseudo() in proc_inode_alloc(). The assignemet in proc_alloc_inode() is redundant and also doesn't work after the current_time() api is changed to take struct inode* instead of struct *super_block. This bug was reported after current_time() was used to assign times in proc_alloc_inode(). Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Reported-by: Fengguang Wu <fengguang.wu@intel.com> [0-day test robot] Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-09-27 21:06:20 -04:00
Deepa Dinamani	3cd886666f	vfs: Add current_time() api current_fs_time() is used for inode timestamps. Change the signature of the function to take inode pointer instead of superblock as per Linus's suggestion. Also, move the api under vfs as per the discussion on the thread: https://lkml.org/lkml/2016/6/9/36 . As per Arnd's suggestion on the thread, changing the function name. current_fs_time() will be deleted after all the references to it are replaced by current_time(). There was a bug reported by kbuild test bot with the change as some of the calls to current_time() were made before the super_block was initialized. Catch these accidental assignments as timespec_trunc() does for wrong granularities. This allows for the function to work right even in these circumstances. But, adds a warning to make the user aware of the bug. A coccinelle script was used to identify all the current .alloc_inode super_block callbacks that updated inode timestamps. proc filesystem was the only one that was modifying inode times as part of this callback. The series includes a patch to fix that. Note that timespec_trunc() will also be moved to fs/inode.c in a separate patch when this will need to be revamped for bounds checking purposes. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-09-27 21:06:20 -04:00
Eric Biggers	0026ba4008	fs/buffer.c: make __getblk_slow() static __getblk_slow() was exported to modules in commit `3b5e6454aa` ("fs/buffer.c: support buffer cache allocations with gfp modifiers"). This seems to have been a mistake, as no users were introduced nor was the function declared in a header. Change it back to 'static'. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-09-27 18:47:38 -04:00
Alexey Dobriyan	771187d61b	proc: unsigned file descriptors Make struct proc_inode::fd unsigned. This allows better code generation on x86_64 (less sign extensions). Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-09-27 18:47:38 -04:00
Alexey Dobriyan	9b80a184ea	fs/file: more unsigned file descriptors Propagate unsignedness for grand total of 149 bytes: $ ./scripts/bloat-o-meter ../vmlinux-000 ../obj/vmlinux add/remove: 0/0 grow/shrink: 0/10 up/down: 0/-149 (-149) function old new delta set_close_on_exec 99 98 -1 put_files_struct 201 200 -1 get_close_on_exec 59 58 -1 do_prlimit 498 497 -1 do_execveat_common.isra 1662 1661 -1 __close_fd 178 173 -5 do_dup2 219 204 -15 seq_show 685 660 -25 __alloc_fd 384 357 -27 dup_fd 718 646 -72 It mostly comes from converting "unsigned int" to "long" for bit operations. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-09-27 18:47:38 -04:00
Shawn Lin	85e7340f21	fs: compat: remove redundant check of nr_segs nr_segs should never be less than zero as its type is unsigned long, so let's remove this check. Signed-off-by: Shawn Lin <shawn.lin@rock-chips.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-09-27 18:47:38 -04:00
David Howells	a818101d7b	cachefiles: Fix attempt to read i_blocks after deleting file [ver #2 ] An NULL-pointer dereference happens in cachefiles_mark_object_inactive() when it tries to read i_blocks so that it can tell the cachefilesd daemon how much space it's making available. The problem is that cachefiles_drop_object() calls cachefiles_mark_object_inactive() after calling cachefiles_delete_object() because the object being marked active staves off attempts to (re-)use the file at that filename until after it has been deleted. This means that d_inode is NULL by the time we come to try to access it. To fix the problem, have the caller of cachefiles_mark_object_inactive() supply the number of blocks freed up. Without this, the following oops may occur: BUG: unable to handle kernel NULL pointer dereference at 0000000000000098 IP: [<ffffffffa06c5cc1>] cachefiles_mark_object_inactive+0x61/0xb0 [cachefiles] ... CPU: 11 PID: 527 Comm: kworker/u64:4 Tainted: G I ------------ 3.10.0-470.el7.x86_64 #1 Hardware name: Hewlett-Packard HP Z600 Workstation/0B54h, BIOS 786G4 v03.19 03/11/2011 Workqueue: fscache_object fscache_object_work_func [fscache] task: ffff880035edaf10 ti: ffff8800b77c0000 task.ti: ffff8800b77c0000 RIP: 0010:[<ffffffffa06c5cc1>] cachefiles_mark_object_inactive+0x61/0xb0 [cachefiles] RSP: 0018:ffff8800b77c3d70 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff8800bf6cc400 RCX: 0000000000000034 RDX: 0000000000000000 RSI: ffff880090ffc710 RDI: ffff8800bf761ef8 RBP: ffff8800b77c3d88 R08: 2000000000000000 R09: 0090ffc710000000 R10: ff51005d2ff1c400 R11: 0000000000000000 R12: ffff880090ffc600 R13: ffff8800bf6cc520 R14: ffff8800bf6cc400 R15: ffff8800bf6cc498 FS: 0000000000000000(0000) GS:ffff8800bb8c0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000098 CR3: 00000000019ba000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Stack: ffff880090ffc600 ffff8800bf6cc400 ffff8800867df140 ffff8800b77c3db0 ffffffffa06c48cb ffff880090ffc600 ffff880090ffc180 ffff880090ffc658 ffff8800b77c3df0 ffffffffa085d846 ffff8800a96b8150 ffff880090ffc600 Call Trace: [<ffffffffa06c48cb>] cachefiles_drop_object+0x6b/0xf0 [cachefiles] [<ffffffffa085d846>] fscache_drop_object+0xd6/0x1e0 [fscache] [<ffffffffa085d615>] fscache_object_work_func+0xa5/0x200 [fscache] [<ffffffff810a605b>] process_one_work+0x17b/0x470 [<ffffffff810a6e96>] worker_thread+0x126/0x410 [<ffffffff810a6d70>] ? rescuer_thread+0x460/0x460 [<ffffffff810ae64f>] kthread+0xcf/0xe0 [<ffffffff810ae580>] ? kthread_create_on_node+0x140/0x140 [<ffffffff81695418>] ret_from_fork+0x58/0x90 [<ffffffff810ae580>] ? kthread_create_on_node+0x140/0x140 The oopsing code shows: callq 0xffffffff810af6a0 <wake_up_bit> mov 0xf8(%r12),%rax mov 0x30(%rax),%rax mov 0x98(%rax),%rax <---- oops here lock add %rax,0x130(%rbx) where this is: d_backing_inode(object->dentry)->i_blocks Fixes: `a5b3a80b89` (CacheFiles: Provide read-and-reset release counters for cachefilesd) Reported-by: Jianhong Yin <jiyin@redhat.com> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Steve Dickson <steved@redhat.com> cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-09-27 18:31:29 -04:00
Al Viro	fc56b9838a	cifs: don't use memcpy() to copy struct iov_iter it's not 70s anymore. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-09-27 18:13:04 -04:00
Al Viro	4bce9f6ee8	get rid of separate multipage fault-in primitives * the only remaining callers of "short" fault-ins are just as happy with generic variants (both in lib/iov_iter.c); switch them to multipage variants, kill the "short" ones * rename the multipage variants to now available plain ones. * get rid of compat macro defining iov_iter_fault_in_multipage_readable by expanding it in its only user. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2016-09-27 18:12:24 -04:00
Trond Myklebust	bfc505ded0	pNFS: Fix atime updates on pNFS clients Fix the code so that we always mark the atime as invalid in nfs4_read_done(). Currently, the expectation appears to be that the pNFS drivers should always do this, with the result that most of them don't. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-27 14:35:36 -04:00
Trond Myklebust	8a64c4ef10	NFSv4.1: Even if the stateid is OK, we may need to recover the open modes TEST_STATEID only tells you that you have a valid open stateid. It doesn't tell the client anything about whether or not it holds the required share locks. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Tested-by: Oleg Drokin <green@linuxhacker.ru> [Anna: Wrap nfs_open_stateid_recover_openmode in CONFIG_NFS_V4_1 checks] Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-27 14:35:31 -04:00
Trond Myklebust	7ebeb7fe74	NFSv4: If recovery failed for a specific open stateid, then don't retry Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Tested-by: Oleg Drokin <green@linuxhacker.ru> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-27 14:35:27 -04:00
Trond Myklebust	76e8a1bd14	NFSv4: Fix retry issues with nfs41_test/free_stateid _nfs41_free_stateid() needs to be cached by the session, but nfs41_test_stateid() may return NFS4ERR_RETRY_UNCACHED_REP (in which case we should just retry). Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Tested-by: Oleg Drokin <green@linuxhacker.ru> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-27 14:35:23 -04:00
Trond Myklebust	304020fe48	NFSv4: Open state recovery must account for file permission changes If the file permissions change on the server, then we may not be able to recover open state. If so, we need to ensure that we mark the file descriptor appropriately. Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Tested-by: Oleg Drokin <green@linuxhacker.ru> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-27 14:35:19 -04:00
Trond Myklebust	67dd483026	NFSv4: Mark the lock and open stateids as invalid after freeing them Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Tested-by: Oleg Drokin <green@linuxhacker.ru> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-27 14:35:15 -04:00
Trond Myklebust	b134fc4a53	NFSv4: Don't test open_stateid unless it is set We need to test the NFS_OPEN_STATE flag for whether or not the open_stateid is valid. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Tested-by: Oleg Drokin <green@linuxhacker.ru> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-27 14:35:11 -04:00
Trond Myklebust	272289a3df	NFSv4: nfs4_do_handle_exception() handle revoke/expiry of a single stateid If we're not yet sure that all state has expired or been revoked, we should try to do a minimal recovery on just the one stateid. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Tested-by: Oleg Drokin <green@linuxhacker.ru> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-27 14:35:07 -04:00
Trond Myklebust	7f04883146	NFS: Always call nfs_inode_find_state_and_recover() when revoking a delegation Don't rely on nfs_inode_detach_delegation() succeeding. That can race... Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Tested-by: Oleg Drokin <green@linuxhacker.ru> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-27 14:35:04 -04:00
Trond Myklebust	1393d9612b	NFSv4: Fix a race when updating an open_stateid If we're replacing an old stateid which has a different 'other' field, then we probably need to free the old stateid. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Tested-by: Oleg Drokin <green@linuxhacker.ru> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-27 14:35:00 -04:00
Trond Myklebust	b1a318de9b	NFSv4: Fix a race in nfs_inode_reclaim_delegation() If we race with a delegreturn before taking the spin lock, we currently end up dropping the delegation stateid. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Tested-by: Oleg Drokin <green@linuxhacker.ru> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-27 14:34:54 -04:00
Trond Myklebust	9c27869d3f	NFSv4: Pass the stateid to the exception handler in nfs4_read/write_done_cb The actual stateid used in the READ or WRITE can represent a delegation, a lock or a stateid, so it is useful to pass it as an argument to the exception handler when an expired/revoked response is received from the server. It also ensures that we don't re-label the state as needing recovery if that has already occurred. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Tested-by: Oleg Drokin <green@linuxhacker.ru> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-27 14:34:50 -04:00
Trond Myklebust	26f474432a	NFSv4.1: nfs4_layoutget_handle_exception handle revoked state Handle revoked open/lock/delegation stateids when LAYOUTGET tells us the state was revoked. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Tested-by: Oleg Drokin <green@linuxhacker.ru> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-27 14:34:46 -04:00
Trond Myklebust	d7f3e4bfe7	NFSv4: nfs4_handle_setlk_error() handle expiration as revoke case If the server tells us our stateid has expired, then handle that as if it was revoked. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Tested-by: Oleg Drokin <green@linuxhacker.ru> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-27 14:34:42 -04:00
Trond Myklebust	404ea3569a	NFSv4: nfs4_handle_delegation_recall_error() handle expiration as revoke case If the server tells us our stateid has expired, then handle that as if it was revoked. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Tested-by: Oleg Drokin <green@linuxhacker.ru> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-27 14:34:38 -04:00
Trond Myklebust	6c2d8f8d30	NFSv4: nfs_inode_find_state_and_recover() should check all stateids Modify the helper nfs_inode_find_state_and_recover() so that it can check all open/lock/delegation state trackers on that inode for whether or not they need are affected by a revoked stateid error. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Tested-by: Oleg Drokin <green@linuxhacker.ru> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-27 14:34:35 -04:00
Trond Myklebust	059b43e974	NFSv4: Ensure we don't re-test revoked and freed stateids This fixes a potential infinite loop in nfs_reap_expired_delegations. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Tested-by: Oleg Drokin <green@linuxhacker.ru> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-27 14:34:31 -04:00
Trond Myklebust	26d36301bd	NFSv4.1: Ensure we call FREE_STATEID if needed on close/delegreturn/locku If a server returns NFS4ERR_ADMIN_REVOKED, NFS4ERR_DELEG_REVOKED or NFS4ERR_EXPIRED on a call to close, open_downgrade, delegreturn, or locku, we should call FREE_STATEID before attempting to recover. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Tested-by: Oleg Drokin <green@linuxhacker.ru> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-27 14:34:27 -04:00
Trond Myklebust	f0b0bf8826	NFSv4.1: FREE_STATEID can be asynchronous Nothing should need to be serialised with FREE_STATEID on the client, so let's make the RPC call always asynchronous. Also constify the stateid argument. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Tested-by: Oleg Drokin <green@linuxhacker.ru> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-27 14:34:23 -04:00
Trond Myklebust	c5896fc862	NFSv4.1: Ensure we always run TEST/FREE_STATEID on locks Right now, we're only running TEST/FREE_STATEID on the locks if the open stateid recovery succeeds. The protocol requires us to always do so. The fix would be to move the call to TEST/FREE_STATEID and do it before we attempt open recovery. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Tested-by: Oleg Drokin <green@linuxhacker.ru> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-27 14:34:12 -04:00
Trond Myklebust	f7a62adad0	NFSv4.1: Allow revoked stateids to skip the call to TEST_STATEID In some cases (e.g. when the SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED sequence flag is set) we may already know that the stateid was revoked and that the only valid operation we can call is FREE_STATEID. In those cases, allow the stateid to carry the information in the type field, so that we skip the redundant call to TEST_STATEID. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Tested-by: Oleg Drokin <green@linuxhacker.ru> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-27 14:34:01 -04:00
Trond Myklebust	63d63cbf5e	NFSv4.1: Don't recheck delegations that have already been checked Ensure we don't spam the server with test_stateid() calls for delegations that have already been checked. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Tested-by: Oleg Drokin <green@linuxhacker.ru> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-27 14:33:55 -04:00
Trond Myklebust	bb3d1a3b24	NFSv4.1: Deal with server reboots during delegation expiration recovery Ensure that if the server reboots while we're testing and recovering from revoked delegations, we exit to allow the state manager to handle matters. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Tested-by: Oleg Drokin <green@linuxhacker.ru> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-27 14:33:49 -04:00
Trond Myklebust	45870d6909	NFSv4.1: Test delegation stateids when server declares "some state revoked" According to RFC5661, if any of the SEQUENCE status bits SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED, SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED, SEQ4_STATUS_ADMIN_STATE_REVOKED, or SEQ4_STATUS_RECALLABLE_STATE_REVOKED are set, then we need to use TEST_STATEID to figure out which stateids have been revoked, so we can acknowledge the loss of state using FREE_STATEID. While we already do this for open and lock state, we have not been doing so for all the delegations. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Tested-by: Oleg Drokin <green@linuxhacker.ru> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-27 14:33:44 -04:00
Trond Myklebust	41020b671a	NFSv4.x: Allow callers of nfs_remove_bad_delegation() to specify a stateid Allow the callers of nfs_remove_bad_delegation() to specify the stateid that needs to be marked as bad. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Tested-by: Oleg Drokin <green@linuxhacker.ru> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-27 14:33:37 -04:00
Trond Myklebust	4586f6e283	NFSv4.1: Add a helper function to deal with expired stateids In NFSv4.1 and newer, if the server decides to revoke some or all of the protocol state, the client is required to iterate through all the stateids that it holds and call TEST_STATEID to determine which stateids still correspond to valid state, and then call FREE_STATEID on the others. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Tested-by: Oleg Drokin <green@linuxhacker.ru> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-27 14:33:21 -04:00
Trond Myklebust	43912bbbae	NFSv4.1: Allow test_stateid to handle session errors without waiting If the server crashes while we're testing stateids for validity, then we want to initiate session recovery. Usually, we will be calling from a state manager thread, though, so we don't really want to wait. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Tested-by: Oleg Drokin <green@linuxhacker.ru> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-27 14:32:59 -04:00
Trond Myklebust	4c8e544746	NFSv4.1: Don't check delegations that are already marked as revoked If the delegation has been marked as revoked, we don't have to test it, because we should already have called FREE_STATEID on it. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Tested-by: Olek Drokin <green@linuxhacker.ru> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-27 14:32:41 -04:00
Trond Myklebust	aa05c87f23	NFSv4: nfs4_copy_delegation_stateid() must fail if the delegation is invalid We must not allow the use of delegations that have been revoked or are being returned. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Fixes: `869f9dfa4d` ("NFSv4: Fix races between nfs_remove_bad_delegation()...") Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Cc: stable@vger.kernel.org # v3.19+ Tested-by: Oleg Drokin <green@linuxhacker.ru> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-27 14:32:31 -04:00
Trond Myklebust	b3f9e72390	NFSv4: Don't report revoked delegations as valid in nfs_have_delegation() If the delegation is revoked, then it can't be used for caching. Fixes: `869f9dfa4d` ("NFSv4: Fix races between nfs_remove_bad_delegation()...") Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Cc: stable@vger.kernel.org # v3.19+ Tested-by: Oleg Drokin <green@linuxhacker.ru> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-27 14:32:12 -04:00
Trond Myklebust	7dc72d5f7a	NFS: Fix inode corruption in nfs_prime_dcache() Due to inode number reuse in filesystems, we can end up corrupting the inode on our client if we apply the file attributes without ensuring that the filehandle matches. Typical symptoms include spurious "mode changed" reports in the syslog. We still do want to ensure that we don't invalidate the dentry if the inode number matches, but we don't have a filehandle. Fixes: `fa9233699c` ("NFS: Don't require a filehandle to refresh...") Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Cc: stable@vger.kernel.org # v4.0+ Tested-by: Oleg Drokin <green@linuxhacker.ru> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-27 14:31:52 -04:00
Trond Myklebust	0a014a44a5	NFSv4.1: Don't deadlock the state manager on the SEQUENCE status flags As described in RFC5661, section 18.46, some of the status flags exist in order to tell the client when it needs to acknowledge the existence of revoked state on the server and/or to recover state. Those flags will then remain set until the recovery procedure is done. In order to avoid looping, the client therefore needs to ignore those particular flags while recovering. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Tested-by: Oleg Drokin <green@linuxhacker.ru> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-27 14:31:27 -04:00
Jan Kara	225c5161b1	ext2: Unmap metadata when zeroing blocks When zeroing blocks for DAX allocations, we also have to unmap aliases in the block device mappings. Otherwise writeback can overwrite zeros with stale data from block device page cache. Signed-off-by: Jan Kara <jack@suse.cz>	2016-09-27 18:16:55 +02:00
Eric Engestrom	a1a9e5d298	debugfs: propagate release() call result The result was being ignored and 0 was always returned. Return the actual result instead. Signed-off-by: Eric Engestrom <eric.engestrom@imgtec.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2016-09-27 12:45:57 +02:00
Johannes Thumshirn	78618d395b	sysfs print name of undiscoverable attribute group Print the name of an undiscoverable attribute group and not the pointer's address. Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2016-09-27 12:24:29 +02:00
Miklos Szeredi	2773bf00ae	fs: rename "rename2" i_op to "rename" Generated patch: sed -i "s/\.rename2\t/\.rename\t\t/" `git grep -wl rename2` sed -i "s/\brename2\b/rename/g" `git grep -wl rename2` Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2016-09-27 11:03:58 +02:00
Miklos Szeredi	18fc84dafa	vfs: remove unused i_op->rename No in-tree uses remain. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2016-09-27 11:03:58 +02:00
Miklos Szeredi	1cd66c93ba	fs: make remaining filesystems use .rename2 This is trivial to do: - add flags argument to foo_rename() - check if flags is zero - assign foo_rename() to .rename2 instead of .rename This doesn't mean it's impossible to support RENAME_NOREPLACE for these filesystems, but it is not trivial, like for local filesystems. RENAME_NOREPLACE must guarantee atomicity (i.e. it shouldn't be possible for a file to be created on one host while it is overwritten by rename on another host). Filesystems converted: 9p, afs, ceph, coda, ecryptfs, kernfs, lustre, ncpfs, nfs, ocfs2, orangefs. After this, we can get rid of the duplicate interfaces for rename. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: David Howells <dhowells@redhat.com> [AFS] Acked-by: Mike Marshall <hubcap@omnibond.com> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jan Harkes <jaharkes@cs.cmu.edu> Cc: Tyler Hicks <tyhicks@canonical.com> Cc: Oleg Drokin <oleg.drokin@intel.com> Cc: Trond Myklebust <trond.myklebust@primarydata.com> Cc: Mark Fasheh <mfasheh@suse.com>	2016-09-27 11:03:58 +02:00
Miklos Szeredi	e0e0be8a83	libfs: support RENAME_NOREPLACE in simple_rename() This is trivial to do: - add flags argument to simple_rename() - check if flags doesn't have any other than RENAME_NOREPLACE - assign simple_rename() to .rename2 instead of .rename Filesystems converted: hugetlbfs, ramfs, bpf. Debugfs uses simple_rename() to implement debugfs_rename(), which is for debugfs instances to rename files internally, not for userspace filesystem access. For this case pass zero flags to simple_rename(). Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Alexei Starovoitov <ast@kernel.org>	2016-09-27 11:03:57 +02:00
Miklos Szeredi	f03b8ad8d3	fs: support RENAME_NOREPLACE for local filesystems This is trivial to do: - add flags argument to foo_rename() - check if flags doesn't have any other than RENAME_NOREPLACE - assign foo_rename() to .rename2 instead of .rename Filesystems converted: affs, bfs, exofs, ext2, hfs, hfsplus, jffs2, jfs, logfs, minix, msdos, nilfs2, omfs, reiserfs, sysvfs, ubifs, udf, ufs, vfat. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Acked-by: Boaz Harrosh <ooo@electrozaur.com> Acked-by: Richard Weinberger <richard@nod.at> Acked-by: Bob Copeland <me@bobcopeland.com> Acked-by: Jan Kara <jack@suse.cz> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Mikulas Patocka <mpatocka@redhat.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Dave Kleikamp <shaggy@kernel.org> Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Cc: Christoph Hellwig <hch@infradead.org>	2016-09-27 11:03:57 +02:00
Miklos Szeredi	9a232de499	ncpfs: fix unused variable warning Without CONFIG_NCPFS_NLS the following warning is seen: fs/ncpfs/dir.c: In function 'ncp_hash_dentry': fs/ncpfs/dir.c:136:23: warning: unused variable 'sb' [-Wunused-variable] struct super_block *sb = dentry->d_sb; Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2016-09-27 11:03:57 +02:00
J. Bruce Fields	7d22fc11c7	nfsd4: setclientid_confirm with unmatched verifier should fail A setclientid_confirm with (clientid, verifier) both matching an existing confirmed record is assumed to be a replay, but if the verifier doesn't match, it shouldn't be. This would be a very rare case, except that clients following https://tools.ietf.org/html/rfc7931#section-5.8 may depend on the failure. Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2016-09-26 15:20:38 -04:00
J. Bruce Fields	ebd7c72c63	nfsd: randomize SETCLIENTID reply to help distinguish servers NFSv4.1 has built-in trunking support that allows a client to determine whether two connections to two different IP addresses are actually to the same server. NFSv4.0 does not, but RFC 7931 attempts to provide clients a means to do this, basically by performing a SETCLIENTID to one address and confirming it with a SETCLIENTID_CONFIRM to the other. Linux clients since `05f4c350ee` "NFS: Discover NFSv4 server trunking when mounting" implement a variation on this suggestion. It is possible that other clients do too. This depends on the clientid and verifier not being accepted by an unrelated server. Since both are 64-bit values, that would be very unlikely if they were random numbers. But they aren't: knfsd generates the 64-bit clientid by concatenating the 32-bit boot time (in seconds) and a counter. This makes collisions between clientids generated by the same server extremely unlikely. But collisions are very likely between clientids generated by servers that boot at the same time, and it's quite common for multiple servers to boot at the same time. The verifier is a concatenation of the SETCLIENTID time (in seconds) and a counter, so again collisions between different servers are likely if multiple SETCLIENTIDs are done at the same time, which is a common case. Therefore recent NFSv4.0 clients may decide two different servers are really the same, and mount a filesystem from the wrong server. Fortunately the Linux client, since `55b9df93dd` "nfsv4/v4.1: Verify the client owner id during trunking detection", only does this when given the non-default "migration" mount option. The fault is really with RFC 7931, and needs a client fix, but in the meantime we can mitigate the chance of these collisions by randomizing the starting value of the counters used to generate clientids and verifiers. Reported-by: Frank Sorenson <fsorenso@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2016-09-26 15:20:38 -04:00
Jeff Layton	19e4c3477f	nfsd: set the MAY_NOTIFY_LOCK flag in OPEN replies If we are using v4.1+, then we can send notification when contended locks become free. Inform the client of that fact. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2016-09-26 15:20:37 -04:00
Jeff Layton	7919d0a27f	nfsd: add a LRU list for blocked locks It's possible for a client to call in on a lock that is blocked for a long time, but discontinue polling for it. A malicious client could even set a lock on a file, and then spam the server with failing lock requests from different lockowners that pile up in a DoS attack. Add the blocked lock structures to a per-net namespace LRU when hashing them, and timestamp them. If the lock request is not revisited after a lease period, we'll drop it under the assumption that the client is no longer interested. This also gives us a mechanism to clean up these objects at server shutdown time as well. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2016-09-26 15:20:36 -04:00
Jeff Layton	76d348fadf	nfsd: have nfsd4_lock use blocking locks for v4.1+ locks Create a new per-lockowner+per-inode structure that contains a file_lock. Have nfsd4_lock add this structure to the lockowner's list prior to setting the lock. Then call the vfs and request a blocking lock (by setting FL_SLEEP). If we get anything besides FILE_LOCK_DEFERRED back, then we dequeue the block structure and free it. When the next lock request comes in, we'll look for an existing block for the same filehandle and dequeue and reuse it if there is one. When the lock comes free (a'la an lm_notify call), we dequeue it from the lockowner's list and kick off a CB_NOTIFY_LOCK callback to inform the client that it should retry the lock request. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2016-09-26 15:20:36 -04:00
Jeff Layton	a188620ebd	nfsd: plumb in a CB_NOTIFY_LOCK operation Add the encoding/decoding for CB_NOTIFY_LOCK operations. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2016-09-26 15:20:35 -04:00
Andreas Gruenbacher	332f51d7db	gfs2: Initialize atime of I_NEW inodes Fix for commit `719ee344`: initialize atime of I_NEW inodes to 0 so that the timestamps read from disk will always be more recent than the initial timestamp, and the atime in the I_NEW inode will be set correctly. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>	2016-09-26 13:24:34 -05:00
Andreas Gruenbacher	d7c436cd60	gfs2: Update file times after grabbing glock In gfs2_page_mkwrite, grab the inode glock in EX mode before calling file_update_time: grabbing the lock may result in a call to gfs2_dinode_in, which will reset the file times to their on-disk state. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>	2016-09-26 13:20:19 -05:00
Vasily Averin	1eca45f8a8	NFSD: fix corruption in notifier registration By design notifier can be registered once only, however nfsd registers the same inetaddr notifiers per net-namespace. When this happen it corrupts list of notifiers, as result some notifiers can be not called on proper event, traverse on list can be cycled forever, and second unregister can access already freed memory. Cc: stable@vger.kernel.org fixes: `36684996` ("nfsd: Register callbacks on the inetaddr_chain and inet6addr_chain") Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2016-09-26 14:17:45 -04:00
Liu Bo	196e02490c	Btrfs: remove unnecessary btrfs_mark_buffer_dirty in split_leaf When we're not able to get enough space through splitting leaf, we'd create a new sibling leaf instead, and it's possible that we return a zero-nritem sibling leaf and mark it dirty before it's in a consistent state. With CONFIG_BTRFS_FS_CHECK_INTEGRITY=y, the integrity check of check_leaf will report panic due to this zero-nritem non-root leaf. This removes the unnecessary btrfs_mark_buffer_dirty. Reported-by: Filipe Manana <fdmanana@gmail.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 19:50:44 +02:00
Josef Bacik	4867268c57	Btrfs: don't BUG() during drop snapshot Really there's lots of things that can go wrong here, kill all the BUG_ON()'s and replace the logic ones with ASSERT()'s and return EIO instead. Signed-off-by: Josef Bacik <jbacik@fb.com> [ switched to btrfs_err, errors go to common label ] Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 19:37:06 +02:00
Arnd Bergmann	2fd57fcb16	btrfs: fix btrfs_no_printk stub helper The addition of btrfs_no_printk() caused a build failure when CONFIG_PRINTK is disabled: fs/btrfs/send.c: In function 'send_rename': fs/btrfs/ctree.h:3367:2: error: implicit declaration of function 'btrfs_no_printk' [-Werror=implicit-function-declaration] This moves the helper outside of that #ifdef so it is always defined, and changes the existing #ifdef to refer to that helper as well for consistency. Fixes: 47c57058ff2c ("btrfs: btrfs_debug should consume fs_info when DEBUG is not defined") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 19:37:06 +02:00
Liu Bo	851cd173f0	Btrfs: memset to avoid stale content in btree leaf This is an additional patch to "Btrfs: memset to avoid stale content in btree node block". This uses memset to initialize the unused space in a leaf to avoid potential stale content, which may be incurred by pushing items between sibling leaves. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 19:37:06 +02:00
Goldwyn Rodrigues	0f5053eb90	btrfs: parent_start initialization cleanup Code cleanup. parent_start is initialized multiple times when it is not necessary to do so. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 19:37:06 +02:00
Goldwyn Rodrigues	6cea66e544	btrfs: Remove already completed TODO comment Fixes: `7cf5b97650` ("btrfs: qgroup: Cleanup old inaccurate facilities") Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 19:37:06 +02:00
Goldwyn Rodrigues	dd12d5b804	btrfs: Do not reassign count in btrfs_run_delayed_refs Code cleanup. count is already (unsgined long)-1. That is the reason run_all was set. Do not reassign it (unsigned long)-1. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 19:37:06 +02:00
Anand Jain	0ccd05285e	btrfs: fix a possible umount deadlock btrfs_show_devname() is using the device_list_mutex, sometimes a call to blkdev_put() leads vfs calling into this func. So call blkdev_put() outside of device_list_mutex, as of now. [ 983.284212] ====================================================== [ 983.290401] [ INFO: possible circular locking dependency detected ] [ 983.296677] 4.8.0-rc5-ceph-00023-g1b39cec2 #1 Not tainted [ 983.302081] ------------------------------------------------------- [ 983.308357] umount/21720 is trying to acquire lock: [ 983.313243] (&bdev->bd_mutex){+.+.+.}, at: [<ffffffff9128ec51>] blkdev_put+0x31/0x150 [ 983.321264] [ 983.321264] but task is already holding lock: [ 983.327101] (&fs_devs->device_list_mutex){+.+...}, at: [<ffffffffc033d6f6>] __btrfs_close_devices+0x46/0x200 [btrfs] [ 983.337839] [ 983.337839] which lock already depends on the new lock. [ 983.337839] [ 983.346024] [ 983.346024] the existing dependency chain (in reverse order) is: [ 983.353512] -> #4 (&fs_devs->device_list_mutex){+.+...}: [ 983.359096] [<ffffffff910dfd0c>] lock_acquire+0x1bc/0x1f0 [ 983.365143] [<ffffffff91823125>] mutex_lock_nested+0x65/0x350 [ 983.371521] [<ffffffffc02d8116>] btrfs_show_devname+0x36/0x1f0 [btrfs] [ 983.378710] [<ffffffff9129523e>] show_vfsmnt+0x4e/0x150 [ 983.384593] [<ffffffff9126ffc7>] m_show+0x17/0x20 [ 983.389957] [<ffffffff91276405>] seq_read+0x2b5/0x3b0 [ 983.395669] [<ffffffff9124c808>] __vfs_read+0x28/0x100 [ 983.401464] [<ffffffff9124eb3b>] vfs_read+0xab/0x150 [ 983.407080] [<ffffffff9124ec32>] SyS_read+0x52/0xb0 [ 983.412609] [<ffffffff91825fc0>] entry_SYSCALL_64_fastpath+0x23/0xc1 [ 983.419617] -> #3 (namespace_sem){++++++}: [ 983.424024] [<ffffffff910dfd0c>] lock_acquire+0x1bc/0x1f0 [ 983.430074] [<ffffffff918239e9>] down_write+0x49/0x80 [ 983.435785] [<ffffffff91272457>] lock_mount+0x67/0x1c0 [ 983.441582] [<ffffffff91272ab2>] do_add_mount+0x32/0xf0 [ 983.447458] [<ffffffff9127363a>] finish_automount+0x5a/0xc0 [ 983.453682] [<ffffffff91259513>] follow_managed+0x1b3/0x2a0 [ 983.459912] [<ffffffff9125b750>] lookup_fast+0x300/0x350 [ 983.465875] [<ffffffff9125d6e7>] path_openat+0x3a7/0xaa0 [ 983.471846] [<ffffffff9125ef75>] do_filp_open+0x85/0xe0 [ 983.477731] [<ffffffff9124c41c>] do_sys_open+0x14c/0x1f0 [ 983.483702] [<ffffffff9124c4de>] SyS_open+0x1e/0x20 [ 983.489240] [<ffffffff91825fc0>] entry_SYSCALL_64_fastpath+0x23/0xc1 [ 983.496254] -> #2 (&sb->s_type->i_mutex_key#3){+.+.+.}: [ 983.501798] [<ffffffff910dfd0c>] lock_acquire+0x1bc/0x1f0 [ 983.507855] [<ffffffff918239e9>] down_write+0x49/0x80 [ 983.513558] [<ffffffff91366237>] start_creating+0x87/0x100 [ 983.519703] [<ffffffff91366647>] debugfs_create_dir+0x17/0x100 [ 983.526195] [<ffffffff911df153>] bdi_register+0x93/0x210 [ 983.532165] [<ffffffff911df313>] bdi_register_owner+0x43/0x70 [ 983.538570] [<ffffffff914080fb>] device_add_disk+0x1fb/0x450 [ 983.544888] [<ffffffff91580226>] loop_add+0x1e6/0x290 [ 983.550596] [<ffffffff91fec358>] loop_init+0x10b/0x14f [ 983.556394] [<ffffffff91002207>] do_one_initcall+0xa7/0x180 [ 983.562618] [<ffffffff91f932e0>] kernel_init_freeable+0x1cc/0x266 [ 983.569370] [<ffffffff918174be>] kernel_init+0xe/0x100 [ 983.575166] [<ffffffff9182620f>] ret_from_fork+0x1f/0x40 [ 983.581131] -> #1 (loop_index_mutex){+.+.+.}: [ 983.585801] [<ffffffff910dfd0c>] lock_acquire+0x1bc/0x1f0 [ 983.591858] [<ffffffff91823125>] mutex_lock_nested+0x65/0x350 [ 983.598256] [<ffffffff9157ed3f>] lo_open+0x1f/0x60 [ 983.603704] [<ffffffff9128eec3>] __blkdev_get+0x123/0x400 [ 983.609757] [<ffffffff9128f4ea>] blkdev_get+0x34a/0x350 [ 983.615639] [<ffffffff9128f554>] blkdev_open+0x64/0x80 [ 983.621428] [<ffffffff9124aff6>] do_dentry_open+0x1c6/0x2d0 [ 983.627651] [<ffffffff9124c029>] vfs_open+0x69/0x80 [ 983.633181] [<ffffffff9125db74>] path_openat+0x834/0xaa0 [ 983.639152] [<ffffffff9125ef75>] do_filp_open+0x85/0xe0 [ 983.645035] [<ffffffff9124c41c>] do_sys_open+0x14c/0x1f0 [ 983.650999] [<ffffffff9124c4de>] SyS_open+0x1e/0x20 [ 983.656535] [<ffffffff91825fc0>] entry_SYSCALL_64_fastpath+0x23/0xc1 [ 983.663541] -> #0 (&bdev->bd_mutex){+.+.+.}: [ 983.668107] [<ffffffff910def43>] __lock_acquire+0x1003/0x17b0 [ 983.674510] [<ffffffff910dfd0c>] lock_acquire+0x1bc/0x1f0 [ 983.680561] [<ffffffff91823125>] mutex_lock_nested+0x65/0x350 [ 983.686967] [<ffffffff9128ec51>] blkdev_put+0x31/0x150 [ 983.692761] [<ffffffffc033481f>] btrfs_close_bdev+0x4f/0x60 [btrfs] [ 983.699699] [<ffffffffc033d77b>] __btrfs_close_devices+0xcb/0x200 [btrfs] [ 983.707178] [<ffffffffc033d8db>] btrfs_close_devices+0x2b/0xa0 [btrfs] [ 983.714380] [<ffffffffc03081c5>] close_ctree+0x265/0x340 [btrfs] [ 983.721061] [<ffffffffc02d7959>] btrfs_put_super+0x19/0x20 [btrfs] [ 983.727908] [<ffffffff91250e2f>] generic_shutdown_super+0x6f/0x100 [ 983.734744] [<ffffffff91250f56>] kill_anon_super+0x16/0x30 [ 983.740888] [<ffffffffc02da97e>] btrfs_kill_super+0x1e/0x130 [btrfs] [ 983.747909] [<ffffffff91250fe9>] deactivate_locked_super+0x49/0x80 [ 983.754745] [<ffffffff912515fd>] deactivate_super+0x5d/0x70 [ 983.760977] [<ffffffff91270a1c>] cleanup_mnt+0x5c/0x80 [ 983.766773] [<ffffffff91270a92>] __cleanup_mnt+0x12/0x20 [ 983.772738] [<ffffffff910aa2fe>] task_work_run+0x7e/0xc0 [ 983.778708] [<ffffffff91081b5a>] exit_to_usermode_loop+0x7e/0xb4 [ 983.785373] [<ffffffff910039eb>] syscall_return_slowpath+0xbb/0xd0 [ 983.792212] [<ffffffff9182605c>] entry_SYSCALL_64_fastpath+0xbf/0xc1 [ 983.799225] [ 983.799225] other info that might help us debug this: [ 983.799225] [ 983.807291] Chain exists of: &bdev->bd_mutex --> namespace_sem --> &fs_devs->device_list_mutex [ 983.816521] Possible unsafe locking scenario: [ 983.816521] [ 983.822489] CPU0 CPU1 [ 983.827043] ---- ---- [ 983.831599] lock(&fs_devs->device_list_mutex); [ 983.836289] lock(namespace_sem); [ 983.842268] lock(&fs_devs->device_list_mutex); [ 983.849478] lock(&bdev->bd_mutex); [ 983.853127] [ 983.853127] * DEADLOCK * [ 983.853127] [ 983.859113] 3 locks held by umount/21720: [ 983.863145] #0: (&type->s_umount_key#35){++++..}, at: [<ffffffff912515f5>] deactivate_super+0x55/0x70 [ 983.872713] #1: (uuid_mutex){+.+.+.}, at: [<ffffffffc033d8d3>] btrfs_close_devices+0x23/0xa0 [btrfs] [ 983.882206] #2: (&fs_devs->device_list_mutex){+.+...}, at: [<ffffffffc033d6f6>] __btrfs_close_devices+0x46/0x200 [btrfs] [ 983.893422] [ 983.893422] stack backtrace: [ 983.897824] CPU: 6 PID: 21720 Comm: umount Not tainted 4.8.0-rc5-ceph-00023-g1b39cec2 #1 [ 983.905958] Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 1.0c 09/07/2015 [ 983.913492] 0000000000000000 ffff8c8a53c17a38 ffffffff91429521 ffffffff9260f4f0 [ 983.921018] ffffffff92642760 ffff8c8a53c17a88 ffffffff911b2b04 0000000000000050 [ 983.928542] ffffffff9237d620 ffff8c8a5294aee0 ffff8c8a5294aeb8 ffff8c8a5294aee0 [ 983.936072] Call Trace: [ 983.938545] [<ffffffff91429521>] dump_stack+0x85/0xc4 [ 983.943715] [<ffffffff911b2b04>] print_circular_bug+0x1fb/0x20c [ 983.949748] [<ffffffff910def43>] __lock_acquire+0x1003/0x17b0 [ 983.955613] [<ffffffff910dfd0c>] lock_acquire+0x1bc/0x1f0 [ 983.961123] [<ffffffff9128ec51>] ? blkdev_put+0x31/0x150 [ 983.966550] [<ffffffff91823125>] mutex_lock_nested+0x65/0x350 [ 983.972407] [<ffffffff9128ec51>] ? blkdev_put+0x31/0x150 [ 983.977832] [<ffffffff9128ec51>] blkdev_put+0x31/0x150 [ 983.983101] [<ffffffffc033481f>] btrfs_close_bdev+0x4f/0x60 [btrfs] [ 983.989500] [<ffffffffc033d77b>] __btrfs_close_devices+0xcb/0x200 [btrfs] [ 983.996415] [<ffffffffc033d8db>] btrfs_close_devices+0x2b/0xa0 [btrfs] [ 984.003068] [<ffffffffc03081c5>] close_ctree+0x265/0x340 [btrfs] [ 984.009189] [<ffffffff9126cc5e>] ? evict_inodes+0x15e/0x170 [ 984.014881] [<ffffffffc02d7959>] btrfs_put_super+0x19/0x20 [btrfs] [ 984.021176] [<ffffffff91250e2f>] generic_shutdown_super+0x6f/0x100 [ 984.027476] [<ffffffff91250f56>] kill_anon_super+0x16/0x30 [ 984.033082] [<ffffffffc02da97e>] btrfs_kill_super+0x1e/0x130 [btrfs] [ 984.039548] [<ffffffff91250fe9>] deactivate_locked_super+0x49/0x80 [ 984.045839] [<ffffffff912515fd>] deactivate_super+0x5d/0x70 [ 984.051525] [<ffffffff91270a1c>] cleanup_mnt+0x5c/0x80 [ 984.056774] [<ffffffff91270a92>] __cleanup_mnt+0x12/0x20 [ 984.062201] [<ffffffff910aa2fe>] task_work_run+0x7e/0xc0 [ 984.067625] [<ffffffff91081b5a>] exit_to_usermode_loop+0x7e/0xb4 [ 984.073747] [<ffffffff910039eb>] syscall_return_slowpath+0xbb/0xd0 [ 984.080038] [<ffffffff9182605c>] entry_SYSCALL_64_fastpath+0xbf/0xc1 Reported-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 19:37:06 +02:00
Liu Bo	a958eab0ed	Btrfs: fix memory leak in do_walk_down The extent buffer 'next' needs to be free'd conditionally. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 19:37:06 +02:00
Jeff Mahoney	c01f5f96f5	btrfs: btrfs_debug should consume fs_info when DEBUG is not defined We can hit unused variable warnings when btrfs_debug and friends are just aliases for no_printk. This is due to the fs_info not getting consumed by the function call, which can happen if convenenience variables are used. This patch adds a new btrfs_no_printk static inline that consumes the convenience variable and does nothing else. It silences the unused variable warning and has no impact on the generated code: $ size fs/btrfs/extent_io.o* text data bss dec hex filename 44072 152 32 44256 ace0 fs/btrfs/extent_io.o.btrfs_no_printk 44072 152 32 44256 ace0 fs/btrfs/extent_io.o.no_printk Fixes: `27a0dd61a5` (Btrfs: make btrfs_debug match pr_debug handling related to DEBUG) Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 19:37:06 +02:00
Jeff Mahoney	04ab956ee6	btrfs: convert send's verbose_printk to btrfs_debug This was basically an open-coded, less flexible dynamic printk. We can just use btrfs_debug instead. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 19:37:06 +02:00
Jeff Mahoney	ab8d0fc48d	btrfs: convert pr_* to btrfs_* where possible For many printks, we want to know which file system issued the message. This patch converts most pr_* calls to use the btrfs_* versions instead. In some cases, this means adding plumbing to allow call sites access to an fs_info pointer. fs/btrfs/check-integrity.c is left alone for another day. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 19:37:04 +02:00
Jeff Mahoney	62e855771d	btrfs: convert printk(KERN_* to use pr_* calls This patch converts printk(KERN_* style messages to use the pr_* versions. One side effect is that anything that was KERN_DEBUG is now automatically a dynamic debug message. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 18:08:44 +02:00
Jeff Mahoney	5d163e0e68	btrfs: unsplit printed strings CodingStyle chapter 2: "[...] never break user-visible strings such as printk messages, because that breaks the ability to grep for them." This patch unsplits user-visible strings. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 18:08:44 +02:00
Jeff Mahoney	cea67ab92d	btrfs: clean the old superblocks before freeing the device btrfs_rm_device frees the block device but then re-opens it using the saved device name. A race exists between the close and the re-open that allows the block size to be changed. The result is getting stuck forever in the reclaim loop in __getblk_slow. This patch moves the superblock cleanup before closing the block device, which is also consistent with other callers. We also don't need a private copy of dev_name as the whole routine operates under the uuid_mutex. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 18:08:44 +02:00
Liu Bo	02794222c4	Btrfs: kill BUG_ON in run_delayed_tree_ref In a corrupted btrfs image, we can come across this BUG_ON and get an unreponsive system, but if we return errors instead, its caller can handle everything gracefully by aborting the current transaction. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 18:08:44 +02:00
Josef Bacik	6bdf131fac	Btrfs: don't leak reloc root nodes on error We don't track the reloc roots in any sort of normal way, so the only way the root/commit_root nodes get free'd is if the relocation finishes successfully and the reloc root is deleted. Fix this by free'ing them in free_reloc_roots. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 18:08:44 +02:00
Masahiro Yamada	e2c8990734	btrfs: squash lines for simple wrapper functions Remove unneeded variables and assignments. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 18:08:38 +02:00
Liu Bo	6b722c1747	Btrfs: improve check_node to avoid reading corrupted nodes We need to check items in a node to make sure that we're reading a valid one, otherwise we could get various crashes while processing delayed_refs. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 18:05:28 +02:00
Liu Bo	a42cbec9c6	Btrfs: add error handling for extent buffer in print tree Somehow we missed btrfs_print_tree when last time we updated error handling for read_extent_block(). This keeps us from getting a NULL pointer panic when btrfs_print_tree's read_extent_block() fails. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 18:04:01 +02:00
Liu Bo	a43f7f8206	Btrfs: remove BUG_ON in start_transaction Since we could get errors from the concurrent aborted transaction, the check of this BUG_ON in start_transaction is not true any more. Say, while flushing free space cache inode's dirty pages, btrfs_finish_ordered_io -> btrfs_join_transaction_nolock (the transaction has been aborted.) -> BUG_ON(type == TRANS_JOIN_NOLOCK); Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 18:04:01 +02:00
Liu Bo	3eb548ee3a	Btrfs: memset to avoid stale content in btree node block During updating btree, we could push items between sibling nodes/leaves, for leaves data sections starts reversely from the end of the block while for nodes we only have key pairs which are stored one by one from the start of the block. So we could do try to push key pairs from one node to the next node right in the tree, and after that, we update the node's nritems to reflect the correct end while leaving the stale content in the node. One may intentionally corrupt the fs image and access the stale content by bumping the nritems and causes various crashes. This takes the in-memory @nritems as the correct one and gets to memset the unused part of a btree node. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 18:03:47 +02:00
Liu Bo	3561b9db70	Btrfs: return gracefully from balance if fs tree is corrupted When relocating tree blocks, we firstly get block information from back references in the extent tree, we then search fs tree to try to find all parents of a block. However, if fs tree is corrupted, eg. if there're some missing items, we could come across these WARN_ONs and BUG_ONs. This makes us print some error messages and return gracefully from balance. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 17:59:49 +02:00
Josef Bacik	9c8e63db1d	Btrfs: kill BUG_ON()'s in btrfs_mark_extent_written No reason to bug on in here, fs corruption could easily cause these things to happen. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 17:59:49 +02:00
Josef Bacik	8436ea91a1	Btrfs: kill the start argument to read_extent_buffer_pages Nobody uses this, it makes no sense to do partial reads of extent buffers. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 17:59:49 +02:00
Josef Bacik	afcdd129e0	Btrfs: add a flags field to btrfs_fs_info We have a lot of random ints in btrfs_fs_info that can be put into flags. This is mostly equivalent with the exception of how we deal with quota going on or off, now instead we set a flag when we are turning it on or off and deal with that appropriately, rather than just having a pending state that the current quota_enabled gets set to. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 17:59:49 +02:00
Qu Wenruo	ba8b04c1d4	btrfs: extend btrfs_set_extent_delalloc and its friends to support in-band dedupe and subpage size patchset Extend btrfs_set_extent_delalloc() and extent_clear_unlock_delalloc() parameters for both in-band dedupe and subpage sector size patchset. This should reduce conflict of both patchset and the effort to rebase them. Cc: Chandan Rajendra <chandan@linux.vnet.ibm.com> Cc: David Sterba <dsterba@suse.cz> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 17:59:49 +02:00
Jeff Mahoney	897a41b116	btrfs: add dynamic debug support We can re-use the dynamic debugging descriptor to make use of the dynamic debugging mechanism but still use our own printk interface. Defining the DEBUG macro works as it did before. When it's defined, all of the messages default to print. We can also enable all debug messages at boot or module-load time using the 'dyndbg' and 'btrfs.dyndbg' options. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 17:59:49 +02:00
Luis Henriques	2309e79650	btrfs: Fix warning "variable ‘gen’ set but not used" Variable 'gen' in reada_for_search() is not used since commit `58dc4ce432` ("btrfs: remove unused parameter from readahead_tree_block"). This patch simply removes this variable. Signed-off-by: Luis Henriques <luis.henriques@canonical.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 17:59:49 +02:00
Luis Henriques	1f079fa2f8	btrfs: Fix warning "variable ‘blocksize’ set but not used" Variable 'blocksize' in reada_walk_down() is not used since commit `d3e46fea1b` ("btrfs: sink blocksize parameter to readahead_tree_block"). This patch simply removes this variable. Signed-off-by: Luis Henriques <luis.henriques@canonical.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 17:59:49 +02:00
Naohiro Aota	5d8eb6fe51	btrfs: let btrfs_delete_unused_bgs() to clean relocated bgs Currently, btrfs_relocate_chunk() is removing relocated BG by itself. But the work can be done by btrfs_delete_unused_bgs() (and it's better since it trim the BG). Let's dedupe the code. While btrfs_delete_unused_bgs() is already hitting the relocated BG, it skip the BG since the BG has "ro" flag set (to keep balancing BG intact). On the other hand, btrfs cannot drop "ro" flag here to prevent additional writes. So this patch make use of "removed" flag. btrfs_delete_unused_bgs() now detect the flag to distinguish whether a read-only BG is relocating or not. Signed-off-by: Naohiro Aota <naohiro.aota@hgst.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 17:59:49 +02:00
Liu Bo	49303381f1	Btrfs: bail out if block group has different mixed flag Currently we allow inconsistence about mixed flag (BTRFS_BLOCK_GROUP_METADATA \| BTRFS_BLOCK_GROUP_DATA). We'd get ENOSPC if block group has mixed flag and btrfs doesn't. If that happens, we have one space_info with mixed flag and another space_info only with BTRFS_BLOCK_GROUP_METADATA, and global_block_rsv.space_info points to the latter one, but all bytes from block_group contributes to the mixed space_info, thus all the allocation will fail with ENOSPC. This adds a check for the above case. Reported-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> [ updated message ] Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 17:59:49 +02:00
Liu Bo	2571e73967	Btrfs: fix memory leak in reading btree blocks So we can read a btree block via readahead or intentional read, and we can end up with a memory leak when something happens as follows, 1) readahead starts to read block A but does not wait for read completion, 2) btree_readpage_end_io_hook finds that block A is corrupted, and it needs to clear all block A's pages' uptodate bit. 3) meanwhile an intentional read kicks in and checks block A's pages' uptodate to decide which page needs to be read. 4) when some pages have the uptodate bit during 3)'s check so 3) doesn't count them for eb->io_pages, but they are later cleared by 2) so we has to readpage on the page, we get the wrong eb->io_pages which results in a memory leak of this block. This fixes the problem by firstly getting all pages's locking and then checking pages' uptodate bit. t1(readahead) t2(readahead endio) t3(the following read) read_extent_buffer_pages end_bio_extent_readpage for pg in eb: for page 0,1,2 in eb: if pg is uptodate: btree_readpage_end_io_hook(pg) num_reads++ if uptodate: eb->io_pages = num_reads SetPageUptodate(pg) _______________ for pg in eb: for page 3 in eb: read_extent_buffer_pages if pg is NOT uptodate: btree_readpage_end_io_hook(pg) for pg in eb: __extent_read_full_page(pg) sanity check reports something wrong if pg is uptodate: clear_extent_buffer_uptodate(eb) num_reads++ for pg in eb: eb->io_pages = num_reads ClearPageUptodate(page) _______________ for pg in eb: if pg is NOT uptodate: __extent_read_full_page(pg) So t3's eb->io_pages is not consistent with the number of pages it's reading, and during endio(), atomic_dec_and_test(&eb->io_pages) will get a negative number so that we're not able to free the eb. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 17:59:49 +02:00
Liu Bo	e46a28ca3d	Btrfs: remove BUG() in raid56 This BUG() has been triggered by a fuzz testing image, which contains an invalid chunk type, ie. a single stripe chunk has the raid6 type. Btrfs can handle this gracefully by returning -EIO, so besides using btrfs_warn to give us more debugging information rather than a single BUG(), we can return error properly. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 17:59:49 +02:00
Lu Fengqi	afce772e87	btrfs: fix check_shared for fiemap ioctl Only in the case of different root_id or different object_id, check_shared identified extent as the shared. However, If a extent was referred by different offset of same file, it should also be identified as shared. In addition, check_shared's loop scale is at least n^3, so if a extent has too many references, even causes soft hang up. First, add all delayed_ref to the ref_tree and calculate the unqiue_refs, if the unique_refs is greater than one, return BACKREF_FOUND_SHARED. Then individually add the on-disk reference(inline/keyed) to the ref_tree and calculate the unique_refs of the ref_tree to check if the unique_refs is greater than one.Because once there are two references to return SHARED, so the time complexity is close to the constant. Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 17:59:49 +02:00
David Sterba	b0de6c4c81	btrfs: create example debugfs file only in debugging build Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 17:59:49 +02:00
Eric Sandeen	07f6a48043	btrfs: fix perms on demonstration debugfs interface btrfs provides a helpful demonstration of how to export a global variable via debugfs; however, it is unique among other debugfs files in that it is world-writable, which causes some concern to people who are not familiar with its purpose. Fix it so that it is only user-writable. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 17:59:49 +02:00
Liu Bo	c79a175175	Btrfs: fix memory leak of block group cache While processing delayed refs, we may update block group's statistics and attach it to cur_trans->dirty_bgs, and later writing dirty block groups will process the list, which happens during btrfs_commit_transaction(). For whatever reason, the transaction is aborted and dirty_bgs is not processed in cleanup_transaction(), we end up with memory leak of these dirty block group cache. Since btrfs_start_dirty_block_groups() doesn't make it go to the commit critical section, this also adds the cleanup work inside it. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 17:59:49 +02:00
Brian Foster	5cd9cee98b	xfs: log recovery tracepoints to track current lsn and buffer submission Log recovery has particular rules around buffer submission along with tricky corner cases where independent transactions can share an LSN. As such, it can be difficult to follow when/why buffers are submitted during recovery. Add a couple tracepoints to post the current LSN of a record when a new record is being processed and when a buffer is being skipped due to LSN ordering. Also, update the recover item class to include the LSN of the current transaction for the item being processed. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-09-26 08:34:52 +10:00
Brian Foster	60a4a22251	xfs: update metadata LSN in buffers during log recovery Log recovery is currently broken for v5 superblocks in that it never updates the metadata LSN of buffers written out during recovery. The metadata LSN is recorded in various bits of metadata to provide recovery ordering criteria that prevents transient corruption states reported by buffer write verifiers. Without such ordering logic, buffer updates can be replayed out of order and lead to false positive transient corruption states. This is generally not a corruption vector on its own, but corruption detection shuts down the filesystem and ultimately prevents a mount if it occurs during log recovery. This requires an xfs_repair run that clears the log and potentially loses filesystem updates. This problem is avoided in most cases as metadata writes during normal filesystem operation update the metadata LSN appropriately. The problem with log recovery not updating metadata LSNs manifests if the system happens to crash shortly after log recovery itself. In this scenario, it is possible for log recovery to complete all metadata I/O such that the filesystem is consistent. If a crash occurs after that point but before the log tail is pushed forward by subsequent operations, however, the next mount performs the same log recovery over again. If a buffer is updated multiple times in the dirty range of the log, an earlier update in the log might not be valid based on the current state of the associated buffer after all of the updates in the log had been replayed (before the previous crash). If a verifier happens to detect such a problem, the filesystem claims corruption and immediately shuts down. This commonly manifests in practice as directory block verifier failures such as the following, likely due to directory verifiers being particularly detailed in their checks as compared to most others: ... Mounting V5 Filesystem XFS (dm-0): Starting recovery (logdev: internal) XFS (dm-0): Internal error XFS_WANT_CORRUPTED_RETURN at line ... of \ file fs/xfs/libxfs/xfs_dir2_data.c. Caller xfs_dir3_data_verify ... ... Update log recovery to update the metadata LSN of recovered buffers. Since metadata LSNs are already updated by write verifer functions via attached log items, attach a dummy log item to the buffer during validation and explicitly set the LSN of the current transaction. This ensures that the metadata LSN of a buffer is updated based on whether the recovery I/O actually completes, and if so, that subsequent recovery attempts identify that the buffer is already up to date with respect to the current transaction. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-09-26 08:34:27 +10:00
Brian Foster	040c52c0aa	xfs: don't warn on buffers not being recovered due to LSN The log recovery buffer validation function is invoked in cases where a buffer update may be skipped due to LSN ordering. If the validation function happens to come across directory conversion situations (e.g., a dir3 block to data conversion), it may warn about seeing a buffer log format of one type and a buffer with a magic number of another. This warning is not valid as the buffer update is ultimately skipped. This is indicated by a current_lsn of NULLCOMMITLSN provided by the caller. As such, update xlog_recover_validate_buf_type() to only warn in such cases when a buffer update is expected. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-09-26 08:32:50 +10:00
Brian Foster	22db9af248	xfs: pass current lsn to log recovery buffer validation The current LSN must be available to the buffer validation function to provide the ability to update the metadata LSN of the buffer. Pass the current_lsn value down to xlog_recover_validate_buf_type() in preparation. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-09-26 08:32:07 +10:00
Brian Foster	12818d24db	xfs: rework log recovery to submit buffers on LSN boundaries The fix to log recovery to update the metadata LSN in recovered buffers introduces the requirement that a buffer is submitted only once per current LSN. Log recovery currently submits buffers on transaction boundaries. This is not sufficient as the abstraction between log records and transactions allows for various scenarios where multiple transactions can share the same current LSN. If independent transactions share an LSN and both modify the same buffer, log recovery can incorrectly skip updates and leave the filesystem in an inconsisent state. In preparation for proper metadata LSN updates during log recovery, update log recovery to submit buffers for write on LSN change boundaries rather than transaction boundaries. Explicitly track the current LSN in a new struct xlog field to handle the various corner cases of when the current LSN may or may not change. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-09-26 08:22:16 +10:00
Dave Chinner	ddeb14f4fb	xfs: quiesce the filesystem after recovery on readonly mount Recently we've had a number of reports where log recovery on a v5 filesystem has reported corruptions that looked to be caused by recovery being re-run over the top of an already-recovered metadata. This has uncovered a bug in recovery (fixed elsewhere) but the vector that caused this was largely unknown. A kdump test started tripping over this problem - the system would be crashed, the kdump kernel and environment would boot and dump the kernel core image, and then the system would reboot. After reboot, the root filesystem was triggering log recovery and corruptions were being detected. The metadumps indicated the above log recovery issue. What is happening is that the kdump kernel and environment is mounting the root device read-only to find the binaries needed to do it's work. The result of this is that it is running log recovery. However, because there were unlinked files and EFIs to be processed by recovery, the completion of phase 1 of log recovery could not mark the log clean. And because it's a read-only mount, the unmount process does not write records to the log to mark it clean, either. Hence on the next mount of the filesystem, log recovery was run again across all the metadata that had already been recovered and this is what triggered corruption warnings. To avoid this problem, we need to ensure that a read-only mount always updates the log when it completes the second phase of recovery. We already handle this sort of issue with rw->ro remount transitions, so the solution is as simple as quiescing the filesystem at the appropriate time during the mount process. This results in the log being marked clean so the mount behaviour recorded in the logs on repeated RO mounts will change (i.e. log recovery will no longer be run on every mount until a RW mount is done). This is a user visible change in behaviour, but it is harmless. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-09-26 08:21:44 +10:00
Dave Chinner	292378edcb	xfs: remote attribute blocks aren't really userdata When adding a new remote attribute, we write the attribute to the new extent before the allocation transaction is committed. This means we cannot reuse busy extents as that violates crash consistency semantics. Hence we currently treat remote attribute extent allocation like userdata because it has the same overwrite ordering constraints as userdata. Unfortunately, this also allows the allocator to incorrectly apply extent size hints to the remote attribute extent allocation. This results in interesting failures, such as transaction block reservation overruns and in-memory inode attribute fork corruption. To fix this, we need to separate the busy extent reuse configuration from the userdata configuration. This changes the definition of XFS_BMAPI_METADATA slightly - it now means that allocation is metadata and reuse of busy extents is acceptible due to the metadata ordering semantics of the journal. If this flag is not set, it means the allocation is that has unordered data writeback, and hence busy extent reuse is not allowed. It no longer implies the allocation is for user data, just that the data write will not be strictly ordered. This matches the semantics for both user data and remote attribute block allocation. As such, This patch changes the "userdata" field to a "datatype" field, and adds a "no busy reuse" flag to the field. When we detect an unordered data extent allocation, we immediately set the no reuse flag. We then set the "user data" flags based on the inode fork we are allocating the extent to. Hence we only set userdata flags on data fork allocations now and consider attribute fork remote extents to be an unordered metadata extent. The result is that remote attribute extents now have the expected allocation semantics, and the data fork allocation behaviour is completely unchanged. It should be noted that there may be other ways to fix this (e.g. use ordered metadata buffers for the remote attribute extent data write) but they are more invasive and difficult to validate both from a design and implementation POV. Hence this patch takes the simple, obvious route to fixing the problem... Reported-and-tested-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-09-26 08:21:28 +10:00
Wolfram Sang	97beb3ae02	fs: compat_ioctl: add pretimeout functions for watchdogs Watchdog core now handles those ioctls centrally, so we want 64 bit support, too. Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Vladimir Zapolskiy <vladimir_zapolskiy@mentor.com> Acked-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Wim Van Sebroeck <wim@iguana.be>	2016-09-24 09:27:18 +02:00
Linus Torvalds	b22734a550	Merge branch 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "Josef fixed a problem when quotas are enabled with his latest ENOSPC rework, and Jeff added more checks into the subvol ioctls to avoid tripping up lookup_one_len" * 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: btrfs: ensure that file descriptor used with subvol ioctls is a dir Btrfs: handle quota reserve failure properly	2016-09-23 13:39:37 -07:00
Linus Torvalds	e47f2e50ea	One more trivial fix for the binary attribute code from Phil Turnbull. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJX5KV7AAoJEA+eU2VSBFGD6hEQAINlrv/sIX2mQcxaETodsvPq kKt6ESgogl0ZTq3lpNhaOwhiozrvgCPJibQZarq4Qr2q2Sz+AkQzYSLCcVO+CmJB 94w4jy2m+M+diEFKpjexJpD+LfEoJPjhfrjs9wI6CKUL2F0FS+LUUOU44gCzSKdh wupkVgPvC3csUZG/9QwTRxZH9Zh/DpsN2JC7MkM3YSc5ELw+YaFWWiEMNjyNMll2 ex2l2+fhfbdHW8WGl5rCjaCfjagi1h2VMtOkbwr4LWX89IMVgAdKbtkquAcme41t o6oHAqN+8EZwxaWdKTR247u5dg5p7W2MeOQyJmlFzUa52fv8APrKONlUfmco/aYC fBvt4s0Hsg/i57dpl+ZdFIfEXzpDgQZpWCEoUvGzfNayghUBk7vF+CcTl+lzcnqA qEiKu9NLMpVmMb1XWCAJzWDTVhY/JJrfx/ndsHiyWlXuiI+yDvQvIIN3fVbkzzHR 4Q52n8zVa2MaVcACb5vf0OKVaETNsemD3oMN5irGcA/RMylxnO7iKghemDYDXMfZ Cnm5pyIm6ZF2a9UapetKEfQawdo7UkS1wXkKMPwLhB6aoK4gbk5pxK0oUxmiQyyp T5o9nZ3Vmj4XoZwaaq2mlIOlj/USSIa8DChXMb43NH8agiMwFzIm8nbAHhr9TEtd JpaLYUe+BvqcZvTwBRxS =+uba -----END PGP SIGNATURE----- Merge tag 'configfs-for-4.8-2' of git://git.infradead.org/users/hch/configfs Pull configfs fix from Christoph Hellwig: "One more trivial fix for the binary attribute code from Phil Turnbull" * tag 'configfs-for-4.8-2' of git://git.infradead.org/users/hch/configfs: configfs: Return -EFBIG from configfs_write_bin_file.	2016-09-23 09:45:15 -07:00
Jeff Layton	bec782b4fc	nfsd: fix dprintk in nfsd4_encode_getdeviceinfo nfserr is big-endian, so we should convert it to host-endian before printing it. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2016-09-23 10:18:52 -04:00
Daniel Wagner	2a446a5d99	NFS: cache_lib: use complete() instead of complete_all() There is only one waiter for the completion, therefore there is no need to use complete_all(). Let's make that clear by using complete() instead of complete_all(). The generic caching code from sunrpc is calling revisit() only once. The usage pattern of the completion is: waiter context waker context do_cache_lookup_wait() nfs_cache_defer_req_alloc() init_completion() do_cache_lookup() nfs_cache_wait_for_upcall() wait_for_completion_timeout() nfs_dns_cache_revisit() complete() nfs_cache_defer_req_put() Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-23 09:40:12 -04:00
Daniel Wagner	024de8f1ad	NFS: direct: use complete() instead of complete_all() There is only one waiter for the completion, therefore there is no need to use complete_all(). Let's make that clear by using complete() instead of complete_all(). nfs_file_direct_write() or nfs_file_direct_read() allocated a request object via nfs_direct_req_alloc(), which initializes the completion. The request object then is freed later in the exit path. Between the initialization and the release either nfs_direct_write_schedule_iovec() resp nfs_direct_read_schedule_iovec() are called which will asynchronously process the request. The calling function waits via nfs_direct_wait() till the async work has been done. Thus there is only one waiter on the completion. nfs_direct_pgio_init() and nfs_direct_read_completion() are passed via function pointers to nfs pageio. The first function does a ref counting (get_dreq() and put_dreq()) which ensures that nfs_direct_read_completion() and nfs_direct_read_schedule_iovec() only call the completion path once. The usage pattern of the completion is: waiter context waker context nfs_file_direct_write() dreq = nfs_direct_req_alloc() init_completion() nfs_direct_write_schedule_iovec() nfs_direct_wait() wait_for_completion_killable() nfs_direct_write_schedule_work() nfs_direct_complete() complete() nfs_file_direct_read() dreq = nfs_direct_req_all() init_completion() nfs_direct_read_schedule_iovec() nfs_direct_wait() wait_for_completion_killable() nfs_direct_read_schedule_iovec() nfs_direct_complete() complete() nfs_direct_read_completion() nfs_direct_complete() complete() Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-23 09:14:16 -04:00
David S. Miller	d6989d4bbe	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2016-09-23 06:46:57 -04:00
Eric W. Biederman	e98d413703	devpts: Change the owner of /dev/pts/ptmx to the mounter of /dev/pts In 99.99% of the cases only root in a user namespace can mount /dev/pts and in those cases the owner of /dev/pts/ptmx will remain root.root In the oddball case where someone else has CAP_SYS_ADMIN this code modifies the /dev/pts mount code to use current_fsuid and current_fsgid as the values to use when creating the /dev/ptmx inode. As is done when any other file is created. This is a code simplification, and it allows running without a root user entirely. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2016-09-23 11:31:31 +02:00
Eric W. Biederman	6bd1d8758d	devpts: Remove sync_filesystems devpts does not and never will have anything to sync so don't bother calling sync_filesystems on remount. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2016-09-23 11:31:31 +02:00
Eric W. Biederman	40b320e1c7	devpts: Make devpts_kill_sb safe if fsi is NULL Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2016-09-23 11:31:31 +02:00
Eric W. Biederman	c1b241f0c1	devpts: Simplify devpts_mount by using mount_nodev Now that all of the work of setting up a superblock has been moved to devpts_fill_super simplify devpts_mount by calling mount_nodev instead of rolling mount_nodev by hand. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2016-09-23 11:31:31 +02:00
Eric W. Biederman	180d904442	devpts: Move the creation of /dev/pts/ptmx into fill_super The code makes more sense here and things are just clearer. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2016-09-23 11:31:31 +02:00
Eric W. Biederman	dee87d4736	devpts: Move parse_mount_options into fill_super Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2016-09-23 11:31:31 +02:00
Eric W. Biederman	213b067ce3	nsfs: Simplify __ns_get_path Move mntget from the very beginning of __ns_get_path to the success path of __ns_get_path, and remove the mntget calls. This removes the possibility that there will be a mntget/mntput pair of __ns_get_path has to retry, and generally simplifies the code. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2016-09-22 20:06:20 -05:00
Eric W. Biederman	7872559664	Merge branch 'nsfs-ioctls' into HEAD From: Andrey Vagin <avagin@openvz.org> Each namespace has an owning user namespace and now there is not way to discover these relationships. Pid and user namepaces are hierarchical. There is no way to discover parent-child relationships too. Why we may want to know relationships between namespaces? One use would be visualization, in order to understand the running system. Another would be to answer the question: what capability does process X have to perform operations on a resource governed by namespace Y? One more use-case (which usually called abnormal) is checkpoint/restart. In CRIU we are going to dump and restore nested namespaces. There [1] was a discussion about which interface to choose to determing relationships between namespaces. Eric suggested to add two ioctl-s [2]: > Grumble, Grumble. I think this may actually a case for creating ioctls > for these two cases. Now that random nsfs file descriptors are bind > mountable the original reason for using proc files is not as pressing. > > One ioctl for the user namespace that owns a file descriptor. > One ioctl for the parent namespace of a namespace file descriptor. Here is an implementaions of these ioctl-s. $ man man7/namespaces.7 ... Since Linux 4.X, the following ioctl(2) calls are supported for namespace file descriptors. The correct syntax is: fd = ioctl(ns_fd, ioctl_type); where ioctl_type is one of the following: NS_GET_USERNS Returns a file descriptor that refers to an owning user names‐ pace. NS_GET_PARENT Returns a file descriptor that refers to a parent namespace. This ioctl(2) can be used for pid and user namespaces. For user namespaces, NS_GET_PARENT and NS_GET_USERNS have the same meaning. In addition to generic ioctl(2) errors, the following specific ones can occur: EINVAL NS_GET_PARENT was called for a nonhierarchical namespace. EPERM The requested namespace is outside of the current namespace scope. [1] https://lkml.org/lkml/2016/7/6/158 [2] https://lkml.org/lkml/2016/7/9/101 Changes for v2: * don't return ENOENT for init_user_ns and init_pid_ns. There is nothing outside of the init namespace, so we can return EPERM in this case too. > The fewer special cases the easier the code is to get > correct, and the easier it is to read. // Eric Changes for v3: * rename ns->get_owner() to ns->owner(). get_* usually means that it grabs a reference. Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> Cc: "W. Trevor King" <wking@tremily.us> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Serge Hallyn <serge.hallyn@canonical.com>	2016-09-22 20:00:36 -05:00
Andrey Vagin	a7306ed8d9	nsfs: add ioctl to get a parent namespace Pid and user namepaces are hierarchical. There is no way to discover parent-child relationships. In a future we will use this interface to dump and restore nested namespaces. Acked-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: Andrei Vagin <avagin@openvz.org> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2016-09-22 19:59:41 -05:00
Andrey Vagin	6786741dbf	nsfs: add ioctl to get an owning user namespace for ns file descriptor Each namespace has an owning user namespace and now there is not way to discover these relationships. Understending namespaces relationships allows to answer the question: what capability does process X have to perform operations on a resource governed by namespace Y? After a long discussion, Eric W. Biederman proposed to use ioctl-s for this purpose. The NS_GET_USERNS ioctl returns a file descriptor to an owning user namespace. It returns EPERM if a target namespace is outside of a current user namespace. v2: rename parent to relative v3: Add a missing mntput when returning -EAGAIN --EWB Acked-by: Serge Hallyn <serge@hallyn.com> Link: https://lkml.org/lkml/2016/7/6/158 Signed-off-by: Andrei Vagin <avagin@openvz.org> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2016-09-22 19:59:40 -05:00
Andrey Vagin	bcac25a58b	kernel: add a helper to get an owning user namespace for a namespace Return -EPERM if an owning user namespace is outside of a process current user namespace. v2: In a first version ns_get_owner returned ENOENT for init_user_ns. This special cases was removed from this version. There is nothing outside of init_user_ns, so we can return EPERM. v3: rename ns->get_owner() to ns->owner(). get_* usually means that it grabs a reference. Acked-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: Andrei Vagin <avagin@openvz.org> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2016-09-22 19:59:39 -05:00
Trond Myklebust	78d04af499	NFS: nfs_prime_dcache must validate the filename Before we try to stash it in the dcache, we need to at least check that the filename passed to us by the server is non-empty and doesn't contain any illegal '\0' or '/' characters. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-22 17:02:03 -04:00
Jeff Layton	a1d617d8f1	nfs: allow blocking locks to be awoken by lock callbacks Add a waitqueue head to the client structure. Have clients set a wait on that queue prior to requesting a lock from the server. If the lock is blocked, then we can use that to wait for wakeups. Note that we do need to do this "manually" since we need to set the wait on the waitqueue prior to requesting the lock, but requesting a lock can involve activities that can block. However, only do that for NFSv4.1 locks, either by compiling out all of the waitqueue handling when CONFIG_NFS_V4_1 is disabled, or skipping all of it at runtime if we're dealing with v4.0, or v4.1 servers that don't send lock callbacks. Note too that even when we expect to get a lock callback, RFC5661 section 20.11.4 is pretty clear that we still need to poll for them, so we do still sleep on a timeout. We do however always poll at the longest interval in that case. Signed-off-by: Jeff Layton <jlayton@redhat.com> [Anna: nfs4_retry_setlk() "status" should default to -ERESTARTSYS] Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-22 15:54:27 -04:00
Yunlei He	5d4c0af41f	f2fs: preallocate blocks for encrypted file This patch allow preallocates data blocks for buffered aio writes in encrypted file. Signed-off-by: Yunlei He <heyunlei@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> [Jaegeuk Kim: fix to avoid BUG_ON] Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-09-22 11:43:08 -07:00
Chao Yu	5bc994a043	f2fs: show dirty inode number This patch enables showing dirty inode number in procfs. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-09-22 11:43:07 -07:00
Chao Yu	8b038c70df	f2fs: support IO error injection This patch adds to support IO error injection for testing IO error tolerance of f2fs. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-09-22 11:43:06 -07:00
Chao Yu	866969668a	f2fs: fix to return error number of read_all_xattrs correctly We treat all error in read_all_xattrs as a no memory error, which covers the real reason of failure in it. Fix it by return correct errno in order to reflect the real cause. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-09-22 11:43:05 -07:00
Chao Yu	ebfa732217	f2fs: make f2fs_filetype_table static There is no more user of f2fs_filetype_table outside of dir.c, make it static. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-09-22 11:43:04 -07:00
Eric W. Biederman	93f0a88bd4	devpts: Change the owner of /dev/pts/ptmx to the mounter of /dev/pts In 99.99% of the cases only root in a user namespace can mount /dev/pts and in those cases the owner of /dev/pts/ptmx will remain root.root In the oddball case where someone else has CAP_SYS_ADMIN this code modifies the /dev/pts mount code to use current_fsuid and current_fsgid as the values to use when creating the /dev/ptmx inode. As is done when any other file is created. This is a code simplification, and it allows running without a root user entirely. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2016-09-22 13:32:26 -05:00
Eric W. Biederman	985e5d856c	devpts: Remove sync_filesystems devpts does not and never will have anything to sync so don't bother calling sync_filesystems on remount. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2016-09-22 13:32:20 -05:00
Eric W. Biederman	0d126a7ff7	devpts: Make devpts_kill_sb safe if fsi is NULL Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2016-09-22 13:32:16 -05:00
Eric W. Biederman	ec0a9ba6f2	devpts: Simplify devpts_mount by using mount_nodev Now that all of the work of setting up a superblock has been moved to devpts_fill_super simplify devpts_mount by calling mount_nodev instead of rolling mount_nodev by hand. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2016-09-22 13:32:12 -05:00
Eric W. Biederman	7dd17f7134	devpts: Move the creation of /dev/pts/ptmx into fill_super The code makes more sense here and things are just clearer. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2016-09-22 13:32:08 -05:00
Eric W. Biederman	208904793a	devpts: Move parse_mount_options into fill_super Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2016-09-22 13:31:58 -05:00
Eric W. Biederman	df75e7748b	userns: When the per user per user namespace limit is reached return ENOSPC The current error codes returned when a the per user per user namespace limit are hit (EINVAL, EUSERS, and ENFILE) are wrong. I asked for advice on linux-api and it we made clear that those were the wrong error code, but a correct effor code was not suggested. The best general error code I have found for hitting a resource limit is ENOSPC. It is not perfect but as it is unambiguous it will serve until someone comes up with a better error code. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2016-09-22 13:25:56 -05:00
Jeff Layton	d2f3a7f918	nfs: move nfs4 lock retry attempt loop to a separate function This also consolidates the waiting logic into a single function, instead of having it spread across two like it is now. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-22 13:56:04 -04:00
Jeff Layton	1ea67dbd98	nfs: move nfs4_set_lock_state call into caller We need to have this info set up before adding the waiter to the waitqueue, so move this out of the _nfs4_proc_setlk and into the caller. That's more efficient anyway since we don't need to do this more than once if we end up waiting on the lock. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-22 13:56:04 -04:00
Jeff Layton	db783688d4	nfs: add handling for CB_NOTIFY_LOCK in client For now, the callback doesn't do anything. Support for that will be added in later patches. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-22 13:56:04 -04:00
Jeff Layton	a8ce377a5d	nfs: track whether server sets MAY_NOTIFY_LOCK flag We want to handle the two cases differently, such that we poll more aggressively when we don't expect a callback. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-22 13:56:04 -04:00
Jeff Layton	66f570ab73	nfs: use safe, interruptible sleeps when waiting to retry LOCK We actually want to use TASK_INTERRUPTIBLE sleeps when we're in the process of polling for a NFSv4 lock. If there is a signal pending when the task wakes up, then we'll be returning an error anyway. So, we might as well wake up immediately for non-fatal signals as well. That allows us to return to userland more quickly in that case, but won't change the error that userland sees. Also, there is no need to use the *_unsafe sleep variants here, as no vfs-layer locks should be held at this point. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-22 13:56:04 -04:00
Jeff Layton	75575ddf29	nfs: eliminate pointless and confusing do_vfs_lock wrappers Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-22 13:56:04 -04:00
Jeff Layton	b60475c940	nfs: the length argument to read_buf should be unsigned Since it gets passed through to xdr_inline_decode, we might as well have read_buf expect what it expects -- a size_t. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-22 13:56:04 -04:00
Ross Zwisler	cca32b7eeb	ext4: allow DAX writeback for hole punch Currently when doing a DAX hole punch with ext4 we fail to do a writeback. This is because the logic around filemap_write_and_wait_range() in ext4_punch_hole() only looks for dirty page cache pages in the radix tree, not for dirty DAX exceptional entries. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2016-09-22 11:49:38 -04:00
Jan Kara	e03a9976af	jbd2: fix lockdep annotation in add_transaction_credits() Thomas has reported a lockdep splat hitting in add_transaction_credits(). The problem is that that function calls jbd2_might_wait_for_commit() while holding j_state_lock which is wrong (we do not really wait for transaction commit while holding that lock). Fix the problem by moving jbd2_might_wait_for_commit() into places where we are ready to wait for transaction commit and thus j_state_lock is unlocked. Cc: stable@vger.kernel.org Fixes: `1eaa566d36` Reported-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2016-09-22 11:44:06 -04:00
Peter Zijlstra	87709e28dc	fs/locks: Use percpu_down_read_preempt_disable() Avoid spurious preemption. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dave@stgolabs.net Cc: der.herr@hofr.at Cc: paulmck@linux.vnet.ibm.com Cc: riel@redhat.com Cc: tj@kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2016-09-22 15:25:54 +02:00
Peter Zijlstra	7c3f654d8e	fs/locks: Replace lg_local with a per-cpu spinlock As Oleg suggested, replace file_lock_list with a structure containing the hlist head and a spinlock. This completely removes the lglock from fs/locks. Suggested-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dave@stgolabs.net Cc: der.herr@hofr.at Cc: paulmck@linux.vnet.ibm.com Cc: riel@redhat.com Cc: tj@kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2016-09-22 15:25:53 +02:00
Peter Zijlstra	aba3766073	fs/locks: Replace lg_global with a percpu-rwsem Replace the global part of the lglock with a percpu-rwsem. Since fcl_lock is a spinlock and itself nests under i_lock, which too is a spinlock we cannot acquire sleeping locks at locks_{insert,remove}_global_locks(). We can however wrap all fcl_lock acquisitions with percpu_down_read such that all invocations of locks_{insert,remove}_global_locks() have that read lock held. This allows us to replace the lg_global part of the lglock with the write side of the rwsem. In the absense of writers, percpu_{down,up}_read() are free of atomic instructions. This further avoids the very long preempt-disable regions caused by lglock on larger machines. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dave@stgolabs.net Cc: der.herr@hofr.at Cc: paulmck@linux.vnet.ibm.com Cc: riel@redhat.com Cc: tj@kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2016-09-22 15:25:53 +02:00
Jan Kara	030b533c4f	fs: Avoid premature clearing of capabilities Currently, notify_change() clears capabilities or IMA attributes by calling security_inode_killpriv() before calling into ->setattr. Thus it happens before any other permission checks in inode_change_ok() and user is thus allowed to trigger clearing of capabilities or IMA attributes for any file he can look up e.g. by calling chown for that file. This is unexpected and can lead to user DoSing a system. Fix the problem by calling security_inode_killpriv() at the end of inode_change_ok() instead of from notify_change(). At that moment we are sure user has permissions to do the requested change. References: CVE-2015-1350 Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>	2016-09-22 10:56:19 +02:00
Jan Kara	31051c85b5	fs: Give dentry to inode_change_ok() instead of inode inode_change_ok() will be resposible for clearing capabilities and IMA extended attributes and as such will need dentry. Give it as an argument to inode_change_ok() instead of an inode. Also rename inode_change_ok() to setattr_prepare() to better relect that it does also some modifications in addition to checks. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>	2016-09-22 10:56:19 +02:00
Jan Kara	6249033076	fuse: Propagate dentry down to inode_change_ok() To avoid clearing of capabilities or security related extended attributes too early, inode_change_ok() will need to take dentry instead of inode. Propagate it down to fuse_do_setattr(). Acked-by: Miklos Szeredi <mszeredi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>	2016-09-22 10:56:19 +02:00
Jan Kara	fd5472ed44	ceph: Propagate dentry down to inode_change_ok() To avoid clearing of capabilities or security related extended attributes too early, inode_change_ok() will need to take dentry instead of inode. ceph_setattr() has the dentry easily available but __ceph_setattr() is also called from ceph_set_acl() where dentry is not easily available. Luckily that call path does not need inode_change_ok() to be called anyway. So reorganize functions a bit so that inode_change_ok() is called only from paths where dentry is available. Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz>	2016-09-22 10:56:19 +02:00
Jan Kara	69bca80744	xfs: Propagate dentry down to inode_change_ok() To avoid clearing of capabilities or security related extended attributes too early, inode_change_ok() will need to take dentry instead of inode. Propagate dentry down to functions calling inode_change_ok(). This is rather straightforward except for xfs_set_mode() function which does not have dentry easily available. Luckily that function does not call inode_change_ok() anyway so we just have to do a little dance with function prototypes. Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>	2016-09-22 10:56:19 +02:00
Jan Kara	073931017b	posix_acl: Clear SGID bit when setting file permissions When file permissions are modified via chmod(2) and the user is not in the owning group or capable of CAP_FSETID, the setgid bit is cleared in inode_change_ok(). Setting a POSIX ACL via setxattr(2) sets the file permissions as well as the new ACL, but doesn't clear the setgid bit in a similar way; this allows to bypass the check in chmod(2). Fix that. References: CVE-2016-7097 Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2016-09-22 10:55:32 +02:00
Jeff Mahoney	325c50e3ce	btrfs: ensure that file descriptor used with subvol ioctls is a dir If the subvol/snapshot create/destroy ioctls are passed a regular file with execute permissions set, we'll eventually Oops while trying to do inode->i_op->lookup via lookup_one_len. This patch ensures that the file descriptor refers to a directory. Fixes: `cb8e70901d` (Btrfs: Fix subvolume creation locking rules) Fixes: `76dda93c6a` (Btrfs: add snapshot/subvolume destroy ioctl) Cc: <stable@vger.kernel.org> #v2.6.29+ Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2016-09-21 17:22:16 -07:00
Josef Bacik	1e5ec2e709	Btrfs: handle quota reserve failure properly btrfs/022 was spitting a warning for the case that we exceed the quota. If we fail to make our quota reservation we need to clean up our data space reservation. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Tested-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2016-09-21 17:22:16 -07:00
Chao Yu	e0d735c1cc	gfs2: fix to detect failure of register_shrinker register_shrinker can fail after commit `1d3d4437ea` ("vmscan: per-node deferred work"), we should detect the failure of it, otherwise we may fail to register shrinker after gfs2 module was been inited successfully. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>	2016-09-21 12:09:40 -05:00
Martin Brandenburg	0c95ad7636	orangefs: bump minimum userspace version OrangeFS 2.9.6 was released without support for the features op. Thus OrangeFS 2.9.7 will be required to use it. Signed-off-by: Martin Brandenburg <martin@omnibond.com>	2016-09-21 12:37:23 -04:00
Richard Weinberger	6a45b3628c	ovl: Fix info leak in ovl_lookup_temp() The function uses the memory address of a struct dentry as unique id. While the address-based directory entry is only visible to root it is IMHO still worth fixing since the temporary name does not have to be a kernel address. It can be any unique number. Replace it by an atomic integer which is allowed to wrap around. Signed-off-by: Richard Weinberger <richard@nod.at> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Cc: <stable@vger.kernel.org> # v3.18+ Fixes: `e9be9d5e76` ("overlay filesystem")	2016-09-21 16:37:07 +02:00
Christian Lamparter	86f0e06767	debugfs: introduce a public file_operations accessor This patch introduces an accessor which can be used by the users of debugfs (drivers, fs, ...) to get the original file_operations struct. It also removes the REAL_FOPS_DEREF macro in file.c and converts the code to use the public version. Previously, REAL_FOPS_DEREF was only available within the file.c of debugfs. But having a public getter available for debugfs users is important as some drivers (carl9170 and b43) use the pointer of the original file_operations in conjunction with container_of() within their debugfs implementations. Reviewed-by: Nicolai Stange <nicstange@gmail.com> Signed-off-by: Christian Lamparter <chunkeey@gmail.com> Cc: stable <stable@vger.kernel.org> # 4.7+ Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2016-09-21 12:13:31 +02:00
Jiri Olsa	df04abfd18	fs/proc/kcore.c: Add bounce buffer for ktext data We hit hardened usercopy feature check for kernel text access by reading kcore file: usercopy: kernel memory exposure attempt detected from ffffffff8179a01f (<kernel text>) (4065 bytes) kernel BUG at mm/usercopy.c:75! Bypassing this check for kcore by adding bounce buffer for ktext data. Reported-by: Steve Best <sbest@redhat.com> Fixes: `f5509cc18d` ("mm: Hardened usercopy") Suggested-by: Kees Cook <keescook@chromium.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-09-20 13:32:49 -07:00
Jiri Olsa	f5beeb1851	fs/proc/kcore.c: Make bounce buffer global for read Next patch adds bounce buffer for ktext area, so it's convenient to have single bounce buffer for both vmalloc/module and ktext cases. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-09-20 13:32:49 -07:00
Ingo Molnar	41a66072c3	Merge branch 'efi/urgent' into efi/core, to pick up fixes Signed-off-by: Ingo Molnar <mingo@kernel.org>	2016-09-20 16:58:59 +02:00
Chao Yu	f844cd0d76	nfs: cover ->migratepage with CONFIG_MIGRATION It will be more clean to use CONFIG_MIGRATION to cover nfs' private .migratepage in nfs_file_aops like we do in other part of nfs operations. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-20 09:29:39 -04:00
Ingo Molnar	b2c16e1efd	Merge branch 'linus' into x86/asm, to pick up fixes Signed-off-by: Ingo Molnar <mingo@kernel.org>	2016-09-20 08:29:21 +02:00
Junxiao Bi	63b52c4936	Revert "ocfs2: bump up o2cb network protocol version" This reverts commit `38b52efd21` ("ocfs2: bump up o2cb network protocol version"). This commit made rolling upgrade fail. When one node is upgraded to new version with this commit, the remaining nodes will fail to establish connections to it, then the application like VMs on the remaining nodes can't be live migrated to the upgraded one. This will cause an outage. Since negotiate hb timeout behavior didn't change without this commit, so revert it. Fixes: `38b52efd21` ("ocfs2: bump up o2cb network protocol version") Link: http://lkml.kernel.org/r/1471396924-10375-1-git-send-email-junxiao.bi@oracle.com Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-09-19 15:36:17 -07:00
Ashish Samant	d21c353d5e	ocfs2: fix start offset to ocfs2_zero_range_for_truncate() If we punch a hole on a reflink such that following conditions are met: 1. start offset is on a cluster boundary 2. end offset is not on a cluster boundary 3. (end offset is somewhere in another extent) or (hole range > MAX_CONTIG_BYTES(1MB)), we dont COW the first cluster starting at the start offset. But in this case, we were wrongly passing this cluster to ocfs2_zero_range_for_truncate() to zero out. This will modify the cluster in place and zero it in the source too. Fix this by skipping this cluster in such a scenario. To reproduce: 1. Create a random file of say 10 MB xfs_io -c 'pwrite -b 4k 0 10M' -f 10MBfile 2. Reflink it reflink -f 10MBfile reflnktest 3. Punch a hole at starting at cluster boundary with range greater that 1MB. You can also use a range that will put the end offset in another extent. fallocate -p -o 0 -l `1048615` reflnktest 4. sync 5. Check the first cluster in the source file. (It will be zeroed out). dd if=10MBfile iflag=direct bs=<cluster size> count=1 \| hexdump -C Link: http://lkml.kernel.org/r/1470957147-14185-1-git-send-email-ashish.samant@oracle.com Signed-off-by: Ashish Samant <ashish.samant@oracle.com> Reported-by: Saar Maoz <saar.maoz@oracle.com> Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Eric Ren <zren@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-09-19 15:36:17 -07:00
Joseph Qi	3bb8b653c8	ocfs2: fix double unlock in case retry after free truncate log If ocfs2_reserve_cluster_bitmap_bits() fails with ENOSPC, it will try to free truncate log and then retry. Since ocfs2_try_to_free_truncate_log will lock/unlock global bitmap inode, we have to unlock it before calling this function. But when retry reserve and it fails with no global bitmap inode lock taken, it will unlock again in error handling branch and BUG. This issue also exists if no need retry and then ocfs2_inode_lock fails. So fix it. Fixes: `2070ad1aeb` ("ocfs2: retry on ENOSPC if sufficient space in truncate log") Link: http://lkml.kernel.org/r/57D91939.6030809@huawei.com Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Jiufei Xue <xuejiufei@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-09-19 15:36:17 -07:00
Jan Kara	96d41019e3	fanotify: fix list corruption in fanotify_get_response() fanotify_get_response() calls fsnotify_remove_event() when it finds that group is being released from fanotify_release() (bypass_perm is set). However the event it removes need not be only in the group's notification queue but it can have already moved to access_list (userspace read the event before closing the fanotify instance fd) which is protected by a different lock. Thus when fsnotify_remove_event() races with fanotify_release() operating on access_list, the list can get corrupted. Fix the problem by moving all the logic removing permission events from the lists to one place - fanotify_release(). Fixes: `5838d4442b` ("fanotify: fix double free of pending permission events") Link: http://lkml.kernel.org/r/1473797711-14111-3-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reported-by: Miklos Szeredi <mszeredi@redhat.com> Tested-by: Miklos Szeredi <mszeredi@redhat.com> Reviewed-by: Miklos Szeredi <mszeredi@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-09-19 15:36:17 -07:00
Jan Kara	12703dbfeb	fsnotify: add a way to stop queueing events on group shutdown Implement a function that can be called when a group is being shutdown to stop queueing new events to the group. Fanotify will use this. Fixes: `5838d4442b` ("fanotify: fix double free of pending permission events") Link: http://lkml.kernel.org/r/1473797711-14111-2-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Miklos Szeredi <mszeredi@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-09-19 15:36:17 -07:00
Junxiao Bi	d5bf141893	ocfs2: fix trans extend while free cached blocks The root cause of this issue is the same with the one fixed by the last patch, but this time credits for allocator inode and group descriptor may not be consumed before trans extend. The following error was caught: WARNING: CPU: 0 PID: 2037 at fs/jbd2/transaction.c:269 start_this_handle+0x4c3/0x510 [jbd2]() Modules linked in: ocfs2 nfsd lockd grace nfs_acl auth_rpcgss sunrpc autofs4 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs sd_mod sg ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ppdev xen_kbdfront fb_sys_fops sysimgblt sysfillrect syscopyarea xen_netfront parport_pc parport pcspkr i2c_piix4 i2c_core acpi_cpufreq ext4 jbd2 mbcache xen_blkfront floppy pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod CPU: 0 PID: 2037 Comm: rm Tainted: G W 4.1.12-37.6.3.el6uek.bug24573128v2.x86_64 #2 Hardware name: Xen HVM domU, BIOS 4.4.4OVM 02/11/2016 Call Trace: dump_stack+0x48/0x5c warn_slowpath_common+0x95/0xe0 warn_slowpath_null+0x1a/0x20 start_this_handle+0x4c3/0x510 [jbd2] jbd2__journal_restart+0x161/0x1b0 [jbd2] jbd2_journal_restart+0x13/0x20 [jbd2] ocfs2_extend_trans+0x74/0x220 [ocfs2] ocfs2_free_cached_blocks+0x16b/0x4e0 [ocfs2] ocfs2_run_deallocs+0x70/0x270 [ocfs2] ocfs2_commit_truncate+0x474/0x6f0 [ocfs2] ocfs2_truncate_for_delete+0xbd/0x380 [ocfs2] ocfs2_wipe_inode+0x136/0x6a0 [ocfs2] ocfs2_delete_inode+0x2a2/0x3e0 [ocfs2] ocfs2_evict_inode+0x28/0x60 [ocfs2] evict+0xab/0x1a0 iput_final+0xf6/0x190 iput+0xc8/0xe0 do_unlinkat+0x1b7/0x310 SyS_unlinkat+0x22/0x40 system_call_fastpath+0x12/0x71 ---[ end trace a62437cb060baa71 ]--- JBD2: rm wants too many credits (149 > 128) Link: http://lkml.kernel.org/r/1473674623-11810-2-git-send-email-junxiao.bi@oracle.com Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-09-19 15:36:17 -07:00
Junxiao Bi	2b0ad0085a	ocfs2: fix trans extend while flush truncate log Every time, ocfs2_extend_trans() included a credit for truncate log inode, but as that inode had been managed by jbd2 running transaction first time, it will not consume that credit until jbd2_journal_restart(). Since total credits to extend always included the un-consumed ones, there will be more and more un-consumed credit, at last jbd2_journal_restart() will fail due to credit number over the half of max transction credit. The following error was caught when unlinking a large file with many extents: ------------[ cut here ]------------ WARNING: CPU: 0 PID: 13626 at fs/jbd2/transaction.c:269 start_this_handle+0x4c3/0x510 [jbd2]() Modules linked in: ocfs2 nfsd lockd grace nfs_acl auth_rpcgss sunrpc autofs4 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs sd_mod sg ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ppdev xen_kbdfront xen_netfront fb_sys_fops sysimgblt sysfillrect syscopyarea parport_pc parport pcspkr i2c_piix4 i2c_core acpi_cpufreq ext4 jbd2 mbcache xen_blkfront floppy pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod CPU: 0 PID: 13626 Comm: unlink Tainted: G W 4.1.12-37.6.3.el6uek.x86_64 #2 Hardware name: Xen HVM domU, BIOS 4.4.4OVM 02/11/2016 Call Trace: dump_stack+0x48/0x5c warn_slowpath_common+0x95/0xe0 warn_slowpath_null+0x1a/0x20 start_this_handle+0x4c3/0x510 [jbd2] jbd2__journal_restart+0x161/0x1b0 [jbd2] jbd2_journal_restart+0x13/0x20 [jbd2] ocfs2_extend_trans+0x74/0x220 [ocfs2] ocfs2_replay_truncate_records+0x93/0x360 [ocfs2] __ocfs2_flush_truncate_log+0x13e/0x3a0 [ocfs2] ocfs2_remove_btree_range+0x458/0x7f0 [ocfs2] ocfs2_commit_truncate+0x1b3/0x6f0 [ocfs2] ocfs2_truncate_for_delete+0xbd/0x380 [ocfs2] ocfs2_wipe_inode+0x136/0x6a0 [ocfs2] ocfs2_delete_inode+0x2a2/0x3e0 [ocfs2] ocfs2_evict_inode+0x28/0x60 [ocfs2] evict+0xab/0x1a0 iput_final+0xf6/0x190 iput+0xc8/0xe0 do_unlinkat+0x1b7/0x310 SyS_unlink+0x16/0x20 system_call_fastpath+0x12/0x71 ---[ end trace 28aa7410e69369cf ]--- JBD2: unlink wants too many credits (251 > 128) Link: http://lkml.kernel.org/r/1473674623-11810-1-git-send-email-junxiao.bi@oracle.com Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-09-19 15:36:17 -07:00
Kirill A. Shutemov	31b4beb473	ipc/shm: fix crash if CONFIG_SHMEM is not set Commit `c01d5b3007` ("shmem: get_unmapped_area align huge page") makes use of shm_get_unmapped_area() in shm_file_operations() unconditional to CONFIG_MMU. As Tony Battersby pointed this can lead NULL-pointer dereference on machine with CONFIG_MMU=y and CONFIG_SHMEM=n. In this case ipc/shm is backed by ramfs which doesn't provide f_op->get_unmapped_area for configurations with MMU. The solution is to provide dummy f_op->get_unmapped_area for ramfs when CONFIG_MMU=y, which just call current->mm->get_unmapped_area(). Fixes: `c01d5b3007` ("shmem: get_unmapped_area align huge page") Link: http://lkml.kernel.org/r/20160912102704.140442-1-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Tony Battersby <tonyb@cybernetics.com> Tested-by: Tony Battersby <tonyb@cybernetics.com> Cc: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> [4.7.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-09-19 15:36:17 -07:00
Ian Kent	7cbdb4a286	autofs: use dentry flags to block walks during expire Somewhere along the way the autofs expire operation has changed to hold a spin lock over expired dentry selection. The autofs indirect mount expired dentry selection is complicated and quite lengthy so it isn't appropriate to hold a spin lock over the operation. Commit `47be61845c` ("fs/dcache.c: avoid soft-lockup in dput()") added a might_sleep() to dput() causing a WARN_ONCE() about this usage to be issued. But the spin lock doesn't need to be held over this check, the autofs dentry info. flags are enough to block walks into dentrys during the expire. I've left the direct mount expire as it is (for now) because it is much simpler and quicker than the indirect mount expire and adding spin lock release and re-aquires would do nothing more than add overhead. Fixes: `47be61845c` ("fs/dcache.c: avoid soft-lockup in dput()") Link: http://lkml.kernel.org/r/20160912014017.1773.73060.stgit@pluto.themaw.net Signed-off-by: Ian Kent <raven@themaw.net> Reported-by: Takashi Iwai <tiwai@suse.de> Tested-by: Takashi Iwai <tiwai@suse.de> Cc: Takashi Iwai <tiwai@suse.de> Cc: NeilBrown <neilb@suse.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-09-19 15:36:17 -07:00
Joseph Qi	e6f0c6e617	ocfs2/dlm: fix race between convert and migration Commit `ac7cf246df` ("ocfs2/dlm: fix race between convert and recovery") checks if lockres master has changed to identify whether new master has finished recovery or not. This will introduce a race that right after old master does umount ( means master will change), a new convert request comes. In this case, it will reset lockres state to DLM_RECOVERING and then retry convert, and then fail with lockres->l_action being set to OCFS2_AST_INVALID, which will cause inconsistent lock level between ocfs2 and dlm, and then finally BUG. Since dlm recovery will clear lock->convert_pending in dlm_move_lockres_to_recovery_list, we can use it to correctly identify the race case between convert and recovery. So fix it. Fixes: `ac7cf246df` ("ocfs2/dlm: fix race between convert and recovery") Link: http://lkml.kernel.org/r/57CE1569.8010704@huawei.com Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Jun Piao <piaojun@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-09-19 15:36:16 -07:00
Jeff Layton	ca440c383a	pnfs: add a new mechanism to select a layout driver according to an ordered list Currently, the layout driver selection code always chooses the first one from the list. That's not really ideal however, as the server can send the list of layout types in any order that it likes. It's up to the client to select the best one for its needs. This patch adds an ordered list of preferred driver types and has the selection code sort the list of available layout drivers according to it. Any unrecognized layout type is sorted to the end of the list. For now, the order of preference is hardcoded, but it should be possible to make this configurable in the future. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: J. Bruce Fields <bfields@fieldses.org> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:11:13 -04:00
Andy Adamson	04fa2c6bb5	NFS pnfs data server multipath session trunking Try all multipath addresses for a data server. The first address that successfully connects and creates a session is the DS mount address. All subsequent addresses are tested for session trunking and added as aliases. Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:37 -04:00
Andy Adamson	ad0849a7ef	NFS test session trunking with exchange id Use an async exchange id call to test for session trunking To conform with RFC 5661 section 18.35.4, the Non-Update on Existing Clientid case, save the exchange id verifier in cl_confirm and use it for the session trunking exhange id test. Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:36 -04:00
Andy Adamson	04ea1b3e6d	NFS add xprt switch addrs test to match client Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:36 -04:00
Andy Adamson	ba84db96aa	NFS detect session trunking Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:36 -04:00
Andy Adamson	e7b7cbf662	NFS refactor nfs4_check_serverowner_major_id For session trunking, to compare nfs41_exchange_id_res with existing nfs_client Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:36 -04:00
Andy Adamson	8e548edb40	NFS refactor nfs4_match_clientids For session trunking, to compare nfs41_exchange_id_res with exiting nfs_client. Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:36 -04:00
Andy Adamson	8d89bd70bc	NFS setup async exchange_id Testing an rpc_xprt for session trunking should not delay application progress over already established transports. Setup exchange_id to be able to be an async call to test an rpc_xprt for session trunking use. Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:36 -04:00
Trond Myklebust	5405fc44c3	NFSv4.x: Add kernel parameter to control the callback server Add support for the kernel parameter nfs.callback_nr_threads to set the number of threads that will be assigned to the callback channel. Add support for the kernel parameter nfs.nfs.max_session_cb_slots to set the maximum size of the callback channel slot table. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:36 -04:00
Trond Myklebust	bb6aeba736	NFSv4.x: Switch to using svc_set_num_threads() to manage the callback threads This will allow us to bump the number of callback threads at will. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:36 -04:00
Trond Myklebust	3b01c11ee8	NFSv4.x: Fix up the global tracking of the callback server Ensure that the nfs_callback_info[] array correctly tracks the struct svc_serv. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:36 -04:00
Trond Myklebust	d002526886	SUNRPC: Initialise struct svc_serv backchannel fields during __svc_create() Clean up. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:36 -04:00
Trond Myklebust	f4b52bb084	NFSv4.x: Set up struct svc_serv_ops for the callback channel In order to manage the threads using svc_set_num_threads, we need to fill in a few extra fields. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:36 -04:00
Jeff Layton	3132e49ece	pnfs: track multiple layout types in fsinfo structure Current NFSv4.1/pNFS client assumes that MDS supports only one layout type. While it's true for most existing servers, nevertheless, this can be change in the near future. For now, this patch just plumbs in the ability to track a list of layouts in the fsinfo structure. The existing behavior of the client is preserved, by having it just select the first entry in the list. Signed-off-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de> Signed-off-by: Jeff Layton <jlayton@poochiereds.net> Reviewed-by: J. Bruce Fields <bfields@fieldses.org> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2016-09-19 13:08:35 -04:00
Vivek Goyal	8eac98b8be	ovl: during copy up, switch to mounter's creds early Now, we have the notion that copy up of a file is done with the creds of mounter of overlay filesystem (as opposed to task). Right now before we switch creds, we do some vfs_getattr() operations in the context of task and that itself can fail. We should do that getattr() using the creds of mounter instead. So this patch switches to mounter's creds early during copy up process so that even vfs_getattr() is done with mounter's creds. Do not call revert_creds() unless we have already called ovl_override_creds(). [Reported by Arnd Bergmann] Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2016-09-19 16:50:59 +02:00
Al Viro	5d3ddd84ea	udf: don't bother with full-page write optimisations in adinicb case ... it would get converted to regular if such had been attempted Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jan Kara <jack@suse.cz>	2016-09-19 10:47:01 +02:00
Christoph Hellwig	25f4e70291	ext2: use iomap to implement DAX Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-09-19 11:30:29 +10:00
Christoph Hellwig	6750ad7198	ext2: stop passing buffer_head to ext2_get_blocks Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-09-19 11:28:39 +10:00
Christoph Hellwig	6c31f495d1	xfs: use iomap to implement DAX Another users of buffer_heads bytes the dust. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-09-19 11:28:38 +10:00
Christoph Hellwig	e372843a40	xfs: refactor xfs_setfilesize Rename the current function to __xfs_setfilesize and add a non-static wrapper that also takes care of creating the transaction. This new helper will be used by the new iomap-based DAX path. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-09-19 11:26:41 +10:00
Christoph Hellwig	66642c5c1d	xfs: take the ilock shared if possible in xfs_file_iomap_begin We always just read the extent first, and will later lock exlusively after first dropping the lock in case we actually allocate blocks. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-09-19 11:26:39 +10:00
Christoph Hellwig	17879e8f86	xfs: fix locking for DAX writes So far DAX writes inherited the locking from direct I/O writes, but the direct I/O model of using shared locks for writes is actually wrong for DAX. For direct I/O we're out of any standards and don't have to provide the Posix required exclusion between writers, but for DAX which gets transparently enable on applications without any knowledge of it we can't simply drop the requirement. Even worse this only happens for aligned writes and thus doesn't show up for many typical use cases. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-09-19 11:24:50 +10:00
Christoph Hellwig	a7d73fe6c5	dax: provide an iomap based fault handler Very similar to the existing dax_fault function, but instead of using the get_block callback we rely on the iomap_ops vector from iomap.c. That also avoids having to do two calls into the file system for write faults. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-09-19 11:24:50 +10:00
Christoph Hellwig	a254e56812	dax: provide an iomap based dax read/write path This is a much simpler implementation of the DAX read/write path that makes use of the iomap infrastructure. It does not try to mirror the direct I/O calling conventions and thus doesn't have to deal with i_dio_count or the end_io handler, but instead leaves locking and filesystem-specific I/O completion to the caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-09-19 11:24:49 +10:00
Christoph Hellwig	b0d5e82fcf	dax: don't pass buffer_head to copy_user_dax This way we can use this helper for the iomap based DAX implementation as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-09-19 11:24:49 +10:00
Christoph Hellwig	1aaba0958e	dax: don't pass buffer_head to dax_insert_mapping This way we can use this helper for the iomap based DAX implementation as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-09-19 11:24:49 +10:00
Christoph Hellwig	befb503ca6	iomap: expose iomap_apply outside iomap.c This allows the DAX code to use it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-09-19 11:24:49 +10:00
Christoph Hellwig	ecd50729f7	iomap: add IOMAP_F_NEW flag Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-09-19 11:24:37 +10:00
Christoph Hellwig	51446f5ba4	xfs: rewrite and optimize the delalloc write path Currently xfs_iomap_write_delay does up to lookups in the inode extent tree, which is rather costly especially with the new iomap based write path and small write sizes. But it turns out that the low-level xfs_bmap_search_extents gives us all the information we need in the regular delalloc buffered write path: - it will return us an extent covering the block we are looking up if it exists. In that case we can simply return that extent to the caller and are done - it will tell us if we are beyoned the last current allocated block with an eof return parameter. In that case we can create a delalloc reservation and use the also returned information about the last extent in the file as the hint to size our delalloc reservation. - it can tell us that we are writing into a hole, but that there is an extent beyoned this hole. In this case we can create a delalloc reservation that covers the requested size (possible capped to the next existing allocation). All that can be done in one single routine instead of bouncing up and down a few layers. This reduced the CPU overhead of the block mapping routines and also simplified the code a lot. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-09-19 11:10:21 +10:00
Christoph Hellwig	85a6e764ff	xfs: make xfs_inode_set_eofblocks_tag cheaper for the common case For long growing file writes we will usually already have the eofblocks tag set when adding more speculative preallocations. Add a flag in the inode to allow us to skip the the fairly expensive AG-wide spinlocks and multiple radix tree operations in that case. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-09-19 11:09:48 +10:00
Christoph Hellwig	f8e3a82575	xfs: factor our a helper to calculate the EOF alignment And drop the pointless mp argument to xfs_iomap_eof_align_last_fsb, while we're at it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-09-19 11:09:28 +10:00
Christoph Hellwig	e9c4973638	xfs: move xfs_bmbt_to_iomap up We'll need it earlier in the file soon, so the unchanged function to the top of xfs_iomap.c Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-09-19 11:09:12 +10:00
Darrick J. Wong	3fd129b63f	xfs: set up per-AG free space reservations One unfortunate quirk of the reference count and reverse mapping btrees -- they can expand in size when blocks are written to other allocation groups if, say, one large extent becomes a lot of tiny extents. Since we don't want to start throwing errors in the middle of CoWing, we need to reserve some blocks to handle future expansion. The transaction block reservation counters aren't sufficient here because we have to have a reserve of blocks in every AG, not just somewhere in the filesystem. Therefore, create two per-AG block reservation pools. One feeds the AGFL so that rmapbt expansion always succeeds, and the other feeds all other metadata so that refcountbt expansion never fails. Use the count of how many reserved blocks we need to have on hand to create a virtual reservation in the AG. Through selective clamping of the maximum length of allocation requests and of the length of the longest free extent, we can make it look like there's less free space in the AG unless the reservation owner is asking for blocks. In other words, play some accounting tricks in-core to make sure that we always have blocks available. On the plus side, there's nothing to clean up if we crash, which is contrast to the strategy that the rough draft used (actually removing extents from the freespace btrees). Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-09-19 10:30:52 +10:00
Darrick J. Wong	385d655861	xfs: defer should allow ->finish_item to request a new transaction When xfs_defer_finish calls ->finish_item, it's possible that (refcount) won't be able to finish all the work in a single transaction. When this happens, the ->finish_item handler should shorten the log done item's list count, update the work item to reflect where work should continue, and return -EAGAIN so that defer_finish knows to retain the pending item on the pending list, roll the transaction, and restart processing where we left off. Plumb in the code and document how this mechanism is supposed to work. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2016-09-19 10:26:25 +10:00
Darrick J. Wong	c611cc0360	xfs: count the blocks in a btree Provide a helper method to count the number of blocks in a short form btree. The refcount and rmap btrees need to know the number of blocks already in use to set up their per-AG block reservations during mount. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-09-19 10:25:20 +10:00
Darrick J. Wong	4ed3f68792	xfs: create a standard btree size calculator code Create a helper to generate AG btree height calculator functions. This will be used (much) later when we get to the refcount btree. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-09-19 10:25:03 +10:00
Darrick J. Wong	a1d46cffaf	xfs: remove xfs_btree_bigkey Remove the xfs_btree_bigkey mess and simply make xfs_btree_key big enough to hold both keys in-core. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-09-19 10:24:36 +10:00
Darrick J. Wong	cd00158ce3	xfs: convert RUI log formats to use variable length arrays Use variable length array declarations for RUI log items, and replace the open coded sizeof formulae with a single function. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-09-19 10:24:27 +10:00
Darrick J. Wong	e43c460dcd	iomap: add a flag to report shared extents Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-09-19 10:13:02 +10:00
Christoph Hellwig	5f4e5752a8	fs: add iomap_file_dirty Originally-From: Christoph Hellwig <hch@lst.de> This function uses the iomap infrastructure to re-write all pages in a given range. This is useful for doing a copy-up of COW ranges, and might be useful for scrubbing in the future. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>	2016-09-19 10:12:45 +10:00
Linus Torvalds	4d2899d73c	Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6 Pull cifs fixes from Steve French: "Small set of cifs fixes" * 'for-next' of git://git.samba.org/sfrench/cifs-2.6: Move check for prefix path to within cifs_get_root() Compare prepaths when comparing superblocks Fix memory leaks in cifs_do_mount()	2016-09-16 17:09:48 -07:00
Jeff Layton	89dfdc964b	nfsd: eliminate cb_minorversion field We already have that info in the client pointer. No need to pass around a copy. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2016-09-16 16:15:52 -04:00
Jeff Layton	1983a66f57	nfsd: don't set a FL_LAYOUT lease for flexfiles layouts We currently can hit a deadlock (of sorts) when trying to use flexfiles layouts with XFS. XFS will call break_layout when something wants to write to the file. In the case of the (super-simple) flexfiles layout driver in knfsd, the MDS and DS are the same machine. The client can get a layout and then issue a v3 write to do its I/O. XFS will then call xfs_break_layouts, which will cause a CB_LAYOUTRECALL to be issued to the client. The client however can't return the layout until the v3 WRITE completes, but XFS won't allow the write to proceed until the layout is returned. Christoph says: XFS only cares about block-like layouts where the client has direct access to the file blocks. I'd need to look how to propagate the flag into break_layout, but in principle we don't need to do any recalls on truncate ever for file and flexfile layouts. If we're never going to recall the layout, then we don't even need to set the lease at all. Just skip doing so on flexfiles layouts by adding a new flag to struct nfsd4_layout_ops and skipping the lease setting and removal when that flag is true. Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2016-09-16 16:15:52 -04:00
Mike Galbraith	420902c9d0	reiserfs: Unlock superblock before calling reiserfs_quota_on_mount() If we hold the superblock lock while calling reiserfs_quota_on_mount(), we can deadlock our own worker - mount blocks kworker/3:2, sleeps forever more. crash> ps\|grep UN 715 2 3 ffff880220734d30 UN 0.0 0 0 [kworker/3:2] 9369 9341 2 ffff88021ffb7560 UN 1.3 493404 123184 Xorg 9665 9664 3 ffff880225b92ab0 UN 0.0 47368 812 udisks-daemon 10635 10403 3 ffff880222f22c70 UN 0.0 14904 936 mount crash> bt ffff880220734d30 PID: 715 TASK: ffff880220734d30 CPU: 3 COMMAND: "kworker/3:2" #0 [ffff8802244c3c20] schedule at ffffffff8144584b #1 [ffff8802244c3cc8] __rt_mutex_slowlock at ffffffff814472b3 #2 [ffff8802244c3d28] rt_mutex_slowlock at ffffffff814473f5 #3 [ffff8802244c3dc8] reiserfs_write_lock at ffffffffa05f28fd [reiserfs] #4 [ffff8802244c3de8] flush_async_commits at ffffffffa05ec91d [reiserfs] #5 [ffff8802244c3e08] process_one_work at ffffffff81073726 #6 [ffff8802244c3e68] worker_thread at ffffffff81073eba #7 [ffff8802244c3ec8] kthread at ffffffff810782e0 #8 [ffff8802244c3f48] kernel_thread_helper at ffffffff81450064 crash> rd ffff8802244c3cc8 10 ffff8802244c3cc8: ffffffff814472b3 ffff880222f23250 .rD.....P2.".... ffff8802244c3cd8: 0000000000000000 0000000000000286 ................ ffff8802244c3ce8: ffff8802244c3d30 ffff880220734d80 0=L$.....Ms .... ffff8802244c3cf8: ffff880222e8f628 0000000000000000 (.."............ ffff8802244c3d08: 0000000000000000 0000000000000002 ................ crash> struct rt_mutex ffff880222e8f628 struct rt_mutex { wait_lock = { raw_lock = { slock = 65537 } }, wait_list = { node_list = { next = 0xffff8802244c3d48, prev = 0xffff8802244c3d48 } }, owner = 0xffff880222f22c71, save_state = 0 } crash> bt 0xffff880222f22c70 PID: 10635 TASK: ffff880222f22c70 CPU: 3 COMMAND: "mount" #0 [ffff8802216a9868] schedule at ffffffff8144584b #1 [ffff8802216a9910] schedule_timeout at ffffffff81446865 #2 [ffff8802216a99a0] wait_for_common at ffffffff81445f74 #3 [ffff8802216a9a30] flush_work at ffffffff810712d3 #4 [ffff8802216a9ab0] schedule_on_each_cpu at ffffffff81074463 #5 [ffff8802216a9ae0] invalidate_bdev at ffffffff81178aba #6 [ffff8802216a9af0] vfs_load_quota_inode at ffffffff811a3632 #7 [ffff8802216a9b50] dquot_quota_on_mount at ffffffff811a375c #8 [ffff8802216a9b80] finish_unfinished at ffffffffa05dd8b0 [reiserfs] #9 [ffff8802216a9cc0] reiserfs_fill_super at ffffffffa05de825 [reiserfs] RIP: 00007f7b9303997a RSP: 00007ffff443c7a8 RFLAGS: 00010202 RAX: 00000000000000a5 RBX: ffffffff8144ef12 RCX: 00007f7b932e9ee0 RDX: 00007f7b93d9a400 RSI: 00007f7b93d9a3e0 RDI: 00007f7b93d9a3c0 RBP: 00007f7b93d9a2c0 R8: 00007f7b93d9a550 R9: 0000000000000001 R10: ffffffffc0ed040e R11: 0000000000000202 R12: 000000000000040e R13: 0000000000000000 R14: 00000000c0ed040e R15: 00007ffff443ca20 ORIG_RAX: 00000000000000a5 CS: 0033 SS: 002b Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Mike Galbraith <mgalbraith@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Jan Kara <jack@suse.cz>	2016-09-16 17:20:59 +02:00
Miklos Szeredi	2b6bc7f48d	ovl: lookup: do getxattr with mounter's permission The getxattr() in ovl_is_opaquedir() was missed when converting all operations on underlying fs to be done under mounter's permission. This patch fixes this by moving the ovl_override_creds()/revert_creds() out from ovl_lookup_real() to ovl_lookup(). Also convert to using vfs_getxattr() instead of directly calling i_op->getxattr(). Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2016-09-16 14:12:11 +02:00
Miklos Szeredi	8b326c61de	ovl: copy_up_xattr(): use strnlen Be defensive about what underlying fs provides us in the returned xattr list buffer. strlen() may overrun the buffer, so use strnlen() and WARN if the contents are not properly null terminated. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Cc: <stable@vger.kernel.org>	2016-09-16 14:12:11 +02:00
Phil Turnbull	42857cf512	configfs: Return -EFBIG from configfs_write_bin_file. The check for writing more than cb_max_size bytes does not 'goto out' so it is a no-op which allows users to vmalloc an arbitrary amount. Fixes: `03607ace80` ("configfs: implement binary attributes") Cc: stable@kernel.org Signed-off-by: Phil Turnbull <phil.turnbull@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2016-09-16 12:58:28 +02:00
Miklos Szeredi	814184fd40	vfat: don't use ->d_time Use d_fsdata instead, which is the same size. Introduce helpers to hide the typecasts. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>	2016-09-16 12:44:21 +02:00

... 14 15 16 17 18 ...

47524 Commits